esProc, A Script Language for Data Analytics with Parallel Mechanism: Code:the way for parallelizing python and numpy nodes.

The example code as follows:

#!python
import threading
import time
import os,stat

mylock = threading.RLock()
result=0
class myThread(threading.Thread):
def __init__(self, s,e,fn):
threading.Thread.__init__(self)
self.s = s
self.e=e
self.fn=fn
def run(self): #Main function of thread class
global result
myfile = open(self.fn,'r')
myfile.seek(self.s)
if self.s>1: #if not first paragraph, skip the beginning line
myfile.readline()
value=0
while myfile.tell()<self.e: #read more line at tail, tell() gets position
row=myfile.readline().split('\t')
value+=int(row[0])
myfile.close()
result+=value;
def main():
start=time.time()
file="d:/T22.txt"
m=4 #sperat to m parts
args = os.stat (file)
size = args[stat.ST_SIZE ]
n=int(size/m) #read n bytes for each part
thread=[]
for count in range(0,4):
s=count*n
e=(count+1)*n
thread.append(myThread(s,e,file))
thread[count].start()
for count in range(0,4):
thread[count].join()
end=time.time()
print("result",result)
print("time:",end-start)
if __name__== '__main__':
main()

The parallel computing function that Python offers basically inherits the mechanism of C language. The benefit of this mechanism is flexible, and by which, you can write out any things. But its drawback is that it provides too low-level supports, coding and debugging are difficult, and that you have to solve a series of problems such as shared memory conflicts.

In short, this kind of mechanism is more fit for system-level programmers, but not for those application-level programmers who resort to Python. So it is not a key function in Python, it is just provided, not very professional.

If there are a huge number of parallel tasks, Python is not count as an appropriate language. Erlang, as a professional parallel computing language, is no longer using thread mechanism with high resource consumption, but using lightweight Actor mechanism instead, avoiding process on shared memory conflict, in order to support parallel tasks of tens of thousands or even a few millions. Java-based Scala language also provides a similar Actor method, but not as professional as Erlang is.

If you need to compute a large amount of structured data (including logs, etc.), the other choice is esProc. As always, esProc still uses thread mechanism, but its syntax is much simpler than that of Python (of course, its drawback is that it is unable to develop system-level applications). And esProc also supports multi-machine cluster, so it is more convenient to deal with big data in esProc.

menu

June 10, 2014

Code:the way for parallelizing python and numpy nodes.

No comments:

Post a Comment