esProc, A Script Language for Data Analytics with Parallel Mechanism: An Illustration of esProc’s Parallel Processing of Big Text File

esProc can parallelly process big text files conveniently. The following case will illustrate its operating method.

Let’s assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesman’s total Amount of big orders in the past four years.The big orders refer to those whose amount is above 2000.

To do parallel process, first we must segment the file. esProc provides the cursor data object and its function, which segment big text files conveniently and read easily. Take file(“e:/sales.txt”).cursor@tz(;, 3:24) for an example, it means, basically, the file isdivided averagely into 24 segments by bytes, then the third one will be read. By doing so, half line of data, or half line of record, will occur, and further programming and processing are required. However, we have to traversal across all the previous rows if segmenting the file by rows, which cannot reach an expected high efficiency by adopting segmenting algorithm and parallel processing. Unlike these two types of solution, esProc can do rounding off job automatically to ensure the data’s validity when segmenting the file.

With esProc at hand, only simple parallel processing will be required after the file is segmented. The code is as follow:

Main Program

A1: Set the number of parallel tasks as 24, that is, divide the file into 24 segments.
A2: Call the subprogram to do multi-threading parallel computation. There are two task parameters: to(A1),A1. The value of to(A1) is [1,2,3…24], which corresponds the serial number of each segment; A1 represents the total segments. When all the tasks is done, the computed results will be uniformly stored in A2.
A3: Merge the computed result of each task in A2 according to Sellerld.
A4: Group and summarize the merged results to seek each salesman’s amount.

Subprogram

Both segment and total, which represent respectively the current segment and total segments, are parameters of the subprogram. Say, the value of segment’s parameter is 3 and that of total is a permanent 24 in task 3.

A1: Read the file with cursor. Note that which segment should be processed in current task should be decided according to parameter delivered by main program.
A2: Select those records whose order amount is above 2000 after the year of 2011.
A3: Group and summarize the filtered data.
A4: Return the task’s computed result to main program.

Code description

For CPU with N cores, it appears natural to set N tasks. The fact is that some tasks always operate faster than others, say, according to the differences of filtered data. So the common situation is that some cores are in an idle state after finishing those faster tasks, but a few other cores are still in operation with slower tasks. Comparatively, if each core performs multiple tasks in turn, the speed of different tasks will be averaged out and the general operation will become more stable. In the above example, therefore, the task is divided into 24 parts and given to CPU’s 8 cores to process (how many parallel threads are allowed at the same time to configurate in the environment of esProc). But, on the other hand, too many segmented tasks may have weaknesses. One is the decline of whole performance, the other is that there will be more computed results produced by each task when added together, and they will occupy more memory.
callx makes develop process simpler by encapsulating complex multithread computation, which enables programmers to focus on business algorithm instead of being distracted by complicated semaphore control.

Computed results of A3 in the above main program have been sorted automatically according to Sellerld, so it is not necessary to sort again in A4 when grouping and summarizing. @o, the function option of groups, can group and summarize efficiently without sorting.

Further illustration
Sometimes when the data size of a text file amount to several TB, it is required to use multiple node based on cluster to make parallel computation. Now esProc can do parallel computation easily. Because its cursor and relative functions support non-expensive scale-out and distributed file system (DFS). As to the above example, all that is needed is a node list added to A2 in the main program. It will be like this: =callx("sub.dfx", to(A1),A1; ["192.168.1.200:8281", "192.168.1.201:8281", ”......”]).

menu

July 13, 2014

An Illustration of esProc’s Parallel Processing of Big Text File

No comments:

Post a Comment