esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Improves Text Processing

The usual way to insert summary values into the grouped data is to process data group by group. Import a group of data, append them and their summary value to a new file and then do the same with the next group, and so on. But it is not easy to realize this in hard coding. esProc, however, supports group cursor with which a whole group of data can be imported automatically. The following example will show how esProc deals with this kind of computation.

The log webdata.log has three columns separated by commas. The first column is the identifier for grouping data. The other two columns hold numerical values. Some of the data are as follows:

Notice that the first and fourth group has the same identifier for grouping data.

Now we are asked to insert the average value of the second column and an empty row between each group, as shown below:

esProc code for doing this task:

A1=file("E: \\webdata.log").cursor(;",")

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the corresponding file object. In the function, comma is used as the column separator (default separator is tab) and default column names are _1,_2…_n, in which _1 is the column to mark data grouping. We can also specify the column names like cursor(groupName,data1,data2;”,”).

The code only creates cursor objects but does not import the data. The data importing will be started by for statement or fetch function.

B1=file("e:\\result.txt"). This line of code creates a file object for storing the computed results.

A2:for A1;_1

This line of code fetches data from the cursor in A1 by loop, importing a group of data with the same first column (the name is _1) each time. It is in this step that data are really imported into the memory.

The for statement here is worth special attention. for cs,n means fetching n rows from cursor cs at a time. While for cs;x means fetching a group of records with the same x field from cursor cs in which data need to be grouped beforehand by x. In this example, the data are already grouped. But if the data are ungrouped, they can be prepared them by using other esProc functions (like sortx, a function for sorting cursors).

The x in the statement for cs;x can be an expression, according to which multiple rows will be imported each time uninterruptedly until the expression changes. For example, for A14 ;left(_1,4) will judge the first four characters of the first column according to the expression and corresponding records will be classified into the same group until the characters change .

B2-B4 is the loop body of for statement in A2. The loop body processes every group of data in the same way. Its working scope, as can be seen from the cellset, is represented by indentation rather than by parentheses or other identifiers like begin/end. What’s more, the loop variable can be represented by the name of the cell where for statement resides, which, in this example, means A2 represents the records of the current group. Seen in debug mode, the value of A2 in the first-run loop is as follows:

B2=B1.export@a(A2;",")

This line of code appends A2 to the defined file object. export function exports a group of records to the file, in which @a option means appending. In order to keep consistent with the source data, comma is used here as the separator (though the default separator is tab). Open result.txt after the first loop and we can see the following data:

B3=A2._1+"_avg,"+string(A2.avg(_2))+”\r\n”

This line of code is used to piece together the summarizing string. A2._1 represents the first column of the current group. Its value is “webcat_service” as with the first group. The expression A2.avg(_2) means getting the average value of the second column of the current group. The value is 2.25 as with the first group. string function will formatting the variable of floating point type into the string.

For the first group of data, B3’s value is this:

B4=B1.write@a(B3)

This line of code appends B3 to the result file. Both export function and write function can write data into a new file. The former writes structured data into the file, whereas the latter writes strings or an array of strings into the file. @a option appends data, which is preceded by writing the carriage return into the file.

At this point, the above script has finished processing all data. The final result can be viewed in result.txt as follows:

menu

May 26, 2015

esProc Improves Text Processing – Insert Summary values into Grouped Data

No comments:

Post a Comment