esProc, A Script Language for Data Analytics with Parallel Mechanism: Group Cursor in esProc

In the big data computing, besides the grouping and aggregate operations, sometimes you also need to retrieve a group of data each time to analyze. For example, analyze the sales by date, collect statistics on sales curve for each product, and the purchase habit of each client.

In esProc, you can use function cs.fetch(;x) or cs.skip(;x) to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve a product each time and prepare to examine the sales data of each product:

From B7, the records of the 20^th goods can be retrieved like this:

The data retrieval in esProc cursor is a one-way street. Thus the data in cursor must be in order when retrieving a group of records each time as necessary.

As we know, that the @z option can be used to retrieve file by block or data from cursor. However, when retrieving by block, esProc will determine how the data is divided, and sometimes you may encounter troubles.

First, let’s prepare a data text: For the above-used data which are already sorted by the sequence number, store them into a new binary file Order_Products:

In the later computation, if retrieving data by segment, we will get the situation given below:

After all data are divided into 100 segments, retrieve the data from the 1st segment in A3, and retrieve the data from 2nd segment in A5, as shown below:

At this point, you may encounter such problems: For the product number B1445, its sales record appears in both groups. If aggregating after data retrieval each time, then duplicate product numbers may appear in the result returned, and the re-aggregation will be necessary to get the final result. Such piecewise computation is quite common for the parallel computation over big data. The above conditions will make the computation ever more complicated. In this case, we should perform the segmenting by group when storing the data.

When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by group. By doing so, the data from a same group is sure to be fully retrieved all at once when retrieving the data by block. For example:

For the data sorted by the sequence number of products, save them as a binary file Order_Products_G, segment by group according to the PID. This is slightly different to the method we adopted previously to write the data to a file of Order_Products. Please note that piecewise storage is only valid for the binary file.

To this point, the circumstances are different to retrieve by section:

In this step, the data retrieved in A3 and A5 are as follows:

At this point, for the data of the segment 1, all product records whose number is B1445 will be read out. As for the data of segment 2, the record will be retrieved from the next product. As can be seen, if the segmenting by group is set to perform during writing a binary file, the data of a whole group will be put in a segment for retrieval from the cursor. With segmenting by group, the integrity of the data in each group can be guaranteed, and the piecewise computation over big data can be simpler and easier.

menu

August 7, 2014

Group Cursor in esProc

No comments:

Post a Comment