June 3, 2015

Processing Big Text Files with esProc Cursor: Code Examples

esProc supports cursor objects and provides related functions to process big text files conveniently. Below is an example.

Suppose a text file, sales.txt, stores ten million sales records. Its fields mainly include SellerID, OrderDate and Amount. Requirement: Computing each seller’s total amount of big orders in the past four years. Big orders are those whose amount is above 2,000.


esProc code:

Code explanation:
A1: Importing all the ten million rows into memory at once will cause memory overflow. Here they are imported in batches.

A2: Fetch data by loop, with 100,000 rows at a time.

B2: Filter each batch of data to select those records whose amount is above 2,000 after the year of 2011.

B3: Group and summarize the filtered data to get every seller’s sales amount for the current batch.

B4: Append the computed result of the current batch to a certain variable (B1), and move on to the computation on next batch.

A5: After computations on all batches are completed, each seller’s sales amount for every batch can be found in B1. Finally, group and summarize B1’s data to get each seller’s total sales amount.

Code analysis:

In cell A1, esProc cursor is created with cursor function. The cell name is used as the name of the cursor variable. The creation of the cursor does not result in fetching data into the memory. Data will only be fetched when fetch operation or other equal operations are executed, like the statement for A1,100000 in cell A2, which indicates fetching data from cursor by loop, with 100,000 rows at a time. We can see that the data volume in memory is always kept in a relatively small level and no overflows will occur.

select and groups are functions for processing structured data. After data is fetched into memory using esProc cursor, it can be processed and analyzed by library functions specializing for structured computation. This is easier than writing underlying code by hand. 

Also equipped with functions and syntax for processing semi-structured data, e.g. functions for splitting and merging data, loop and traversal statement and fork statement, the esProc cursor can perform complex data cleansing and produce easily computed structured data.   

Data Splitting and analysis
One example is the blog whose format is too complex to be used directly for computation and analysis. A typical file containing blogs needs to be parsed into the two-dimensional table of standard format so as to be directly used for structured data computation or be stored in a database.  

This is a record from the original blog:
10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xxx.com/xxxx.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad –

It is then split and parsed into one that is stored in a two-dimensional table:

The following code omits the file access and the final combination of batches of data (see the previous example), and only displays steps for data splitting and analysis: 

Data cleansing
Let’s look at an example of typical data cleansing. The format of employee table read from a file is not standard. It needs to be reformatted into standard structured data in batches. The current batch of data will be stored temporarily in cell D3. The reformatting rules are:

1.The record is invalid if either UserID or firstName is null or an empty string.

2.UserID should only contain digits; the record is invalid if it contains letters.

3.For duplicate UserIDs, only the one last entered will be kept.

4.Delete the possible space before and after a piece of data.

5.Capitalize all the first letters of firstName.

6.Combine the Full name in the form of “firstName+”.”+“lastName”. If lastName is null or an empty string, fullname equals firstName.

The following script also omits the file access and the combination of batches of data, and only includes the code for data cleansing: