esProc supports
cursor objects and provides related functions to process big text files
conveniently. Below is an example.
Suppose a text file, sales.txt,
stores ten million sales records. Its fields mainly include SellerID, OrderDate
and Amount. Requirement: Computing each seller’s total amount of big orders in
the past four years. Big orders are those whose amount is above 2,000.
esProc
code:
Code explanation:
A1: Importing all the ten million rows into memory at once will
cause memory overflow. Here they are imported in batches.
A2: Fetch data by loop, with 100,000 rows at a time.
B2: Filter each batch of data to select those records whose amount
is above 2,000 after the year of 2011.
B3: Group and summarize the filtered data to get every seller’s sales
amount for the current batch.
B4: Append the computed result of the current batch to a certain
variable (B1), and move on to the computation on next batch.
A5: After computations on all batches are completed, each seller’s sales
amount for every batch can be found in B1. Finally, group and summarize B1’s data
to get each seller’s total sales amount.
Code analysis:
In cell A1, esProc cursor is created with cursor
function. The cell name is used as the name of the cursor variable. The creation
of the cursor does not result in fetching data into the memory. Data will only
be fetched when fetch operation or other
equal operations are executed, like the statement for
A1,100000 in cell A2, which indicates fetching data from cursor by loop,
with 100,000 rows at a time. We can see that the data volume in memory is
always kept in a relatively small level and no overflows will occur.
select and groups are
functions for processing structured data. After data is fetched into memory using
esProc cursor, it can be processed and analyzed by library functions specializing
for structured computation. This is easier than writing underlying code by
hand.
Also equipped with functions and syntax for
processing semi-structured data, e.g. functions for splitting and merging data,
loop and traversal statement and fork statement, the esProc cursor can perform
complex data cleansing and produce easily computed structured data.
Data Splitting and analysis
One example is the blog whose format is too
complex to be used directly for computation and analysis. A typical file containing
blogs needs to be parsed into the two-dimensional table of standard format so
as to be directly used for structured data computation or be stored in a
database.
This is a record from the original blog:
10.10.10.145 - - [01/May/2013:03:24:56
-0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100
"http://www.xxx.com/xxxx.html""Mozilla/5.0 (Windows NT 6.1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
0 tankLee in 9fc0792eb272b78916c3872e9ad –
The following code omits the file access and the final combination of batches of data (see the previous example), and only displays steps for data splitting and analysis:
Data cleansing
Let’s look at an example of typical data cleansing. The format of
employee table read from a file is not standard. It needs to be reformatted
into standard structured data in batches. The current batch of data will be
stored temporarily in cell D3. The reformatting rules are:
1.The record is invalid if either
UserID or firstName is null or an empty string.
2.UserID should only contain digits;
the record is invalid if it contains letters.
3.For duplicate UserIDs, only the
one last entered will be kept.
4.Delete the possible space
before and after a piece of data.
5.Capitalize all the first
letters of firstName.
6.Combine the Full name in the
form of “firstName+”.”+“lastName”. If lastName is null or an empty string,
fullname equals firstName.
No comments:
Post a Comment