August 25, 2014

A Method of Grouping and Summarizing Data of Big Text Files in R Language

It is common to use R language to group and summarize data of files. Sometimes we may find ourselves processing comparatively big files which have smaller computed result and bigger source data. We cannot load them wholly to the memory when we need to compute them. The only solutions could be batch importing and computing as well as result merging. We’ll use an example in the following to illustrate the way of R language to group and summarize data from big text files.

Here is a file, sales.txt, of 1G size, which contains a great number of records of sales orders. We want to group field CLIENT and summarize field AMOUNT. "\t" is used in the file as the column separator. The first rows of data are as follows:
R's solution:
con <- file("E: \\sales.txt", "r")
ORDERID","Group.1","SELLERID","x","ORDERDATE")))!=0) {

Part of the computed result:

Code interpretation:
The 1stline: Open the file handle.
The 2nd ~ 3rdline: Import the first batch of 100,000 rows of data, group and summarize them and save the result in result.
The 4th ~ 8thline: Import data by loop, with 100,000 rows of data per batch, and store them in the variable databatch. Then get the second and fourth field, i.e. "CLIENT" and "AMOUNT", merge databatch into result, and execute grouping operation.
It can be seen that, at a certain moment, only databatch,which includes 100,000 rows of data, and result, the summarizing result, have memory usage. Usually, the size of the latter is small and will not result in a memory overflow.
The 11thline: Close the file handle.

Matters needing attention:
Data frame. Because the data frame of R language cannot directly perform the computing of big files, loop statement is necessary to help to do the job in this occasion. The steps are: import a batch of data and merge them into the data frame result; group and summarize result and then import the next batch of data. You can see that this part of code of loop statement is a little complicated.

Column name. As the first row of data is the column name, header=TRUE can be used in the first batch of data to directly set the column name. But the subsequent data hasn't column names and header=FALSE should be used to import data. The default column names are V1, V2 and so forth when header=FALSE is used. But the default column names are Group.1 and x after grouping and summarizing are executed, and col.names is needed to change the column names in order to maintain structure consistency both before and after grouping and summarizing and set the stage for the subsequent merging. The code about column names is worth our notice because it is easy to get wrong.

Alternative solutions:
Python, esProc and Perl can also perform the same operation. They can execute the grouping and summarizing of data from big text files and the subsequent structured data computing as R language does. We'll briefly introduce the coding methods used by esProc and Python.
esProc can process data in batches automatically, which requires no manual control from the programmers by loop statement and produces quite simple code:

Cursor is a data type used for structured data computing in esProc. Its usage is similar to that of the data frame, but it is better at processing big files and performing complicated computations. What’s more, @t option in the code indicates that the first line of the file is the column name. So it is convenient to use the column name directly in subsequent computation.

Python's code structure, which also requires manual loop control, is similar to that of R language. But Python itself hasn't the structured data type, like data frame or cursor, so its code is executed in a lower level:
from itertools import groupby
from operator import itemgetter
result = []
myfile = open("E:\\sales.txt",'r')
BUFSIZE = 10240000
lines = myfile.readlines(BUFSIZE)
while lines:
    for line in lines:
    result=sorted(result,key=lambda x:(x[0]))                #the sorting before grouping is executed
    for key, items in groupby(result, itemgetter(0)):    # group using groupBy function
        for subItem in items:value+=subItem[1]
batch.append([key,value])                 # finally, merger the summarizing results into a two-dimensional array

    lines = myfile.readlines(BUFSIZE)

Except for the above two-dimensional array, Python can execute the operation with the third-party packages. For example, pandas has the structured data object similar to the data frame. pandassimplifies the code in a similar way asR language. But it lacks sufficient ability to perform big file computing, thus loop statement is still needed while programming.