August 26, 2014

Method of Filtering Data of Big Text Files with R Language

It's common to use R language to filter data of files. Sometimes we need to filter data of big files which cannot be imported to the memory all at once. Solutions to this problem include batch importing and filtering as well as result merging. Here is an example to illustrate the method of filtering data of big files with R language.

There is a file sales.txt of 1G size containing a great number of sales order records. Please filter out records in which the field value of AMOUNT is between 2,000 and 3,000. The column separator used in the file is "\t". The first rows of data are as follows: 

R's solution:
con <- file("E:\\sales.txt", "r")
result<-result[result$V4>=2000 & result$V4<=3000,]
while(length(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t"))!=0) {

These are some of the computed results:

Code interpretation:
The 1st rowOpen the file handle.

The 2nd row: Skip the first row, i.e. the column name.

The 3rd ~4th row: Import the first batch of 100,000 rows of data, then perform 

filtering and save the result in result.

The 5th ~ 8th row: Import by loop. 100,000 rows data are imported per batch. Append the filtering result to result variable, and then begin the processing of next batch.

The 9th row: Close the file handle.

Points for attention:

For the small files, only a line of code is enough for executing data importing and the first row can be set as the column name of data frame. But this won't work with the big files whose data require to be imported in batches. And for the second batch of imported data,the first row cannot be set as the column name. The column names are by default V1, V2, V3 and so forth.

In order to import data from big files in batches, R language has to use while statement in performing the operation. Besides, it is not convenient in using column names. These will produce fairly complicated code.

Alternative methods:
Python, esProc and Perl can also perform the same operation to do this task. As R language, all of them can perform file data filtering and structured data computing. The following will briefly introduce the methods used by esProc and Python.

esProc can automatically process data in batches and it’s not necessary for programmers to control the operation manually with loop statement, so its code is quite simple:

Cursor is an esProc data type used for structured data computing, which has almost the same usage as data frame but is better at processing big files and performing complex computation. What's more, cursor can set the first row of the file as the column name using @t option.

Python's code structure is similar to that of R language. Both require controlling loop manually. But Python hasn't structured data types like data frame or cursor, so its code is classified into a lower level:
result = []
myfile = open("E: \\sales.txt",'r')
BUFSIZE = 10240000
lines = myfile.readlines(BUFSIZE)
while lines:
    for line in lines:
        if(AMOUNT>=2000 and AMOUNT<=3000):
    lines = myfile.readlines(BUFSIZE)

Though Python can perform the above operation using the third-party packages, like pandas, which has structured data object similar to the data frame, it lacks sufficient ability in processing big files. Thus it is hard for it to further simplify the code.