There is a file sales.txt of 1G size containing a great number of sales order records. Please filter out records in which the field value of AMOUNT is between 2,000 and 3,000. The column separator used in the file is "\t". The first rows of data are as follows:
R's solution:
con <- file("E:\\sales.txt", "r")
| |
readLines(con,n=1)
| |
result<-result[result$V4>=2000 & result$V4<=3000,]
| |
while(length(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t"))!=0) {
| |
result<-rbind(result,databatch)
| |
}
| |
close(con)
|
Code interpretation:
The 1st row:Open the file handle.
The 2nd row: Skip the first row, i.e. the column name.
The 3rd ~4th row: Import the first batch of 100,000 rows of data, then perform
filtering and save the result in result.
The 5th ~ 8th row: Import by loop. 100,000 rows data are imported per batch. Append the filtering result to result variable, and then begin the processing of next batch.
The 9th row: Close the file handle.
Points for attention:
For the small files, only a line of code is enough for executing data importing and the first row can be set as the column name of data frame. But this won't work with the big files whose data require to be imported in batches. And for the second batch of imported data,the first row cannot be set as the column name. The column names are by default V1, V2, V3 and so forth.
In order to import data from big files in batches, R language has to use while statement in performing the operation. Besides, it is not convenient in using column names. These will produce fairly complicated code.
Alternative methods:
Python, esProc and Perl can also perform the same operation to do this task. As R language, all of them can perform file data filtering and structured data computing. The following will briefly introduce the methods used by esProc and Python.
Python, esProc and Perl can also perform the same operation to do this task. As R language, all of them can perform file data filtering and structured data computing. The following will briefly introduce the methods used by esProc and Python.
esProc can automatically process data in batches and it’s not necessary for programmers to control the operation manually with loop statement, so its code is quite simple:
Cursor is an esProc data type used for structured data computing, which has almost the same usage as data frame but is better at processing big files and performing complex computation. What's more, cursor can set the first row of the file as the column name using @t option.
Python's code structure is similar to that of R language. Both require controlling loop manually. But Python hasn't structured data types like data frame or cursor, so its code is classified into a lower level:
result = []
| |
myfile = open("E: \\sales.txt",'r')
| |
BUFSIZE = 10240000
| |
myfile.readline()
| |
lines = myfile.readlines(BUFSIZE)
| |
value=0
| |
while lines:
| |
for line in lines:
| |
record=line.split('\t')
| |
AMOUNT=float(record[3])
| |
if(AMOUNT>=2000 and AMOUNT<=3000):
| |
result.append(record)
| |
lines = myfile.readlines(BUFSIZE)
| |
myfile.close()
|
Though Python can perform the above operation using the third-party packages, like pandas, which has structured data object similar to the data frame, it lacks sufficient ability in processing big files. Thus it is hard for it to further simplify the code.
No comments:
Post a Comment