June 10, 2014

How to speed up the data processing of R language?

Speaking of R’s speed, everyone will recommend you to avoid loop, use library functions instead, or develop low-level code in other languages, then to be called by R. however, if this is the case, what does R itself means?

It is necessary to traverse data frequently during data processing, especially for data in files. The problem can't be solved without loop; if always developing low-level code in other languages, then where’s the advantage of R? Why don’t we complete the goal in these languages?

Parallel is also a method.

R itself does not have a parallel mechanism, but has a RHADOOP to cooperate with HADOOP. Unfortunately, RHadoop is still very slow. MapReduce will split the task into pieces, then throw them to R for processing (this is the principle of MapReduce). While R is good at processing set operation, it is not necessary for pieces. 

At Reduce stage, despite a set as it does, R does not directly support cursor-style (iterator) operation. To take advantage of R, it is true only after the whole set is fetched into memory, but this will lead to memory overflow.
R is developed in C language, but Hadoop is mainly based on Java. To pass data from JAVA to C, a type conversion is required, which will result in loss of performance. Unless there is a large cluster, which, however, in turn leads to higher hardware and maintenance costs, R programmer is bad at networks and clusters.

If you take more cares about problems on data processing speed, my advice is: to ignore R. Currently, it seems there’s beyond hope of improving R.

In fact, data processing rarely involves math statistical computing, mostly does validation, aggregation, filtering and other operations where R script can be replaced by lots of products, such as, Perl, Python and esProc can be dozens of times faster than R, in particular, esProc offers far more powerful data object than data frame, also support for external file cursor-style and simple parallel computing mechanism, it is indeed the best choice.