esProc, A Script Language for Data Analytics with Parallel Mechanism: Comparison Between esProc’s Sequence Table Object and R’s Data Frame (II)

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (I)

Actual case

In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.

Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.

Sequence Table Solution:

Data frame Solution:

01 library(gdata) #use excel function library

02 A1<- read.xls("e:\\data\\all.xlsx") #import data

03 A2<-subset(A1,as.POSIXlt(Date)>=as.POSIXlt('2012-06-01') &as.POSIXlt(Date)<=as.POSIXlt('2012-06-30')) #filter by date

04 A3 <- split(A2,A2$Code) #group by Code

05 A8<-list()

06 for(i in 1:length(A3)){

07 A3[[i]][order(as.numeric(A3[[i]]$Date)),] #sort by Date in each group

08 A3[[i]]$INC<-with(A3[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

09 if(nrow(A3[[i]])>0){ #add a column, continuous increased days

10 A3[[i]]$CID[[1]]<-1

11 for(j in 2:nrow(A3[[i]])){

12 if(A3[[i]]$INC[[j]]>0 ){

13 A3[[i]]$CID[[j]]<-A3[[i]]$CID[[j-1]]+1

14 }else{

15 A3[[i]]$CID[[j]]<-0

16 }

17 }

18 }

19 if(max(A3[[i]]$CID)>=5){ #stock max CID is bigger than 5

20 A8[[length(A8)+1]]<-A3[[i]]

21 }

22 }

23 A9<-lapply(A8,function(x) x$Code[[1]]) #finally,stock code

Comparison：

1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.

2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.

3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computationby reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.

4.To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.

Test Performance

Test 1: Generating 10 million records in memory, each consists of three fields. All values are random numbers. Records are filtered, and each field is summed.

Sequence table:

Data frame：

> library(timeDate)

> start=Sys.timeDate()

> col1=rnorm(n=10000000,mean=20000,sd=10000)

> col2=rnorm(n=10000000,mean=40000,sd=10000)

> col3=rnorm(n=10000000,mean=80000,sd=10000)

> data1=data.frame(col1,col2,col3)

> data2=subset(data1,col1>90)

> result=colSums(data2)

> print(result)

col1 col2 col3

200844165732 390691612886 781453730448

> end=Sys.timeDate()

> print(end-start)

Time difference of 1.533333 mins

Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.

Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields

Sequence Table:

Data frame:

>library(timeDate)

> start=Sys.timeDate()

> data<-read.table("d:/T21.txt",sep = "\t")

> data1=subset(data,V1>90,select=c(V9,V11))

> result=colSums(data1)

> print(result)

V9 V11

5942982895 59484930179

> end=Sys.timeDate()

> print(end-start)

Time difference of 1.134722 hours

Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.

From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.

menu

July 16, 2014

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (II)

Actual case

Test Performance

No comments:

Post a Comment