Actual case
In this part we use a real case for comprehensive comparison o fdata frame and sequence table.Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.
Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.
Data frame Solution:
01 library(gdata) #use excel function library
02 A1<- read.xls("e:\\data\\all.xlsx") #import data
03 A2<-subset(A1,as.POSIXlt(Date)>=as.POSIXlt('2012-06-01') &as.POSIXlt(Date)<=as.POSIXlt('2012-06-30')) #filter by date
04 A3 <- split(A2,A2$Code) #group by Code
05 A8<-list()
06 for(i in 1:length(A3)){
07 A3[[i]][order(as.numeric(A3[[i]]$Date)),] #sort by Date in each group
08 A3[[i]]$INC<-with(A3[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price
09 if(nrow(A3[[i]])>0){ #add a column, continuous increased days
10 A3[[i]]$CID[[1]]<-1
11 for(j in 2:nrow(A3[[i]])){
12 if(A3[[i]]$INC[[j]]>0 ){
13 A3[[i]]$CID[[j]]<-A3[[i]]$CID[[j-1]]+1
14 }else{
15 A3[[i]]$CID[[j]]<-0
16 }
17 }
18 }
19 if(max(A3[[i]]$CID)>=5){ #stock max CID is bigger than 5
20 A8[[length(A8)+1]]<-A3[[i]]
21 }
22 }
23 A9<-lapply(A8,function(x) x$Code[[1]]) #finally,stock code
|
Comparison:
1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.
2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.
3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computationby reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.
4.To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.
Test Performance
Test 1: Generating 10 million records in memory, each consists of three fields. All values are random numbers. Records are filtered, and each field is summed.
Sequence table:
> library(timeDate)
> start=Sys.timeDate()
> col1=rnorm(n=10000000,mean=20000,sd=10000)
> col2=rnorm(n=10000000,mean=40000,sd=10000)
> col3=rnorm(n=10000000,mean=80000,sd=10000)
> data1=data.frame(col1,col2,col3)
> data2=subset(data1,col1>90)
> result=colSums(data2)
> print(result)
col1 col2 col3
200844165732 390691612886 781453730448
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.533333 mins
|
Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.
Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields
Data frame:
>library(timeDate)
> start=Sys.timeDate()
> data<-read.table("d:/T21.txt",sep = "\t")
> data1=subset(data,V1>90,select=c(V9,V11))
> result=colSums(data1)
> print(result)
V9 V11
5942982895 59484930179
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.134722 hours
|
Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.
From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.
No comments:
Post a Comment