esProc, A Script Language for Data Analytics with Parallel Mechanism: Comparison of Loop Function in esProc and R Language

Loop function can traverse every member of an array or a set, express complicated loop statements with simple functions, as well as reduce the amount of code and increase readability. Both esProc and R language support the loop function. The following will compare their similarities and differences in usage.

1.Generating data

Generate odd numbers between 1 and 10.

esProc:

x=to(1,10).step(2)

In the code, to(1,10)generates consecutive integers from 1 to 10, step function gets members in consecutively according to the computed result of last step and the final result is [1,3,4,5,7,9]. This type of data in esProc is called a sequence.

The code has a simpler version: x=10.step(2).

R language:

x<-seq(from=1,to=10,by=2)

This piece of code gets integers directly and inconsecutively from 1 to 10. Computed result is c(1,3,4,5,9). This type of data in R language is called vector.

A simpler version of this piece of code isx<-seq(1,10,2).

Comparison:

1.Both can solve the problem in this example. esProc needs two steps to solve it, indicating theoretically a poor performance. While R language can resolve it with only one step, displaying a better performance.

2.The method for esProc to develop code is getting members from a set according to the sequence number. It is a common method. For example, there is a string sequence A1=["a", "bc", "def"……],now get strings in the positions of odd numbers. Here it’s no need to change the type of code writing, the code isx=A1.step(2).

R language generates data directly, thus it has a better performance. It can write common expressions, too. For example, get strings in the positions of odd numbers from the string vector quantity A1=c("a", "bc", "def"……), the expression in R language can bex=A1[seq(1,length(A1),2)].

3.esProc loop function has characteristics that R language hasn’t, that is, built-in loop variables and operators. “~” represents the loop variable, “#” represents the loop count, “[]” represents relative position and “{}” represents relative interval. By using these variables and operators, esProc can produce common concise expressions. For example, seek square of each member of the set A2=[2,3,4,5,6]:

A2.(~*~) /Result is[4,9,16,25,36], which can also be written as A2**A2. But the latter lacks a sense of immediacy and commonality.R language can only use A2*A2 to express the result.

Get the first three members:

A2.select(#<=3) / Result is [2,3,4]

Get each member’s previous member and create a new set:

A2.(~[-1]) / Result is [null,2,3,4,5]

Growth rate:

A2.((~ - ~[-1])/ ~[-1]) /Result is [null,0.5,0.33333333333,0.25,0.2]

Moving average:

A2.(~{-1,1}.avg()) /Result is [2.5, 3.0, 4.0, 5.0, 5.5]

Summary:

In this example, that R language can directly generate data and produce common expressions shows that it is more flexible and takes less memory space than esProc.

2. Filtering records

Computational objects of a loop function can be an array or a set whose members are single value, or two-dimensional structured data objects whose members are records. In fact, loop function is mainly used in processing the latter. For example, select orders of 2010 whose amount is greater than 2,000 from sales, the order records.
Note: sales originates from a text file, some of its data are as follows:

esProc:

sales.select(ORDERDATE>=date("2010-01-01") && AMOUNT>2000)

Some of the results are:

R language:

sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") &sales$AMOUNT>2000,]

Some of the results are:

Comparison:

1. Both esProc and R language can realize this function. Their difference lies that esProc uses select loop function while R language directly uses index. But there isn't an essential distinction between them. In addition, R language can further simplify the expression by using attach function:

sales[as.POSIXlt(ORDERDATE)>=as.POSIXlt("2010-01-01") & AMOUNT>2000,]

Thus, there are more similarities between them.

2. Except query, loop function can be used to seek sequence number, sort, rank, seek Top N, group and summarize, etc. For example, seek sequence numbers of records.

sales.pselect@a(ORDERDATE>=date("2010-01-01") && AMOUNT>2000) /esProc

which(as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") &sales$AMOUNT>2000) #R language

For example, sort records by SELLERID in ascending order and by AMOUNT in descending order.

sales.sort(SELLERID,AMOUNT:-1) /esProc

sales[order(sales$SELLERID,-sales$AMOUNT),] /R language

For example, seek the top three records by AMOUNT.

sales.top(-AMOUNT;3) /esProc

head(sales[order(-sales$AMOUNT),],n=3) /R language

3. Sometimes, R language computes with index, like filtering; sometimes it computes with functions, like seeking sequence numbers of records; sometimes it programs in the form of “data set + function + data set”, like sorting; and other times it works in the way of “function + data set + function”, like seeking TopN. Its programming method seems flexible but is liable to greatlyconfuse programmers. By comparison, esPoc always adopts object-style method “data set + function + function …”in access. The method has a simple and uniform structure and is easy for programmers to grasp.

Here is an example of performing continuous computations. Filter records and seek Top N. esProc will computelike this:

sales.select(ORDERDATE>=date("2010-01-01") && AMOUNT>2000).top(AMOUNT;3)

And R language will compute in this way:

Mid<-sales[as.POSIXlt(sales$ORDERDATE)>=as.POSIXlt("2010-01-01") &sales$AMOUNT>2000,]

head(Mid [order(Mid$AMOUNT),],n=3)

As you can see, esProc is better at programming multi-step continuous computations.

Summary:In this example, esPoc gains the upper hand in ensuring syntax consistency and performing continuous computations, and is more beginner-friendly.

3. Grouping and summarizing

The loop function is often employed in grouping and summarizing records. For example, group by CLIENT and SELLERID, and then sum up AMOUNT and seek the maximum value.

esProc:

sales.groups(CLIENT,SELLERID;sum(AMOUNT),max(AMOUNT))

Some of the results are as follows:

R language:

result1<-aggregate(sales[,4],sales[c(3,2)],sum)

result2<-aggregate(sales[,4],sales[c(3,2)],max)

result<-cbind(result1,result2[,3])

Some of the results are as follows:

Comparison:

1.In this case, more than one summarizing method is required. esProc can complete the task in one step. R language has to go through two steps to sum up and seek the maximum value, and finally, combine the results with cbind, because its built-in library function cannot directly use multiple summarizing methods simultaneously. Besides, R language will have more memory usage in completing the task.

2. Another thing is the illogical design in R language. For sales[c(3,2)], the group order in the code is that SELLERID is ahead of CLIENT, but in business, the order is completely opposite. In the result, the order changes again and becomes the same as that in the code. In a word, there is not a unified standard for business logic, the code and the computed result.

Summary:In this example, esProc has the advantages of high efficiency, small memory usage and having a unified standard.

4.Seeking quadratic sum

Use a loop function to seek quadratic sum of the set v=[2,3,4,5].

Please note that both esProc and R language have functions to seek quadratic sum, but a loop function will be used here to perform this task.

esProc:

v.loops(~~+~*~;0)

R language:

Reduce(function(x,y) x+y*y, c(0,v))

Comparison:

1.Both esProc and R language can realize this function easily.

2.The use of loops function by esProc means that it sets zero as the initial value, computes every member of v in order and returns the final result. In the code, "~" represents member being computed and "~~" represents computed result of last step. For example, the arithmetic in the first step is 0+2*2 and that in the second step is4+3*3, and so forth.The final result is 54.

The use of reduce function by R language means that it computes members of [0,2,3,4,5] in order, and puts the computed result of the current step into the next one to go on with the computation. As esProc, the arithmetic in the first step is 0+2*2 and that in the second step is 4+3*3, and so forth.

3. R language employs lambda expression to perform the operation. This is one of the programming methods of anonymous functions, and can be directly executed without specifying the function name. In this example, function(x,y),the specification, defines two parameters; x+y*y, the body, is responsible for performing the operation; c(0,v) combines 0and v into[0,2,3,4,5] in which every member will take part in the operation in order. Because it can input a complete function, this programming method becomes quite flexible and is able to perform operations containing complicated functions.

The esProc programming method can be regarded as an implicit lambda expression, which is essentially the same as the explicit expression in R language. Butit has a bare expression without function name, specification and variables and its structure is simpler. In this example, "~" represents the built-in loop variable unnecessary to be defined; ~~+~*~is the expression responsible for performing the operation; v is a fixed parameter in which every member will take part in the operation in order. Being unable to input a function, it is not as good as R language theoretically in flexibility and ability of expression.

4. Despite being not flexible enough in theory, esProc programming method boasts convenient built-in variables and operators, like ~, ~~, #, [], {}, etc., and gets a more powerful expression in practical use. For example, esProc uses“~~” to directly represent the computed result of last step, while R language needs reduce function and extra variables to do this. esProc can use “#” to directly represent the current loop number while R language is difficult to do this. Also, esProc can use “[]”to represent relative position. For example, ~[1]is used to represent the value of next member and Close[-1]is used to represent value of the field Close in the last record.

In addition, esProc can use“{}”to represent relative interval. For example, {-1,1}represents the three members between the previous and next member. Therefore,the common expression v.(~{-1,1}.avg())can be used to compute moving average, while R language needs specific functions to do this. For example,there is even no such a function for “seeking average” in the expression filter(v/3, rep(1, 3),sides = 1), which is difficult to understand for beginners.

Summary:In this case, the lambda expression in R language is more powerful in theory but is a little difficult to understand. By comparison, esProc programming method is easier to understand.

5. Inter-rows and –groups operation

Here is a table stock containing daily trade data of multiple stocks. Please compute daily growth rate of closing price of each stock.

Some of the original data are as follows:

esProc:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC))

R language:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

}

Comparison:

1. Both esProc and R language can achieve the task. esProc only uses loop function in computing, achieving high performance and concise code. R language requires writing code manually by using for statement, which brings poor performance and readability.

2. To complete the task, two layers of loop are required: loop each stock, and then loop each record of the stocks. Except being good at expressing the innermost loop, loop function of R language (including lambda syntax) hasn't built-in loop variables and is hard to express multi-layer loops. Even if it manages to work out the code, the code is unintelligible.

Loop function of esProc can not only use “~” to represent the loop variable, but also be used in nested loop, therefore, it is expert at expressing multi-layer loops. For example, A10.(~.sort(Date))in the code is in fact the abbreviation of A10.(~.sort(~.Date)).The first “~” represents the current stock, and the second "~" represents the current record of this stock.

3. As a typical ordered operation, it is required that the closing price of last day be subtracted from the current price. With the useful built-in variables and operators, such as #,[] and {}, esProc is easy to express this type of ordered operation. For example, Close-Close[-1]can represent the increasing amount. R language can also perform the ordered operation, but its syntax is much too complicated due to the lack of facilities like loop number, relative position, relative interval and so on. For example, the expression of increasing amount is Close-c(0,Close[- length (Close)]).

It is hard enough for loop function in R language to perform the relative simple ordered operation in this example, let alone the more complicated operations. In those cases, multi-layer for loop is usually needed. For example, find out how many days the stock has been rising:

A10<-split(stock, stock $Code)

for(I in 1:length(A10){

A10[[i]][order(as.numeric(A10[[i]]$Date)),] #sort by Date in each group

A10[[i]]$INC<-with(A10[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price

if(nrow(A 10[[i]])>0){ #add a column, continuous increased days

A10 [[i]]$CID[[1]]<-1

for(j in 2:nrow(A3[[i]])){

if(A10 [[i]]$INC[[j]]>0 ){

A10 [[i]]$CID[[j]]<-A10 [[i]]$CID[[j-1]]+1

}else{

A10 [[i]]$CID[[j]]<-0

}

The code in esProc is still concise and easy to understand:

A10=stock.group(Code)

A11=A10.(~.sort(Date))

A12=A11.(~.derive((Close-Close[-1]):INC), if(INC>0,CID=CID[-1]+1, 0):CID))

Summary:In performing multi-layer loops or inter-rows and -groups operations, esProc loop function has higher computational performance and more concise code.

menu

August 14, 2014

Comparison of Loop Function in esProc and R Language

1.Generating data

2. Filtering records

3. Grouping and summarizing

4.Seeking quadratic sum

5. Inter-rows and –groups operation

No comments:

Post a Comment