November 15, 2012

Computation after Grouping: R Language vs. esProc

Original post:  http://it.toolbox.com/blogs/data-analytics/computation-after-grouping-r-language-vs-esproc-53756

For SQL, the grouping and summarizing actions are inseparable and must be performed at the same time, which compromises its ability to analyze interactively. By comparison, for R language and esProc, the users can group first and then decide whether to summarize or carry out any more complex computation. For example, without summarizing, R or esProc users can perform the inter-row computation within the group. They can select one of the groups to regroup after studying the values summarized.


Some of you smart readers may have discovered that the latter one is just the OLAP drilling indeed. Well, we will discuss it in details in another essay. Now, let's focus on the subject of this essay and check the respective characteristics of R
and esProc.


Please find the Orders table from the typical Northwind database, as given in the blow example.


Example 1: Group by year without summarizing.


R solution: orderByYear<-split data-blogger-escaped-br="br" data-blogger-escaped-format="format" data-blogger-escaped-rderdate="rderdate" data-blogger-escaped-result="result">
esProc solution: A2=A1.group(year(OrderDate))


Comments: Regarding the basic computation functionalities performed in steps, both R and esProc are perfect in achieving the goal.


Example 2: On the basis of example 1, summarize the freightage of the data from each group by totaling up.

R solution: sumByYear<-mapply data-blogger-escaped-br="br" data-blogger-escaped-function="function" data-blogger-escaped-orderbyyear="orderbyyear" data-blogger-escaped-reight="reight" data-blogger-escaped-sum="sum" data-blogger-escaped-x="x">
esProc solution: A3=A2.(~.sum(Freight))


Comments: Both solutions are perfect.

Example 3: On the basis of example 1, regroup the data by month this time, and then sum up the freight.

R solution:

orderBymonth<-orderbyyear data-blogger-escaped-br="br">
for(i in seq(orderByYear)){

orderBymonth[[i]]<-aggregate data-blogger-escaped-br="br" data-blogger-escaped-format="format" data-blogger-escaped-i="i" data-blogger-escaped-list="list" data-blogger-escaped-m="m" data-blogger-escaped-orderbyyear="orderbyyear" data-blogger-escaped-rderdate="rderdate" data-blogger-escaped-reight="reight" data-blogger-escaped-sum="sum">
}

esProc solution: A4=A2.(~.group(month(OrderDate);~.sum(Freight)))


Comments: As can be seen from the above, it seems that the R solution is fairly complicated. However, it follows the same procedure of that of esProc actually. They all aim to the data of each year (the orderByYear[[i]] for R, and the ~ for esProc,) for grouping operation (the aggregate for R, and the group for esProc). The difference lies in that esProc use ".()" to represent the looping of array. While in R, the looping is represented with for/while/loop. One point worthy of notice is that there are several representations of various usages are available for R to represent the grouping action, which increases the learning difficulties quite a bit. Even the SQL of 10 years ago can beat R in this aspect. We will discuss the flexibility of syntax in a specific essay. Let's proceed with the topic to see the examples below.


Example 4: From the data in the Example 3, we can see that only the statistics of year 1997 is complete with the data of all 12 months. Let's compute the month-on-month value in the year 1997. In which, the data of year 1997 is represented as orderBymonth$"1998" in R, and it is represented as A(2) in esProc.


R solution: orderBymonth$"1997"$lrr<- data-blogger-escaped--="-" data-blogger-escaped-br="br" data-blogger-escaped-c="c" data-blogger-escaped-length="length" data-blogger-escaped-orderbymonth="orderbymonth" data-blogger-escaped-x="x">
esProc solution: A5=A4(2).derive((#1-#1[-1])/#1[-1])


Comments: In this case, the two solutions differ totally. Let's talk about esProc first, #1 represents the first field, that is, the summarized value of Freight. It corresponds to the orderBymonth$"1997"$x in R, and #[-1] represents the previous piece of data in this loop. Therefore, compared with that of the previous period, the expression is simply: (This month – Previous month)/Previous month. Such style of presentation is really simple and clear.


For R user, they can also adopt the similar practices with loops as a must. In fact, if the month-on-month comparison is to be conducted every year, esProc users will only need to replace the 2 in A4(2).derive((#1-#1[-1])/#1[-1]) with ~, while R user will have to use the nested double loops.


Considering the business expert is less skilled in IT field than that of the program developer or the IT pro, I remove the loop with a trick so as to let the business people to handle it smoothly. The actual solution of R is to move the original X down a row, acting as the computational column. Based on this new column and the original column to compute, the computation of relative position can be converted to the inter-column computation. Someone may think the solution becomes simpler at the cost of the adding difficulty of comprehension. Because the R lacks the means to represent the relative positions, the R users will have no choice. They get such a dilemma of compromising the complexity of coding or that of the understanding.


Judging from above examples, we can find that both R and esProc are highly capable for users to compute after grouping. However, esProc provides a much simpler and easier style of representation for users to understand.