August 28, 2014

Methods of Grouping and Summarizing in R Language

The operation of grouping and summarizing includes grouping one or more certain fields of two-dimensional structured data and then summarizing fields of each group. The following will introduce methods of grouping and summarizing in R language through an example. In order to make the example more typical, we’ll set two fields to be grouped and two summarizing operations.

Case description:
Please group data frame orders according to CLIENT and SELLERID, and then summate field AMOUNT and seek its maximum value respectively in each group.

Note: orders contains records of sales orders. Its source can be a database or a file. Such as orders<-read.table("sales.txt",sep="\t", header=TRUE).The first rows of data are as follows: 

Method 1: aggregate function
    result1<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],sum)
    result2<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],max)

Part of the computed result:

Code interpretation:
1.The name aggregate implies that it is a function specializing in grouping and summarizing. Both its input parameters and computed result are data frame and its usage is relatively simple.
2.aggregate function cannot perform multiple summarizing operations on grouped data, thus two lines of code are required to realize the operations of seeking sum and max respectively, then their results are combined using cbind. Obviously, the code is not satisfactory in performance and usability.
3. aggregate function has a strange requirement about the order of the fields to be grouped, that is, the fields must be in reversed order. In view of this, the code for grouping CLIENT first and then SELLERID should only be written as orders[,c("SELLERID","CLIENT")]. The code written according to the normal way of thinking will be incorrect instead.
4.  Not only the code is written in an unnatural way, but the computed result is weird too by putting filed SELLERID before CLIENT. In reality, the code should be improved in order to make the computed result conform to the business logic.

aggregate function manages to perform the task after a fashion. But it is not good in performance and usability because the way of coding, computed result and business logic are inconsistent with each other.

    result1<-lapply(sp,FUN=function(x) sum(x$AMOUNT))
    result2<-lapply(sp,FUN=function(x) max(x$AMOUNT))

Part of the computed result:

Code interpretation:
1.  The role of split function is to group the data frame according to specified fields. No further computation is involved. lapply function can perform the same operation on data of each group. By working with each other, split and lapply can fulfill the task.
2.  Because the grouped data can be reused, this operation performs better than that using aggregate function.
3.  As lapply function doesn’t support multiple statistical approaches, two lines of code are required too to realize the operations of seeking sum and max respectively, and then use cbind to combine the results. What’s more, this operation requires an extra split function, so instead of enhancing the usability, it reduces it.
4. The grouping order is still unnatural and the code has to be written reversely as orders[,c("SELLERID","CLIENT")].
5. The computed result needs a lot of modification which brings great inconvenience. It can be seen that the first column of the computed result is, in fact, the “SELLERID.CLIENT”. The column needs tobe split into two columns whose orders should be exchanged.

This operation improves some performance but the usability is obviously poor with inconsistency in the aspects of way of coding, business logic and the computed result.

lapply belongs to the family of apply function. Similar functions include sapply and tapply, whose usages differ on parameters. For example:
    result1<-sapply(sp,simplify=FALSE,FUN=function(x) sum(x$AMOUNT))
    result2<-sapply(sp,simplify=FALSE,FUN=function(x) max(x$AMOUNT))

tapply specializes in data frame, which, by rights, is the most suitable one for fulfilling this task. But it isn’t in fact. It applies only to the situation where a single field is required to be grouped. When it is used to group two fields together, the result will be two-dimensional matrix. This requires users to make further complicated processing. For example, the computed result of the line of code tapply(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],function(x) sum(x))is as follows:

Third-party library functions
There are various disadvantages when using R's built-in functions to group and summarize. In response to the problem, we may consider using the third-party library functions, such as reshape, stack, etc. The stability and computational efficiency of these library functions is generally not as good as those of the built-in functions, and the information for their use is not many. Therefore it is difficult for them to fulfil the task. Here we won’t go into any example about their use.

Third-party languages
Python, esProc and Perl can also be employed to fulfil this task. All of them can perform grouping and summarizing as well as structured data computing as R language can. We’ll briefly introduce solutions of esProc and python.

esProc can fulfil this task by simply using groups function. Its syntax is concise and easy to understand, as well as in line with the natural way of thinking. The code is as follows:      result=orders.groups(CLIENT,SELLERID;sum(Amount),max(Amount))

The computed result, syntax and business logic are highly consistent with each other in esProc. Some of the computed results are as follows:

If python's built-in functions are used to deal with this task, the code will be rather complicated. Here pandas, the third-party function library, comes to help. pandas will first perform grouping operation using groupby function, then summarize using agg function. Its code, which is simpler than R language but not as good as esProc,is written like this:     result=orders.groupby(['CLIENT','SELLERID']).agg({'AMOUNT':[sum,max]}).

Pandas' computed result and syntax are highly consistent with the business logic. Part of the computed result is as follows: