esProc, A Script Language for Data Analytics with Parallel Mechanism: Methods of Realizing Fixed Grouping and Summarizing in R Language

In grouping and summarizing operation, if the names and numbers of the groups have been given, then we call the operation as fixed grouping and summarizing. The grouping criterion of this type of operation comes from outside of the data set, such as grouping by the client list of parameters list, or by conditions list. This type of operation involves some problems like if the grouping criterion is beyond the boundary of the data set, if extra groups are needed and if there is overlap among the data. To solve these problems requires some effort. In the following, we'll look at several methods for realizing grouping and summarizing in R language.

Case 1: Group criterion is within the data set

Data frame sales contains records of sales orders, in which column CLIENT contains clients list and column AMOUNT contains the orderamounts. Please group sales by “potential clients list” and then summate columns AMOUNT in each group. The potential clients list is [ARO,BON,CHO], which happens to be a subset of column CLIENT.

Note: Source of sales may be a database or a file, such asorders<-read.table("sales.txt",sep="\t", header=TRUE). The first rows of data are as follows:

Code:

byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO"))

result<-aggregate(sales$AMOUNT,list(byFac),sum)

Computed result:

Code interpretation:

1.factor function generates a grouping criterion (which is called as factor in R language), and aggregate function groups and summarizes according to the grouping criterion. The structure of the code is quite clear.

2. Note that the grouping criterion is neither a vector quantity nor an array, so it cannot be written as byFac<- c("ARO","BON","CHO"); it cannot be used directly too unless it is converted into the list type. They are not easy to understand for beginners and require special attention.

3. If column CLIENT is set as the grouping criterion (that is, the unfixed grouping), only a line of code is enough:result<-aggregate(sales$AMOUNT,list(sales$CLIENT),sum).

Summary:

aggregate function can easily fulfill this task.

Case 2: Grouping criterion is beyond the boundary of the data set

It is the exceptional case that the grouping criterion is within the column data. In fact, since the grouping criterion comes from outside of the data set (like the external parameters), its members may not in the column data. In this case, we try to solve such a problem.

Supposed that the value of “potential clients list” is [ARO,BON,CHO,ZTOZ], please divide data into four groups according to “potential clients list” and summate column AMOUNT in each group. Note that client ZTOZ is not in column CLIENT.

Code similar to that of case 1:

byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO","ZTOZ"))

result<-aggregate(sales$AMOUNT,list(byFac),sum)

The computed result of the above code:

It can be seen that there are only three groups of data instead of the four groups as required in the computed result. Apparently, the above code fails to fulfill the task and needs some improvement.

The improved code:

byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO","ZTOZ"))

tapply(sales$AMOUNT, list(byFac),function(x) sum(x))

Computed result is:

Code interpretation:

1. The improved code is in line with the business logic, which makes all the required four groups be displayed in the result.

2. tapply is used to group and summarize. It is more universal than aggregate. But it confuses beginners most of the time as it lacks a sense of immediacy compared with aggregate.

3. That the summating value of ZTOZ is NA means that ZTOZ is not in column CLIENT. If the summating value of ZTOZ is zero, ZTOZ is in column CLIENT and order amount is zero too.

4. In this case, there are just four groups in the grouping and summarizing result. Extra clients shouldn't be here. These extra clients can be called "extra group" and the code for computing its summating value cannot be produced only through some simple modification on the basis of the current code. Instead, a new function is needed:

filtered<-sales[!is.element(sales$CLIENT,byFac),]

redundant<-sum(filtered$AMOUNT)

It is not a piece of complicated code, but the way of coding is apparently different from the previous one.

Summary:

tapply can fulfill this task easily.

Case 3: No overlap among the grouping criteria

It is also one of the fixed groupings that set some conditions as the grouping criteria. For example, divide the order amount into four intervals by 1000, 2000 and 4000. Each interval is a group of orders. Then summate the order amount of each group.

Code:

byFac<-cut(sales$AMOUNT,breaks=c(0,1000,2000,4000,Inf))

result<-tapply(sales$AMOUNT, list(byFac),function(x) sum(x))

Computed result:

Code interpretation:

cut function divide the data frame into four intervals, then tapply function groups the data frame by these intervals and summarize the result of each group.

Summary:

The cooperation of cut and tapply can easily perform the simplest conditional grouping.

Case 4: Re-compute the overlapping data in grouping criteria

The simplest conditional grouping not involves overlapping among the conditions. But in reality, it is common that conditions are overlapping. For example, group the order amount according to the following conditions:

1000 ~ 4000：standard orders r14

<2000：non-important orders r2

>3000：important orders r3

The group of standard orders overlaps with the other two groups. In this circumstance, the problem of whether we should re-compute the overlapping data arises. Let's first deal with situations which require the re-computing.

Code:

r14<-subset(sales,AMOUNT>=1000 & AMOUNT<=4000 )

r2<-subset(sales,AMOUNT<2000)

r3<-subset(sales,AMOUNT>3000 )

grouped<-list(r14=r14,r2=r2,r3=r3)

result<-lapply(grouped,FUN=function(x) sum(x$AMOUNT))

Computed result:

Note: r2 and r3 contain some of the data of r14.

Code interpretation:

1. The above code can complete this task but is too complicated. If there are more conditions, the code will become lengthy.

2. A new function lapply is used in the code. So far we have used a lot of functions, including factor, aggregate, list, tapply, cut, subset, lapply, etc, in order to perform the operation of fixed grouping. We have to use different functions and ways of coding to perform the operation of conditional grouping just because overlapping conditions are involved or not. To be honest, it is difficult to master all these methods.

3. The computed result of the above code is list, while some results of the code in the previous cases are data.frame, and some others are arrays. This inconsistency will bring trouble in practice.

Summary:

The method can complete the task, but the code is complicated and effort is needed to learn to use a lot of functions.

Case 5: Overlapping conditions and non-repeating results

In the previous case, we solved the problem when re-computing the overlapping conditions is needed. But sometimes, we need the non-repeating computed result, that is, the grouped data shouldn't appear again. In this case, the operation will be like this: r2 shouldn't contain data of r14, and r3 shouldn't contain data of r2 and r14.

Code:

r14<-subset(sales,AMOUNT>=1000 & AMOUNT<=4000 )

r2<-subset(sales,AMOUNT<2000 & !(AMOUNT>=1000 & AMOUNT<=4000))

r3<-subset(sales,AMOUNT>3000 & !((AMOUNT>=1000 & AMOUNT<=4000)) & !(AMOUNT<2000))

grouped<-list(r14=r14,r2=r2,r3=r3)

result<-lapply(grouped,FUN=function(x) sum(x$AMOUNT))

Computed result:

Note that when it is no need to re-compute the overlapping data, the value of r2 and r3 is smaller than the result computed previously.

Code interpretation:

More logical judgments are used in the above code, which further complicates the code. It thus can be imagined that the amount of code will be huge if groups are many and conditions are complicated.

Summary：

This method can complete the task but the code is complicated.

The third-party solution

We can also fulfil the task in this case using Python, esProc and Perl. All of them can perform the operation of fixed grouping and summarizing as well as the computation of structured data. The following will briefly introduce the solution in esProc.

Case 1:

byFac=["ARO","BON","CHO"]

grouped=sales.align@a(byFac, CLIENT)

grouped.new(byFac(#), ~.sum(AMOUNT))

Computed result:

Case 2:

The code is omitted because it is completely the same as that in case 1.

Computed result:

If we want to summate the extra group, a little modification will be enough:

byFac=["ARO","BON","CHO","ZTOZ"]

grouped=sales.align@a@n(byFac,CLIENT)

grouped.new((byFac|"redundant")(#), ~.sum(AMOUNT))

Modified part is marked in red. @n represents that an extra group is added to the result set. This way of coding is easy to master than that of R language.

Computed result:

Case 3:

For simple conditional grouping, esProc will only need to change the align function to enum function. The other part will remain unchanged.

byFac=["?<=1000" ,"?>1000 && ?<=2000","?>2000 && ?<=4000","?>4000"]

grouped=sales.enum(byFac,AMOUNT)

grouped.new(byFac(#),~.sum(AMOUNT))

Computed result:

esProc:

@r option is added to the previous code when re-computing the overlapping data is needed.

byFac=["?>=1000 && ?<=4000","?<2000" ,"?>3000"]

grouped=sales.enum@r(byFac,AMOUNT)

grouped.new(byFac(#),~.sum(AMOUNT))

Computed result:

esProc:

Remove the @r option when it is no need to compute the overlapping data. This is the same as the operation of simple condition grouping.

byFac=["?>=1000 && ?<=4000","?<2000" ,"?>3000"]

grouped=sales.enum(byFac,AMOUNT)

grouped.new(byFac(#),~.sum(AMOUNT))

Computed result:

It can be seen that esProc needs only two functions: align and enum, to realize all types of fixed grouping and summarizing. Its code is consistent and simple.

menu

September 1, 2014

Methods of Realizing Fixed Grouping and Summarizing in R Language

No comments:

Post a Comment