esProc, A Script Language for Data Analytics with Parallel Mechanism: Grouping Function Comparison: R Language vs. esProc

Recently, I participated in the analysis for several marketing questionnaires in a row. Various typical grouping issues are encountered during the process. To share with the interested fellow in the same business, I classified and summarized these issues.

Grouping is to allocate the samples into several groups according to a specific flag. There is a difference between groups and the relative commonness shared by group members. The grouping plays an important role in statistical analytics. For example, the type grouping is used to differentiate the types of economy, society, sciences, and other phenomena. The structural grouping is used to study the internal structure, and the analysis group is used to analyze the coexistence relation between data.

As the mainstream structured data analysis language, both R language and esProc (download) provide the rich functions of grouping. Let's use some examples to have an idea of their difference. In these cases, we will use the Orders table from Northwind database as the sample data.

Basic grouping: group by a certain column, for example, view data by employee.

R: orderByEmp<-split(result,result$EmployeeID)
esProc: A2=A1.group(EmployeeID)

Regarding the basic functions, both R and esProc implement it well. In addition, users can expand the basic grouping functions, such as, group by multi-columns, group and summarize concurrently, first group and then summarize, and continuous grouping at one level after another, and the interrow computation on data to be grouped. For details on the expanded functions/features of the basic groupings, the interested readers can refer to another essay Computation after Grouping - R language vs. esProc.

The basic grouping, in effect, can be characterized as follows: the original member will always be assigned to a certain group, and no duplicate member is allowed. This is the completely-partitioned grouping that is provided by the relation algebra (i.e. SQL). In some cases, the conditions are even more complicated. For example, the Marketing department sent a list of advertisements & regions (AdCountry by name). These regions are the location where the advertisement campaigns are intensively launched. Currently, we need to analyze the order conditions in these regions. Such conditions are characterized as follows:

The advertisement list has definitely less countries than those in the Orders table because it is the advertisement for partial countries.
The advertisement list may comprise more countries than those in the Orders table because it is quite normal for some countries having no orders.

Such type of grouping can be referred to as "incompletely-partitioned grouping". This is not supported by SQL and hard to implement. Let's check whether R and esProc can over-perform SQL in this respect:

R solution:
AdCountry<-c("USA","Finland","Canada","NotInOrders")
AdCountryFac<-factor(result$ShipCountry,levels=AdCountry)
groupByAd<-split(result,AdCountryFac)

esProc solution:
A3=["USA","Finland","Canada","NotInOrders"]
A4=A1.align(A3,ShipCountry)

Comments:
Both R and esProc can solve this problem well. The data is grouped into 4 groups. There are some data in the first 3 groups, and the last group is empty, as expected.

Let us then check the grouping on simple conditions: classify the freightage into 3 categories of 0-30, 30-100, and 100.

R solution:
Frekind<-cut(result$Freight,breaks=c(0,30,100,Inf))
orderByFei<-split(result,Frekind)

esProc solution:
A5=["?<=30", "?<=100 && ?>=30" ,"?>100"]
A6=A1.enum(A5,Freight)

Comments: Both solutions solve the problem perfectly. However, you may have noticed that the representation of esProc is much more flexible. For example, esProc users can carry out the boolean judgment on character string or data of other types. They can also compute the boolean expression on 2 fields concurrently. By comparison, it is much more complicated for R to implement the similar functions. Since R users can only perform grouping on one field of numeric type to group it into a non-overlapped range/category. The limitations are really not just a few.

Then, let us check a case of much more complicated example of grouping on conditions. For example, the freightage belongs to these 3 categories: the $5-$15 is the freightage range that is most easily to be accepted by users, the low freightage range is for those below the $50, and the high freightage range is for those higher than 50. In this case, there is some overlapped area of category 1 and category 2. Then, the record with less than $10 freightage must exist in both these 2 groups.

R solution:
subset(result,Freight>=5 & Freight<=15)->g5to15
subset(result,Freight<=50 )->g0to50
subset(result,Freight>50 )->g50toinf
gFeight<-list(g5to15=g5to15,g0to50=g0to50,g50toinf=g50toinf)

esProc solution:
A5=["?<=5 &&?>=15", "?<=50 " ,"?>50"]
A6=A1.enum@r(A5,Freight)

Comments: R does not provide the function/feature to implement the grouping on complicated conditions. In fact, such grouping is made out awkwardly. So, it by no means resembles the "grouping action". esProc solution is the same to that for the previous example. In which, @r is used to indicate that the duplicates are allowed in the groups. Such syntax style is flexible, and you can implement lots of functions or features on the basis of the limited number of functions, not having to name any new functions. I am considering the necessity to discuss the related topic in my next essay.

As can be seen from several examples above, R can be used to implement the advanced grouping function. R is much more powerful than SQL in this respect. However, it is still less superior and easy-to-use than esProc regarding the flexibility and usability.

menu

November 15, 2012

Grouping Function Comparison: R Language vs. esProc

2 comments: