esProc, A Script Language for Data Analytics with Parallel Mechanism: Grouping Function Comparison R Language vs. esProc

Grouping is to allocate the samples into several groups according to a specific flag. There is a difference between groups and the relative commonness shared by group members. The grouping plays an important role in statistical analysis. For example, the type grouping is used to differentiate the types of economy, society, sciences, and other phenomena. The structural grouping is used to study the internal structure, and the analysis group is used to analyze the coexistence relation between data.

As the data analysis language, both R language and esProc provide the rich functions of grouping. Let’s use some examples to have an idea of their difference. In these cases, we will use the Orders table from Northwind database as the sample data.

Basic grouping: group by a certain column, for example, view data by employee.

R：orderByEmp<-split(result,result$EmployeeID)

esProc：A2=A1.group(EmployeeID)

Regarding the basic functions, both R and esProc implement it well. In addition, users can expand the basic grouping functions, such as, group by multi-columns, group and summarize concurrently, first group and then summarize, and continuous grouping at one level after another, and the inter-row computation on data to be grouped.

The basic grouping, in effect, can be characterized as follows: the original member will always be assigned to a certain group, and no duplicate member is allowed. This is the completely-divided grouping that is championed by the relation algebra (i.e. SQL). In some cases, the conditions are even more complicated. For example, the Marketing department sent a list of advertisements & regions (AdCountry by name). These regions are the location where the advertisement campaigns are intensively launched. Currently, we need to analyze the order conditions in these regions. Such conditions are characterized as follows:

The advertisement list has definitely less countries than those in the Orders table because it is the advertisement for partial countries.

The advertisement list may comprise more countries than those in the Orders table because it is quite normal for some countries having no orders.

Such type of grouping can be referred to as "incompletely divided grouping". This is not supported by SQL theory and hard to implement. Let's check whether R language and esProc can over-perform SQL in this respect:

R solution:

AdCountry<-c("USA","Finland","Canada","NotInOrders")

AdCountryFac<-factor(result$ShipCountry,levels=AdCountry)

groupByAd<-split(result,AdCountryFac)

esProc solution:

A3=["USA","Finland","Canada","NotInOrders"]

A4=A1.align(A3,ShipCountry)

Comments:Both R and esProc can solve this problem well. The data is grouped into 4 groups. There are some data in the first 3 groups, and the last group is empty, as expected.

Let us then check the grouping on simple conditions: classify the freightage into 3 categories of 0-30, 30-100, and 100.

R solution:

Frekind<-cut(result$Freight,breaks=c(0,30,100,Inf))

orderByFei<-split(result,Frekind)

esProc solution:

A5=["?<=30", "?<=100&& ?>=30" ,”?>100”]

A6=A1.enum(A5,Freight)

Comments: Both solutions solve the problem perfectly. However, you may have noticed that the representation of esProc is much more flexible. For example, esProc users can carry out the Boolean judgment on character string or data of other types. They can also compute the Boolean expression on 2 fields concurrently. By comparison, it is much more complicated for R language users to implement the similar functions. Since R language users can only perform grouping on one field of numeric type to group it into a non-overlapped range/category. The limitations are really not just a few.

Then, let us check a case of much more complicated example of grouping on conditions. For example, the freightage belongs to these 3 categories: the 5$-15$ is the freightage range that is most easily to be accepted by users, the low freightage range is for those below the 50$, and the high freightage range is for those higher than 50. In this case, there is some overlapped area of category 1 and category 2. Then, the record whose freightage below 10$ must exist in both these 2 groups.

R solution:

subset(result,Freight>=5 & Freight<=15)->g5to15

subset(result,Freight<=50 )->g0to50

subset(result,Freight>50 )->g50toinf

gFeight<-list(g5to15=g5to15,g0to50=g0to50,g50toinf=g50toinf)

esProc solution:

A5=["?<=5 &&?>=15", "?<=50" ,”?>50”]

A6=A1.enum@r(A5,Freight)

Comments: R does not provide the function / feature to implement the grouping on complicated conditions. In fact, such grouping is made out awkwardly. So, it by no means resembles the "grouping action". esProc solution is the same to that for the previous example. In which, @r is used to indicate that the duplicates are allowed in the groups. Such syntax style is flexible, and you can implement lots of functions or features on the basis of the limited number of functions, not having to name any new functions.

As can be seen from several examples above, R language can be used to implement the advanced grouping function. R language is more powerful than SQL in this respect. However, it is still behind esProc regarding the flexibility and usability.

menu

July 28, 2014

Grouping Function Comparison R Language vs. esProc

No comments:

Post a Comment