August 27, 2014

Code Examples of Common In-Memory Grouping with esProc

It is convenient to realize some common in-memory grouping with esProc, such as, equal grouping, alignment grouping and enumeration grouping. They are to be illustrated with the following examples.

Equal grouping

Grouping basis of equal grouping is certain fields (or computed columns derived from fields) within a data set. Each group is a subset of original data set.
Case description: Group sales orders by the year.

Data description: Data of sales orders are shown below: 
The above data set (table sequence) can be accessed from a database or a file. For example:
A1=file("E:/sales.txt").import@t()

esProc code:
         A2=A1.group(year(OrderDate))

Computed results:
Code interpretation:

1. In this example, grouping basis comes from OrderDate. The date of sales order will be converted into the year through year(OrderDate), and data of the same year will be grouped together.

2. There may be multiple fields for grouping. For example, regroup data of different years and sellers according to year and sellers. The code is:

A1.group(year(OrderDate),SellerId)

3. Often, the grouped data are used to perform aggregation operations, such as, compute each year's sales amount according to A2. The code is:
A2.new(year(OrderDate):y,~.sum(Amount):a)

Computed results are:

Or, combine grouping and summarizing into one step with groups function:
A1.groups(year(OrderDate):y; sum(Amount):a)

Of course, sometimes we have to execute grouping and summarizing separately in order to reuse the code and improve computational efficiency. For example, filter one of the groups of A2, and perform association computation for another group. Another situation is that, if, after summarizing, data of a certain group are unusual and worth further study, then this group can be used directly to go on with the computations. It's no need to filter the group again.  

4. By default, esProc's group function will group data by using hash algorithm. But, comparing adjacent rows can have higher performance for ordered data. This can be executed by using option @o in group function. For example:
A1.group@o(year(OrderDate),SellerId)

Alignment grouping


Criterions used for equal grouping come from within a dataset. But sometimes, they originate from without, like fields of other data sets, arrays created by users, parameter list and so on. Thus the alignment grouping comes into being.
   
Different from equal grouping, method of alignment grouping may produce empty subsets, which have no members to correspond to data of a group. It may also produce incomplete groups, meaning that some data won’t be included in any group. These things won't happen for equal grouping. 

Case description: Table of top 10 sellers has been worked out in the light of performance, please group sales orders according to the table order.
The data set before grouping:

The sales orders are the same as those in above example. Data are stored in A1.

Table of the top 10 sellers is stored in B1 as follows: 

Sellers table may come from a temporary table, or be generated by a piece of code. The generating process is not the focus of this example.
esProc code:
    A1.align@a(B1:empID,SellerId)

Computed results:

Code interpretation:
1. In this example, the grouping basis (sellers table) comes from without the data set to be grouped. After grouping is completed, a group contains only the data of one seller, and groups are sorted according to sellers table. 

2. Because sellers in sales orders outnumber those in sellers table, some of the orders won't appear in any groups. If we want to create one more group to store these orders, we can use function option @n as follows:
A1.align@a@n(B1:empID,SellerId)

The one more group will be put last as follows:  

3. Sometimes, the grouping basis is not within the data set to be grouped, such as, "newly-employed sellers table". In this case, it's normal to produce empty groups. Modify the first group of data in the table into empID=100, for example, the computed results will be: 

Enumeration grouping 


The grouping basis for enumeration grouping could be more flexible. It could be any Boolean expressions. Those records consistent with the expression will get into the same group.

Similar to alignment grouping, enumeration grouping is also of incomplete grouping, probably producing empty subsets or results that some records are not included in any group. In addition, with this grouping method, it is likely that some records may appear in more than one group. 

Case description: Dividing sales orders into four groups, they are: A. order amount is less than 1000; B. order amount is less than 2000; C. order amount is less than 3000; D. order amount is less than 10,000. Note that the data cannot be grouped repeatedly, that is, if an order has been in group A, it must not be put into group B, C, or D.

The data set before grouping:
The sales orders are the same as those in above example. Data are stored in A1.

esProc code:
         A2=["?<=1000","?<=2000","?<=3000","?<=10000"]
         A3=A1.enum(A2,Amount)

Computed results:
Case interpretation:

1.In this example, grouping basis for grouping is multiple flexible expressions. Each record will be compared with the expressions. Those consistent with the same expression will be put into the same group. Groups are sorted according to order of grouping basis as well.

2. By default, enumeration grouping will not produce identical results. The method, which is showed in the above example, is that after group A’s data are selected, the rest of data will be compared with expression B to see their consistency. While the use of function option @r, which represents that all data are compared with expression B, may produce identical results. For example: A3=A1.enum@r(A2,Amount), computed results are: 

3. Similar to alignment grouping, if the expression for enumeration grouping is inconsistent with any data to be grouped, empty group will appear. Besides, if some data are inconsistent with any expression, function option @n can be used to put them into a surplus group. 

No comments:

Post a Comment