esProc, A Script Language for Data Analytics with Parallel Mechanism: Examples of esProc as used in set Operation

Set operations are frequently used in statistical analysis with structured data, For example, listing all students who has published papers; listing all staff who has participated in all previous training; selecting qualified students in examination for re-examand so on. Within exProc, application of set is everywhere. The most commonly used sequence and sequence table data types are all sets. Therefore, better understanding and using of set helps to complete data computation in a more reasonable and faster way.

For example, the table below contains some sales data:

Now we need to select customers who entered Top 20 revenue contributors (Top 20 customers) in every month of 2013. To solve this problem we can first select all sales data for 2013, group them and to get the statistics for each month. Then we can do a loop to select the Top 20 customers for each month. The intersection of the Top 20 lists for all 12 months will contain the name of customers we wanted. Such complex problems are too difficult to be handled by SQL or stored procedures.

With esProc, we can split complex problems into different steps, and do the computations step by step to get the final result. First, from the sales data we can retieve those for 2013, and group them by month:

esProc’s grouping of data is real grouping, which actually separates data into different groups according to the criterion. This is different from SQL, in which the“group by” command can only return the aggregated result of a grouping. After grouping, the data in A3 is as following:

Before grouping, all data will be sorted automatically. Each group is a set of sales records. For example, the data for March is as following:

To know the total sales revenue for each customer in every month, we need to further split the data by customers. In esProc, we only need to do loop on data for each month, and group them by customers respectively. We can use A.(x) to to do loop on set members, without the need to code for loop.

After further grouping, the monthly data in A4 is a set of sets：

Now, the data for March is as following:

We can see that each group in data for March is the transaction data for certain customer.

The set used in esProc is different from that in mathematical concepts. They are ordered sets and therefore can meet the statistical needs of sorting and selection by position, etc. Then we can find the Top 20 customers for each month:

In A5, do loop on the data for each month to get the Top 20 customers for each month. And in A6, listthe names and monthly revenues of these customers. The computation result in A6 is as following:

Finally, we can further solve the problem:

Generate the name lists of Top 20 customers in A7 for each month. And finally in A8 we can find the intersection of the Top 20 lists for each month as following:

From this example we can find that ordered sets in esProc can make problem solving more intuitive. Within the set, we can easily do grouping, sorting and other computations. This helps to make the goal for each step of data processing clear and easy to understand. Meanwhile, the using of set concept can reduce the complexity and coding workload for loops on set members and set operations, such as the computation for set intersection.

menu

July 16, 2014

Examples of esProc as used in set Operation

No comments:

Post a Comment