esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Simplifies SQL-style Computations

During database application development, we often need to handle complicated SQL-style computations. Data grouping with fixed criteria is just one type of these. It refers to the computation whose grouping criteria not come from the to-be-grouped data, but from the outside, such as another table, the external parameters or a list of conditions. For the operation of data grouping with fixed criteria in certain cases, like the one in which the grouping criteria are defined by another table and no specific grouping order is required, it can be realized easily in SQL. But for the other cases, SQL has difficulty in realizing the operation. esProc, however, provides easy solutions to various problems of this operation, which will be illustrated through the following examples.

A table – sales - stores the records of orders, in which column CLIENT contains names of clients and column AMOUNT contains order amounts. Now group the table according to the “list of potential clients” and sum up column AMOUNT of each group. Some of the data of sales are displayed in the following figure:

Example 1：The “list of potential clients” comes from the Std field of another table - potential – and only has four records, which are TAS, DSGC, GC and ZTOZ in order. Client ZTOZ is not in sales. It is required to group and summarize the data according to these four records in their order.

If no specific order is required, this operation can be realized easily with SQL. The corresponding code is as follows:

select potential.std as client, sum(sales.amount) as amount from potential left join client on potential.std=sales.client group by potential.std.

But since a specific order is required in this example, a field for sorting the data must be added if SQL is used to realize the operation and this field should finally be removed using subquery . By contrast, it will be much easier to perform the operation in esProc, as the following code shows:

A1,B1：Query data from the database, and name the two groups of selected data respectively as sales and potential, as shown below:

A3:=sales.align@a(potential:Std,Client) This line of code uses align function to divide the Client field of sales into four groups according to the Std field of potential, as shown in the following figure:

It can be seen that the data of the first three groups have existed in sales, while those in the fourth group are not included in it. So the value of the fourth group is empty. The @a option of align function means fetching all the eligible data of each group; without it only the first eligible data of each group will be fetched.

A4:=pjoin(potential.(Std),A3.(~.sum(Amount)))

This line of code joins two groups of data using pjoin function. One part is potential.(Std), which represents the Std field of potential, and the other part is A3.(~.sum(Amount)), which means summing up the Amount field of each group in A3. The final result of this example is as follows:

Example 2：The list of potential clients has fixed values but there are many clients in it many. If there are only a small number of clients, union statement can be used in SQL to combine all the clients into a pseudo table. But the method is not desirable if there are a large number of clients, and a new table must be created to store the data permanently. Yet the use of esProc can save us the trouble of creating the table. The esProc code is as follows:

In the above code, A2 contains strings which are separated from each other by commas and which represent a great number of fixed values conveniently.

Example 3：The list of potential client is the external parameters, like TAS, BON, CHO, ZTOZ.
As the parameters change frequently, it is inconvenient to create a pseudo table using union statement in SQL. Alternatively, we have to create a temporary table, parse the parameters and insert the parsed parameters into this table for later computation. In esProc, it is unnecessary to create a temporary table for this case. The realization process is as follows:

First, define a parameter arg1, as shown in the following figure:

Then modify the script file, as shown below:

Run the s esProc_sql_fixedcriteria_cript and enter the parameter values, which are assumed to be “TAS, BON, CHO, ZTOZ“, as shown below:

Because the grouping criteria are the same as those in example 1, their final results are the same.

Note: The code in A2 converts the strings “TAS, BON, CHO, ZTOZ” into the sequence ["TAS","DSGC","GC","ZTOZ"]. But this step of conversion can be omitted if the parameters are entered and ["TAS","DSGC","GC","ZTOZ"] is got directly.

Example 4：The criteria for the operation of data grouping with fixed criteria can be either the numerical values or conditions. For example, divide the order amounts into four sections by 1000, 2000 and 4000, with each section being a group of orders, and sum up the total amount of each group.

If the criteria have been given, they can be fixed in SQL; if the criteria are the dynamic external parameters, high-level programming languages, like Java, should be used to write a program to compose a SQL statement and the process will be rather complicated. By supporting dynamic expressions, esProc can realize the operation in this example easily. The following shows the esProc code:

In the above code, the variable byFac represents the grouping criteria in this example, which include four string conditions. byFac can be external parameters, or a view or a table in a database as well. The final result of A4 s as follows:

Example 5：In the previous operations of conditional grouping, it happens that there are no overlapped conditions. But actually it is common to find that conditions overlap with each other. For example, group the order amounts according to the following conditions:

1,000~4,000：Common orders - r14

Less than 2000：Non-important orders - r2

Above 3000：Important orders - r3

Here both r2 and r3 overlap with r14. In this case, if we don’t want the overlapped data, we can first select the data satisfying r14 and then select from the rest of the data those satisfying r2, and so forth.

esProc provides enum function to perform the conditional grouping whose conditions overlap with each other, as shown below:

The grouping result of A3 is as follows:

Computed result is as follows:

But sometimes overlapped data is needed. For this purpose, data satisfying r14 will be selected from sales – one of the two groups of selected data mentioned above, and then data satisfying r2 will be selected from the original table of sales, and so forth. Here @r option will be used in enum function to modify the code in A3 into =sales.enum@r(byFac,Amount). Now the grouping result is as follows:

In the above figure, the data in red boxes overlap with each other. The final computed result is as follows:

In addition, esProc program can be called by the reporting tool or a Java program in a way similar to that in which a Java program calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to the Java main program. For more details, please refer to the related documents.

menu

August 26, 2015

esProc Simplifies SQL-style Computations – Data Grouping with Fixed Criteria

No comments:

Post a Comment