esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Simplifies SQL-style Computations

During database application development, we often need to deal with complicated SQL-style computations. The transposition of rows and columns is one of them. Oracle uses pivot function to realize this computation. The other databases haven’t the counterparts to realize it directly, which makes the corresponding code difficult to write, understand and maintain. Besides, even the pivot function can only transpose the fixed columns, but is powerless about the unfixed ones. So are the other databases. Generally all of them must resort to the high level programming languages to realize the dynamic SQL.

However, coding this computation with esProc will be concise and easy to understand. We’ll use an an example to illustrate this.

The following figure shows part of the SALES – a database table where order data are stored.

It is required to compute the total order amount, the maximum and minimum order amount, and the total number of orders of each month of the year 2013, and then transpose the data into a table with thirteen columns and four rows, in which the four operations occupy the first column, with subtotal being the column name, and every month covers a column, with the column names being 1, 2, 3, 4… The first five fields are as follows:

esProc code:

A1：Execute the SQL statement of selecting the data of the year 2013 and grouping and summarizing the data by the month. Result is as follows:

This simple SQL statement for data grouping and summarizing is supported by any database. The difficulty is the transposition of rows and columns following it.

A2=create(subtotal).record(["OSum","OMAX","OMIN","OCount"])

This line of code creates an empty table sequence where there is only one field: subtotal, as shown below:

Note: A table sequence is a data type in esProc. It is a structured two-dimension table similar to the data table of in SQL, but with more powerful function and more flexible usage. By the way, the result of A1 is a table sequence as well.

B2=A2.derive(${to(A1.len()).string()}).

This line of code adds twelve columns to the table sequence in A2 and thus forms the data structure after the transposition, as shown below:

derive function is used to add new columns to an existing table sequence so as to form a new one. For example, derive(1) means adding one column, where 1 is the field name and the field value is the same as the column name. derive(0:field1, null:field2) means adding two columns, where, respectively, field names are field1 and field2 and field values are 0 and null.

According to the requirement of transposition, twelve columns should be added here, for which the code should be derive(1,2,3,4,5,6,7,8,9,10,11,12). A macro, that is ${}, whose role is to convert a string into an expression, is used here in order to generate the code dynamically. to(A1.len()) in the macro is a sequence, whose value is [1,2,3,4,5,6,7,8,9,10,11,12]. The function string() is used to convert the sequence into the string “1,2,3,4,5,6,7,8,9,10,11,12”.

A3-A5：Perform loop on A1, accessing one record each time, rearranging it vertically and, at the same time, modifying the corresponding column in the table sequence in B2. Please note the working range of the loop statement can be represented by the indentation, with no need of using braces ({}), or begin/end. So both B4 and B5 are in the working range and neither A4 nor A5 is in it.

Note: In esProc’s loop body, the loop variable is the cell where for statement is entered. In other word, A3 can be used to reference the current record and A3.MONTH can be used to reference the MONTH field of the current record.

B4=A3.OSum | A3.OMAX | A3.OMIN | A3.OCount

This line of code concatenates the summarized fields of the current record in columns. The operator “|” represents concatenation. For example, the records of December in A1 should be like this after being concatenated:

A3.OSum in the code represents the OSum field of the current record. Since OSum is the second field of the records, it can be referenced by its sequence number, thus the code shall be written as A3.#2.

Equally, the above line of code can be put as B4=A3.#2 | A3.#3 | A3.#4 | A3.#5.

B5=eval(“B2.run(B4(#):#”+ string(#A3+1)+ “)”).

This line of code means modifying the fields in B2 based on the result of B4.

eval function parses strings into expressions dynamically. For example, the computed result of eval(“2+3″) is 5; and here B2.run(B4(#): #13), the loop code for December, in eval function means inserting members of B4 in order into the 13th column (i.e. December) in B2 according to the sequence numbers of the records in B2.

run function is used to modify the fields. For instance, run(field1+field2:field1, 0:#2) means modifying the value of field1 into field1+field2 and the value of the second field (i.e. #2) into 0.

#A3 means the current loop number. Its value is 1 when the first loop is being executed, and the value is 2 when the second one is being executed, and so on and so forth.

When the loop statement in A3-B5 is executed, the final result will be got in B2. The first several columns are as follows:

In addition, esProc program can be called by the reporting tool or a Java program in a way similar to that in which a Java program calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to the Java main program. For more details, please refer to the related documents。

menu

August 28, 2015

esProc Simplifies SQL-style Computations – Transpose Rows and Columns

No comments:

Post a Comment