esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Simplifies Order-related SQL-style Computations: Examples

It is difficult for SQL to handle order-related computations, for it does not support ordered sets. But as these computations are very common in real-world business, there are a lot of online discussions related to them. For example:

Usually SQL approaches order-related computations by:

1.Using window functions

In most cases, use of window functions can help simplify the SQL if the working database (such as Oracle MSSQL) supports window functions. The dependence on database support limits their working sphere.

2.Using complex statements

If the database does not support window functions, you need to use very complicated nested subqueries in composing SQL statements. Of course some databases (such as MySQL) that support variables and ordered execution could simplify certain part of the algorithm.

3.Composing a stored procedure

But in handling a complicated scenario, it is really not easy to compile queries for the stored procedure step by step.

If the main program is Java-based, you can use esProc (free version is available) to assist the computing process. esProc supports ordered data. It is much simpler to perform order-related calculations with esProc than with SQL. Here are some examples.

Calculating year-on-year growth rate and link relative ratio

Inter-row calculations (such as calculating year-on-year growth rate and link relative ratio) are difficult to be performed if the current database does not support window functions. In that case you need to change the way of approaching the problem by performing join operations. But by doing so, the code becomes unreadable and inefficient. Even if you can use the window functions, you still have to write the nested subqueries, producing quite lengthy SQL statements.

Let’s look at how esProc handles the inter-row calculation. The sales table stores orders data in years. You need to calculate the link relative ratio and year-on-year growth rate of each month in a specified time period based on this table. Below is a selection of the source data:

esProc script:

A1: Query the database by the specified time period, and group Amount by the year and month in OrderDate. Both begin and end are external parameters. Suppose begin="2011-01-01 00:00:00" and end="2014-07-08 00:00:00".

A2: Add a new field lrr to A1’s table for holding the monthly link relatives and the year-on-year growth rate of each month. The expression representing the former is mAmount/mAmount[-1]. esProc supports using [N] or [-N] to represent the Nth row after or before the current row. In the expression, mAmount represents the amount of the current month and mAmount[-1] represents that of the previous month. Note that the link relative ratio value of the first month (here it is January 2011) is null.

A3: Sort A2 by the month and the year to calculate year-on-year growth rate. The complete expression should’ve been =A2.sort(m,y), but as A2 has been sorted by the year, you can just sort it by the month with A2.sort(m), which is more efficient.

A4: Calculate year-on-year growth rate of the sales amount based on A3. Note that the calculation can only be performed between the same months.

The execution result of the esProc script can be the data source of the report, or can be called through JDBC from the Java application. The code for Java to call the esProc script is as follows:

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//Call esProc script (which is similar to the stored procedure); p1 is the name of the script file

st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call p1()");

st.setObject(1,"2011-01-01 00:00:00");

st.setObject(2,"2014-07-08 00:00:00");

//Execute the script

st.execute();

//Get the result set

ResultSet rs = st.getResultSet();

……

The returned value is a ResultSet object in accordance with JDBC standard. The method of calling an esProc script is the same as that of accessing a database. Programmers can master it fast as long as they are familiar with JDBC.

More details about deploying esProc JDBC and calling script through it can be found in esProc Integration & Application: Java Invocation.

A special type of inter-row calculation

There is another type of inter-row calculation, which will be explained through an example. Below is the source data:

Target: To add columns store1 and store2. For the first record, the rule is store1=store and store2=store. From the second record up, it would be store1=store+store1[-1] and store2=store1+store2[-1] in which [-1] represents the previous record.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data.

A2: Create the result set and set the rule that store1=store+store1[-1] and store2=store1+store2[-1]. esProc uses [-1] to represent a position relative to the current one, that is, the previous record.

SQL approach for comparison:

Calculating continuous rising/declining values

Below is the stock records, based on which you need to count the number of continuous days when Price values are positive or negative.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data and sort it by Date

A2: Add a result field. If both the current Price value and the previous Price value are positive or negative, add 1 to the result value; otherwise the result value remains 1.

esProc uses [-1] to represent the previous record relative to the current one, (-2) to reference the second-to-last record and {-1,1} to represent a dynamic interval. They can help simplify the complicated inter-row calculations.

Querying continuous same values

Below is the source data:

Target: To get the num values that have appeared continuously for at least three times.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data.

A2: Create the variable lx to which the initial value assigned is 1.

A3: Judge if the num value of the current record is the same as that of the previous record. If the result is true, add 1 to lx. Finally get num values whose corresponding lx value is 3.

Querying neighboring record

Below is the source data:

Target: To select the record whose Property2 value is greater than 100 and find its next record.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data.

A2: Get the position of the record where Property2>100.

A3: Get the record in the desired position and the one next to it based on A2’s sequence of position.

Counting number of rows by specified interval

Below is the table data:

Target: To count the number of rows where value<10 (or whatever value). If the value is less than 10 in multiple continuous rows, count them one row. Thus the computing result based on the above data should be 3.

esProc script:

A1: Execute the SQL statement to retrieve data.

A2: Group rows according to the condition that value>10 and value<10, in which @o means only comparing the neighboring rows. Then count the number of groups where value<10.

Different from SQL grouping which must be followed by an aggregate, esProc group operation can retain the members of each group for potential use. This design helps make an SQL algorithm simpler.

SQL approach:

Filtering on dynamic intervals

Below is the source data:

Target: To sort the records by ID in ascending order and find the records where DDATE is non-ordered, that is, DDATE hasn’t been sorted in ascending order.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data and sort it by ID.

A2: Select the records where DDATE value is not the biggest compared with all previous records, and where DDATE value is not the smallest compared with all the following records. In this statement, DDATE{,0} represents a set consisting of all DDATE values from the first to the current record and DDATE{0,} represents a set consisting of all DDATE values from the current record to the last.

SQL approach:

Finding lost values for an interval

Below is the source data:

Target: To find the lost numbers in an interval of numbers generated by the smallest value and the biggest value in field A

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data and sort it by A.

A2: Generate an interval consisting of continuous numbers according to the value of the first record, and then get difference of this interval and A1.

SQL approach:

Merging intervals

Below is the source data:

Target: To judge whether intervals generated by values of num1 and num2 in each record are overlapped. Merge them if the result is true.

The targeted result:

The way of approaching it is to judge whether a row can merge with the previous row. If the result is false, make it a separate group (+1); if true, group them into one (+0). Finally get the biggest and the smallest values from each group to generate the interval.

esProc script:

A1: Execute the SQL statement to retrieve data, and sort it by num1.

A2: Create a temporary variable a, and assign 0 to it as its initial value.

A3: In the sorted table, compare each num1 value with the num2 value in the previous row. If the num1 value is bigger, start a new group with the row; otherwise merge the row to which num1 belongs with the previous row.

SQL approach:

Generating intervals

Below is the source data:

Target: To divide RBD_VALUE values into segments and compose them into strings, which are like A-F,H,J-K,N-O,Q-S,U-V,X-Z.

Rules:

If RBD_VALUE= “ALL”, then discard the value;

According to the alphabetical order if certain RBD_VALUE values are consecutive letters , then make them a segment with both the first and last letter included and “-” between them, like “A-F”;

If two values of RBD_VALUE are inconsecutive, like K and N between which L and M are absent, then unite the former with the letter or letters before it and join the latter with the letter or letters after it. Separate the two segments with a comma. This is like “J-K”, “N-O”;

If a value in RBD_VALUE has not consecutive letters both before and after it, then take it as a separate segment, like the letter H. G and I are absent respectively before and after it.

The targeted result:

esProc script:

A1: Execute the SQL statement to retrieve data.

A2: Align RBD_VALUE field in A1 with the alphabet.

A3: Group rows in A2. Group rows where RBD_VALUE is null together and put the other rows where RBD_VALUE is non-null into another group. @o means performing merge-style data grouping, which only compares the neighboring values.

A4: Select the group in which RBD_VALUE is non-null.

A5: Compose strings according to the number of members in each group. ~.m(-1) means getting the last row from the current group.

menu

June 18, 2015

esProc Simplifies Order-related SQL-style Computations: Examples

Calculating year-on-year growth rate and link relative ratio

A special type of inter-row calculation

Calculating continuous rising/declining values

Querying continuous same values

Querying neighboring record

Counting number of rows by specified interval

Filtering on dynamic intervals

Finding lost values for an interval

Merging intervals

Generating intervals

No comments:

Post a Comment