esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Simplifies SQL-style Computations

There are many complicated SQL-style computations in database application development. Interval merging is one of them, such as the summation of non-overlapping time intervals and merging of overlapping time intervals, etc. Due to the lack of orderliness of an SQL set, SQL solves the problems by resorting to recursive method that is difficult to be used with a database that supports not enough recursive functions. Besides, we need to write a nested SQL statement containing multilayered subqueries that is too lengthy for realization and modification.

In contrast, esProc has an easier way for the computation. It merges intervals step by step through intuitive script. Let’s look at an example.

Table Udetail stores user operation records in detail. Some of the original data are as follows:

ID UID ST ET

1 1001 2014-9-1 10:00:00 2014-9-2 11:30:00

2 1001 2014-9-1 10:30:00 2014-9-2 11:00:00

3 1001 2014-9-3 11:00:00 2014-9-4 12:00:00

4 1001 2014-9-4 10:00:00 2014-9-5 13:00:00

5 1001 2014-9-4 15:00:00 2014-9-5 18:00:00

6 1002 2014-9-1 11:00:00 2014-9-2 11:30:00

7 1002 2014-9-1 10:30:00 2014-9-2 11:00:00

In this table, ST and ET represent respectively the starting time and ending time of the operations and each user may have overlapping operational time intervals. Based on a specified user, we are to:

1. Merge the overlapping time intervals so as to create a new record of time intervals;

2. Summarize the total time of non-overlapping intervals.

esProc does it as follows:

An explanation of the script：

A1=db1.query("select * from udetail where UID=?",arg1)

Retrieve data of the specified user from the database. arg1 is an external parameter. Suppose the value of it is 1001, then the query result is as follows:

A2= A1.sort(ST).select(ET>ET[-1]).run(max(ST,ET[-1]):ST)

This line of code first sorts A1’s table by the starting time (.sort(ST)) and selects records by the condition that each ET is greater than the previous one, i.e. to delete the intervals completely covered by others. For every two overlapping records, get the minimum ST value and the maximum ET value to create a new time interval. The result is as follows:

It can be seen that sets are ordered in esProc, so the previous record can be referenced by its number. It is very different from SQL. If the continuous time intervals need to be combined into one interval, A2’s code can be like this：A2=A1.sort(ST).select(ET>ET[-1]).run(if(ST<ET[-1],ST[-1],ST):ST).group(ST;~.m(-1).ET:ET)

And the following result will be got:

A3=A2.sum(interval@s(ST,ET))

Summarize the total overlapping time. The result is as follows:

If the total non-overlapping time is wanted (without details), the code can be

A1.sort(ST).select(ET>ET[-1]).sum(interval@s(max(ST,ET[-1]),ET))

If the whole step of time is relatively small, the code can be modified as

A1.(periods@s(ST,ET)).union().len()-2, which gets the result by counting the number of time points.

Finally, an esProc script can be called by the reporting tool or the Java program much like they call a database. It returns a result in the form of ResultSet through JDBC provided esProc. See related documents for more details.

menu

July 7, 2015

esProc Simplifies SQL-style Computations – Interval Merging

No comments:

Post a Comment