esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Simplifies SQL-style Computation

During developing database applications, we are often faced with complicated SQL-style computations like relational computation on multilayered groups. As SQL requires summarizing during data grouping and it doesn't support object-style relational access, it is difficult to deal with these computations. We have to resort to an advanced technique of window functions nested with multilayered subqueries. esProc, however, can handle them more easily by supporting real grouping and visual object-style relational access.

In practice, there are a lot of cases in which relational computation on groups is needed, such as the one presented on the website http://forums.bit-tech.net/showthread.php?t=207052. Based on this practical example, we have designed a more common one for illustrating in detail how esProc works to realize the relational computation on groups.

Computing target: Query branches of a DVD store where there are less than four categories of DVD copies.

Data Structure：Branch Table stores information of the branch stores. DVD Table stores titles and categories of DVDs, in which DVDs, like Transformers IV, are virtual data items instead of physical disks. DVDCopy Table stores physical DVD copies branch stores have. Note: DVDCopy Table is related to Branch Table through BranchID field, and to DVD Table through DVDID field. The following is part of the data:

Branch Table：

DVD Table：

DVDCopy Table：

Description：

1.Computed results should be certain records in Branch Table.

2.That the Status field of a record in DVDCopy shows Miss means the DVD is missing, and that a record’s LastDateReturned field is empty means the DVD has been rented out but not returned. Obviously the DVDs that are missing and haven't been returned are outside of our computing objects and should be filtered away.

3.We should consider the situation that there may be certain branch stores whose information isn't included in DVDCopy Table, though it is rarely seen.

Solution:

1.Select valid, existing DVD copies the branch stores have from DVDCopy Table.

2.Group DVDCopy Table by BID. Each group will contain all DVD copies a branch store has.

3.Select the DVDs corresponding to the DVD copies each branch store has, and compute the number of categories to which these DVDs belong.

4.Select branch stores where the number of categories of existing DVDs is less than four. These branch stores are eligible.

5.Select branch stores which DVDCopy Table hasn't. They are also eligible.

6. Combine the two kinds of eligible branch stores.

Code written in esProc:

A1-A3：Query data from three tables in the database. The three tables are made variables which are named respectively as Branch, DVD and DVDCopy. Computed results are as follows:

A4： Switch the DVDID field and BID field in DVDCopy Table to corresponding records in DVD Table and Branch Table respectively. Note: This step is the basis of object-style relational access, which requires the use of switch function. After computing, DVDCopy becomes as follows:

Fields in blue have corresponding records. Click one and you can see the details, as shown in the following figure:

Now we can perform object-style relational access only with the operator “.”. For instance, DVDCopy.(DVDID). (Category) represents the category of each DVD copy, and DVDCopy.(BID) gets the detailed information (complete record) about the branch store corresponding to each DVD copy.

A5：=DVDCopy.select(Statues!="Miss" && LastDateReturend!=null). This line of code aims to filter away the DVD copies that are missing and that haven’t been returned. After the code is executed, values of A5 are shown as follows:

A6：=A5.group(BID) is to group data in A5 by BID, with each row representing all DVD copies a branch store has. The result is as follows:

Click the data in blue and you’ll see members of each group:

It can be seen that group function only groups data, but doesn't summarize the data at the same time. In this point, it is different from the function for grouping in SQL. Sometimes, we need to further process the grouped data, rather than simply summarizing them. To do this, esProc's group function is more convenient to use, as shown in the code in A7:

A7：=A6.new( ~.BID:BonList, ~.(DVDID).id(Category).count():CatCount )

The above line of code computes the number of categories of DVD copies to which each branch store corresponds. new function can generate a new object A7 based on the data in A6. A7 has two columns: BonList and CatCount. BonList originates directly from column BID of the grouped data in A6, and CatCount originates from column DVDID of the grouped data. There are three steps to compute CatCount: ~.(DVDID) finds the DVD records corresponding to all DVD copies each branch store has; id(Category) removes repeated records of Category from these DVD records; and count() computes the number of categories. The result is as follows:

That is, branch store B002 has three categories of DVD copies, B003 also has three categories and B001 has four categories.

A8：A7.select(CatCount<4).This line of code makes query to select branch stores whose CatCount is less than 4. The result is as follows:

The above branch stores that are in short supply are computed according to DVDCopy Table. But maybe some branch stores with serious supply shortage are not in the DVDCopy, such as the cases that all the DVD copies in the branch store has been rented out, or that the branch store hasn't any DVD copies. So these branch stores should also be counted. The code for this step is as follows:

A9：=A8.(BonList) | (Branch \ A7.(BonList))

In the above code, operator "|" represents the union operation of two data sets (which can be replaced by union function); operator "\" represents the complement operation (which can be replaced by diff function). A8.(BonList), Branch and A7.(BonList) represent respectively branch stores with supply shortage in DVDCopy Table, all branch stores and branch stores which are included in DVDCopy Table. Their respective values are:

A9 computes the final result of this example. Its values are:

Or we can realize the computation indirectly. For instance, first compute "branch stores that are not in short supply", and then compute the complement of the result and Branch Table. The final result should be the same as that of A9.

Please note variables like A8 or Branch cannot be used to represent a data set in SQL because it doesn't support explicit set. Thus the simple code in the above have to be replaced by lines of complicated SQL statements.

Besides, esProc can be called by a Java program. The method is similar to that with which a Java program calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents

menu

October 20, 2014

esProc Simplifies SQL-style Computation – Relational Computation on Groups

No comments:

Post a Comment