esProc, A Script Language for Data Analytics with Parallel Mechanism: The Three-layered Reporting Architecture

A reporting architecture consists of three layers from bottom to top – storage layer, computing layer and displaying layer. The storage layer contains raw data, which may be stored in a relational database (RDB), a NoSQL database, and a local or HDFS file, or may just be a JSON stream. The computing layer can access the storage layer to process the structured and semi-structured raw data to generate the result set that is suitable for reporting. The displaying layer is responsible for presenting and flavoring data in the form of a list, a table, a cross table or a chart.

Today, the computing layer is usually embedded into the storage layer or the displaying layer. In the first case, generally it is the database that does the computing work; and in the second case, computing work will be accomplished by reporting scripts (that use the cell function or inter-cell calculation).

The advantages of performing computations in databases are high data consistency, rich basic algorithms and high performance, but we shouldn’t ignore its disadvantages:

1．Despite the high performance the in-database computation achieves, much of it could be canceled by the performance loss resulted from retrieving data via JDBC and conversion from data flow to objects with the reporting tool. That remains a problem that refuses to be solved.

2．Simple SQL statements can be executed fast, but complex ones are not easy to optimize. The performance of running a stored procedure that uses a for statement for data fetching could reduce exponentially.

3．SQL isn’t thoroughly set-oriented, doesn’t promote stepwise computation and lacks support for ordered sets, rendering natural thinking in SQL and in the stored procedure impossible and programming in them difficult.

4．Often a lot of money is spent in database building and maintenance in order to increase security and reliability. So it’s reasonable to store only the core business data, instead of the large number of intermediate tables, in the database. Unfortunately, most report data sources originate from the intermediate tables.

5．A database is flat-structured, unable to organize data in multilevel directories and only capable of managing a small number of tables. In-database computation will generate many intermediate tables that are related to each other and that may have confusing names, causing difficulty to data administration.

6．It’s difficult to handle multiple/heterogeneous data sources in databases, such as cross-database computations and computations between database and text file.

Handling computation with reporting tools may address the shortcomings of in-database computation to some extent, but there are downsides as well:

1．In handling a computation, the reporting tool needs to hide its rows and columns to make room for storing the intermediate results, increasing memory usage and decreasing performance.

2．Apart from storing values, the grid cells of a report have appearance properties (such as font, size and background color) that may affect performance.

3．Reporting tools can only handle relatively simple computations, such as aggregation and getting computed columns. They lack the ability to deal with most of the data handling tasks involving complex computations.

In sum, there are major drawbacks when embedding the computing layer into the storage layer or the displaying layer in a reporting architecture. A better solution is to make it an independent layer. Specifically, an independent computing layer has the following advantages:

Out-of-database algorithms: Reducing dependence on stored procedures

Stored procedures are managed by the database, while the report templates are stored in a file system independent of the database. It’s difficult to map one onto the other. Editing stored procedures requires programmers to have certain administrative rights for performing the recompilation, opening the door for potential security issues. Multiple reporting routines could share one stored procedure, which is very likely to cause tight coupling between applications, and that, in turn, would mess up the invocation relationships. It’s complicated to write stored procedures and their computation-oriented feature brings little advantage in performance. The frequent use of stored procedures needs outstanding data management skills and huge data administration cost, but that is beyond most development teams can achieve or afford.

An independent computing layer can significantly reduce the use of the database stored procedures. Algorithms, together with the report templates, are stored outside databases and belong exclusively to a single application, which will reduce coupling between reporting routines and won’t cause coupling between a reporting routine and other types of routines.

Here’s an example.

The database table stateSales contains the sales information of the products in each state. It has 3 fields – state, product and amount – where there are duplicate values. Now you need to remove those records with duplicate values and find out the “popular products whose sales amount in every state ranks top 10” and present the data of these products in a report.

Traditionally, we would write a stored procedure to do this. Below is the code:

01 create or replace package salesPkg

02 as

03 type salesCur is ref cursor;

04 end;

05 CREATE OR REPLACE PROCEDURE topPro(io_cursor OUT salesPkg.salesCur)

06 is

07 varSql varchar2(2000);

08 tb_count integer;

09 BEGIN

10 select count(*) into tb_count from dba_tables where table_name='TOPPROTMP';

11 if tb_count=0 then

12 strCreate:='CREATE GLOBAL TEMPORARY TABLE TOPPROTMP (

stateTmp NUMBER not null,

productTmp varchar2(10) not null,

amountTmp NUMBER not null

)

ON COMMIT PRESERVE ROWS';

13 execute immediate strCreate;

14 end if;

15 execute immediate 'truncate table TOPPROTMP';

16 insert into TOPPROTMP(stateTmp,productTmp,amountTmp)

select state,product,amount from stateSales a

where not(

(a.state,a.product) in (

select state,product from stateSales group by state,product having count(*) > 1

)

and rowid not in (

select min(rowid) from stateSales group by state,product having count(*)>1

)

order by state,product;

17 OPEN io_cursor for

18 SELECT productTmp FROM (

SELECT stateTmp,productTmp,amountTmp,rankorder

FROM (SELECT stateTmp,productTmp,amountTmp,RANK() OVER(PARTITION BY stateTmp ORDER BY amountTmp DESC) rankorder

FROM TOPPROTMP

)

WHERE rankorder<=10 order by stateTmp

)

GROUP BY productTmp

HAVING COUNT(*)=(SELECT COUNT(DISTINCT stateTmp ) FROM TOPPROTMP);

END;

With a separate computing layer, the algorithm can be implemented with coupling not being an issue any more. We can choose to use high-level languages like JAVA, or data-computing class libraries such as Pandas and esProc, to perform the computation.

Using JAVA as the computing layer raises two problems. First, the computing layer should be taken care of separately or with report templates. But since JAVA is a compiled language, the computing code that is ready to use needs to be compiled and packaged with the main reporting tool program, thus increasing, instead of decreasing the coupling and disabling realization of independent modules. Second, JAVA is not as professional as the data-computing class libraries, in that it needs to do bottom-up programming starting with the basic algorithms. That demands a lot of work.

Both Pandas and esProc are professional data-computing libraries. But Pandas lacks an easy-to-use JDBC interface for receiving invocation from the reporting tool. That’s why we take esProc here for illustration.

A2 removes the records with duplicate values. A3 groups records by state. B3 gets the sequence numbers of the top 10 products in each group. A4 gets the products from each group according to their sequence numbers. A5 calculates the intersection between groups.

Save the above script as the topPro.dfx file and call it from the reporting tool via JDBC. The calling method is the same as that of calling a stored procedure. With BIRT, for instance, we can call the script using this statement: call topPro()

The above esProc script is simple, clear, and easy to develop and maintain, as well as allows interpreted execution. The script requires only the select right to access the database, without the need of higher levels of rights to perform modification and editing. It can be stored with the corresponding report template to gain easy maintenance.

Out-of-database data: Reducing intermediate tables

An intermediate table is an intermediate result set got through data arrangement and aggregation that needs to be performed in advance because of the large amount of data or the computational complexity. It serves as the source of data of the report and makes report development easier and more efficient. However, similar to the stored procedures, intermediate tables are difficult to manage, potentially vulnerable to security threats, error-prone during invocations and costly to develop. Particularly, those tables will take up large database space, holding hardware and software costs on a rather high level.

Generally, intermediate tables are calculated from static historical data, so they don’t demand the ability of maintaining transaction consistency the databases have, and, for being the redundant data, a high degree of security and steadiness. They are stored in databases just for obtaining computational power. We can retrieve these intermediate results from databases as external data files which will then be handled by the computing layer. In this way, intermediate tables will become much fewer.

The intermediate data that has been moved out of a database can be managed by a file system in a tree structure and stored with the matching report templates and data preprocessing algorithms. An independent computing layer can sustain the intermediate data’s powerful computational ability for use in report creation. With this arrangement, data is more easily and conveniently managed and occupies less database space while maintaining a lower overhead. Moreover, file systems have higher IO performance than databases, and thus more suitable for being used in report development tasks.

Here’s an example.

The Orders table contains core business data from which a number of intermediate tables are generated. One of them is the sumOrders table in which sales amounts are summarized according to clients, sellers, years and month. Now we need to query data in certain years in sumOrders, calculate the link relative ratio of each month and present the result in a report. Below is the table:

Client	SellerId	year	month	amount
ARO	498	2014	10	1814
ARO	499	2013	10	4709
BDR	26	2014	10	990
BDR	41	2014	1	554
BDR	51	2014	11	7
BDR	56	2013	1	4437

Import the sumOrders table as an external file (or we can import the desired data from the core business data table), handle the file in the computing layer, and then return result to the reporting tool. Take esProc as an example and the code will be as follows:

Explanation: Import the text file into the memory with tab being the separator and the first row being the filed names. Then query desired data according to the parameter and perform aggregation by month. Finally, calculate the link relative ratio, in which total[-1] represents the sales amount of the previous month. Result is as follows:

A controlling computing layer: Enabling T+0 reports

A T+0 report is one that presents both the current and historical transaction data.

Historical data has two characteristics: First, it is huge, which causes high database space usage and hardware and software costs. Second, it’s continually expanding, which could affect the current transaction. The traditional idea in response to the problems is keeping only the useful part of the historical data in the current transaction database and moving the rest of it out to build a separate database.

Consequently, programmers and developers are only able to create reports based only on either the present data or the past data, but unable to create a report based on both present and past data, making a T+0 report impossible. To build and view such a report, cross-database querying is needed. Theoretically, many databases support cross-database queries by mapping data from another database into the database currently handled and retrieving the data into the current database for handling if an analysis is needed. But this way, the actual performance and steadiness is rather unsatisfactory and the overhead is fairly high. On the other hand, a historical database has large storage capacity and doesn’t require transaction consistency, so it is most probably that it is a different type of database. Under such circumstances, the computing mechanisms provided by database vendors for cross-database data handling are hard to put into practice.

Yet, with an independent computing layer, high-performance, low-cost, stable cross-database data handling can be achieved.

Independent of any specific database, a computing layer can realize T+0 reports solidly. It permits each database to handle its own data while it collects result sets to summarize and returns the final result to the reporting tool for presentation. Apparently, this mechanism supports scaling out better and allows for different types of historical databases.

Here’s an example.

A telecom company stores user service information in the database table userService. You need to build a T+0 report for displaying the call duration, number of calls, duration of local calls and number of local calls. The actual operation shows that the report building performance is not good because data is huge but the queries are very inefficient.

We can greatly increase the reporting performance by using an independent computing layer. This is how it goes: store the userService table in multiple databases and perform parallel processing in the computing layer.

In esProc, the code will be like this:

The fork statement launches the parallel processing with four subthreads, each of which retrieves data from a database and returns its own group and aggregate result to the main thread that merges the subthreads’ results and performs a second group and aggregate to obtain the desired result set for reporting.

Popular reporting tools generally fail to provide the capability of performing parallel retrieval and aggregation, and, therefore, turn to high-level languages, like Java, to do that. But it’s difficult to write parallel code in Java and the language lacks the ability of handling structured data, so the effect is no match for the one achieved through an independent computing layer.

Handling various data sources directly

An independent computing layer supports many different types of data sources, such as NoSQL databases, file data and HTTP data, and can directly handle them without importing them into the RDBs, saving the effort of building extra databases, as well as lowering cost and the risk of data inconsistency.

In some aspects, NoSQL databases offer superior performance than RDBs, but they don’t have sufficient computational power or each do computations in a distinctive way. If we can enable cooperation between NoSQL databases and the independent computing layer, we can sustain the former’s strengths. For example, the throughput of MongoDB is significantly higher than the throughput ordinary RDBs can achieve when transferring append-only log data, but since the NoSQL database is weak in handling structured data, it needs a database middleware or a Java middleware. An independent computing layer can sustain the high-throughput of MongoDB while injecting into it sufficient ability of processing structured data.

Here’s an example.

A MongoDB Collection last3 has two fields: variable and timestamp. We want to group data by variable and find from each group of documents the three ones with the biggest timestamp values and the ten ones with the smallest timestamp values.

Below is a selection of the raw data:

{"_id" : ObjectId("54f69645e4b077ed8d997857"),

"variable" : "A",

"timestamp" : ISODate("1995-01-01T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997856"),

"variable" : "A",

"timestamp" : ISODate("1995-01-02T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997855"),

"variable" : "A",

"timestamp" : ISODate("1995-01-03T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997854"),

"variable" : "B",

"timestamp" : ISODate("1995-01-02T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997853"),

"variable" : "B",

"timestamp" : ISODate("1995-01-01T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997852"),

"variable" : "B",

"timestamp" : ISODate("1994-01-03T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997851"),

"variable" : "C",

"timestamp" : ISODate("1994-01-03T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997850"),

"variable" : "C",

"timestamp" : ISODate("1994-01-02T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997858"),

"variable" : "C",

"timestamp" : ISODate("1994-01-01T00:00:00Z")}

{"_id" : ObjectId("54f69645e4b077ed8d997859"),

"variable" : "C",

"timestamp" : ISODate("1993-01-01T00:00:00Z")}

In a conventional way, we need to use a Java middleware and return the result to the reporting tool through a user-defined data source interface. The code is cumbersome and the process is complicated. An independent computing layer can directly use the MongoDB data source and return result to the reporting tool. Take esProc for example, the code will be like this:

Explanation: Connect to MongoDB, group data by variable and retrieve the collection last3 as a cursor (big data handling supported). Then run a loop to import a group of documents with the same variable value into the memory each time, and during each loop get the three documents with biggest timestamp values and append them to B2. After that A5 gets from B2 the ten documents with the smallest timestamp values.

menu

January 13, 2016

The Three-layered Reporting Architecture

Out-of-database algorithms: Reducing dependence on stored procedures

Out-of-database data: Reducing intermediate tables

A controlling computing layer: Enabling T+0 reports

Handling various data sources directly

No comments:

Post a Comment