A reporting architecture consists of three
layers from bottom to top – storage layer, computing layer and displaying
layer. The storage layer contains raw data, which may be stored in a relational
database (RDB), a NoSQL database, and a local or HDFS file, or may just be a
JSON stream. The computing layer can access the storage layer to process the structured
and semi-structured raw data to generate the result set that is suitable for
reporting. The displaying layer is responsible for presenting and flavoring data
in the form of a list, a table, a cross table or a chart.
Today, the computing layer is usually embedded
into the storage layer or the displaying layer. In the first case, generally it
is the database that does the computing work; and in the second case, computing
work will be accomplished by reporting scripts (that use the cell function or inter-cell
calculation).
The advantages of performing computations
in databases are high data consistency, rich basic algorithms and high performance,
but we shouldn’t ignore its disadvantages:
1.Despite the
high performance the in-database computation achieves, much of it could be
canceled by the performance loss resulted from retrieving data via JDBC and
conversion from data flow to objects with the reporting tool. That remains a
problem that refuses to be solved.
2.Simple SQL statements
can be executed fast, but complex ones are not easy to optimize. The
performance of running a stored procedure that uses a for statement for data fetching could reduce exponentially.
3.SQL isn’t thoroughly
set-oriented, doesn’t promote stepwise computation and lacks support for ordered
sets, rendering natural thinking in SQL and in the stored procedure impossible
and programming in them difficult.
4.Often a lot
of money is spent in database building and maintenance in order to increase
security and reliability. So it’s reasonable to store only the core business
data, instead of the large number of intermediate tables, in the database.
Unfortunately, most report data sources originate from the intermediate tables.
5.A database
is flat-structured, unable to organize data in multilevel directories and only
capable of managing a small number of tables. In-database computation will
generate many intermediate tables that are related to each other and that may
have confusing names, causing difficulty to data administration.
6.It’s
difficult to handle multiple/heterogeneous data sources in databases, such as
cross-database computations and computations between database and text file.
Handling computation with reporting tools may
address the shortcomings of in-database computation to some extent, but there
are downsides as well:
1.In handling
a computation, the reporting tool needs to hide its rows and columns to make
room for storing the intermediate results, increasing memory usage and
decreasing performance.
2.Apart from storing
values, the grid cells of a report have appearance properties (such as font,
size and background color) that may affect performance.
3.Reporting
tools can only handle relatively simple computations, such as aggregation and
getting computed columns. They lack the ability to deal with most of the data
handling tasks involving complex computations.
In sum, there are major drawbacks when embedding
the computing layer into the storage layer or the displaying layer in a
reporting architecture. A better solution is to make it an independent layer.
Specifically, an independent computing layer has the following advantages:
Out-of-database algorithms: Reducing dependence on stored
procedures
Stored procedures are managed by the
database, while the report templates are stored in a file system independent of
the database. It’s difficult to map one onto the other. Editing stored
procedures requires programmers to have certain administrative rights for performing
the recompilation, opening the door for potential security issues. Multiple reporting
routines could share one stored procedure, which is very likely to cause tight
coupling between applications, and that, in turn, would mess up the invocation relationships.
It’s complicated to write stored procedures and their computation-oriented
feature brings little advantage in performance. The frequent use of stored
procedures needs outstanding data management skills and huge data
administration cost, but that is beyond most development teams can achieve or
afford.
An independent computing layer can
significantly reduce the use of the database stored procedures. Algorithms, together
with the report templates, are stored outside databases and belong exclusively
to a single application, which will reduce coupling between reporting routines
and won’t cause coupling between a reporting routine and other types of routines.
Here’s an example.
The database table stateSales contains the sales information of the products in each
state. It has 3 fields – state, product and amount – where there are duplicate
values. Now you need to remove those records with duplicate values and find out
the “popular products whose sales amount in every state ranks top 10” and present
the data of these products in a report.
Traditionally, we would write a stored
procedure to do this. Below is the code:
01 create or replace package salesPkg
02 as
03 type salesCur is ref cursor;
04 end;
05 CREATE OR REPLACE PROCEDURE topPro(io_cursor OUT
salesPkg.salesCur)
06 is
07
varSql varchar2(2000);
08
tb_count integer;
09 BEGIN
10 select count(*) into tb_count from dba_tables where table_name='TOPPROTMP';
11 if tb_count=0 then
12
strCreate:='CREATE GLOBAL TEMPORARY TABLE TOPPROTMP (
stateTmp NUMBER
not null,
productTmp
varchar2(10) not null,
amountTmp
NUMBER not null
)
ON COMMIT PRESERVE
ROWS';
13 execute immediate strCreate;
14 end if;
15 execute immediate 'truncate table
TOPPROTMP';
16 insert into
TOPPROTMP(stateTmp,productTmp,amountTmp)
select state,product,amount from stateSales a
where not(
(a.state,a.product) in (
select state,product from stateSales group by state,product having count(*) > 1
)
and rowid not in (
select min(rowid) from stateSales group by state,product having count(*)>1
)
)
order by state,product;
17 OPEN io_cursor for
18 SELECT productTmp FROM (
SELECT
stateTmp,productTmp,amountTmp,rankorder
FROM (SELECT
stateTmp,productTmp,amountTmp,RANK() OVER(PARTITION BY stateTmp ORDER BY amountTmp DESC) rankorder
FROM
TOPPROTMP
)
WHERE
rankorder<=10 order
by stateTmp
)
GROUP BY productTmp
HAVING COUNT(*)=(SELECT COUNT(DISTINCT stateTmp ) FROM TOPPROTMP);
END;
With a separate computing layer, the
algorithm can be implemented with coupling not being an issue any more. We can
choose to use high-level languages like JAVA, or data-computing class libraries
such as Pandas and esProc, to perform the computation.
Using JAVA as the computing layer raises
two problems. First, the computing layer should be taken care of separately or
with report templates. But since JAVA is a compiled language, the computing
code that is ready to use needs to be compiled and packaged with the main reporting
tool program, thus increasing, instead of decreasing the coupling and disabling
realization of independent modules. Second, JAVA is not as professional as the data-computing
class libraries, in that it needs to do bottom-up programming starting with the
basic algorithms. That demands a lot of work.
Both Pandas and esProc are professional data-computing
libraries. But Pandas lacks an easy-to-use JDBC interface for receiving
invocation from the reporting tool. That’s why we take esProc here for illustration.
A2 removes the records with duplicate
values. A3 groups records by state. B3 gets the sequence numbers of the top 10
products in each group. A4 gets the products from each group according to their
sequence numbers. A5 calculates the intersection between groups.
Save the above script as the topPro.dfx file and call it from the
reporting tool via JDBC. The calling method is the same as that of calling a
stored procedure. With BIRT, for instance, we can call the script using this
statement: call topPro()
The above esProc script is simple, clear, and
easy to develop and maintain, as well as allows interpreted execution. The
script requires only the select right to access the database, without the need
of higher levels of rights to perform modification and editing. It can be
stored with the corresponding report template to gain easy maintenance.
Out-of-database data: Reducing intermediate tables
An intermediate table is an intermediate
result set got through data arrangement and aggregation that needs to be
performed in advance because of the large amount of data or the computational
complexity. It serves as the source of data of the report and makes report
development easier and more efficient. However, similar to the stored procedures,
intermediate tables are difficult to manage, potentially vulnerable to security
threats, error-prone during invocations and costly to develop. Particularly,
those tables will take up large database space, holding hardware and software costs
on a rather high level.
Generally, intermediate tables are
calculated from static historical data, so they don’t demand the ability of
maintaining transaction consistency the databases have, and, for being the
redundant data, a high degree of security and steadiness. They are stored in
databases just for obtaining computational power. We can retrieve these
intermediate results from databases as external data files which will then be
handled by the computing layer. In this way, intermediate tables will become
much fewer.
The intermediate data that has been moved out
of a database can be managed by a file system in a tree structure and stored
with the matching report templates and data preprocessing algorithms. An
independent computing layer can sustain the intermediate data’s powerful
computational ability for use in report creation. With this arrangement, data
is more easily and conveniently managed and occupies less database space while maintaining
a lower overhead. Moreover, file systems have higher IO performance than
databases, and thus more suitable for being used in report development tasks.
Here’s an example.
The Orders
table contains core business data from which a number of intermediate tables
are generated. One of them is the sumOrders
table in which sales amounts are summarized according to clients, sellers,
years and month. Now we need to query data in certain years in sumOrders, calculate the link relative
ratio of each month and present the result in a report. Below is the table:
Client
|
SellerId
|
year
|
month
|
amount
|
ARO
|
498
|
2014
|
10
|
1814
|
ARO
|
499
|
2013
|
10
|
4709
|
BDR
|
26
|
2014
|
10
|
990
|
BDR
|
41
|
2014
|
1
|
554
|
BDR
|
51
|
2014
|
11
|
7
|
BDR
|
56
|
2013
|
1
|
4437
|
Explanation: Import the text file into the memory with tab being the separator and the first row being the filed names. Then query desired data according to the parameter and perform aggregation by month. Finally, calculate the link relative ratio, in which total[-1] represents the sales amount of the previous month. Result is as follows:
A controlling computing layer: Enabling T+0 reports
A T+0 report is one that presents both the
current and historical transaction data.
Historical data has two characteristics: First,
it is huge, which causes high database space usage and hardware and software
costs. Second, it’s continually expanding, which could affect the current
transaction. The traditional idea in response to the problems is keeping only the
useful part of the historical data in the current transaction database and
moving the rest of it out to build a separate database.
Consequently, programmers and developers
are only able to create reports based only on either the present data or the past
data, but unable to create a report based on both present and past data, making
a T+0 report impossible. To build and view such a report, cross-database
querying is needed. Theoretically, many databases support cross-database
queries by mapping data from another database into the database currently
handled and retrieving the data into the current database for handling if an
analysis is needed. But this way, the actual performance and steadiness is
rather unsatisfactory and the overhead is fairly high. On the other hand, a historical
database has large storage capacity and doesn’t require transaction consistency,
so it is most probably that it is a different type of database. Under such
circumstances, the computing mechanisms provided by database vendors for
cross-database data handling are hard to put into practice.
Yet, with an independent computing layer,
high-performance, low-cost, stable cross-database data handling can be
achieved.
Independent of any specific database, a
computing layer can realize T+0 reports solidly. It permits each database to
handle its own data while it collects result sets to summarize and returns the
final result to the reporting tool for presentation. Apparently, this mechanism
supports scaling out better and allows for different types of historical
databases.
Here’s an example.
A telecom company stores user service
information in the database table userService.
You need to build a T+0 report for displaying the call duration, number of
calls, duration of local calls and number of local calls. The actual operation
shows that the report building performance is not good because data is huge but
the queries are very inefficient.
We can greatly increase the reporting
performance by using an independent computing layer. This is how it goes: store
the userService table in multiple
databases and perform parallel processing in the computing layer.
In esProc, the code will be like this:
The fork
statement launches the parallel processing with four subthreads, each of which
retrieves data from a database and returns its own group and aggregate result
to the main thread that merges the subthreads’ results and performs a second
group and aggregate to obtain the desired result set for reporting.
Popular reporting tools generally fail to provide
the capability of performing parallel retrieval and aggregation, and,
therefore, turn to high-level languages, like Java, to do that. But it’s
difficult to write parallel code in Java and the language lacks the ability of
handling structured data, so the effect is no match for the one achieved
through an independent computing layer.
Handling various data sources directly
An independent computing layer supports
many different types of data sources, such as NoSQL databases, file data and HTTP
data, and can directly handle them without importing them into the RDBs, saving
the effort of building extra databases, as well as lowering cost and the risk
of data inconsistency.
In some aspects, NoSQL databases offer
superior performance than RDBs, but they don’t have sufficient computational
power or each do computations in a distinctive way. If we can enable cooperation
between NoSQL databases and the independent computing layer, we can sustain the
former’s strengths. For example, the throughput of MongoDB is significantly
higher than the throughput ordinary RDBs can achieve when transferring
append-only log data, but since the NoSQL database is weak in handling
structured data, it needs a database middleware or a Java middleware. An
independent computing layer can sustain the high-throughput of MongoDB while injecting
into it sufficient ability of processing structured data.
Here’s an example.
A MongoDB Collection last3 has two fields:
variable and timestamp. We want to group data by variable and find from each
group of documents the three ones with the biggest timestamp values and the ten
ones with the smallest timestamp values.
Below is a selection of the raw data:
{"_id" :
ObjectId("54f69645e4b077ed8d997857"),
"variable" : "A",
"timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" : ObjectId("54f69645e4b077ed8d997856"),
"variable" : "A",
"timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997855"),
"variable" : "A",
"timestamp" : ISODate("1995-01-03T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997854"),
"variable" : "B",
"timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997853"),
"variable" : "B",
"timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997852"),
"variable" : "B",
"timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997851"),
"variable" : "C",
"timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997850"),
"variable" : "C",
"timestamp" : ISODate("1994-01-02T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997858"),
"variable" : "C",
"timestamp" : ISODate("1994-01-01T00:00:00Z")}
{"_id" :
ObjectId("54f69645e4b077ed8d997859"),
"variable" : "C",
"timestamp" : ISODate("1993-01-01T00:00:00Z")}
|
In a conventional way, we need to use a Java middleware and return the result to the reporting tool through a user-defined data source interface. The code is cumbersome and the process is complicated. An independent computing layer can directly use the MongoDB data source and return result to the reporting tool. Take esProc for example, the code will be like this:
Explanation: Connect to MongoDB, group data
by variable and retrieve the collection last3 as a cursor (big data handling supported).
Then run a loop to import a group of documents with the same variable value
into the memory each time, and during each loop get the three documents with
biggest timestamp values and append them to B2. After that A5 gets from B2 the
ten documents with the smallest timestamp values.
No comments:
Post a Comment