April 17, 2014

esProc Helps Database realize Real-time Big Data computing

The Big Data Real-time Application is a scenario to return the computation and analysis results in real time even if there are huge amount of data. This is an emerging demand on database applications in recent years.

In the past, because there are not so many data, the computation is simple, and few parallelisms, the pressure on the database is not great. A high-end or middle-range database server or cluster can allocate enough resource to meet the demand. Moreover, in order to rapidly and parallelly access to the current business data and the historic data, users also tend to arrange a same database server for both the query analysis system and the production system. By this way, the database cost can be lowered, the data management streamlined, and the parallelism ensured to some extent. We are in the prime time of database real time application development.

In recent years, due to the data explosion, and the more and more diversified and complex application, new changes occur to the database system. The obvious change is that the data is growing at an accelerated pace with ever higher and higher volume. Applications are increasingly complex, and the number of concurrent accesses makes no exception. In this time of big data, the database is under increasing pressure, posing a serious challenge to the real-time application.

The first challenge is the real-time. With the heavy workload on the database, the database performance drops dramatically, the response is sluggish, and user experience is going from bad to worse quickly. The normal operation of the critical business system has been affected seriously. The real-time application has actually become the half real-time.

The second challenge is the cost. In order to alleviate the performance pressure, users have to upgrade the database. The database server is expensive, so are the storage media and user license agreement. Most databases require additional charges on the number of CPUs, cluster nodes, and size of storage space. Due to the constant increase of data volume and pressure on database, such upgrade will be done at intervals.

The third challenge is the database application. The increasing pressure on database can seriously affect the core business application. Users would have to off-load the historic data from the database. Two groups of database servers thus come into being: one group for storing the historical data, and the other group for storing the core business data. As we know, the native cross-database query ability of databases are quite weak, and the performance is very low. To deliver the latest and promptest analysis result on time, applications must perform the cross-database query on the data from both groups of databases. The application programing would be getting ever more complex.

The forth challenge is the database management. In order to deliver the latest and promptest analysis result on time, and avoid the complex and inefficient cross-database programming, most users choose to accept the management cost and difficulty increase - timely update the historic library with the latest data from the business library. The advanced edition of database will usually provide the similar subscription & distribution or data duplicate functions.
The real-time big data application is hard to progress when beset with these four challenges.

How to guarantee the parallelism of the big data application? How to reduce the database cost while ensuring the real-time? How to implement the cross-database query easily? How to reduce the management cost and difficulty? This is the one of hottest topics being discussed among the CIOs or CTOs.

esProc is a good remedy to this stubborn headache. It is the database middleware with the complete computational capability, offering the support for the computing no matter in external storage, across databases, or parallelly. The combination of database and esProc can deliver enough capability to solve the four challenges to big data applications.

esProc supports for the computation over files from external storage and the HDFS. This is to say, you can store a great volume of historical data in several cheap hard disks of average PCs, and leave them to esProc to handle. By comparison, database alone can only store and manage the current core business data. The goal of cutting cost and diverting computational load is thus achieved.

esProc supports the parallel cluster computing, so that the computational pressure can be averted to several cheap node machines when there are heavy workload and a great many of parallel and sudden access requests. Its real-time is equal or even superior to that of the high-end database.

esProc offers the complete computational capability especially for the complex data computing. Even it alone can handle those applications involving the complex business logics. What's even better, esProc can do a better job when working with the database. It supports the computations over data from multiple data sources, including various structural data, non-structural data, database data, local files, the big data files in the HDFS, and the distributed databases. esProc can provide a unified JDBC interface to the application at upper level. Thus the coupling difficulty between big data and traditional databases is reduced, the limitation on the single-source report removed, and the difficulty of the big data application reduced.

With the seamless support for the combined computation over files stored in external storage and the database data, users no longer need the complex and expensive data synchronization technology. The database only focus on the current data and core business applications, while esProc enable users to access both the historic data in external storage and the current business data in database. By doing so, the latest and promptest analysis result can be delivered on time.

The cross-database computation and external storage computation capability of esProc can ensure the real-time query while alleviating the pressure on database. Under the assistance of esProc, the big data real-time application can be implemented efficiently at relatively low cost.

About esProc: http://www.raqsoft.com/product-esproc

April 16, 2014

Inter Row Computation to Showcase the Difference Between R Language and esProc

The inter row computation is quite common, such as the aggregate, comparison with same period of any previous year, and link relative ratio. Both R language and esProc provides the pretty good inter-row computation ability with slight difference to each other. In the below case, the utilization of some basic inter-row computations is demonstrated to study on the differences between the two methods:

A sales department of a company wants to make statistics on the outstanding sales persons, that is, the sales persons achieve half of the total sales of the company. The data are mainly from the order table of MSSQL database salesOrder. The main fields include the ID of order: ordered, Name of sales person: name, Sales amount: sales, and date of order: salesDate.
The straightforward solution is as shown below:
1.Group by sales person to calculate the sales amount of each sales person.
2.Sort by sales amount in reverse order on the basis of the data from the previous step.
3.According to the previous step, calculate the aggregate value of each record, and calculate the standard of comparison: the half of total sales of this company.
4.Of the aggregate values calculated in the previous step, select out the list of sales persons whose sales achievement meet the below conditions: lower or equal to the standard of comparison; or although higher than the standard of comparison, the sales achievement of previous sales person is lower than the standard of comparison.

The detailed solution of R language is as shown below:
01 library(RODBC)
02 odbcDataSources()
03 conn<-odbcConnect("sqlsvr")
04 originalData<-sqlQuery(conn,'select * from salesOrder')
05 odbcClose(conn)
06 nameSum<-gNameMonth<-aggregate(originalData$sales,list(originalData$name),sum)
07 names(nameSum)<-c('name','salesSum')
08 orderData<-nameSum[rev(order(nameSum$salesSum)),]
09 halfSum<-sum(orderData$salesSum)/2
10 orderData$addup<-cumsum(orderData$salesSum)
11 subset(orderData,addup<=halfSum | (addup>halfSum& c(0,addup[-length(addup)])<halfSum))
Please find the detailed solution of esProc below:

Then, let us study on the differences between aggregate values:
The R language uses cumsum to compute the aggregate value in the line 10.
esProc uses cumulate in A4 to calculate the aggregate value.
Both writing styles are very convenient for users. However, the operation principle of esProc is aimed to each record: firstly, calculate the cumulate, then, get the aggregate value corresponding to this record according to the #row number. By comparison, R language enjoys a higher efficiency than esProc on this respect since the computation will be only carried out once if using R language.

Dividing one statement of esProc into two statements can solve the efficiency issue, that is, firstly, calculate the list of aggregate value separately, and then insert it to the original data set. However, such writing style is not as concise as the R language that only requires one line of code.

Then, let us check the qualified sales person and the differences:

The R language completes the computation at the Line 11, mainly by moving the line, and using c(0,addup[-length(addup)])to construct a column for the new data. Compared with the column addup, the new column just moves down one column, and the last entry of data is removed and filled with 0 of the first entry. Then, you can compare whether the aggregate value is lower than the standard of comparison, or although it is higher than the standard of comparison, its previous record is lower than the standard.

The R language does not provide the ability to access the data at the relative position. Therefore, the method of “move the data in the relative position to the current position” is adopted. Though the result is still the same, the style of writing is not intuitive enough, and it requires the analyst a relatively higher ability in logic thinking.

The writing style of esProc is select(addup<=B3 || (addup>B3 &&addup[-1]<B3)). Simple and clear indeed! This is the expression of relative position featured by esProc. Users can use the method of [-1] to represent the record in a position one record before or several records after the current record. For example, the aggregation value calculation in A4 can also be rewritten to A3.derive(addup[-1]+salesSum:addup).

Unlike the fixed algorithm of aggregate value, the algorithm of this step is relatively much freer. You may find that the style of expression regarding the relative position of esProc is very agile with great advantages.
Compared with the fixed algorithms, this step of algorithm is much freer.

As we can see from the above case, the computations of relative position and inter-row computations can solve many problems which are apparently complex. esProc is more flexible in expressing the relative positions. Therefore, esProc users can feel more relax when calculating the complex problems.

Regarding the R language, appending to the whole column/row and the fixed algorithm are relative more concise.

About esProc: http://www.raqsoft.com/product-esproc

April 15, 2014

Data Computing Within Oracle, MSSQL and Mysql in Report Developing

Recently a development team met some difficulties in data source computation when developing iReport reports. After the use of esProc for cross database computation, the problem is resolved.

This is a project payment progress report, as part of a project management system. It’s based on an Oracle database. The reports needs to present the project name, payment amount, contract value, payment progress (in percentage), and name of the project manager, etc., for all ongoing projects. Amount these the first 3 items contributes to the difficulties met by the development team.

Note that the project payment record is stored in accounting system, rather than the Oracle database used by the development team. The accounting system is built on MSSQL database. The contract value information is also stored in a separate MSSQL database, used by contract management system. After analysis to the database, the project information record has a one-to-many relationship with payment record, and an one-to-one relationship with contract information.

By simplifying the table and fields other than the hardest part of the problem, the issue can then be described as: we need to join table projectpayProcess and contract from three heterogeneous databases. 
         Main fields from project include: projectNo, projectName.
         Main fields from payProcess include: payID, projectNo, amount.
         Main fields from contract include: contractNo, projectNo, conAmount.
         It’s fairly easy to write the SQL statements if the three tables are in the same database. However they now belong to three heterogeneous databases, the join is then difficult. Meanwhile, the two MSSQL databases are used by two independent commercial software, which the development team can load data from, but not able to modify or administrate. This adds to the difficulty of the development. Although iReport support simple cross database computation with two tables, it’s very difficult to do this with three tables. The development team cannot leverage iReport to meet this requirement.

The use of new reporting tool that supports multiple data sources might help. However as iReport is the only one used in the development process so far, a sudden change is not realistic. iReport supports user-defined JavaBean as data source, this can also be used to handle cross database computation. But JAVA is not a language for computation. The “join” computation between multiple sets requires a lot of coding. The workload makes it a non-sense. ETL tool is also a possible way to consolidate tables from multiple databases into one. The ETL approach normally requires incremental update to databases, which means timestamp and triggers need to be used in payProcess and contract table. As we know, commercial software does not allow databases to be modified in this way. ETL cannot provide the real time updates required for payment progress monitoring, which means it’s not suitable for this report.

Under such condition, the use of esProc for cross database computation is a handy approach. Here are the codes

A1, A2, A3 are for loading data from different databases, which are native syntax 
for each database. Note the difference in SQL for each heterogeneous database.
A4 is an association statement. From here on the syntax has nothing to do with certain database. The “join” function associate A1, A2, A3 according to projectID. Note that this is inner join. If it’s left join, the syntax of “join@1” should be used, whereas for full join, the syntax is “join@f”.
A5 is to output the required fields from the join, which could be part or all fields from A1, A2 and A3.

The statement of result A5 means to output the computation result by JDBC, so that iReport could retrieve them directly through JDBC. It’s also possible for esProc to be used in JAVA codes directly by means of JDBC call.

The above codes are just a prototype, which can actually run successfully. In reality more table association and data fields need to be added, plus some and parameter filters.

As we could see, the solving of cross database computation issue with esProc is fairly easy. esProc supports JAVA well, and is handy to be used with iReport.

About esProc: http://www.raqsoft.com/product-esproc

An Alternative of IOE in the Big Data World

What is IOE? I=IBM, O=Oracle, and E=EMC. They represent the typical high-end database and data warehouse architecture. The high-end servers include HP, IBM, and Fujitsu, the high-end database software includes Teradata, Oracle, Greenplum; the high-end storages include EMC, Violin, and Fusion-io.
In the past, such typical high performance database architecture is the preference of large and middle sized organizations. They can run stably with superior performance, and became popular when the informatization degree was not so high and the enterprise application was simple. With the explosive data growth and the nowadays diversified and complex enterprise applications, most enterprises have gradually realized that they should replacing IOE, and quite a few of them have successfully implemented their road map to cancel the high-end database totally, including Intel, Alibaba, Amazon, eBay, Yahoo, and Facebook.
The data explosion has brought about sharp increase in the storage capacity demand, and the diversified and complex applications pose the challenge to meet the fast-growing computation pressure and parallel access requests. The only solution is to upgrade even more frequently. More and more enterprise managements get to feel the pressure of the great cost to upgrade IOE. More often than not, enterprises still suffer from the slow response and high workloads even if they've invested heavily. That is why these enterprises are determined to replace IOE.
Hadoop is one of the IOE solutions on which the enterprise management have pinned great hope.
It supports the cheap desktop hard disk as a replacement to high-end storage media of IOE.
Its HDFS file system can replace the disk cabinet of IOE, ensuring the secure data redundancy.
It supports the cheap PC to replace the high-end database server.
It is the open source software, not incurring any cost on additional CPUs, storage capacities, and user licenses.
With the support for parallel computing, the inexpensive scale-out can be implemented, and the storage pressure can be averted to multiple inexpensive PCs at less acquisition and management cost, so as to have greater storage capacity, higher computing performance, and a number of paralleling processes far more than that of IOE. That's why Hadoop is highly anticipated.
However, Hadoop’s structured data computation still cannot reach that level as IOE did, especially in relational database led by Oracle. The data computing is the most important software function for the modern enterprise data center. Nowadays, it is normal to find some data computing involving the complex business logics, in particular the applications of enterprise decision-making, procedure optimizing, performance benchmarking, time control, and cost management. However, Hadoop alone cannot replace IOE. As a matter of facts, those enterprises of high-profile champions for replacing IOE have to partly keep the IOE. With the drawback of insufficient computing capability, Hadoop can only be used to compute the simple ETL, data storage and locating, and is awkward to handle the truly massive business data computation.
To replace IOE, we need to have the computational capability no weaker than the enterprise-level database and seamlessly incorporating this capability to Hadoop to give full play to the advantageous computing solution of Hadoop. esProc can meet this demand.
esProc is a parallel computing middleware which is built with pure Java and focused on powering Hadoop. It can access Hive via JDBC or directly read and write to HDFS. With the complete data computing system, esProc can replace the most data computing ability of IOE in a simpler way. It is especially good at the computation requiring complex business logics and stored procedures.
esProc supports the professional data scripting languages, offering the true set data type, easy for algorithm design from business client's perspective, and effortless to implement the complex business logics. In addition, esProc supports the ordered set for arbitrary access to the member of set and perform the serial-number-related computation. The set of set can be used to represent the complex grouping style easily, for example, the equal grouping, align grouping, and enum grouping. esProc also provides the complete code editing and debugging functions. It can be regarded as a dynamic set-lized language which has something in common with R language, and offers native support for distributed parallel computation from the core. Programmers can surely be benefited from the efficient parallel computation of esProc while still having the simple syntax of R. It is built for the data computing, and optimized for data processing. For the complex analysis business, both its development efficiency and computing performance are beyond the existing solution of Hadoop in structured data computing.
The combined use of Hadoop + esProc can fully remedy the drawback to Hadoop, empowering Hadoop to replace the very most of IOE features and improving its computing capability dramatically.

April 14, 2014

A More Convenient Data Computing Script than SQL

Database plays an irreplaceable role in the modern economy and is widely used in the business computing areas like Enterprise Resources Planning (ERP), Customer Relation Management (CRM), Supply Chain Management (SCM), and the Decision Support System (DSS).

Computation of structured data in the database mainly relies on SQL (Structured Query Language)SQL is the powerful, simple-to-use, and widely-applied database computing script. However, it has some native drawbacks: non-stepwise computation, incomplete set-lization, and no object reference available. Although almost all vendors have introduced and launched some non-compatible solution, such as various stored procedure like PL-SQLT-SQL. These improved alternatives cannot remedy the native SQL drawbacks.

esProc solves these drawbacks with more powerful computational capability, much lower technical requirement, and broader scope of application. It is a more convenient database computing scripts.
Case Description

A multinational retail enterprise needs to collect statistics on the newly opened retail store, including: How many new retail stores will open in this year? Of which how many companies have the sales over 1 million dollars? Among these companies with over-1-million sales, how many companies are abased overseas?

This question is progressive. The three questions are mutually related, the next question can be regarded as the further exploring on the current question, fit for step-by-step computation.

The original data is from the database of stores table with the main fields:storeCode, storeNameopenedTimeprofit, and nation. Let's check the SQL solution first.

SQL Solution
To solve such problem with SQL, you will need to write 3 SQL statements as given below.
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy');
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy') and profit>1000000;
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy') and profit>1000000 and nation<>’local’;
SQL1:Get the result of question 1.

SQL2:Solve the problem 2.Because the step-by-step computation is impossible (that is, the results of previous computation cannot be utilized), you can only solve and take it as an individual problem.
SQL3: Solve the problem 3,and you are not allowed to compute in steps either.

esProc Solution

A1 cell: Get the records requested in problem 1.
A2 cell: Step-by-step computation. Operate on the basis of cell A1, and get the record meeting the conditions of problem 2.
A3 cell: Proceed with the step-by-step computation, and get the records requested in the problem 3.
B1B2, and B3 cell: It is still the step-by-step computation. Count the corresponding records.

For the SQL, there are 3 associations for you to compute in steps, and explore progressively. However, because step-by-step computation is hard to implement with SQL, this problem has to be divided into 3 individual problems.

esProc is to compute in steps following the natural habit of thinking: Decompose the general objective into several simple objective; Solve every small objective step by step; and ultimately complete the final objective.

In case that you proceed with the computation on the basis of the original 3problems, for example, seek "proportion of problem 3 taken in the problem 2", or"onproblem 3, group by country". As for esProc users, they can simply write ”=A3/A2”, and ”A3.group(nation)”. In each step, there is a brief and clear expression of highly readable, without any requirements on a strong technical background. By comparison, SQL requires redesigning the statement. The redesigned statement will undoubtedly become more and more complex and longer. Such job can only be left to those who have the advanced technical ability in SQL.

esProc can decompose the complex problem into simple computation procedure based on the descriptions from the business perceptive. This is just the advantage of the step-by-step computation. By comparison, SQL does not allow for computation by step or problem decomposition, and thus it is against the scientific methodology, and not fit for the complex computation.
Complete Set-lization
Case Description
A certain advertisement agency needs to compute the clients whose annual sales values are among the top 10.

The data are from the sales table, which records the annual sales value of each client with the fields like customer, time, and amount.

SQL solution
SELECT customer
    SELECT customer
    FROM (
         SELECT customer,RANK() OVER(PARTITION BY time ORDER BY amount DESC) rankorder 
         FROM  sales ) 
    WHERE rankorder<=10) 
GROUP BY customer
Such Problem requires ranking the sets of a set, that is, group by “time” and then rank by “customer” in the group. Since the popular SQL-92 syntax is still hard to represent this, the SQL-2003 standard, which is gradually supported by several vendors, will be used to solve this problem barely.

Just a tip to compute the customer intersections in the last step, the count of years equals to the count of clients.

esProc Solution

A1: Group the original dataset by year so that A1 will become a set of sets.
B1: Get the serial number of records whose sales values are among the top 10 of each group. The rank() is used to rank in every group, and pselect() can be used to retrieve the serial number on conditions. ~ is used to represent every member in the set. B1 is the “set of set”.
A2: Retrieve the record from A1 according to the serial number stored in B2, and get the customer field of the record.
A3: Compute the intersection of sets.


The SQL set-lization is incomplete and can only be used to represent the simple result set. Developers cannot use SQL to represent the concept of “set of set”. Only the queries of 3-level-nested-loops are available to barely perform the similar computations. In addition, SQL cannot be used to perform the intersection operation easily that developers with advanced techniques can only resort to the unreadable statements to perform the similar operations, such as “count of years equal to the count of clients”. It equals to compute the intersection of client sets.
The set is the base of massive data. esProc can achieve set-lization completely, represent the set, member, and other related generic or object reference conveniently, and perform the set operations easily, such as intersection, complement, and union.

When analyzing the set-related data, esProc can greatly reduce the computation complexity. By taking the advantage of set, esProc can solve many problems agilely and easily that are hard to solve with SQL.

Case Description
Suppose that a telecommunication equipment manufacturer needs to compute the monthly link relative ratio of sales value (i.e. the increase percent of sales value of each month compared with that of the previous month). The sales data is stored in the sales table with the main fields including salesMonth, and salesAmount.

SQL solution
select salesAmount, salesMonth,
        (case when
prev_price !=0 then ((salesAmount)/prev_price)-1
else 0
end) compValue
from (select salesMonth, salesAmount,
lag(salesAmount,1,0) over(order by salesMonth) prev_price
from sales) t

The popular SQL-92 has not introduced the concept of serial number, which adds many difficulties to the computation. Considering this, the designer of SQL-2003 has partly remedied this drawback. For example, the window function lag() is used to retrieve the next record in this example.
In addition, in the above statement, the “case when” statement is used to avoid the error of division by zero on the first record.

esProc Solution
sales.derive(salesAmount / salesAmount [-1]-1: compValue)

The derive() is an esProc function to insert the newly computed column to the existing data. The new column is compValue by name, and the algorithm is “(Sales value of this month/Sales value of previous month)-1”. The “[n]” is used to indicate the relative position, and so [-1] is to represent the data of the previous month.

On the other hand, for the data of the first record, the additional procedure for division by zero is not required in esProc.

From the above example, even if using SQL-2003, the solution to such problem is lengthy and complex, while the esProc solution is simple and clear owing to its support for the ordered set.

Moreover, SQL-2003 only provides the extremely limited computation capability. For example, esProc user can simply use the ”{startPosition,endPosition}” to represent the seeking of a range, and simply use ”(-1)” to represent the seeking of the last record. Regarding the similar functionality, it will be much harder for SQL user to implement.

In the practical data analysis, a great many of complex computations are related to the order of data. SQL users are unable to handle such type of computations as easily as esProc users because SQL lacks of the concept of Being Ordered.

An insurance enterprise has the below analysis demands: to pick out the annual outstanding employees (Employee of the Year) whose Department Manager has been awarded with the President Honor. The data are distributed in two tables: department table (main fields are deptName, and manager), and employee table (main fields are empName, empHonor, and empDept).

empHonor has three types of values: null value; ”president's award”, PA for short; and ”employee of the year”, EOY for short. There are 2 groups of correspondence relations: empDept and deptName, and Manager and empName.

SQL solution
FROM employee A,department B,employee C 
WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=EOY AND C.empHornor=PA

SQL users can use the nested query or associated query to solve such kind of problems. In this case, we choose the association query that is both concise and clear. The association statement behind the “where” has established the one-to-many relation between deptName and empDept, and the one-to-one relation between manager and empName.

esProc Solution


esProc solution is intuitive: select the employee of “EOY” whose Department Manager has be awarded with “PA”.

The SQL statement to solve such kind of question is lengthy and not intuitive. In fact, the complete association query language is “inner join…on…” style. This statement is simplified in the above example. Otherwise it will be much hard to understand.

esProc users can use ”.” for object reference. Such style is intuitive and easy to understand. The complex and lengthy association statement for multiple tables can thus be converted to the simple object access, which is unachievable for SQL. When there are more and more tables, the complexity of SQL association query will rise in geometric series. By comparison, the esProc user can always access the data intuitively and easily by taking the advantage of object reference.

Regarding the multi-table associations of complex computation, esProc can handle it more intuitively and conveniently than SQL.

From the comparison of the above four examples, we can see that esProc is not only characterized with step-by-step computation, complete set-lization, sorted sets, and object reference. The analysis style is intuitive, the syntax style is agile, and the function is powerful. esProc is a tool especially designed for mass data computation, and a more convenient database computing script.

About esProc: http://www.raqsoft.com/product-esproc