June 20, 2013

New Algorithm Helps Java for Massive Structured Data Computing


Java language does not have any competitive advantages in data computing, in particular the massive structured data computing. For example, according to the order detail computation, we need to find out the sales persons whose sales growths are over 10% in 3 consecutive months.

Java does not have the related advanced function to implement this. So, it’s hard for Java to handle such computation only with its own capability. Java needs a large amount of time and effort to manually realize the details in computation. For example, firstly, define classes and represent every piece of data with objects; secondly, use List to store multi-pieces of data; thirdly, use the nested multi-level loops to compute. Except the sorting algorithm, almost all massive data processing algorithms involved in the computation require manual implementation, such as aggregating, filtering, and grouping. Such computations usually involve the set computation and relation computation among massive data, or computation on relative positions between objects and object attributes. It takes great efforts to implement the underlying logics for these computations.

That’s why we must improve the Java computational capability. We need a tool tailored for implementing the structured data computation easily!

How about SQL? Not all Java application allows for using database. In addition, there are many data in Txt/Excel, and sometimes, problems of computation across databases and code reuse may be encountered. Moreover, SQL is still not convenient for handling many computations. Taking the above-mentioned computation for example, SQL is by no means convenient to compose:

01 WITH A AS
02       (SELECT salesMan,month, amount/lag(amount) 
03           OVER(PARTITION BY salesMan ORDER BY month)-1 rising_range 
04           FROM sales), 
05      B AS
06            (SELECT salesMan, 
07                CASE WHEN rising_range>=1.1 AND
08                     lag(rising_range) OVER(PARTITION BY salesMan
09                          ORDER BY month)>=1.1 AND
10                     lag(rising_range,2) OVER(PARTITION BY salesMan
11                          ORDER BY month)>=1.1 
12                THEN 1 ELSE 0 END is_three_consecutive_month 
13      FROM A) 
14 SELECT DISTINCT salesMan FROM B WHERE is_three_consecutive_month=1

In this case, esProc is the better choice.

esProc is a development tool for database computing, specializing in simplifying the complex computation and is quite convenient to integrate with Java. For esProc, the corresponding scripts are shown below:


esProc allows for the direct retrieval and computation across multiple databases, text files, and Excel sheets. Its grid style and agile syntax are especially designed for the massive structured data computation. It supports external parameters, and the result can be exported directly via JDBC. So, with esProc, the computational capability of Java is dramatically improved. In addition, by nature, esProc supports cross-database computation and the code reuse, with very perfect debugging functions. No wonder that the development productivity of esProc is also superior to that of SQL.

1 comment:

  1. I am not sure about java but SQL does seem to give me headaches! I use STATISTICA which is excellent with both structured/unstructured data.

    ReplyDelete