esProc, A Script Language for Data Analytics with Parallel Mechanism: June 2013

June 26, 2013

Computing with Java for Massive Data without Database

Many Java applications are not incorporated with database. So, what if using such Java applications for query or structured data computing? For example, according to an Excel sheet downloaded from a finance website, find the shares rising for N consecutive days in a certain period.

For the computation on structured data, programmers usually embed the SQL statements in the Java code, and access the database server via JDBC. Although SQL statements are embedded with lots of structured-data-specific algorithms, Java lacks the advanced functions to implement these operations directly and straightforwardly. Therefore, without database, it is quite hard to implement such computation with the language capability of Java only.

It takes programmers a great amount of time and effort to implement every detail in the computation manually. Except the sorting algorithm, almost all algorithms for massive data computing require manual implementations, for example, aggregating, filtering, and grouping. For another example, to define the class and represent every piece of data with object, use List to store multiple pieces of data, and then compute through the nested multi-level loops. The computations of such kinds usually also involve the operations on sets and relations among massive data, or the computations on the relative positions between objects or object properties. It is quite cumbersome to implement these underlying logics.

Embedding a database and then performing ETL is obviously an awkward method. Is there any more agile and convenient method?

In this case, esProc is the best choice. It is a professional database computing and development tool.

esProc is good at simplifying the complex computation, and allows for Java application to access the result from esProc via JDBC. The esProc solution to this case is given below:

esProc can directly retrieve data from and compute on multiple databases\txt files\Excel sheets. esProc offers a grid style and agile syntax specially tailored for massive structured data computation. With the support for external parameters, the result can be exported via JDBC, and invoked by Java language and reporting tools. So, esProc can boost the Java computational capability dramatically. In addition, it enables the cross-database computation and supports code reuse by nature. Even the debug functionality is also quite perfect. Considering all these advantages, it is clear that esProc is more efficient than SQL.

June 20, 2013

New Algorithm Helps Java for Massive Structured Data Computing

Java language does not have any competitive advantages in data computing, in particular the massive structured data computing. For example, according to the order detail computation, we need to find out the sales persons whose sales growths are over 10% in 3 consecutive months.

Java does not have the related advanced function to implement this. So, it’s hard for Java to handle such computation only with its own capability. Java needs a large amount of time and effort to manually realize the details in computation. For example, firstly, define classes and represent every piece of data with objects; secondly, use List to store multi-pieces of data; thirdly, use the nested multi-level loops to compute. Except the sorting algorithm, almost all massive data processing algorithms involved in the computation require manual implementation, such as aggregating, filtering, and grouping. Such computations usually involve the set computation and relation computation among massive data, or computation on relative positions between objects and object attributes. It takes great efforts to implement the underlying logics for these computations.

That’s why we must improve the Java computational capability. We need a tool tailored for implementing the structured data computation easily!

How about SQL? Not all Java application allows for using database. In addition, there are many data in Txt/Excel, and sometimes, problems of computation across databases and code reuse may be encountered. Moreover, SQL is still not convenient for handling many computations. Taking the above-mentioned computation for example, SQL is by no means convenient to compose:

01 WITH A AS

02 (SELECT salesMan,month, amount/lag(amount)

03 OVER(PARTITION BY salesMan ORDER BY month)-1 rising_range

04 FROM sales),

05 B AS

06 (SELECT salesMan,

07 CASE WHEN rising_range>=1.1 AND

08 lag(rising_range) OVER(PARTITION BY salesMan

09 ORDER BY month)>=1.1 AND

10 lag(rising_range,2) OVER(PARTITION BY salesMan

11 ORDER BY month)>=1.1

12 THEN 1 ELSE 0 END is_three_consecutive_month

13 FROM A)

14 SELECT DISTINCT salesMan FROM B WHERE is_three_consecutive_month=1

In this case, esProc is the better choice.

esProc is a development tool for database computing, specializing in simplifying the complex computation and is quite convenient to integrate with Java. For esProc, the corresponding scripts are shown below:

esProc allows for the direct retrieval and computation across multiple databases, text files, and Excel sheets. Its grid style and agile syntax are especially designed for the massive structured data computation. It supports external parameters, and the result can be exported directly via JDBC. So, with esProc, the computational capability of Java is dramatically improved. In addition, by nature, esProc supports cross-database computation and the code reuse, with very perfect debugging functions. No wonder that the development productivity of esProc is also superior to that of SQL.

June 14, 2013

Why esProc is Created?

Data computing is widely used, and business users hope to complete the data computation independently. Although SQL, R, Java, C and other current solutions have powerful computational ability, coding for complex computing is rather cumbersome(R language is better but too difficult to understand).

Data computing demands are both common and complex

Data computing is widely used
Abundant data exists in the database but difficult to compute directly

Data analysis and query are essential for data computing. Report data source preparation and data management & ETL also involve data computing. Most of the problems are complex and diverse, and the business computing is usually characterized with timeliness and the unpredication. The computation objects are changing constantly and oftentimes available at any time. Users hope to deal with the data computation conveniently.

Stock Rise to the Limit for 3 Consecutive Days in a Month

Settle Outstanding Traffic Fines and Late Fee

Solution: SQL (or MDX)

Advantage: Enough computational ability to handle the structured data
Disadvantage: Difficult to program and understand

SQL provides the comprehensive computation ability for the massive structured data. However, SQL does not support step by step computation, and cannot handle the set data explicitly, sequence and order, and the function of object reference. SQL completes the computation in an unnatural way for human thinking, thus adding difficulty to the writing and understanding.

Related Readings:
The Disadvantages of SQL Computation

Example & Comments for SQL Computation Disadvantage

Bid Farewell to Stored Procedure

Current Solution: High-level Programming Languages

Advantage: Powerful enough to control the procedure
Disadvantage: Complex application environment
Disadvantage: Don’t support structured data with very high coding complexity

JAVA, C#, C++, and other high-level programming languages have a complete mechanism for branch and loop; they are very flexible in term of data computation. However, the application environments of them are too complex. In addition, they don’t support massive structured data well. It is inconvenient to operate on the record, set, dataset, and other data type directly.

Current Solution: R Language

Advantage: open-source and massive library functions
Disadvantage: difficult to understand and higher technical requirement

R boasts its pretty and agile syntax and the open interface for secondary development, so there are a great number of third party packages. But R lacks the good UI interface. Senior technical background and expertise are required to grasp R. R language is also not specialized for structured data computing, and the related support is not elaborate enough.

Syntax Agility Comparison: R Language vs. esProc

June 6, 2013

Intelligent Formula Copy Brings Flexible Spreadsheet Calculation

Spreadsheet is popular with business users for its simplicity and usability. But it is a pity that some common computations are still tough to solve with spreadsheets. The inter-row computation of summary value is such tough problem.

According to the data of order below, how to calculate the rate of sales increase in each month?

Obviously, the rate of increase in February should be (D58-D2)/D2, which is a typical inter-row computation. The tough problem for traditional business spreadsheet software is that the traditional business spreadsheet software only allows for manual formulas entering. Copying or dragging formulas to other cells will only lead to the wrong result. For example, copy a formula to the cell of March, as shown in below figure:

75-fold increase? Obviously wrong! The correct formula should be (D113-D58)/D58, while the resulting formula by copying is (D113-D57)/D57. The reason for this phenomenon is that the traditional business spreadsheet software only allows for the rigid formula-pasting based on the relative positions, lacking the intelligent adjustment mechanism.

Obviously, if the data volume is huge, entering all formulas manually will be such a pain and error-prone.

Then, let’s talk about esCalc. As brand new business spreadsheet software reputed for great computing capability, esCalc is highly expected on this problem.

The same data is shown below:

In esCalc, you only need to input the formula once to solve this problem! For example, enter (D58-D2)/D2 for February, and the result is shown below:

No doubt you’ve seen that all computations are finished by entering the formula for once. No need to copy or adjust the formula. Take the formula for March for example, it is (D113-D58)/D58, just the same as I expected.

esCalc boasts an unique homocell model which arranges cells not in a simple relative positions, but in an auto-established business association. The immediate benefit is that the formulas will be copied automatically, that is, the formula will be copied and pasted to the cells at the same business level automatically. In the above-mentioned case, for the March, April, and February bands, the respective cell in the respective summery row can be regarded as the homocells to each other. Therefore, the formula written in the cell of February will be copied and pasted to the corresponding cells of March and April.

Needless to say, such copying is not the migration of the relative positions in the traditional business spreadsheet software. This is a kind of Intelligent Migration, for example, migrate the formula for February to the homocell for March, as mentioned above.

Through the auto-pasting and intelligent migration of formulas, esCalc can relieve the great amount of manual work. Because it is implemented automatically, the possibilities of errors are also reduced greatly.

Seeing is believing. The computational capability of esCalc, as legend has it, is truly powerful. Let Excel trembles at esCalc!

June 3, 2013

What is esProc? Developer Tool for Business Computing!

esProc is a developer tool for business computing as well as a desktop analysis software. It is specialized in computation on structured data and complicated multistep computation to meet the fast-changing demands.

Developer tool for business computing
esProc is a developer tool for data computing with higher development efficiency, better debugging features, easier codes maintenance, and Big Data support. It is database computing script with more advanced core model, and specializes in complex computing objects.

Independent data-computing layer for Java application
The JDK provides few functions for structured data computing, while esProc can effectively enhance the ability of Java in this respect. esProc provides a richer and more complete system of structured data computing than SQL, easily achieve various of complex computing demands, and seamlessly integrates with the main program in the form of standard JDBC embedded into Java applications.

esProc separates the complex computations from applications and databases, thus effectively reducing the burden of database (The costs of database expansion are high). esProc can also be applied in situations where there are no databases but still need to do batch computing.
Datasource for reporting tool
For a system where Java reporting tool is adopted, esProc is ideal to perform the complex computation, compute with multiple data sources, and clean the dirty data sources. The reporting tool can receive the result returned by esProc via JDBC by taking esProc as a database.

esProc supports various data sources such as the database driven by JDBC and the non-database source like Excel, Txt, etc.. esProc can access the data from multiple and diversified data sources for interactive computing. And the result is exported as a single data source that can be invoked by reporting tools or other external applications.

Related Articles:
Called by Reporting Tool via JDBC
Fit for the Heterogeneous Data Environment
Lightweight ETL
Although esProc is not a professional ETL tool, it can be used to save you from the cumbersome SQL/SP and provides the application system with ready-to-use data. esProc has powerful data processing ability. The unordered data can become clean and usable through Extraction, Transformation, and Load (ETL).

Desktop BI Tool
esProc is a database script, enabling agile and easy-to-use statement for the interactive analysis of structured data, and is especially good at dealing with complex, flexible or occasional data.

Plug-and-use, without any deployment
esProc is a desktop BI tool to help users complete a series of computing independently, especially for complex analytic goals.

Related Readings:
A Desktop Application of Plug and use Design
Without modeling beforehand
esProc does not require data modeling in advance, allowing users to freely conduct data analysis. It can conveniently reference and process the prior calculation results and realize multi-step complicated analysis, capable for real-time computing and analysis.

Related Readings:
Prepare Test Data for Sales Management System
Statistics on Sales Values of the Top 3 Salespersons Distributed in Respective Product Categories

menu