esProc, A Script Language for Data Analytics with Parallel Mechanism: July 2013

July 31, 2013

Five Useful Computation Layers Solve Java’s Data Calculation

The data computation layer in between the data persistent layer and the application layer is responsible for computing the data from data persistence layer, and returning the result to the application layer. The data computation layer of Java aims to reduce the coupling between these two layers and shift the computational workload from them. The typical computation layer is characterized with below features:

1. Ability to compute on the data from arbitrary data persistence layers, not only databases, but also the non-database Excel, Txt, or XML files. Of all these computations, the key is the computation on the commonest structured data.

2. Ability to perform the interactive computations among various data sources uniformly, not only including the computation among different databases, but also calculation between the databases and non-database data sources.

3. The couplings between database and computation layer, as well as the computation layer and Java code can be kept as low as possible to facilitate the migration.

4. The architecture can be non-Java but should be integrated with Java conveniently.

5. Higher development efficiency in the respects of scripting, readability, debugging, and daily maintenance

6. As for the tendency of the complex computation and big data computation, the computation layer can provide the direct support to achieve the goal.

In this survey, 5 data computation layers of Hibernate, esProc, SQL, iBATIS, and R language are tested and compared on the basis of the below metrics: maturity, low coupling, scripting, integration, UI friendliness, performance, complex computation, support for big data, non-database computation, cross-database computation, and convenience for debugging.

Hibernate

Hibernate is the lightweight ORM frame, which was invented by Gavin King, and is now owned by JBOSS. It is the outstanding computational layer in the non-distributed environment, for example Intranet. Hibernate provides the access mode based on the object completely, while esProc and iBATIS can only be treated as the semi-object or object-alike ones.
esProc is a newly-emerged Java development tool, especially designed to solve the complex cross-database computations. Unlike other data computation layers, esProc only takes SQL as a data source. Once data is retrieved from SQL, the computation in esProc is completely independent from SQL. By comparison, PJA/Hibernate would be forced to open the interface for SQL to handle some computations they are unable to handle.
Hibernate almost enables the complete uncoupling between computational scripts, Java code, and database completely. However, what’s a pity is that Hibernate still heavily rely on SP/SQL in many respects due to its lack of computational capability.

Besides, the EJB JPA is one of the computation layer protocols. Since Hibernate is actually the JPA, we will not dwell on it here.

Maturity: 4 stars. With more than a decade of market testing, Hibernate is already well developed and mature.

Low coupling: 4 stars. Hibernate was introduced for this advantage. But the local SQL is still unavoidable, and Hibernate is hard to achieve the perfect migration.

Scripting: 2 stars. Hibernate computation modes include the object reference and HQL. The former one can get 5 stars since it is quite easy. The latter one gets 2 starts because it is more difficult than SQL to lean, extremely tough to debug, and less efficient than SQL. Its unavoidable reliance on SQL to handle some computations makes users face the great challenge to use these two languages in combination. Let’s give it 2 stars and overall 3 stars on average.

Integration: 2 stars. Hibernate is built with the pure Java architecture. So, to integrate, just copy the jar package and several mapping files, and pay attention to use the session well. It is easy to start, but far more difficult when it comes to the mandatory Hibernate cache, which demands so extremely strong architectural design capabilities that the normal programmers are not encouraged to explore it. Needless to say, the inherited disadvantage of ORM does not exist in other computation layers.

UI friendliness: 0 star. Hibernate provides the object generator, but lacks the most important HQL graphic design interface. Almost no usable GUI is available.

Performance: 3 stars. Its support for L3 cache is less capable than that of SQL. But it is equivalent to 60% of SQL all-roundly according to my personal experience.

Complex computation: 0 star. No support is available for the complex computation, and you may need the SQL/external tools.

Support for big data: 1 star. No direct support is available for hadoop architecture, and relevant research has been carried out.

Non-database computation: 0 star. No direct support is available for non-database computation.

Cross-database computation: 0 star. No direct support is available for cross-database computation. Each HQL only supports a single database.

Convenience for debugging: 0 star. The awkwardness in debugging is a fatal drawback for programmers.

esProc:

Maturity: 1 star. Only 1 year passes since its entering the marketplace. The breadth and depth of its application is worse than other data computation layer.

Low coupling: 4 stars. Its scripts are independent from databases and Java codes, and its algorithms have nothing to do with the concrete database, which can be concluded as the low coupling. Immigration to various databases can be implemented easily. Because the output interface is JDBC, immigration to reports can be achieved easily as well. This is an unmatched feature in comparison to other computation layers.

Scripting: 4 stars. It allows for scripting in grid, reference with cell name, direct verifying result at each step, and decomposing a complex goal to several simple steps. Its syntax focuses on object-reference but no object only. Unlike the descriptive style of SQL statements, the brand new experience requires some knowledge about esProc. Java programmers will have their own choices based on their own experiences.

Integration: 5 stars. A pure Java architecture is adopted by esProc. The JDBC interface is offered for easy integration by any database programmer without esProc background.

UI friendliness: 4 stars. Independent graphic editor is easy to use and intuitive. However, the helpdesk system is not friendly enough.

Performance: 2 stars. All computations are performed in memory. So, it is not recommended to handle too huge amount of data.

Complex computation: 5 stars. This is purpose to develop esProc.

Non-database computation: 3 stars. Support for Excel/Txt, but not XML or Web Service.

Support for big data: 4 stars. HDFS access is offered. Plus, it is said that the parallel computing is also supported, but the details in pending to reveal.

Cross-database computation: 5 stars. esProc syntax is independent from the specific database, and supports the cross-database computation by nature.

Convenience for debugging: 5 stars. The debugging function is perfect and quite convenient to use. Considering its supports for verifying the computation steps of the finest granularity, we can say no other computation layers can offer such convenience.

SQL

SQL/SP/JDBC is of one kind. They are old computation layers boasting the performance and flexibility. However, things change that SQL alone become hard to satisfy the need, such as the growing development of Java, data explosion, and emerging complex problems. Although there are not so many high scores SQL have achieved, the weights are generally the highest.

Maturity: 5 stars. The most mature one.

Low coupling: 0 star. High coupling! Except for lab use, it’s nearly impossible for users to write the computational scripts independent from the databases and codes.

Scripting: 3 stars. SQL is actually very difficult to write and maintain, and cost users a great deal of time to learn. Fortunately, SQL is mature and well established. There are many forums with rich contents. Hibernate becomes popular because it addresses the need to solve the incapability between various data.

Integration: 5 stars. The first lesson of Java programmers is taught to connect the database through JDBC.

UI friendliness: 5 stars. Abundant SQL development tools with high maturity are available. I myself have tried more than 10 tools.

Performance: 5 stars. Databases support this language directly with the highest performance.

Complex computation: 3 stars. SQL is quite fit for the normal computation. The complex problems can be also solved with SQL but a very tough procedure is unavoidable, while Hibernate cannot cope with it at all. SP appears with no great improvement to the reality. The code is hard to split, mainly owing to the complex goal is hard to be decomposed into several simple steps.

Support for big data: 1 star. Several database vendors have declared their support for big data. This adds fuel to the incompatibility of SQL statements, and I’ve never seen a successful case yet.

Non-database computation: 1 star. No direct support is available. ETL/data warehouse can achieve the goal at great cost.

Cross-database computation: 1 star. Support several databases but the performance is poor. In addition, the middleware like DBLink and link server can barely support it to a degree far from “arbitrary and convenient”.

Convenience for debugging: 1 star. It is quite hard for users to debug and check the intermediate results. Only after all scripts have been run can users to check the final result. The only solution is “programing with debugging” which is deliberately creating a great numbers of temporary tables instantly.

iBATIS:

The computation layer is powerful for its simpleness and agility. Unlike Hibernate, iBATIS encourages programmers to write SQL statements. So, its learning cost is the least. In addition, iBATIS has implemented the uncoupling between computational scripts and Java code at the least cost – 80% Hibernate functions at 20% costs. The pending unsolved 20% is the uncoupling between the computational scripts and database.

The complex computation environment is its weak points, for example, distributed computation, complex computation, non-database computation, and cross-database computation.

Maturity: 4 stars. iBATIS is a proven frame with decades of marketing test. It is my favorite frame, though it still has the drawback of insufficient support for cache.

Low coupling: 2 stars. SQL can be replaced seamlessly. However, it’s still SQL for specific database. In facts, this latter one is the database-related problem. Since the vendor is determined to retain the customers, the incompatiblilty of SQL ensures the inability to migrate. On the other side, the programmers always look for the freedom to migrate by whatsoever means.

Scripting: 3 stars. It is SQL.

Integration: 4 stars. Basically, it’s not difficult. It takes the beginner half a day to grasp it well.

UI friendliness: 4 stars. No graphic design interface for computational procedure is available, but SQL tools can be used instead.

Performance: 3 stars. It’s slightly less powerful than SQL, mainly because the conversion between resultSet and map/list is a bit more time-consuming. In addition, its support for cache is worse than that of Hibernate. By general comparison, their differences are not great. In my opinion, it is a failure to introduce the ORM along with the performance problem.

Complex computation: 3 stars. It’s the same as SQL, and stronger than Hibernate.

Support for big data: 1 star. Same as SQL

Non-database computation: 1 star. Same as SQL

Cross-database computation: 1 star. Same as SQL

Convenience for debugging: 1 star. Same as SQL

R language

R is not easy to integrate with Java. However, it is worth to mention because of its powerful computational capability, wide support for communities and advantages in big data. Needless to say, it is the hardest to learn among all these computation layers.

Maturity: 5 stars. The long history of R is only shorter than that of SQL. R has been a hot topic in many forums, in particular in the age of big data.

Low coupling: 4 stars. R language makes no difference to that of esProc in this respect.

Scripting: 3 stars. In this respect, R is very similar to esProc. But esProc is more agile, more flexible in scripting, and more professional in supporting the structural data, while R has a great number of inbuilt model syntax. Let’s call it even.

Integration: 1 star. R is not built with Java architecture, and hard to integrate with Java. Considering its comparably unsatisfactory performance, the performance would drop dramatically once integrated.

UI friendliness: 3 stars. A specific IDE interface is offered. However, it is not finely built, and suffers the low usability, which is common to all open source products.

Performance: 2 stars. Full memory computation makes it hard to handle large data volume.

Complex computation: 5 stars. It’s similar to esProc.

Support for big data: 3 stars. There is a combination mechanism of using R together with Hadoop. But it is not easy to combine the non-Java system and Java-system, which also compromises the performance greatly.

Non-database computation: 5 stars. It’s similar to esProc.

Cross-database computation: 5 stars. It’s similar to esProc.

Convenience for debugging: 2 stars. The debugging is barely offered and is very unprofessional.

July 28, 2013

Best Java Development Is Fast and Efficiency

In Java programing, how to debug data calculation scripts conveniently? How to compute on the mass amount of structured data from Excel sheet or Txt file? How to solve the complex computational problems more easily? All in all, how to improve the data computing efficiency of Java?

For most computations, Java is powerful enough and also quite convenient in debugging. However, Java has not directly realized the common data computational algorithms yet. So, Java programmers still have to spend great time and efforts to implement the details like aggregating, filtering, grouping, sorting, and ranking. In the respect of data storage and access, programmers have to use List and other objects to assemble every 2D table and every piece of data, and go through the nested multi-level loops. In addition, such computation involves set and relation operations on massive data, or relative position between object and object properties. The underlying logics for these computations demand great efforts, not to mention the Excel or Text data, data from set, and the complex computational goal.

So, Java alone cannot improve the efficiency for data computation.

SQL database is another option. SQL implements lots of data computational algorithms and alleviates the workload to some extent. But, the shortcomings shown as below are unavoidable:

First, SQL takes a long query as a basic computation unit. Programmers are only allowed to view the final result but not the details of running. It is awkward to prepare the stored procedure and a great many of temporary tables just to debug barely. Writing special script for debugging? Not good idea indeed! A lengthy SQL statement will bring about exponential increase in the difficulty of reading or writing.

Second, to address the Excel, text, or heterogeneous data computation with SQL, programmers have to establish the data mart or global view with ETL or Linked Server at great cost. In addition, SQL does not support the step-by-step computations for decomposing the complex computation goal. Its incomplete support for the set makes programmers still feel tough to solve some complex problems. So, we can conclude that SQL has limited impact on improving the computational efficiency for Java.

In this case, esProc is highly recommended – a database computation development tool ideal for simplifying the complex computations and tailored for cross-database computation and explicit sets with convenient debugging, and direct support for JDBC to integrate with Java apps easily.

Take this typical set operations for example: Retrieve the contracts on conditions that: 1. All valid contracts; 2. Contracts are Signed in 2012; 3. Quantities ordered are higher than 40 (great volume); 4. Unit prices are above 2000 dollars (great unit price); 5. Contracts meet the conditions 2, 3, and 4; 6. All except for those meet the condition 5.

esProc script:

esProc allows for the direct retrieving and computing across multiple databases, text file, or Excel sheets. It is especially designed with grid-like style and agile syntax for the massive structural data computation. With the native support for the external parameters, cross-database computation, and code reuse, esProc boost the data computation and development efficiency for Java greatly.

July 24, 2013

How to Leverage Big Data like Google?

Recently, I read Why Big Data Projects Fail by Stephen Brobst at: http://data-informed.com/why-big-data-projects-fail. I can’t agree more with his opinions which exposed the problem I’ve been worried about. In this article, I am going to further discuss this topic to remind the enterprises to beware of falling into such pitfall of failure.

Let’s have a look on a positive example. As a successful enterprise in leveraging big data, how does Google make use of the big data?

1. Collect the row data, capture the contents of each website, e-mail, or Cookie, and extract the key information.

2. Create the complex syndetic index for this information. Needless to say, the advertisement-related index must be also created.

3. Store these indices and corresponding contents in the distributed servers.

4. When users are browsing website and searching or viewing e-mails, Google will arrange their requests to go through a complex translation procedure, and several index entries will be located accordingly.

5. Retrieve data from server according to the index, and return the search result or advertisement.

Of all those above-mentioned contents, what contents are related to Hadoop architecture? They are the No. 3 and the No. 5 items. That is, data storing and data retrieving.

Can the No.3 and the No. 5 items be implemented easily? Yes. The alike Hadoop solution is of good expandability and low purchase cost.

Can I operate like Google once implemented the No.3 and No.5 items? No, you can’t because you have not implemented the key items of No.2 and No.4 yet.

What are the items of No.2 and No.4? They are business analysis algorithm. This is the algorithm designed by business experts meticulously on the basis of data, business knowledge, and market trends, as a core competency and business decision making procedure for many enterprises. This is the “Value” component of the 4V Theory.

Why big data will fall into the pitfall of failure? It is because the current big data only provides the solution for data storage and query. It lacks a good solution for business analysis to enhance the competitiveness, which is the most crucial. There is a great gap in-between. In facts, the current big data is the tool for IT experts. They are able to implement the MapReduce functions with C++ or Java, but unable to reach the ultimate goal – provide the valuable business algorithms.

To avoid the pitfall of failure, enterprises must use the advanced analysis tool that is business-expert-oriented, regardless of user’s technical background, and capable to convert the business logics to the business algorithm rapidly, intuitively, and conveniently. How about NoSQL or SQL? Neither of them is ideal. They are for the IT personnel only, owing to their requirements on the strong technical background, complex operations, and comparatively weak computation capability.

What are the ideal tools for business experts? From the TCO perspective, I would rather choose the lightweight R language and esProc Desktop than pin my hopes on the heavyweight Teradata Aster and SAP Visual Intelligence. Especially esProc, this business computation desktop tool is designed for business experts, as its syntax is easy to use and understand with lower technical requirements. The scripts are aligned automatically, allowing users to observe the results of each step clearly and visually. The results can be referenced directly through the names of the cells, enabling users to compute freely according to business logic.

July 16, 2013

Integrating Dynamic Calculation Script for Set Operations in Java

In Java development, we may encounter the complex set operations. Java alone is not powerful enough to save programmers’ efforts in implementing the computation details, which is time-consuming and poor in code reuse. In view of this, programmers usually resort to dynamic calculation script for set operation.

SQL is surely the first kind of script that comes into most programmers’ mind. However, to their disappointments, SQL does not support the explicit set, and is unable to represent the sets of a set, ordered set, generic set, and only the result set can be recognized as a set. Therefore, it is only the subset of the true set. Many operations on sets are hard to implement through SQL. Moreover, the computation is not limited on database, such as the data from Excel and even there is no database in the application environment. In this case, the usage of SQL database is further narrowed.

As the true dynamic calculation script for set operations, both R and esProc are more suitable for such computation, considering its support for generic set, ordered set, and Java- embedding. In addition, they not only empower users to retrieve data from one or more databases, Excel sheet, and txt file, but also enable users to compute step by step, and ultimately solve many complex computations in a much more convenient way.

Let’s check it out with this example: a sales department needs to make statistics on the outstanding salesmen who account for half of the total sales based on their sales records.

esProc scripts:

R scripts:

01 library(RODBC)

02 odbcDataSources()

03 conn<-odbcConnect("sqlsvr")

04 originalData<-sqlQuery(conn,'select * from salesOrder')

05 odbcClose(conn)

06 nameSum<-aggregate(originalData$sales,list(originalData$name),sum)

07 names(nameSum)<-c('name','salesSum')

08 orderData<-nameSum[rev(order(nameSum$salesSum)),]

09 halfSum<-sum(orderData$salesSum)/2

10 orderData$addup<-cumsum(orderData$salesSum)

11 subset(orderData,addup<=halfSum | (addup>halfSum & c( 0, addup[- length (addup)]) <halfSum))

esProc supports JDBC for reporting tools and Java codes to reference directly. By comparison, embedding R in Java is more complex due to its reliance on RServer library or Perl for transition and no simple interface is available.

July 10, 2013

What’s the Best Way for Structured Data Computing in Java?

When developing Java programs, sometimes, we may find ourselves facing such a challenge: Performing the massive structured data computation according to the data from text or Excel. For example, with the policy data details, how to find out the salesman who sold the most or the least insurance products during a certain period?

How about importing the above mentioned data into database manually and then computing with SQL database? Bad idea! Importing all the data to database just for once cannot ensure the live update of data. Once data changed, users will have to import over again. This will lead to cumbersome procedures and bring about great workload.

How about automatically importing data to database at a specified time with ETL tool? Still bad! This practice will usually require PHP/Perl/VBScript/JavaScript knowledge and massive data update algorithms, which is very costly in time and money. Moreover, the changes of demand cause difficult maintenance and also have great impact on the performance and stability of the existing database. To meet a certain computing demand, it is not worthy to build a system that is similar to the data warehouse.

In addition, not all Java applications are incorporated with a database. What if there is no database? It is obviously a bad idea to install a set of databases for just meeting a computation demand.

Even if the database and the ETL regular update system are established successfully. It is still inconvenient in most cases, for example, the cross-database computation, and sometimes it is hard to reduce the coupling between data computing scripts and Java codes. Take the SQL statements in this case for an example, it is already quite hard to understand and maintain them, let alone composing such complex scripts, as shown below:

SELECT salesMan

FROM (SELECT salesMan,

row_number() OVER (ORDER BY isrCount DESC) descOrder,

row_number() OVER (ORDER BY isrCount ASC) ascOrder

FROM (SELECT salesMan,

COUNT(*) isrCount

FROM insurance where salesDate>= ? and salesDate<=?

GROUP BY salesMan

)

WHERE descOrder=1 OR ascOrder=1 ORDER BY descOrder

How about using the Java language directly? Even worse! Java does not offer the functions to implement query, group, sort, and summarizing directly and straightforwardly, while these are basic requirements for the massive structured data computation. You will have to implement all these details by yourself.Then, no way out? No. esProc is just an ideal tool to accomplish the objective. This is a professional Java development tool for database computing and really good at simplifying the complex computation to integrate with Java easily. For esProc, the scripts are shown below:

esProc can retrieve data directly from and compute on multiple databases\txt files\Excel sheets. esProc offers a grid style and agile syntax specially tailored for massive structured data computation. The result can be exported to JDBC to boost the Java computational capability dramatically. In addition, it enables the cross-database computation and supports code reuse by nature. Even the debug functionality is also quite perfect. Considering all these advantages, it is clear that esProc is more efficient than SQL.

July 8, 2013

Don’t Limit On SQL for Database Computing in Java Development

Original post: http://www.sourcecodester.com/blog/5454/don%E2%80%99t-limit-sql-database-computing-java-development.html
Check the full content below:

During Java development, SQL is often used to address the database related computation. However, "Ordered Computation" is very inconvenient to realize by writing SQL statements. For example, according to Contract table, compute the monthly growth rate of contract values for each sales man in a specified period.

The result set of SQL follows a weird convention that the result set returned directly does not have any sequence numbers, and SQL does not offer any direct support for the algorithm related to sequence number. The former problem is easy to solve since most ResultSet classes of JDBC support row number, and a few ones which doesn’t support can reload the class by itself. As we all know that there is another solution to the poor portability, such as the rownum in Oracle.

The true trouble is that SQL does not offer the direct support for the algorithms relating to sequence number, such as, the last three, the previous one of the current record, the top five, the last but one, and the ranking of a certain record. For another example of the problem mentioned at the very first beginning, we need to firstly group by salesman, group by year within each group, group by month for each year and summarize, and lastly, perform the inter-row computation between the current month and its previous month. These algorithms are inconvenient to represent with SQL. Many trivial inconveniences make a tough problem. That is how a “tough computational problem” comes into being.

Are there any better solutions?

The answer is yes. esProc can address this problem well.

esProc is a professional Java development tool for database computation and direct support for ordered computation is one of its characteristics. For instance, the scripts for the above mentioned computational goal is shown as bellow:

esProc can retrieve and compute on multiple databases, text files, or Excel sheets. Its grid style and agile syntax are tailored for the massive structured data computation. esProc supports external parameters and the computational result in esProc can be exported though JDBC for direct invoking by Java language and reporting tools. Therefore, esProc can greatly enhance the computational capability of Java. With the native supports for the cross-database computation and code reuse, and the complete and perfect debugging function, esProc offers higher development efficiency than SQL, and is ideal to work with Java.

July 2, 2013

Great Progress Made in Java for Structured Data without Database

In the process of development with Java, we will occasionally encounter the computation similar to data processing in database. For instance, there are two frequently updated Excel sheets, which are the clients’ information and the orders. We need to query the data of clients who have bought all the products on the list through entering a dynamic product list.

The "computation similar to data processing in database" refers to structured data computation of an application without database. Although Java is capable of handling such computation, the procedure is very cumbersome and verbose.

It takes programmers a lot of time and efforts to implement the computational details. For example, seek the maximum value, rank, filter, group, and average. In addition, it is cumbersome to define various data types. For example, define class, use object to represent every piece of data, and then use List to store multiple pieces of data. The computational procedure is implemented with nested multiple level of loops. The computations involve the set computation and relation computation between massive data, or computation on relative positions between objects and object attributes. It takes great efforts to implement the underlying logics for these computations.

It is obvious that such computation is hard to perform all by Java itself. Then, how to implement the computation similar to database in Java conveniently? The answer is esProc.

esProc is a Java development tool especially designed for database computations. esProc offers native support for the cross-database computations and the code reuse, with a set of very perfect debugging functions. No wonder that the development productivity in esProc is also superior to that in SQL. esProc can retrieve from and operate on multiple databases, text files, or Excel sheets. Its grid style and agile syntax are especially designed for the massive structured data computation. esProc supports external parameters and the computational result in esProc can be exported though JDBC for direct invocation with Java language and reporting tools. So, esProc can boost the computational capability of Java dramatically.

With regard to the above-mentioned computational goal, esProc code is as follows:

esProc is good at simplifying the complex computation, and can be integrated with Java in a convenient way.

Class.forName(“com.esproc.jdbc.InternlDriver”);

Connection con=DriverManager.getConnection(“jdbc:esproc:local://”);

PreparedStatement st=con. prepareStatement(“call p31(?)”);

Easily and clearly, let's address the computation similar to database calculation of Java with esProc from now on.

menu