October 14, 2013

Why esProc is needed in Hadoop

esProc is a brand new parallel computing framework with support for reading and writing to the files in HDFS and the full commitment to improve the computational capability, performance, and development efficiency of Hadoop.

>Enhance Computational Capability
The computational capability of Hadoop is developed on the basis of the Java language and the MapReduce parallel framework. Java is really outstanding for it is broadly and generally used in many common applications. However, Java is not powerful enough for the computation in many specialized fields. MapReduce lacks the library functions to support even the simplest data algorithm. No direct support for typical data algorithms of associated computation, sub-query, inter-row computation, and ordered computation. Its computational capability is rather weak.

esProc is also a Java-based parallel framework and provides the script more specific to the big data computation and optimal to process the big data. esProc can work with HDFS to improve the computational capability of Hadoop greatly.

In order to boost the computational capability, Hive SQL is packaged on MapReduce for Hadoop. The computational capability of Hive SQL is quite limited since it is just a subset of SQL with no support for stored procedure, incapable of completing the complex data computation.
With the complete computational system and powerful computational capability, esProc can meet any computational demand effortlessly, and solve the complex data computing in a way easier than stored procedures do. esProc allows for invoking the computational result of Hive, and improves the computational capability of Hadoop by working with Hive.

>Boost Computational Performance
Developed with a rigid frame, MapReduce is inflexible in decomposing and allocating tasks, extremely resource-consuming, and relatively poor in the real-time actions. By comparison, esProc enables the arbitrary task allocation. In the extreme conditions, the time spent on allocating task is only one out often million of the time required by MapReduce, and the parallel performance of esProc is superior.

In MapReduce, the intermediate result of the cross-machine interactions must be stored in HDFS as a file. Although this is an advantage for fault tolerance, the great obstacle of delay is also incurred due to this. By comparison, esProc allows users to make the flexible choice according to the duration of computation. The intermediate result can either be used directly to reduce the obstacle of delay or stored in HDFS to increase the fault tolerance.

It is awkward for MapReduce to complete the common data computations such as the multi-table association, year-over-year and link relative ratio comparisons. If implementing such computation with any workarounds or indirect solutions of MapReduce, then the computation performance will decline dramatically. By comparison, esProc provides the native support for such computation. The combination use of esProc and HDFS boosts the computation efficiency of Hadoop dramatically.

The infrastructure of Hive is still the MapReduce, which implements the common algorithms like associated computation at the cost of performance, usually resulting in a performance of one order of magnitude inferior to that of RDB. The performance of esProc is close or even partly superior to RDB. esProc can work with Hive via JDBC to undertake the computational task with strict requirements on real-time processing.

>Improve Development Efficiency
Even for the simplest computation, MapReduce users will have to program manually - the development efficiency is low. Moreover, MapReduce requires relatively stronger development skills and greater workload to implement the associated computation, ordered computation, equal grouping, year-over-year and link relative ratio comparisons. Hive does not support the stored procedure, and still have to rely on MapReduce to handle a little bit more complex computations.

For the common algorithms, esProc provides abundant library functions for direct use; For the complex algorithm logics, esProc provides the agile syntax and professional IDE for implementing with ease. Working with HDFS and Hive, esProc can greatly boost the development efficiency of Hadoop. With the true support for the data type of set, esProc enables the ordered set and the set-lized groupings, such as the equal grouping, align grouping, and enum grouping. esProc scripts are written in a grid-style cellset so that users can reference the intermediate computational result directly without defining anything.

The debugging function of MapReduce is so outdated that users can only identify the error by checking error messages in the log file. By comparison, esProc supports the break point, step-by-step run, run to cursor, start, end, and other specialized debugging function to ensure the development efficiency.

To define the task scale arbitrarily, MapReduce users will have to customize MapReduce framework, which is not only tough but also compromises the development efficiency seriously. esProc is flexible and arbitrary in task allocation, and the development efficiency is quite high.

esProc has all outstanding features of Hadoop - parallel computations on multiple nodes, inexpensive hardware for scale-out, and open external interface. In addition, esProc renovates Hadoop with the flexible parallel framework, specialized script for big data, agile syntax, and professional IDE.

No comments:

Post a Comment