November 5, 2013

esProc Acting as Stored Procedure for Hadoop

Hadoop is a typical big data solution. Thanks to its inexpensive scale-out capability, It is attractive for many enterprise customers, such as eBay, Yahoo, Facebook, China Mobile, Amazon, IBM, and Intel. To solve the simple computations in Hadoop, we can use Pig, Hive, or other SQL-like languages. However, we may encounter the big data computing involving complex business logics sometimes. For database, it is easy to solve with stored procedures. What can we do for Hadoop?
        
For example, there are two tables: Order table for all orders, and Employee table for list of sales men. We need to summarize the data in big data table Order, seek the total order value for each sales person, and use the full name to indicate the userID in the Order table. One thing to note, Employee table produces junk data and we will have to sort it out according to certain rules:

1. If UserID and firstName is null or empty string, then the record is invalid.

2. UserID shall only hold the number and is invalid once there are any letters.

3. If UserID is duplicate, then only keep the last entry.

4. Remove the heading and tailing spaces of data record.

5. Shift the initials of firstName from lower case to upper case.

6. The full name shall be assembled in the form of firstName+”.”+”lastName”. But if the lastName is null or empty string, then the fullname equals to firstName.

The current Hadoop solution is not capable of handling the stored procedure. In fact, HiveQL itself is the subset of SQL. That's why the Hive is less capable than RDB in data computing. Since Hive is not available with any stored procedures specific to complex business logics, the above-mentioned computations are quite cumbersome to complete with Hive.

Such problems are usually solved in Hadoop by first processing based on MapReduce and then Java hard-coding. Working with Hive, it can be used to handle some more flexible and complex computation. However, writing MapReduce programs is too complex and the development efficiency is low. Plus, MapReduce only offers the relatively poor processing capability on set-lized data. To achieve such computational goal mentioned above, relatively strong technical skills and considerable time are required. Hive was designed to improve the development efficiency of MapRedue. It defeats the purpose and is obviously even harder if using MapReduce to assist Hive.

For those computing uncovering true business values, their business logics are quite complex. Since no stored procedure is available for Hadoop, the complex big data computing can usually be performed at a bit too high cost, and the applicability of Hadoop is always limited. Except for those large users who would invest heavily in development, normal users regard it as the ”inexpensive ETL tool of simple algorithm”.

How to empower Hadoop with stored procedures? How to perform the big data computing involving the complex business logics? esProc is quite a good choice!

esProc is a parallel computing framework software which is built with pure Java and focused on powering Hadoop. esProc provides access to Hive via JDBC as well as the ability to read/write to HDFS directly. Acting as the stored procedure of Hadoop, esProc can handle the big data computation involving complex business logics. Still the above example, esProc solution is shown below:











As can be seen, the way to solve problem with esProc is intuitive and clear:

A1, A2: Use HiveQL to total the order values settled by each sales person;    Retrieve Employee data.
D2: Create an empty result table and then store the userID and fullName in the  future.
Line 3-12: Transverse the Employee table in A2, perform the initial arrangement,  and store the result in D2. 
A12, A14: De-duplicate the records in userID of D2 through the algorithm specific  to the set-lized data.
A15-A16: Associate A1 with D2 for computing the ultimate result.
A17: Output computational result via JDBC. For example, export result to report  or embed it in Java code.

esProc is a scripting language specialized for big data, offering the true set data type, easy for algorithm design from user's perspective, and effortless to implement the complex business logics of clients. In addition, esProc supports the ordered set for arbitrary access to the member of set and perform the serial-number-related computation. The set of set can be used to represent the complex grouping style easily, for example, the equal grouping, align grouping, and enum grouping. Users can operate on the single record in as same way of operating on an object. esProc scripts is written and presented in a grid. By this way, the intermediate result can be referenced without definition. To add convenience, the complete code editing and debugging mechanism is provided. In short, esProc can be regarded as a dynamic set-lized language which has something in common with R language, and offers native support for distributed parallel computation from the core. esProc programmers are benefited from the efficient parallel computation of esProc while still having the simple syntax of R. esProc is designed for data computing and optimized for big data processing. Working with HDFS and Hive, esProc can act as the stored procedure of Hadoop to improve the development and computation efficiency.
Without stored procedures, the current Hadoop solution is only convenient for some simple querying, summarizing, and associative computations. For the complex computation yielding business results truly, the development cost is too great and the applicability is limited. esProc is introduced to empower Hadoop with such stored procedure just in time. Undoubtedly, the applicability of Hadoop is expanded greatly.