February 26, 2015

esProc Helps Process Heterogeneous Data sources in Java - Hive

It is easy for Java to connect to Hive using JDBC. But the computational ability of Hive is less than that of SQL in other databases. So to deal with uncommon computations, data should be retrieved before further operation is performed using Java. Thus the code for will be complicated.

But if esProc is used to help with the Java programming, the complicated operation as a result of using Hive in Java will become simpler. The following example will show how esProc works with Java in detail. orders is a table in Hive containing the detailed data of sales orders. Now it is required to compute the year-on-year comparison and link relative ratio. The data is as follows:

ORDERID CLIENT     SELLERID AMOUNT ORDERDATE
1       UJRNP      17     392  2008/11/2 15:28
2       SJCH         6       4802         2008/11/9 15:28
3       UJRNP      16     13500       2008/11/5 15:28
4       PWQ         9       26100       2008/11/8 15:28
5       PWQ         11     4410         2008/11/12 15:28
6       HANAR     18     6174         2008/11/7 15:28
7       EGU 2       17800       2008/11/6 15:28
8       VILJX         7       2156         2008/11/9 15:28
9       JAYB          14     17400       2008/11/12 15:28
10     JAXE          19     19200       2008/11/12 15:28
11     SJCH         7       13700       2008/11/10 15:28
12     QUICK      11     21200       2008/11/13 15:28
13     HL    12     21400       2008/11/21 15:28
14     JAYB          1       7644         2008/11/16 15:28
15     MIP  16     3234         2008/11/19 15:28


Link relative ratio refers to comparison between the current data and data of the previous period, using month as the time interval. For example, divide the sales figure in April by that in March and we get the link relative ratio of April. Year-on-year comparison is the comparison between the current data and data of the corresponding period of the previous year, which means, for example, dividing the sales figure of April 2014 by that of April 2013. Since Hive provides no window functions, it cannot complete the computation unless using nested SQL. But Hive supports very poor subquery and usually the computation should be performed outside of the database. With esProc, however, the computation can be realized easily. The code is as follows:

A1: Connect to the database through JDBC using the datasource Hive defined in advance.

A2: Query the data in the database by the time period using external parameters begin and end. Such as begin="2011-01-01 00:00:00", end="2014-07-08 00:00:00" (i.e. the current date which can be obtained using now() function).

A3: Group orders by the year and the month and sum up to get the sales of each month.

A4: Add a new field Irr, which is the monthly link relative ratio. The expression is mAmount/mAmount[-1], in which mAmount represents the sales in the current time period and mAmount[-1] represents the sales in the previous one. Note that the link relative ratio of the initial month (January of 2011) is empty.

A5: Sort the data in A4 by the month and the year before we compute the year-on-year comparison. Complete code should be =A4.sort(m,y). But since A4 has been sorted by the year, here we just need to sort it by the month, that is A4.sort(m), which has a better performance.

A6: Add another new field yoy, which is the year-on-year comparison of the 
monthly sales figure. The expression is if(m==m[-1],mAmount/mAmount[-1],null), which means the year-on-year comparison is valid only between the same months of the two time periods. The year-on-year comparison of each month in the initial year (the year of 2011) is empty.

A7Sort the data in A6 by the year in descending order and by the month in ascending order. Note that the data is valid up to July of 2014. The result is as follows:
A8: Close Hive database connection.

A9: Return the result.

This block of code can be called by Java using esProc JDBC to get the final result (the above esProc program will be saved as test.dfx). The code for this is as follows:
          // create a connection using esProc jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call the esProc program (the stored procedure); test is the file name of dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?,?)");
// set the parameters
st.setObject(1,"2011-01-01 00:00:00");//begin
st.setObject(1,"2014-07-08 00:00:00");//end
// execute esProc stored procedure
st.execute();
// get the result set
ResultSet set = st.getResultSet();

It is the same way in which esProc accesses Hive and other ordinary databases. Just configure their JDBC while detailed process is omitted here.