esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Helps Process Heterogeneous Data Sources in Java

It is not difficult for Java to access HDFS through API provided by Hadoop. But to realize computations, like grouping, filtering and sorting, on files in HDFS in Java is troublesome. esProc is a good helper in Java’s dealing with these computations. It can execute the access to HDFS too. With the help of esProc, Java will increase its ability in performing structured and semi-structured data computing, like the above-mentioned computations. Let’s look at how it works through an example.

The text file employee.gz in HDFS contains the employee data. You are required to import the data and select the female employees who were born on and after January 1^st, 1981. The text file has been zipped with gzip in HDFS and cannot be loaded to the memory entirely.

The data in employee.gz is as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000

11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000

12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000

13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000

…

Implementation approach: Call the esProc script with Java program, import and compute the data, then return the result to Java program in the form of ResultSet.

First, you should develop and debug program in esProc’s Integration Development Environment (IDE). The preparatory work is to copy the core packages and the configuration packages of Hadoop to “esProc’s installation directory\esProc\lib”, such as commons-configuration-1.6.jar、commons-lang-2.4.jar、hadoop-core-1.0.4.jar（Hadoop1.0.4）.

Because esProc supports analyzing and evaluating expressions dynamically, it will enable Java to filter the data in HDFS file as flexibly as SQL does. For example, to query the data of female employees who were born on and after January 1^st, 1981, esProc will use an input parameter “where” as the condition, as shown in the figure below:

“where” is a string, its value is BIRTHDAY>=date(1981,1,1) && GENDER=="F".

The code in esProc is as follows:

A1: Define a HDFS file object cursor with the first row being the title and tab being the default field separator. The zipping mode is determined by the filename extension. Here gzip is used. esProc also supports other zipping modes. UTF-8 is a charset, which is a JVM charset by default.

A2: Filter the cursor according to the condition. Here macro is used to realize analyzing the expression dynamically, in which “where” is the input parameter. esProc will first compute the expression surrounded by ${…}, take the computed result as the macro string value and replace ${…} with it, then interpret and execute the code. The final code executed in this example is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").

A3: Return the cursor. If the filtering condition is changed, you only need to change the parameter “where” without modifying the code. For example, you are required to query the data of the female employees who were born on January 1^st, 1981, or of the employees in which NAME+SURNAME is ”RebeccaMoore”. The code for the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore".

The code for calling this block of code in Java with esProc JDBC is as follows (save the esProc program as test.dfx and put the Hadoop jars needed by HDFS in Java’s classpath):

// create a connection using esProc jdbc

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

// call the program in esProc (the stored procedure); test is the file name of dfx

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

//set the parameters

st.setObject(1," BIRTHDAY>=date(1981,1,1) && GENDER==\"F\" ||NAME+SURNAME==\"RebeccaMoore\"");// the parameters are the dynamic filtering conditions

// execute esProc stored procedure

st.execute();

// get the result set, which is the eligible set of employees

ResultSet set = st.getResultSet();

menu

March 1, 2015

esProc Helps Process Heterogeneous Data Sources in Java - HDFS

No comments:

Post a Comment