esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc’s Participation in Java Structured Text Processing

It is simple for Java to sort data in text files of small size. But when big files are involved, you need to import data segmentally, write out the sorting result of each segment in temporary files and at last, merge these temporary files. The programming will be rather complicated. Even if the file is small enough to be loaded to the memory, you will have to analyze the data type of the text file. The analysis is not difficult but the code is long.

However, these problems can be avoided by using esProc to help with programming in Java. Let's look at in detail how this will happen. Now you are required to sort the employee data in the text file employee.txt by STATE in ascending order and by BIRTHDAY in descending order. It is assumed that the data size of the file is huge and exceeds the memory capacity.

employee.txt is of the following format:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000

11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000

12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000

13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000

…

Implementation approach: call the esProc script with Java, import and compute the data, and then return the result in the form of ResultSet to Java. To perform sorting by STATE in ascending order and by BIRTHDAY in descending order, esProc will use an input parameter "sortBy" as the sorting expression, as shown in the following figure:

The value of "sortBy" is STATE,-BIRTHDAY. The field with a minus before it represents the opposite number, which is valid for string data, numerical data and date data.

The code in esProc is as follows:

A1: Define a file cursor object, with the first row being the title and tab being the field separator by default.

A2: Perform sorting according to the expression, using macro to realize parsing the expression dynamically. The "sortBy" in this process is an input parameter. In executing, esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the code. The final code to be executed in this example is =A1.sortx(STATE,-BIRTHDAY;1000000).

A3: Return the result cursor to external program. While Java receives the returned result and traverses the data with ResultSet, esProc will automatically fetch the data corresponding to the cursor. If the sorted data are to be written into other files, the code in A3 should be modified into =file("D:/employee_result.txt").export@t(A2).

If the sorting fields and order are changed, you just modify sortBy– the parameter. For example, if the data are required to be sorted by NAME in ascending order and by STATE and BIRTHDAY in descending order, the value of sortBy will be written as NAME,-STATE,-BIRTHDAY.

sortx function performs sorting by importing data segmentally according to buffer rows, write the results of sorting each segment into temporary files, redistribute the memory usage and then merge these temporary files. Here the parameter 1000000 refers to buffer rows. The principle of assigning value to it is to make the best of the memory to reduce the number of temporary files as far as possible. The number of temporary files is related to the size of both the physical memory and the records, and should be evaluated during programming. Generally, the recommended number is between magnitudes of several hundred thousand to a magnitude of one million.

The code of calling this piece of code (which is saved as test.dfx) in Java with esProc JDBC is as follows:

//create a connection between esProc and its JDBC

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call the program in esProc (the stored procedure); test is the name of file dfx

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

//set the parameters

st.setObject(1,"NAME,-STATE,-BIRTHDAY");

//execute the esProc stored procedure

st.execute();

//get the result set, which is the eligible set of employees

ResultSet set = st.getResultSet();

If the script is simple, the code can be written directly into the program in Java that calls the esProc JDBC. It won’t be necessary to write a special esProc script file (test.dfx):

st=(com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet set=st.executeQuery("=file(\"D:/employee.txt\").cursor@t().sortx(NAME,-STATE,BIRTHDAY;1000000)");

This piece of code in Java calls a line of code in esProc script directly, that is, get the data from the text file, compute the mand return the result set to set– the object of ResultSet.

If the data in employee.txt can be loaded to the memory altogether, sort function can thus be used to perform all-in-memory sorting, in which esProc won't generate the temporary files. The computational speed will be much faster. The code is as follows:

Parameter sortBy can be written as STATE,-BIRTHDAY, or STATE:1,BIRTHDAY:-1. And it is no need to modify the calling program of Java.

menu

September 30, 2014

esProc’s Participation in Java Structured Text Processing - Sorting

No comments:

Post a Comment