September 30, 2014

esProc’s Participation in Java Structured Text Processing - Sorting

It is simple for Java to sort data in text files of small size. But when big files are involved, you need to import data segmentally, write out the sorting result of each segment in temporary files and at last, merge these temporary files. The programming will be rather complicated. Even if the file is small enough to be loaded to the memory, you will have to analyze the data type of the text file. The analysis is not difficult but the code is long.

However, these problems can be avoided by using esProc to help with programming in Java. Let's look at in detail how this will happen. Now you are required to sort the employee data in the text file employee.txt by STATE in ascending order and by BIRTHDAY in descending order. It is assumed that the data size of the file is huge and exceeds the memory capacity.

employee.txt is of the following format:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

Implementation approach: call the esProc script with Java, import and compute the data, and then return the result in the form of ResultSet to Java. To perform sorting by STATE in ascending order and by BIRTHDAY in descending order, esProc will use an input parameter "sortBy" as the sorting expression, as shown in the following figure:

The value of "sortBy" is STATE,-BIRTHDAY. The field with a minus before it represents the opposite number, which is valid for string data, numerical data and date data.

The code in esProc is as follows: 

A1Define a file cursor object, with the first row being the title and tab being the field separator by default.

A2Perform sorting according to the expression, using macro to realize parsing the expression dynamically. The "sortBy" in this process is an input parameter. In executing, esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the code. The final code to be executed in this example is =A1.sortx(STATE,-BIRTHDAY;1000000).

A3Return the result cursor to external program. While Java receives the returned result and traverses the data with ResultSet, esProc will automatically fetch the data corresponding to the cursor. If the sorted data are to be written into other files, the code in A3 should be modified into =file("D:/employee_result.txt").export@t(A2).

If the sorting fields and order are changed, you just modify sortBy– the parameter. For example, if the data are required to be sorted by NAME in ascending order and by STATE and BIRTHDAY in descending order, the value of sortBy will be written as NAME,-STATE,-BIRTHDAY.

sortx function performs sorting by importing data segmentally according to buffer rows, write the results of sorting each segment into temporary files, redistribute the memory usage and then merge these temporary files. Here the parameter 1000000 refers to buffer rows. The principle of assigning value to it is to make the best of the memory to reduce the number of temporary files as far as possible. The number of temporary files is related to the size of both the physical memory and the records, and should be evaluated during programming. Generally, the recommended number is between magnitudes of several hundred thousand to a magnitude of one million.

The code of calling this piece of code (which is saved as test.dfx) in Java with esProc JDBC is as follows:
          //create a connection between esProc and its JDBC
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
//call the program in esProc (the stored procedure); test is the name of file dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
//set the parameters
st.setObject(1,"NAME,-STATE,-BIRTHDAY");
//execute the esProc stored procedure
st.execute();
//get the result set, which is the eligible set of employees
ResultSet set = st.getResultSet();
If the script is simple, the code can be written directly into the program in Java that calls the esProc JDBC. It won’t be necessary to write a special esProc script file (test.dfx):
st=(com. esproc.jdbc.InternalCStatement)con.createStatement();
ResultSet set=st.executeQuery("=file(\"D:/employee.txt\").cursor@t().sortx(NAME,-STATE,BIRTHDAY;1000000)");

This piece of code in Java calls a line of code in esProc script directly, that is, get the data from the text file, compute the mand return the result set to set– the object of ResultSet.

If the data in employee.txt can be loaded to the memory altogether, sort function can thus be used to perform all-in-memory sorting, in which esProc won't generate the temporary files. The computational speed will be much faster. The code is as follows: 

Parameter sortBy can be written as STATE,-BIRTHDAY, or STATE:1,BIRTHDAY:-1. And it is no need to modify the calling program of Java.