March 11, 2015

esProc Improves Text Processing – Set Operations on Big Files

It is common to perform set operations on big files in text processing. For example, find different rows between two files. The code for handling the operations with command line grep or cat command is simple but inefficient. While the operational efficiency is high when high-level languages are used to handle the operations, the code is difficult to write.

esProc supports performing set operations on big files and multithreaded parallel computing. Its code is concise and its performance is remarkable. The following examples will show the esProc method in detail.


file.1txt and file2.txt hold a large number of strings respectively. Find their common rows (that is, compute the intersection). Some of the data are shown as follows:

Both files are big
When both files are too big to be loaded into the memory, esProc’s way of cursor merge can be used to realize the intersection operation. The code is as follows:

A1, B1Open the files as cursors. cursor function doesn’t import all the data into the memory, it opens a file in the form of a cursor (stream) without the memory footprint. The function uses default parameters to import all the fields and to name the new columns automatically as _1, _2, _3…_n, with tab being the column separator. There is only one field - _1 - in this example.

A2=[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)

The above code uses the merge operation to find the common rows of the two files, that is, to compute the intersection. A merge operation requires ordered data, so sortx function is used to sort the cursors first. The corresponding code is A1.sortx(_1) and A2.sortx(_1), in which _1 is the default field name in the files. merge function is used to merge multiple groups of data (two groups as with this example). Without parameter options, it merges the sets in the memory; the use of parameter option @x means to merge the cursors and @i means that the merge result is the intersection.

Notice that the result of A2 is still a cursor without the memory footprint. Only when the computation involves functions like export, fetch, groups and etc. will esProc engine allocate suitable buffers and automatically convert the cursor computing into in-memory computing. 

A3=file("E:\\result.txt").export(A2)

This line of code writes cursor A2 to a file. export function can be used to write both the in-memory data and cursors to files. Here default parameters are used, that is, no column name, using tab as the column separator, overwriting instead of appending files and writing as text files instead of binary files. Open result.txt and you can see some of the data as shown below:

Actually the above three-step esProc code can be combined into a single line of code: 
A1=file("e:\\result.txt").export([file("E:\\file1.txt").cursor().sortx(_1),file("E:\\file2.txt").cursor().sortx(_1)].merge@xi(_1))   

In addition to the intersection operation performed in this example, there are also operations of union, concatenation and difference. Just to modify the option of merge function in A2 to realize them. For example, union file1.txt and file2.txt and remove the duplicate members to get their union. Function option @u can be used to compute the union with the following code: [A1.sortx(_1),B1.sortx(_1)].merge@xu(_1). Their union is as follows:

By not removing the duplicate members, concatenation will be computed. The code is [A1.sortx(_1),B1.sortx(_1)].merge@x(_1), which shows the real meaning of merge operation. The result is as follows:

Use function option @d to compute the difference, the data which are included in file1.txt but not included in file2.txt. The code is [A1.sortx(_1),B1.sortx(_1)].merge@xd(_1). Result is as follows:

Note: The commutative law doesn’t apply to difference operation. So the code for getting the data which are included in file2.txt but not included in file1.txt is [B1.sortx(_1),A1.sortx(_1)].merge@xd(_1). The result is as follows:

One file is big, the other is small
When there is only one big file that runs out of the memory, the smaller one can be loaded into the memory and then use a hash table to perform the set operations. Thus the efficiency can be increased significantly. esProc code is as follows:

A1Open the files as cursors.

A2=file("e:\\file2.txt").import(). This line of code imports the smaller file2.txt into the memory entirely. Similar to cursor function, import function’s default field names are _1 and _2 as well. Click B1 in esProc’s Integrated Development Environment (IDE) to see the computed result of the current cell:

B2>B1.primary(_1).index()

The above code defines a primary key for B1 and creates a hash index. The query speed can be increased greatly in this way. B1 and B2 can be combined into a single line of code : B1=file("E:\\ file2is.txt").import().primary(_1).index()

A3=A1.select(B1.find(~._1))

This line of code selects the common data of cursor A1 and cursor B1, which is computing the intersection. select function is used to execute the query statement and eligible data will be selected. In the function, ~ represents the current record.

The query criterion is B1.find(~._1), meaning to find in B1 the _1 field of the current record of A1.

Notice that the result is still a cursor when the code is executed.

A4=file("E:\\result.txt").export(A3). This line of code writes the final result into a file.

When computing difference, modify the code in A3 to A1.select(!B1.find(~._1)), which selects from A1 the rows which are not included in B1.

To compute the union of file1 and file2, first compute the difference of file1 and file2 and then union it with file2.

conj function in the expression [A3,B1.cursor()].conj@x() in A4 can concatenate multiple sets (which is concatenation operation) – [[1,2,3],[2,3],[]]=[1,2,3,2,3], for example. Function option @x is used to concatenate the data in multiple cursors. Since B1 is not a cursor, cursor function should be used to convert it to a cursor, i.e. B1.cursor(). This is faster than importing data from the file.

Compute a big file and a small one in parallel
The sequential computation is used in the above case, but parallel computation can further enhance the performance. The method is to import files using multithreads. Each thread accesses a part of the file with a cursor, and meanwhile performs a set operation and finally combines the result of each cursor together.

Test a big file of 2.77G size and a small one of 39.93M size under the same hardware environment. It takes an average of 85 seconds to accomplish the sequential computation while it takes an average of 47 seconds to accomplish it with parallel computing. The speed has nearly been doubled. As the set operations are not that complicated in themselves, the computation hits a bottleneck in the hard drive’s ability to import data. The performance will be improved more greatly while the operation is getting more complex.
esProc code for parallel computing is as follows:

B1Import the small file into the memory, define a primary and create an index.
A2=4. A2 represents the number of segments into which a file is going to be divided. Here the number is 4. Generally the number of segments should not exceed the number of CPU cores; otherwise the tasks will be queued for processing and the speed won’t be really increased. The maximum parallel number can be configured in the environment option.

A3=A2.(file("e:\\file1.txt").cursor@z(;, ~:A2))

The above code generates four cursors according to the number of segments. A2.(express) means computing the expression with members of A2 in order. In the parentheses, “~” can be used to represent the current member. Generally A2 is a set, like ["file1", " file2"] or [2,3]. If members of the set are a series of consecutive numbers beginning from 1, like [1,2,3,4], A2 can be abbreviated to 4.( express), as with the code in this example.

In the expression - file("e:\\file1.txt").cursor@z(;, ~:A2) - surrounded by the parentheses, cursor function uses @z option to divide the file into multiple segments and to import one of them using a cursor. ~:A2 means that the file will be roughly divided into 4 segments (A2=4), with “~” representing the current member in A2. Thus the cursors correspond to the first, the second, the third and the fourth segment of file respectively.

The reason for dividing a file “roughly” is that half rows will appear with the exact division. esProc can make sure of importing whole rows automatically by skipping the head incomplete row and making up the tail incomplete row, this is tedious to realize in Java.

A4=A3.(~.select(B1.find(~._1))). This line of code computes intersection on each cursor in A3 and the result is a set of cursors.

A5=A4.conj@xm(). This line of code combines the multiple cursors in A4 in parallel.

A6=file("e:\\result.txt”).export(A5). The code writes the final result into a file. The intersection operation has been accomplished at this point. You can refer to the previous examples to compute difference and union.

esProc can be used alone or be integrated into a Java program. The following is to integrate the esProc script into the Java program through JDBC. The Java code is as follows:
  // create a connection using esProc jdbc
  Class.forName("com.esproc.jdbc.InternalDriver");
  con= DriverManager.getConnection("jdbc:esproc:local://");
  // call esProc script, whose name is test.dfx
  st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
  st.execute();//execute esProc stored procedure, or input a parameter and output the computed result

For the simple script as with the first example, it can be embedded in the Java code and executed directly, without writing the script file (text.dfx). The corresponding Java code is as follows:
ResultSet set=st.executeQuery("file(\"e:\\result.txt\").export([file(\"E:\\file1.txt\").cursor().sortx(_1),file(\"E:\\file2.txt\").cursor().sortx(_1)].merge@xi(_1))");