It is common to perform set operations on
big files in text processing. For example, find different rows between two
files. The code for handling the operations with command line grep or cat command is simple but inefficient. While the operational
efficiency is high when high-level languages are used to handle the operations,
the code is difficult to write.
esProc supports performing set operations
on big files and multithreaded parallel computing. Its code is concise and its
performance is remarkable. The following examples will show the esProc method
in detail.
file.1txt and file2.txt hold a large number
of strings respectively. Find their common rows (that is, compute the intersection).
Some of the data are shown as follows:
Both files are big
When both files are too big to be loaded
into the memory, esProc’s way of cursor merge can be used to realize the
intersection operation. The code is as follows:
A1, B1:Open the files as cursors.
cursor function doesn’t import all
the data into the memory, it opens a file in the form of a cursor (stream) without
the memory footprint. The function uses default parameters to import all the
fields and to name the new columns automatically as _1, _2, _3…_n, with tab being the column separator. There is only one field - _1 - in this example.
A2=[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)
The above code uses the merge operation to
find the common rows of the two files, that is, to compute the intersection. A
merge operation requires ordered data, so sortx
function is used to sort the cursors first. The corresponding code is A1.sortx(_1) and A2.sortx(_1),
in which _1 is the default
field name in the files. merge
function is used to merge multiple groups of data (two groups as with this
example). Without parameter options, it merges the sets in the memory; the use
of parameter option @x means to merge
the cursors and @i means that the
merge result is the intersection.
Notice that the result of A2 is still a
cursor without the memory footprint. Only when the computation involves
functions like export, fetch, groups and etc. will esProc engine allocate
suitable buffers and automatically convert the cursor computing into in-memory
computing.
A3=file("E:\\result.txt").export(A2)
This line of code writes cursor A2 to a file. export function can be used to write both the in-memory data and cursors to files. Here default parameters are used, that is, no column name, using tab as the column separator, overwriting instead of appending files and writing as text files instead of binary files. Open result.txt and you can see some of the data as shown below:
Actually the above three-step esProc code
can be combined into a single line of code:
A1=file("e:\\result.txt").export([file("E:\\file1.txt").cursor().sortx(_1),file("E:\\file2.txt").cursor().sortx(_1)].merge@xi(_1))
By not removing the duplicate members, concatenation will be computed. The code is [A1.sortx(_1),B1.sortx(_1)].merge@x(_1), which shows the real meaning of merge operation. The result is as follows:
Use function option @d to compute the difference, the data which are included in file1.txt but not included in file2.txt. The code is [A1.sortx(_1),B1.sortx(_1)].merge@xd(_1). Result is as follows:
Note: The commutative law doesn’t apply to difference operation. So the code for getting the data which are included in file2.txt but not included in file1.txt is [B1.sortx(_1),A1.sortx(_1)].merge@xd(_1). The result is as follows:
One file is big, the other is
small
When there is only one big file that runs
out of the memory, the smaller one can be loaded into the memory and then use a
hash table to perform the set operations. Thus the efficiency can be increased
significantly. esProc code is as follows:
A1:Open the files as
cursors.
A2=file("e:\\file2.txt").import(). This line of code imports the smaller file2.txt into the memory entirely. Similar to cursor function, import function’s default field names are _1 and _2 as well. Click B1 in esProc’s Integrated Development Environment (IDE) to see the computed result of the current cell:
B2>B1.primary(_1).index()
The
above code defines a primary key for B1 and creates a hash index. The query
speed can be increased greatly in this way. B1 and B2 can be combined into a
single line of code : B1=file("E:\\ file2is.txt").import().primary(_1).index()
A3=A1.select(B1.find(~._1))
This
line of code selects the common data of cursor A1 and cursor B1, which is computing
the intersection. select function is
used to execute the query statement and eligible data will be selected. In the
function, ~ represents the current record.
The
query criterion is B1.find(~._1), meaning to
find in B1 the _1 field of the
current record of A1.
Notice
that the result is still a cursor when the code is executed.
A4=file("E:\\result.txt").export(A3). This line of code writes the final result into a file.
When computing difference, modify the code in A3 to A1.select(!B1.find(~._1)),
which selects from A1 the rows which are not included in B1.
To compute the union of file1 and file2, first compute the difference of file1 and file2 and then union it with file2.
conj function in the expression [A3,B1.cursor()].conj@x()
in A4 can concatenate multiple sets (which is concatenation operation) –
[[1,2,3],[2,3],[]]=[1,2,3,2,3], for example. Function option @x is used to concatenate the data in
multiple cursors. Since B1 is not a cursor, cursor
function should be used to convert it to a cursor, i.e. B1.cursor(). This is faster than importing data from the file.
Compute
a big file and a small one in parallel
The sequential
computation is used in the above case, but parallel computation can further enhance the performance. The
method is to import files using multithreads. Each thread accesses a part of
the file with a cursor, and meanwhile performs a set operation and finally
combines the result of each cursor together.
Test a big file of 2.77G size and a small
one of 39.93M size under the same hardware environment. It takes an average of
85 seconds to accomplish the sequential computation
while it takes an average of 47 seconds to accomplish it with parallel
computing. The speed has nearly been doubled. As the set operations are not
that complicated in themselves, the computation hits a bottleneck in the hard
drive’s ability to import data. The performance will be improved more greatly
while the operation is getting more complex.
esProc code for parallel computing is as
follows:
B1:Import the small
file into the memory, define a primary and create an index.
A2=4. A2 represents the number of segments
into which a file is going to be divided. Here the number is 4. Generally the
number of segments should not exceed the number of CPU cores; otherwise the tasks
will be queued for processing and the speed won’t be really increased. The maximum
parallel number can be configured in the environment option.
A3=A2.(file("e:\\file1.txt").cursor@z(;,
~:A2))
The above code generates four cursors
according to the number of segments. A2.(express) means
computing the expression with members of A2 in order. In the parentheses, “~” can
be used to represent the current member. Generally A2 is a set, like ["file1", " file2"] or [2,3]. If members of the set are a series of consecutive
numbers beginning from 1, like [1,2,3,4], A2 can be abbreviated to 4.( express), as with the code in this example.
In the
expression - file("e:\\file1.txt").cursor@z(;,
~:A2) - surrounded by the parentheses, cursor function uses @z
option to divide the file into multiple segments and to import one of them
using a cursor. ~:A2 means that the file
will be roughly divided into 4 segments (A2=4), with “~” representing the
current member in A2. Thus the cursors correspond to the first, the second, the
third and the fourth segment of file respectively.
The
reason for dividing a file “roughly” is that half rows will appear with the
exact division. esProc can make sure of importing whole rows automatically by skipping
the head incomplete row and making up the tail incomplete row, this is tedious
to realize in Java.
A4=A3.(~.select(B1.find(~._1))).
This line of code computes intersection on each cursor
in A3 and the result is a set of cursors.
A5=A4.conj@xm(). This line of code combines the multiple cursors in A4 in parallel.
A6=file("e:\\result.txt”).export(A5). The code writes the final result into a file. The intersection
operation has been accomplished at this point. You can refer to the previous
examples to compute difference and union.
esProc can be used alone or be integrated
into a Java program. The following is to integrate the esProc script into the
Java program through JDBC. The Java code is as follows:
// create a connection using esProc
jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con=
DriverManager.getConnection("jdbc:esproc:local://");
// call esProc script, whose name is
test.dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call
test()");
st.execute();//execute esProc stored
procedure, or input a parameter and output the computed result
For the simple script as with the first
example, it can be embedded in the Java code and executed directly, without
writing the script file (text.dfx). The corresponding Java code is as follows:
ResultSet set=st.executeQuery("file(\"e:\\result.txt\").export([file(\"E:\\file1.txt\").cursor().sortx(_1),file(\"E:\\file2.txt\").cursor().sortx(_1)].merge@xi(_1))");
No comments:
Post a Comment