There are many occasions during text
processing which require performing string matching with big files. Coding with
command line grep\cat is simple yet inefficient. Though higher efficiency can
be achieved with high-level languages, coding will be rather difficult.
Yet this operation, as well as
multithreaded parallel computing, can be handled more easily in esProc, with
more concise code and much better performance. The following examples will show
esProc method in detail.
file1.txt has a great many strings. Find out the rows ending with “.txt” and
export them to result.txt. Some of
the original data are as follows:
esProc code for doing this task:
A1: Open the file in the form of cursors. Instead
of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without
memory footprint. The function uses default parameters to import all the fields
with tab being the column separator
and to automatically name them _1, _2, _3…_n respectively. There is only one
field, _1, in this example.
A2=A1.select(like@c(_1,"*.txt"))
This
line of code selects rows ending with “.txt” from cursor A1. select function executes the query and like function performs string matching. _1
represents the first field. The use of @c
option in like function means the
matching is case insensitive.
One
point worth noting is that the result of A2 is still a cursor without memory
footprint. Only with the use of functions like export/fetch/groups will esProc
allocate suitable memory buffers and convert the cursor computing to memory
computing.
The
matching rule in the example above is relatively simple. If the rule is
complex, a regular expression will be needed. For example, find out rows
starting with “c:\windows” and not ending with “.txt”.
regex function is used to
perform string matching with the regular expression. Just modify A2’s code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)")
, in which @c option means case insensitive.
Though the
regular expression can be used to realize the string matching with complex
rule, its performance is not satisfactory. For example, to find out rows ending
with “.txt” from a file of 2.13G size in the same test environment, it takes 206
seconds with a regular expression, while it takes only 119 seconds with an
ordinary expression (the select statement).
The @t option used with cursor function means that the first row will be imported as column
names for the use of accessing data at a later time.
The
three query conditions can be represented by EId>100,
like@c(Name,"a*") and Birthday>=date("1984-01-01")
respectively. The logic relation between the conditions is “AND”, which
can be represented by &&.
The above algorithm is sequential computation. The performance can be further
improved if parallel computing is used. The method is this: Import the file
using multithreads, each of which will access some of the data of the file with
a cursor, and perform set operations at the same time; finally, merge the
result of each cursor together.
Test the processing of a file of 2.13G size
under the same hardware environment. It takes an average of 119 seconds with
the sequential computation, whereas it takes
only an average of 56 seconds with the parallel computing, which speeds the
performance almost doubly. The algorithm used in the example is not so complex,
so the bottleneck is the hard driver’s ability to import data. With the
increase of the complexity of the computation, the performance will be improved
more greatly.
A1=4. A1 is the number of segments, which
means the file will be divided into 4 segments. The number is equal to the
number of parallel tasks in operation, which generally should not exceed the
number of CPU cores. Otherwise the tasks will be queued for processing and the
performance won’t be really increased. The maximum number of the parallel tasks
can be configured in the environment option.
A2=A1.(file("e:\\file1.txt").cursor@z(;,
~:A1))
This line of code will generate four
cursors according to the specified number of segments. A1.(express)
means computing the expression with members of A1 respectively. “~” can be used
in the parentheses to represent the current member. Generally A1 is a set, like
["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting
with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.
In the
expression, file("e:\\file1.txt").cursor@z(;,
~:A1), surrounded in the parentheses, cursor function uses @z
option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into
four segments (A1=4) and the ~th segment is fetched. “~” represents the current
member in A1 and each cursor corresponds to the first, the second, the third
and the fourth segment respectively.
Besides,
though exact division will result in incomplete lines, esProc can import
complete lines automatically by skipping the beginning half line of a segment
and completing the ending half line of the segment. This is why the file should
be divided “roughly”.
A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data
of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed
results are still four cursors.
A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.
A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.
An
esProc script not only can work independently in an Integration Development
Environment (IDE), it also can be called by a Java program through JDBC
interface. The calling method is the same as the method of calling an ordinary
database. A one-step esProc script can be embedded in the Java program directly
without script file. Actually the above steps can be combined into one single
step:
file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;,
~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())
It is
also allowed to run this kind of one-step script in operating system’s command
line. Please refer to related documents for further information.
No comments:
Post a Comment