March 13, 2015

esProc Improves Text Processing – String Matching with Big Files

There are many occasions during text processing which require performing string matching with big files. Coding with command line grep\cat is simple yet inefficient. Though higher efficiency can be achieved with high-level languages, coding will be rather difficult.

Yet this operation, as well as multithreaded parallel computing, can be handled more easily in esProc, with more concise code and much better performance. The following examples will show esProc method in detail.


file1.txt has a great many strings. Find out the rows ending with “.txt” and export them to result.txt. Some of the original data are as follows:

esProc code for doing this task:

A1: Open the file in the form of cursors. Instead of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without memory footprint. The function uses default parameters to import all the fields with tab being the column separator and to automatically name them _1, _2, _3…_n respectively. There is only one field, _1, in this example.

A2=A1.select(like@c(_1,"*.txt"))

This line of code selects rows ending with “.txt” from cursor A1. select function executes the query and like function performs string matching. _1 represents the first field. The use of @c option in like function means the matching is case insensitive.

One point worth noting is that the result of A2 is still a cursor without memory footprint. Only with the use of functions like export/fetch/groups will esProc allocate suitable memory buffers and convert the cursor computing to memory computing.

A3=file("e:\\result.txt").export(A2). This line of code exports the final result to a file. Some of the data are as follows:

The matching rule in the example above is relatively simple. If the rule is complex, a regular expression will be needed. For example, find out rows starting with “c:\windows” and not ending with “.txt”.

regex function is used to perform string matching with the regular expression. Just modify A2’s code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)") , in which @c option means case insensitive.

Though the regular expression can be used to realize the string matching with complex rule, its performance is not satisfactory. For example, to find out rows ending with “.txt” from a file of 2.13G size in the same test environment, it takes 206 seconds with a regular expression, while it takes only 119 seconds with an ordinary expression (the select statement).

In fact, many tasks of string matching with complex rule can also be realized with the ordinary expression. Moreover, the syntax is more visual and cost of learning is lower. For example, emp.txt holds a large number of user records, each of which has multiple fields, separated by tab and with the first row being the column names. Suppose you are to find out rows with the rule that “Eid field is lesser than 100, the first letter of Name filed is a and Birthday field is greater than 1984-01-01”. You can do it in esProc as follows:

The @t option used with cursor function means that the first row will be imported as column names for the use of accessing data at a later time.

The three query conditions can be represented by EId>100, like@c(Name,"a*") and Birthday>=date("1984-01-01") respectively. The logic relation between the conditions is “AND”, which can be represented by &&.

The above algorithm is sequential computation. The performance can be further improved if parallel computing is used. The method is this: Import the file using multithreads, each of which will access some of the data of the file with a cursor, and perform set operations at the same time; finally, merge the result of each cursor together.

Test the processing of a file of 2.13G size under the same hardware environment. It takes an average of 119 seconds with the sequential computation, whereas it takes only an average of 56 seconds with the parallel computing, which speeds the performance almost doubly. The algorithm used in the example is not so complex, so the bottleneck is the hard driver’s ability to import data. With the increase of the complexity of the computation, the performance will be improved more greatly.

esProc code for parallel computing:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the performance won’t be really increased. The maximum number of the parallel tasks can be configured in the environment option.

A2=A1.(file("e:\\file1.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with members of A1 respectively. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

In the expression, file("e:\\file1.txt").cursor@z(;, ~:A1), surrounded in the parentheses, cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into four segments (A1=4) and the ~th segment is fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed results are still four cursors.
A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.
A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.

An esProc script not only can work independently in an Integration Development Environment (IDE), it also can be called by a Java program through JDBC interface. The calling method is the same as the method of calling an ordinary database. A one-step esProc script can be embedded in the Java program directly without script file. Actually the above steps can be combined into one single step:
file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;, ~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())

It is also allowed to run this kind of one-step script in operating system’s command line. Please refer to related documents for further information. 

No comments:

Post a Comment