May 27, 2015

esProc Improves Text Processing – Characters Matching

Sometimes during text processing you need to find out words containing certain characters. The logic of this computation is simple, but the code is difficult to write using the regular expression because the order of the characters is flexible. Moreover, the method is inefficient. You may do better to write the program by yourself, but the problem is that high-level languages don’t support set operations and this also makes the coding not easy. By contrast, esProc can parse a string dynamically and thus can match specific characters more easily with simple and intuitive code. Let’s look at how it works through the following example.

Find out words containing e, a, c from the following Sample.txt. Some of the original data are as follows:

esProc Improves Text Processing – Conditional Query on Big Files
During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.
……

esProc code for doing the task:
A1=file("e:\\sample.txt").read()

This line of code reads the file into the memory as a big string, as shown below: 

Besides, read function, used with @n option, can read the data by lines. For example, the result of executing file("e:\\sampleB.txt").read@n() is as follows: 

import function can be used if the data are structured. To import, for example, a file with tab being the separator and the first row being the column names, the code can be file("e:\\sampleC.txt").import@t(). The result is as follows: 

A2=A1.words()

This line of code splits the big string into multiple words and creates a set with them. The words function can filter away the numbers and signs automatically and select only the alphabetic characters. Select only the numbers by adding @d option and both the words and the numbers by adding @a option. The result of A2 is as follows: 

A3=A2.(~.array(""))

This line of code splits each word in A2 into characters. “~” represents each member of the set (word); there is no space within the double quotation marks (""). When the code is executed, A3 holds the subsets of a set, as shown below: 
A4=A3.select(set==set^~)

This line of code selects the words containing set’s characters. select function is used to execute a query statement, in which “~” represents A3’s member for the current computation, operator “^” represents the intersection and “set==set^~” represents that if the intersection of set and the current member is equal to set itself, the current member is an eligible word according to the query condition. “==” is a comparison operator, operators of the same kind also include “!=” (not equal to), “<” (less than) and “>=” (greater than or equal to). “^” is a binary operator representing intersection, other operators of the same kind include “&” (union) and “\” (difference).

set is an external parameter, which can be transferred from either the command line or a Java program according to its different usages. It can be defined on the Integration Development Environment (IDE) interface, as shown below: 

Suppose the value of parameter set is ["e","a","c"], then the above line of code is equal to A3.select(["e","a","c"]==["e","a","c"]^~). Once it is executed, the result is as follows: 

It can be seen that both “complicated” and “Rebecca” contain the three characters: e, a, c.

Besides by computing the intersection, the operation can be realized through position query. The corresponding code is A3.select(~.pos(set)). pos function is used to locate members of set in ~ (also a set). If all of them can be found, then return a sequence consisting of their sequence numbers (that is true); if not found, then return null (that is false).

After A4 finds out the words satisfying the query condition, join the characters of each set, the word in fact, together using the following code:
A5=A4.(~.conj@s())

conj function can concatenate multiple sets together to form a single set. When used with @s option, it can combine all the members of a set into a string. The final result of this example is as follows: 
The above step-by-step computation is intuitive and easy to understand. Actually you can omit the step for splitting the words up and then again concatenating every character, thus the code will become a single line:

 file("e:\\sample.txt").read().words().select(set==set ^ ~.array(""))

No comments:

Post a Comment