Sometimes during text processing you need
to find out words containing certain characters. The logic of this computation
is simple, but the code is difficult to write using the regular expression
because the order of the characters is flexible. Moreover, the method is
inefficient. You may do better to write the program by yourself, but the
problem is that high-level languages don’t support set operations and this also
makes the coding not easy. By contrast, esProc can parse a string dynamically and
thus can match specific characters more easily with simple and intuitive code.
Let’s look at how it works through the following example.
Find out words containing e, a, c from the
following Sample.txt. Some of the
original data are as follows:
esProc Improves Text Processing – Conditional Query on Big Files
During text processing, you often have
the tasks of querying data from a big file on one or more conditions. Command
line grep\cat command can be used to handle some simple situations with
simple command yet low efficiency. Or high-level languages can be used to get
a much higher efficiency with complicated code. If the query conditions are
complex or dynamic, you need to create an additional SQL-like low-level class
library, which increases the complexity of the computation.
……
|
esProc code for doing the task:
A1=file("e:\\sample.txt").read()
This line of code reads the file into the memory as a big string, as shown below:
Besides, read function, used with @n option, can read the data by lines. For example, the result of executing file("e:\\sampleB.txt").read@n() is as follows:
import function can be used if the data are structured. To import, for example, a file with tab being the separator and the first row being the column names, the code can be file("e:\\sampleC.txt").import@t(). The result is as follows:
A2=A1.words()
This line of code splits the big string into multiple words and creates a set with them. The words function can filter away the numbers and signs automatically and select only the alphabetic characters. Select only the numbers by adding @d option and both the words and the numbers by adding @a option. The result of A2 is as follows:
A3=A2.(~.array(""))
This line of code splits each word in A2 into characters. “~” represents each member of the set (word); there is no space within the double quotation marks (""). When the code is executed, A3 holds the subsets of a set, as shown below:
A4=A3.select(set==set^~)
This
line of code selects the words containing set’s
characters. select function is used
to execute a query statement, in which “~” represents A3’s member for the
current computation, operator “^” represents the intersection and “set==set^~”
represents that if the intersection of set
and the current member is equal to set
itself, the current member is an eligible word according to the query
condition. “==” is a comparison operator, operators of the same kind also include
“!=” (not equal to), “<” (less than) and “>=” (greater than or equal to).
“^” is a binary operator representing intersection, other operators of the same
kind include “&” (union) and “\” (difference).
Suppose the value of parameter set is ["e","a","c"], then the above line of code is equal to A3.select(["e","a","c"]==["e","a","c"]^~). Once it is executed, the result is as follows:
It can be seen that both “complicated” and “Rebecca”
contain the three characters: e, a, c.
Besides
by computing the intersection, the operation can be realized through position
query. The corresponding code is A3.select(~.pos(set)).
pos function
is used to locate members of set in ~
(also a set). If all of them can be found, then return a sequence consisting of
their sequence numbers (that is true); if not found, then return null (that is
false).
After A4
finds out the words satisfying the query condition, join the characters of each
set, the word in fact, together using the following code:
A5=A4.(~.conj@s())
conj function can concatenate multiple sets together to form a single set. When used with @s option, it can combine all the members of a set into a string. The final result of this example is as follows:
The
above step-by-step computation is intuitive and easy to understand. Actually
you can omit the step for splitting the words up and then again concatenating
every character, thus the code will become a single line:
file("e:\\sample.txt").read().words().select(set==set
^ ~.array(""))
No comments:
Post a Comment