During text processing, sometimes we need
to replace multiple strings in the source file according to a criteria file.
The command line can be used to replace a single string, but it cannot realize
the batch string replacement. High-level languages can only handle this task through
complicated multilayer loops. If the source file is too big to be loaded into
the memory, the task will become more difficult to handle. esProc supports
processing the loop using the iterative function and importing big files with
cursors, thus it can perform the batch string replacement much more easily. Methods
will be explained in detail through the following examples.
The following is part of the source file, source.txt:
esProc Improves Text Processing – Conditional Query on Big Files
During text processing, you often have
the tasks of querying data from a big file on one or more conditions. Command
line grep\cat command can be used to handle some simple situations with
simple command yet low efficiency. Or high-level languages can be used to get
a much higher efficiency with complicated code. If the query conditions are
complex or dynamic, you need to create an additional SQL-like low-level class
library, which increases the complexity of the computation.
esProc supports performing conditional query on big files and
multithreaded parallel computing, and its code for handling this kind of
problem is both concise and efficient. The following example will teach you
the esProc method of doing the job.
…
|
esProc code for doing the task:
A1=file("e:\\condition.txt").import@t()
This line of code imports the criteria file. import function can import a text file or a binary file as a two-dimensional table (a table sequence), with tab being the default column separator. @t means making the first row the column names. Result of A1 is as follows:
A2=file("e:\\source.txt").read()
This line of code reads the source file. read function can read a text file as a big string. Result of A2 is as follows:
A3=A1.loops(replace(~~,before,after);A2)
This line of code replaces A2’s strings in
batch according to A1. As an iterative
function, loops function can perform loop
computation on a set (like A1, the set of records) by getting members of the
set in order and use them to compute the specified expression (like replace(~~,before,after)) one by one. The computed
result can be used in the next round of computation (~~ represents the previous
computed result) until the last one. A2 is the initial value of loops function.
replace function is used to perform string replacement. It has three
parameters: source string, to-be-replaced string and the after-replacement string,
represented respectively by ~~, before
and after. before and after are the
column names (field names) of the table sequence in A1.
Actually only A3 really performs the
replacement. The following line of code writes the result to a file.
A4=file("e:\\result.txt").write(A3). Here write function is used to write the strings to a file.
You can also combine these steps into a
single line of code:
A1=file("e:\\condition.txt").import@t().loops(replace(~~,before,after);file("e:\\source.txt").read())
A2=file("e:\\result.txt").write(A1)
A1:Import the criteria
file.
A2=file("e:\\source.txt").cursor@s().
This line of code opens the source file. cursor function won’t import the whole data
into the memory, instead it will open the file in the form of cursors (stream).
@s option means the data will be
imported as a single-column table sequence, with _1 being the column name.
Without the option, the data will be imported as a multi-column table sequence
according to the separator and columns will be named _1、_2、_3…_n
automatically.
A3:for A2,1000
It imports the data with the cursor in A2
by loop. A certain batch of data (1,000 rows as with this case) will be
imported each time.
The area of B3-B5 is the loop body of A3,
whose operation is similar to the handling of the previous example. The
operation is to perform batch string replacement on the current rows and append
the result to the new file. Note that a loop body is represented visually in
esProc by the indentation instead of the parentheses or identifiers like
begin/end.
B3=A3.(_1).string@d(“\r\n”)
This line of code converts the current
batch of data to a big string. A3 is the loop variable, representing the
current batch of data. A3.(_1) means fetching column _1 from A3. string function concatenates members of
a set into a big string by the specified separator, which is the carriage
return in this example. @d option forbids
surrounding each member with double quotation marks.
B4=A1.loops(replace(~~,before,after);B3))
This line of code performs batch string
replacement on each big string.
B4=file("e:\\result.txt").write@a(B4)
This line of code writes the replacement
result of the current row to the new file. @a
represents appending the result to the file.
No comments:
Post a Comment