esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Improves Text Processing

During text processing, sometimes we need to replace multiple strings in the source file according to a criteria file. The command line can be used to replace a single string, but it cannot realize the batch string replacement. High-level languages can only handle this task through complicated multilayer loops. If the source file is too big to be loaded into the memory, the task will become more difficult to handle. esProc supports processing the loop using the iterative function and importing big files with cursors, thus it can perform the batch string replacement much more easily. Methods will be explained in detail through the following examples.

A criteria file, condition.txt, has two columns, with tab being the separator. Column before has strings waiting to be replaced and column after holds the resulting strings after the replacement. Suppose to replace certain strings in source.txt in batch according to this configuration file, and write the result to result.txt. Some of the data of condition.txt are as follows (the first row is the column names):

The following is part of the source file, source.txt:

esProc Improves Text Processing – Conditional Query on Big Files

During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.

esProc supports performing conditional query on big files and multithreaded parallel computing, and its code for handling this kind of problem is both concise and efficient. The following example will teach you the esProc method of doing the job.

…

esProc code for doing the task:

A1=file("e:\\condition.txt").import@t()

This line of code imports the criteria file. import function can import a text file or a binary file as a two-dimensional table (a table sequence), with tab being the default column separator. @t means making the first row the column names. Result of A1 is as follows:

A2=file("e:\\source.txt").read()

This line of code reads the source file. read function can read a text file as a big string. Result of A2 is as follows:

A3=A1.loops(replace(~~,before,after);A2)

This line of code replaces A2’s strings in batch according to A1. As an iterative function, loops function can perform loop computation on a set (like A1, the set of records) by getting members of the set in order and use them to compute the specified expression (like replace(~~,before,after)) one by one. The computed result can be used in the next round of computation (~~ represents the previous computed result) until the last one. A2 is the initial value of loops function.

replace function is used to perform string replacement. It has three parameters: source string, to-be-replaced string and the after-replacement string, represented respectively by ~~, before and after. before and after are the column names (field names) of the table sequence in A1.

Actually only A3 really performs the replacement. The following line of code writes the result to a file.

A4=file("e:\\result.txt").write(A3). Here write function is used to write the strings to a file.

You can also combine these steps into a single line of code:

A1=file("e:\\condition.txt").import@t().loops(replace(~~,before,after);file("e:\\source.txt").read())

A2=file("e:\\result.txt").write(A1)

If the file is too big to be loaded into the memory, the data can be imported segmentally to make the replacement and append each set of result to the new file. The computation is performed in this way until the whole file is processed. The corresponding esProc code is as follows:

A1：Import the criteria file.

A2=file("e:\\source.txt").cursor@s().

This line of code opens the source file. cursor function won’t import the whole data into the memory, instead it will open the file in the form of cursors (stream). @s option means the data will be imported as a single-column table sequence, with _1 being the column name. Without the option, the data will be imported as a multi-column table sequence according to the separator and columns will be named _1、_2、_3…_n automatically.

A3:for A2,1000

It imports the data with the cursor in A2 by loop. A certain batch of data (1,000 rows as with this case) will be imported each time.

The area of B3-B5 is the loop body of A3, whose operation is similar to the handling of the previous example. The operation is to perform batch string replacement on the current rows and append the result to the new file. Note that a loop body is represented visually in esProc by the indentation instead of the parentheses or identifiers like begin/end.

B3=A3.(_1).string@d(“\r\n”)

This line of code converts the current batch of data to a big string. A3 is the loop variable, representing the current batch of data. A3.(_1) means fetching column _1 from A3. string function concatenates members of a set into a big string by the specified separator, which is the carriage return in this example. @d option forbids surrounding each member with double quotation marks.

B4=A1.loops(replace(~~,before,after);B3))

This line of code performs batch string replacement on each big string.

B4=file("e:\\result.txt").write@a(B4)

This line of code writes the replacement result of the current row to the new file. @a represents appending the result to the file.

Thus you have completed the batch string replacement with regard to a big file. See the final data in result.txt:

menu

June 1, 2015

esProc Improves Text Processing –Batch String Replacement

No comments:

Post a Comment