esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Improves Text Processing

Sometimes we need to fetch certain data from multiple files of a multi-level directory during text processing. The operation is too complicated to be well performed at the command line. Though it can be realized in high-level languages, the code is difficult to write; and the involvement of big files will increase the difficulty. esProc, however, can import big files with cursors and call the script recursively and thus can process the data fetching in batch. The following example will show its way of doing it.

A directory - “D:\files” – has subdirectories of multiple levels. Each subdirectory has many files of text format. We are asked to fetch a specified line (say the second line) from each of these files and write them into a new file – result.txt. Part of the structure of D:\files is as follows:

esProc code for doing this:

First define a parameter, path, and set its initial value as “D:\files” so as to get data from this directory, as shown below:

A1=directory@p(path)

directory function is used to get the file list in the root directory of the parameter, path. @p option means file names should be presented with full path. The following shows some of the results:

A2=A1.(file(~).cursor@s()) . This line of code opens A1’s files respectively in the form of cursors. A1.(…) means processing A1’s members in proper order; “~” represents the current member; file function is used to create a file object and cursor function will return a cursor object according to the file object.

Tab is used as the default separator in cursor function. Default column names are 1,_2…_n. @s function means ignoring the separator and importing the file content as the strings in a single column with _1 being the column name. Note that the code only creates the cursor objects but doesn’t fetch data. The data fetching will be started by the use of fetch function. The results of A2 are as follows:

A3=A2.((~.skip(1),~.fetch@x(1)))This line of code fetches the second row from A2’s each file cursor. A2.(…) means computing A2’s cursors one by one. (~.skip(1),~.fetch@x(1)) means computing the expression in the parentheses in order and returning the last computed result. ~.skip(1) means skipping a row. ~.fetch@x(1) means fetching the row at the current position (i.e. the second row) and closing the cursor. @x means closing the cursor automatically after the data are fetched. ~.fetch@x(1) represents the result which the parentheses operator will return.

skip function skips multiple rows. You can determine how many rows need to be skipped through a parameter. fetch function fetches multiple rows. Fetch two rows starting from the 10^th row, for example, the code is ~.skip(10),fetch@x(2).

The following shows some of the results of A3:

A4=A3.union()This line of code unions the results in A4 together. union function is used to realize the union operation, removing the duplicate data at the same time. For example, the code for computing the union of two sets: [1,2] and [2,3] is [1,2],[2,3]].union() and the result is [1,2,3]. If duplicate data are wanted, conj function (for concatenation) should be used. Some of the results of A4 are as follows:

A5=file("d:\\result.txt").export@a(A4)This line of code exports the results of A4 to result.txt. export function is used to write data to a file. @a option means appending.

At this point, all data have been fetched as required from the current directory. The rest of the work is to fetch the subdirectories of the current directory and to call this script recursively.

A6=directory@dp(path)directory function is used to fetch all the subdirectories from the current directory. One of the options, d, means fetching the subdirectory names and the other one, p, means fetching the full paths. Thus A6 gets the subdirectories from D:\files:

A7=A6.(call("c:\\readfile.dfx",~))This line of code deals with A6’s members (the subdirectories). The operation is to call the esProc script - c:\\readfile.dfx, and makes the current member (one of the subdirectories) as the input parameter. Note that readfile.dfx is the name of this script.

Through the recursive call in A7, esProc will fetch data from a batch of files of the multilevel directory of D:\files. You can see the final result in result.txt:

menu

May 30, 2015

esProc Improves Text Processing – Fetching Data from a Batch of Files

No comments:

Post a Comment