May 27, 2015

esProc Improves Text Processing – Characters Matching

Sometimes during text processing you need to find out words containing certain characters. The logic of this computation is simple, but the code is difficult to write using the regular expression because the order of the characters is flexible. Moreover, the method is inefficient. You may do better to write the program by yourself, but the problem is that high-level languages don’t support set operations and this also makes the coding not easy. By contrast, esProc can parse a string dynamically and thus can match specific characters more easily with simple and intuitive code. Let’s look at how it works through the following example.

Find out words containing e, a, c from the following Sample.txt. Some of the original data are as follows:

esProc Improves Text Processing – Conditional Query on Big Files
During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.
……

esProc code for doing the task:
A1=file("e:\\sample.txt").read()

This line of code reads the file into the memory as a big string, as shown below: 

Besides, read function, used with @n option, can read the data by lines. For example, the result of executing file("e:\\sampleB.txt").read@n() is as follows: 

import function can be used if the data are structured. To import, for example, a file with tab being the separator and the first row being the column names, the code can be file("e:\\sampleC.txt").import@t(). The result is as follows: 

A2=A1.words()

This line of code splits the big string into multiple words and creates a set with them. The words function can filter away the numbers and signs automatically and select only the alphabetic characters. Select only the numbers by adding @d option and both the words and the numbers by adding @a option. The result of A2 is as follows: 

A3=A2.(~.array(""))

This line of code splits each word in A2 into characters. “~” represents each member of the set (word); there is no space within the double quotation marks (""). When the code is executed, A3 holds the subsets of a set, as shown below: 
A4=A3.select(set==set^~)

This line of code selects the words containing set’s characters. select function is used to execute a query statement, in which “~” represents A3’s member for the current computation, operator “^” represents the intersection and “set==set^~” represents that if the intersection of set and the current member is equal to set itself, the current member is an eligible word according to the query condition. “==” is a comparison operator, operators of the same kind also include “!=” (not equal to), “<” (less than) and “>=” (greater than or equal to). “^” is a binary operator representing intersection, other operators of the same kind include “&” (union) and “\” (difference).

set is an external parameter, which can be transferred from either the command line or a Java program according to its different usages. It can be defined on the Integration Development Environment (IDE) interface, as shown below: 

Suppose the value of parameter set is ["e","a","c"], then the above line of code is equal to A3.select(["e","a","c"]==["e","a","c"]^~). Once it is executed, the result is as follows: 

It can be seen that both “complicated” and “Rebecca” contain the three characters: e, a, c.

Besides by computing the intersection, the operation can be realized through position query. The corresponding code is A3.select(~.pos(set)). pos function is used to locate members of set in ~ (also a set). If all of them can be found, then return a sequence consisting of their sequence numbers (that is true); if not found, then return null (that is false).

After A4 finds out the words satisfying the query condition, join the characters of each set, the word in fact, together using the following code:
A5=A4.(~.conj@s())

conj function can concatenate multiple sets together to form a single set. When used with @s option, it can combine all the members of a set into a string. The final result of this example is as follows: 
The above step-by-step computation is intuitive and easy to understand. Actually you can omit the step for splitting the words up and then again concatenating every character, thus the code will become a single line:

 file("e:\\sample.txt").read().words().select(set==set ^ ~.array(""))

May 26, 2015

esProc Improves Text Processing – Insert Summary values into Grouped Data

The usual way to insert summary values into the grouped data is to process data group by group. Import a group of data, append them and their summary value to a new file and then do the same with the next group, and so on. But it is not easy to realize this in hard coding. esProc, however, supports group cursor with which a whole group of data can be imported automatically. The following example will show how esProc deals with this kind of computation.


The log webdata.log has three columns separated by commas. The first column is the identifier for grouping data. The other two columns hold numerical values. Some of the data are as follows:
Notice that the first and fourth group has the same identifier for grouping data.

Now we are asked to insert the average value of the second column and an empty row between each group, as shown below: 

esProc code for doing this task: 

A1=file("E: \\webdata.log").cursor(;",")

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the corresponding file object. In the function, comma is used as the column separator (default separator is tab) and default column names are _1,_2…_n, in which _1 is the column to mark data grouping. We can also specify the column names like cursor(groupName,data1,data2;”,”).

The code only creates cursor objects but does not import the data. The data importing will be started by for statement or fetch function.

B1=file("e:\\result.txt"). This line of code creates a file object for storing the computed results.

A2:for A1;_1

This line of code fetches data from the cursor in A1 by loop, importing a group of data with the same first column (the name is _1) each time. It is in this step that data are really imported into the memory. 

The for statement here is worth special attention. for cs,n means fetching n rows from cursor cs at a time. While for cs;x means fetching a group of records with the same x field from cursor cs in which data need to be grouped beforehand by x. In this example, the data are already grouped. But if the data are ungrouped, they can be prepared them by using other esProc functions (like sortx, a function for sorting cursors).

The x in the statement for cs;x can be an expression, according to which multiple rows will be imported each time uninterruptedly until the expression changes. For example, for A14 ;left(_1,4) will judge the first four characters of the first column according to the expression and corresponding records will be classified into the same group until the characters change .

B2-B4 is the loop body of for statement in A2. The loop body processes every group of data in the same way. Its working scope, as can be seen from the cellset, is represented by indentation rather than by parentheses or other identifiers like begin/end. What’s more, the loop variable can be represented by the name of the cell where for statement resides, which, in this example, means A2 represents the records of the current group. Seen in debug mode, the value of A2 in the first-run loop is as follows: 

B2=B1.export@a(A2;",")

This line of code appends A2 to the defined file object. export function exports a group of records to the file, in which @a option means appending. In order to keep consistent with the source data, comma is used here as the separator (though the default separator is tab). Open result.txt after the first loop and we can see the following data: 
B3=A2._1+"_avg,"+string(A2.avg(_2))+”\r\n”

This line of code is used to piece together the summarizing string. A2._1 represents the first column of the current group. Its value is “webcat_service” as with the first group. The expression A2.avg(_2) means getting the average value of the second column of the current group. The value is 2.25 as with the first group. string function will formatting the variable of floating point type into the string.

For the first group of data, B3’s value is this: 

B4=B1.write@a(B3)
This line of code appends B3 to the result file. Both export function and write function can write data into a new file. The former writes structured data into the file, whereas the latter writes strings or an array of strings into the file. @a option appends data, which is preceded by writing the carriage return into the file.

At this point, the above script has finished processing all data. The final result can be viewed in result.txt as follows: 

May 22, 2015

esProc Improves Text Processing – Remove Duplicate Rows

During processing the text file, sometimes we need to remove duplicate rows from the grouped data. But the operation becomes complicated when the file under processing is too big to be entirely loaded into the memory. esProc’s group operation supports importing a whole group of data with the file cursor, as well as many options. So it can handle this kind of operation effortlessly. The following example will show you how it works.

The file EPRom.log has four columns separated by tab and its data have been grouped by the second column. Duplicate rows need to be removed (only the first row of each group is wanted). Some of the source data are as follows:

esProc code for doing this

A1=file("E:\\EPRom.log").import()

This line of code imports the file into the memory. By default, the separator is tab and column names are _1_2_3……. If it is a CSV file, the separator needs to be specified using the code import(;”,”). If the file’s first row contains column names, @t option can be used to import them, that is, import@t(). The result of A1 is as follows:

A2=A1.group@1o(_2)

This line of code gets the first row from every group. The group field is _2, the second field. This is the case’s final result, as shown below.

By default, group function will regroup the data. For instance, A1.group(_2) will divide A1 into two groups by the second field, as the following shows:

But the use of @o option won’t make that happen. For instance, result of A1.group@o(_2) is as follows:

With @1 option, the function will get the first row of every group. By using both @1 and @o, we’ve hit the target of this case.
        
In the situation that the file is too big to be wholly loaded into the memory, esProc cursor can be used to deal with it. Corresponding code is as follows:

A1=file("E:\\EPRom.log").cursor()

This line of code opens the log file in the form of cursor. cursor function returns a cursor object according to the corresponding file object, with tab being the separator and _1,_2…_n being column names by default. Notice that the code merely creates the cursor object without importing data. Data importing will be started by for statement or fetch function.

B1= file("e:\\result.txt")

This line of code creates a file object to which the computed results can be exported later on.

A2:for A1;_2

This line of code imports data from cursor A1 by loop. Each time it imports a group of data with the same second column (column name is _2). In this step the data are truly being imported into the memory.

Here for statement deserves special attention. In esProc, for cs,n imports n records from cursor cs each time. for cs;x imports a group of records with the same x field from cursor cs each time, the condition is that records have been grouped by x field.

The x in for cs;x statement can be an expression, which means multiple rows will be imported until the result of computing expression x changes. Take for A1 ; floor(_1/5) as an example. It divides _1 field by 5 and rounds the result off, put the records with the same results into the same group, like the first row to the fifth row.

B2=file("e:\\result.txt").export@a([A2(1)])

As the loop body of for statement in A2, it processes every group of data in the same way. The method is to get the first row of the current group and append it to file result.txt. A2 is the loop variable which represents all records in the current group. A2(1) represents the first record in 2. export function is used to write the structured data into a new file, its @a option means appending. Since A2(1) is a single record, it needs to be converted into array with the operator [].

We can see the final result in result.txt:

In esProc, the working range of for statement can be represented by indentation instead of the parentheses or identifiers like begin/end. In the following code block, for instance, B2-B5 is A2’s working range.

May 19, 2015

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

When parsing logs into structured data, we often find that the records consist of a variable number of lines. This makes the conversion, as well as the corresponding operation, quite complicated. Equipped with various flexible functions for structured data processing, such as regular expressions, string splitting, fetching data located in a different row, and data concatenation, esProc is ideal for processing this kind of text. Following example will show you how it works.


The log file reportXXX.log holds records, each of which consists of multiple lines with 14 data items (fields) and starts with the string “Object Type”. Our goal is to rearrange the log into structured data and write the result to a new text file. Some of the source data are as follows:

esProc code for doing this

A1=file("e:\\reportXXX.log").read()


This line of code reads the logs entirely into the memory. Result is as follows:

A2=A1.array("Object Type: ").to(2,)

This line of code can be divided into two parts. The first part - A1.array("Object Type: ") – splits A1 into strings according to “Object Type”. Result is as follows: 

Except the first item, every item of data is valid. to(2,) means getting items from the second one to the last one. Result of A2 is as follows:


This line of code applies the same regular expression to each member of A2 and gets the 14 fields separated by commas. Following lists the first fields:
 

A4=file("e:\\result.txt").export@t(A3)  

This line of code writes the final result to a new file. Tab is the default separator. The use of @t option will export the field names as the file’s first row. We can see the following data in result.txt:

The regular expression used in the code above is complicated. We’ll use esProc’ built-in functions to make the operation more intuitive. For example, ObjectType field is the first line of each record, so we can separate the records from each other with the line break and then get the first line. left\top\right\bottom actually splits each record’s second line by space and get item 3, 5, 7 and 9.

The task can be handled with esProc built-in functions as follows:

In the above code, pjoin function concatenates many sets together; array function splits a string into many segments by the specified delimiter and creates a set with them, in which (~.array("\r\n") splits each record by carriage return.
In the above example, we assumed that the log file is not big and can be wholly loaded into the memory for computing. But sometimes the file is big and needs to be imported, parsed and exported in batch, which makes the code extremely difficult to write. Besides, because the number of records is variable, there is always a record in a batch of data which cannot be imported completely. This further complicates the coding.

esProc can handle the big log files with arbitrary number of lines more easily using file cursors. Following is a code sample: 

A1=file("\\reportXXX.log").cursor@s()

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the file object, with tab being the default separator and _1,_2…_n being the default column names. @s option means ignoring the separator and importing the file as a one-column string, with _1 being the column name. Note that this code only creates a cursor object and doesn’t import data. Data importing will be started by for statement or fetch function.

A2: for A1,10000

A2 is a loop statement, which imports a batch of data (10,000 rows) each time and sends them to the loop body. This won’t stop until the end of the log file. It can be seen that a loop body in esProc is visually represented by the indentation instead of the parentheses or identifiers like begin/end. The area of B3-B7 is A2’s loop body which processes data like this: by the carriage-return the current batch of data is restored to the text which is split into records again according to “Object Type” , and then the last, incomplete record is saved in B1, a temporary variable, and the first and the last record, both of which are useless, are deleted; and then the regular expression is parsed with each of the rest of the records, getting a two-dimensional table to be written into result.txt. Following will explain this process in detail:

B2=B1+A2.(_1).string@d("\r\n")

This line of code concatenates the temporary variable B1 with the current text. In the first-run loop, B1 is empty. But after that B1 will accept the incomplete record from the previous loop and then concatenate with the current text, thus making the incomplete record complete.

string function concatenates members of a set by the specified separator and @d function forbids surrounding members with quotation marks. Top rows in A2 are as follows:

A2.(_1) represents the set formed by field _1 in A2 :

A2.(_1).string@d("\r\n") means concatenating members of the above set into a big string, which is Object Type: Symbol Location: left: 195 top: 11 right: 123 bottom: 15 Line Color: RGB ( 1 0 0 ) Fill Color:   RGB ( 251 255 0 ) Link:l11…. 

B3=B2.array("Object Type: ")


This line of code splits the big text in B2 into strings by “Object Type”. Result of B3’s first-run loop is as follows:
Since the last string in B3 is not a complete record and cannot be computed, it will be stored in the temporary variable and concatenated with the new string created in the next loop. B4’s code will store this last string in the temporary variable B1.

B4=B1="Object Type: "+B3.m(-1)+"\r\n"

m function gets one or more members of a set in normal or reverse order. For example, m(1) gets the first one, m([1,2,3]) gets the top three and m(-1) gets the bottom one. Or B3(1) can be used to get the first one. And now we should restore the “Object Type” at the beginning of each record which has been 
deleted in the previous string splitting in A2. And the carriage return removed during fetching the text by rows from cursors will be appended.

The first member of B3 is an empty row and the last one is an incomplete row, both of them cannot be computed. We can delete them as follows:

B5=B3.to(2,B3.len()-if(A1.fetch@0(1),1,0)))

This line of code fetches the valid data from B3. If the data under processing is not the last batch, fetch rows from the second one to the second-last one and give up the first empty row and last incomplete row. But if the current batch is the last one, fetch rows from the second one and the last one which is complete and give up the first empty row only.

B3.to(m,n) function fetches rows from the mth one and the nth one in B3. B3.len() represents the number of records in B3, which is the sequence number of the last record in the current batch of data. A1.fetch(n) means fetching n rows from cursor A1 and @0 option means only peeking data but the position of cursor remaining unchanged. if function has three parameters, which are respectively boolean expression, return result when the expression is true and return result when the expression is false. When the current batch of data is not the last one, A1.fetch@0(1) is the valid records and if function will return 1; when it is the last one, value of A1.fetch@0(1) is null and if function will return 0.

B6=B5.regex(regular expression;field names list). This line of code applies the same regular expression to each member of B5 and gets the 14 fields separated by commas. Following lists the first fields:

B7=file("e:\\result.txt").export@a(B6)

This line of code appends the results of B6 to result.txt. It will append a batch of records to the file after each loop until the loop is over. We can view this example’s final result in the big file result.txt:

In the above algorithm, regular expression was used in the loop. But it has a relatively poor compilation performance, so we’d better avoid using it. In this case, we can use two esProc scripts along with pcursor function to realize the stream-style splitting and parsing.

First let’s look at the code for master routine main.dfx:

pcursor function calls a subroutine and returns a cursor consisting of one-column records. A2 parses the regular expression with each record in A1 and returns structured data. Note that the result of A2 is a cursor instead of the in-memory data. Data will be exported to the memory for computing from A2’s cursor segmentally and automatically by executing export function.

Subroutine sub.dfx is used to return the cursor, whose code is similar to the previous one. The difference is that the results need not be written to a new file, the one-column records will be returned instead, as the following code shows:

B6’s result statement can convert the result of B5 to a one-column table sequence and return it to the caller (pcursor function in main.dfx) in the form of a cursor.

With pcursor function, master routine main.dfx can fetch data from the subroutine sub.dfx by regarding it as an ordinary cursor and ignoring the process of data generation. While main.dfx needs data, pcursor function will judge if the loop in sub.dfx should continue, or if it should supply data by returning them from the buffer area. The whole process is automatic. 

March 13, 2015

esProc Improves Text Processing – String Matching with Big Files

There are many occasions during text processing which require performing string matching with big files. Coding with command line grep\cat is simple yet inefficient. Though higher efficiency can be achieved with high-level languages, coding will be rather difficult.

Yet this operation, as well as multithreaded parallel computing, can be handled more easily in esProc, with more concise code and much better performance. The following examples will show esProc method in detail.


file1.txt has a great many strings. Find out the rows ending with “.txt” and export them to result.txt. Some of the original data are as follows:

esProc code for doing this task:

A1: Open the file in the form of cursors. Instead of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without memory footprint. The function uses default parameters to import all the fields with tab being the column separator and to automatically name them _1, _2, _3…_n respectively. There is only one field, _1, in this example.

A2=A1.select(like@c(_1,"*.txt"))

This line of code selects rows ending with “.txt” from cursor A1. select function executes the query and like function performs string matching. _1 represents the first field. The use of @c option in like function means the matching is case insensitive.

One point worth noting is that the result of A2 is still a cursor without memory footprint. Only with the use of functions like export/fetch/groups will esProc allocate suitable memory buffers and convert the cursor computing to memory computing.

A3=file("e:\\result.txt").export(A2). This line of code exports the final result to a file. Some of the data are as follows:

The matching rule in the example above is relatively simple. If the rule is complex, a regular expression will be needed. For example, find out rows starting with “c:\windows” and not ending with “.txt”.

regex function is used to perform string matching with the regular expression. Just modify A2’s code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)") , in which @c option means case insensitive.

Though the regular expression can be used to realize the string matching with complex rule, its performance is not satisfactory. For example, to find out rows ending with “.txt” from a file of 2.13G size in the same test environment, it takes 206 seconds with a regular expression, while it takes only 119 seconds with an ordinary expression (the select statement).

In fact, many tasks of string matching with complex rule can also be realized with the ordinary expression. Moreover, the syntax is more visual and cost of learning is lower. For example, emp.txt holds a large number of user records, each of which has multiple fields, separated by tab and with the first row being the column names. Suppose you are to find out rows with the rule that “Eid field is lesser than 100, the first letter of Name filed is a and Birthday field is greater than 1984-01-01”. You can do it in esProc as follows:

The @t option used with cursor function means that the first row will be imported as column names for the use of accessing data at a later time.

The three query conditions can be represented by EId>100, like@c(Name,"a*") and Birthday>=date("1984-01-01") respectively. The logic relation between the conditions is “AND”, which can be represented by &&.

The above algorithm is sequential computation. The performance can be further improved if parallel computing is used. The method is this: Import the file using multithreads, each of which will access some of the data of the file with a cursor, and perform set operations at the same time; finally, merge the result of each cursor together.

Test the processing of a file of 2.13G size under the same hardware environment. It takes an average of 119 seconds with the sequential computation, whereas it takes only an average of 56 seconds with the parallel computing, which speeds the performance almost doubly. The algorithm used in the example is not so complex, so the bottleneck is the hard driver’s ability to import data. With the increase of the complexity of the computation, the performance will be improved more greatly.

esProc code for parallel computing:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the performance won’t be really increased. The maximum number of the parallel tasks can be configured in the environment option.

A2=A1.(file("e:\\file1.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with members of A1 respectively. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

In the expression, file("e:\\file1.txt").cursor@z(;, ~:A1), surrounded in the parentheses, cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into four segments (A1=4) and the ~th segment is fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed results are still four cursors.
A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.
A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.

An esProc script not only can work independently in an Integration Development Environment (IDE), it also can be called by a Java program through JDBC interface. The calling method is the same as the method of calling an ordinary database. A one-step esProc script can be embedded in the Java program directly without script file. Actually the above steps can be combined into one single step:
file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;, ~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())

It is also allowed to run this kind of one-step script in operating system’s command line. Please refer to related documents for further information.