May 22, 2015

esProc Improves Text Processing – Remove Duplicate Rows

During processing the text file, sometimes we need to remove duplicate rows from the grouped data. But the operation becomes complicated when the file under processing is too big to be entirely loaded into the memory. esProc’s group operation supports importing a whole group of data with the file cursor, as well as many options. So it can handle this kind of operation effortlessly. The following example will show you how it works.

The file EPRom.log has four columns separated by tab and its data have been grouped by the second column. Duplicate rows need to be removed (only the first row of each group is wanted). Some of the source data are as follows:

esProc code for doing this

A1=file("E:\\EPRom.log").import()

This line of code imports the file into the memory. By default, the separator is tab and column names are _1_2_3……. If it is a CSV file, the separator needs to be specified using the code import(;”,”). If the file’s first row contains column names, @t option can be used to import them, that is, import@t(). The result of A1 is as follows:

A2=A1.group@1o(_2)

This line of code gets the first row from every group. The group field is _2, the second field. This is the case’s final result, as shown below.

By default, group function will regroup the data. For instance, A1.group(_2) will divide A1 into two groups by the second field, as the following shows:

But the use of @o option won’t make that happen. For instance, result of A1.group@o(_2) is as follows:

With @1 option, the function will get the first row of every group. By using both @1 and @o, we’ve hit the target of this case.
        
In the situation that the file is too big to be wholly loaded into the memory, esProc cursor can be used to deal with it. Corresponding code is as follows:

A1=file("E:\\EPRom.log").cursor()

This line of code opens the log file in the form of cursor. cursor function returns a cursor object according to the corresponding file object, with tab being the separator and _1,_2…_n being column names by default. Notice that the code merely creates the cursor object without importing data. Data importing will be started by for statement or fetch function.

B1= file("e:\\result.txt")

This line of code creates a file object to which the computed results can be exported later on.

A2:for A1;_2

This line of code imports data from cursor A1 by loop. Each time it imports a group of data with the same second column (column name is _2). In this step the data are truly being imported into the memory.

Here for statement deserves special attention. In esProc, for cs,n imports n records from cursor cs each time. for cs;x imports a group of records with the same x field from cursor cs each time, the condition is that records have been grouped by x field.

The x in for cs;x statement can be an expression, which means multiple rows will be imported until the result of computing expression x changes. Take for A1 ; floor(_1/5) as an example. It divides _1 field by 5 and rounds the result off, put the records with the same results into the same group, like the first row to the fifth row.

B2=file("e:\\result.txt").export@a([A2(1)])

As the loop body of for statement in A2, it processes every group of data in the same way. The method is to get the first row of the current group and append it to file result.txt. A2 is the loop variable which represents all records in the current group. A2(1) represents the first record in 2. export function is used to write the structured data into a new file, its @a option means appending. Since A2(1) is a single record, it needs to be converted into array with the operator [].

We can see the final result in result.txt:

In esProc, the working range of for statement can be represented by indentation instead of the parentheses or identifiers like begin/end. In the following code block, for instance, B2-B5 is A2’s working range.