September 3, 2014

Principle and Use of External Memory Sorting in esProc

It is often required to sort records of tables during data analysis and computing. In esProc, sort function is used to sort data of sequences or table sequences. External memory sorting is required when data being sorted are massive and cannot be loaded into memory all together, for the ordinary sorting cannot handle this situation.

1.External memory sorting of massive data

In data statistics, cursors are usually used to fetch massive data. This applies in esProc, which also processes big data with the cursor. In esProc, the function of a cursor, which reads one or more records each time according to the position(s) marked by it and won’t return all data all at once, is similar to that in a database stored procedure.

A cursor can only fetch part of the data every time, thus operations like sorting and grouping all data in the cursor cannot be executed directly. esProc uses external memory to handle these operations on massive data. Each time it reads chunks of data and computes them and records the result temporarily in the external memory. Later it will merge all the sub-results into a cursor and works out the final result.


Let's prepare a data table with huge data in which the dates and 8-digit phone numbers are generated arbitrarily. The data table will be stored in the format of a binary file for convenience. 

Altogether 100,000 rows of data are generated. Read the 50,001th ~51,000th rows of data using the cursor and the result can be seen in C10 as follows:
We'll take PhoneRecord, the generated data file, as an example to explore how to perform external sorting in esProc.

Using cs.sortx(x…;nfunction in external memory sorting, we can sort the data in cursor cs in ascending order according to the computed result of expression xand set the number of rows in buffer area by defining n and determine the number of records fetched each time when generating a temporary file. For example:

In order to know about how the external memory is used in esProc to sort data, we click  in the debugging area of the toolbar to execute the code step by step until the code in A5 begins to be executed. A2 uses binary data file PhoneRecord to create a cursor. A3 uses sortx function to sort data of the cursor. The sorting result, in fact, will be a larger cursor merged orderly by many temporary cursor files. The result of A3 is as follows:

A4 fetches the first 1,000 records from this cursor as follows after sorting:

While the code in A3 is executed, external files, which are also called as temporary files, are generated in the directory of temporary files:

Because the number of rows in buffer area was set as 20,000 by using sortx function, the 100,000 records in the cursor generated 5 temporary files. The data of one of the temporary files will be imported:

A2 imports the data as follows:

A3 works out the number of data in this temporary file as follows:

By comparing the data in A2 with the final sorting result previously obtained, we can see that the result is, actually, one that obtained by sorting a part of the data. This indicates that each temporary file is the sorting result of some data fetched according to the number of rows in buffer area.

Then go on to execute the previous cellset file. We may find that the temporary files will be deleted automatically when the cursor is closed.

sortx function can also be used to sort multiple fields. For example:

A3 uses Date and –PhoneNum to sort data, meaning sorting by date in ascending order, and then sorting the data of the same date by phone number in descending order. A4 reads the first 1,000 results as follows after sorting:

2. Application of external memory sorting

In fact, from the operation of external memory sorting we can see one of the uses of the cursor-style sorting, that is, the sorted data can be used in orderly merging. The operation of orderly merging gets data from many cursors according to a rule that reading records from the cursor that currently contains the smallest ( or biggest) data. Apparently, this type of operation can only be used when the data of every cursor are properly ordered. In addition, joining records of a cursor in alignment with join@x() function also requires that data in every cursor should be sorted.

When the data of a cursor are properly ordered, it can be specified that we fetch data continuously from the cursor until the expression is changed. For example:

A4 fetches data from the cursor in A3 until the Dateis changed, meaning the data of the first day will be fetched; A5 skips data of consecutive three days; A6 fetches the data of the fifth day. The results of A4 and A6 are as follows:

No comments:

Post a Comment