September 28, 2014

A Code Example of External Memory Sorting in esProc

It is common to sort records in tables in data analysis. In esProc, sort function is used to sort sequences or data in table sequences. External memory sorting is required when the data being sorted is too huge to be loaded into the memory all together and the ordinary sorting method cannot be used.


For example, the text file Order_Foods.txt contains 50,000 pieces of information of food orders: 

Now sort the data of these orders according to the following conditions:
     Sort data by food name in ascending order;
     Sort data by order date in ascending order, then sort data of the same date by food name in descending order;
     Sort data by order date in descending order, then sort data of the same date by order amount in descending order.

In fact 50,000 rows of data cannot be counted as huge data. We just take it as an example to illustrate how to sort huge data. In practical business, we may need to handle several hundred million records or the data files of dozens of or a hundred GB size.

In esProc, cursors are used in computing huge data. The records in data tables won't be fetched all at once. Only one or multiple records will be fetched each time according to the setting. By doing so, memory overflow can be avoided during data computing. Since only some of the data are fetched each time, operations of sorting and grouping on all data in cursors cannot be completed in one step. Using external memory, cs.sortx(x…;n) function in esProc can fetch data in cursors step by step according to n,the set number of rows in buffer area, process them separately and store the results in the external files. In this way, memory overflow won't happen. The result of sorting with cursors is still a cursor, whose method of fetching data is completely the same as that with an ordinary cursor.

Let's look at how to sort data in Order_Foods.txt:

A2 uses text file Order_Foods.txt to generate a cursor and make the first row as column name. A3 uses sortx function to sort data of the cursor in A2 by food name in ascending order. The number of rows in buffer area is set as 1,000, which indicates sorting by the unit of 1,000 records. The result will be saved temporarily as an external file which is also a cursor:

While this cursor is being used to fetch data, all generated external files will be merged automatically, and the required data will be fetched thus. As in A4, the first 1,000 records after sorting are fetched as follows: 


A5 closes the cursor and the external file resulting from sorting is automatically deleted. Since the cursor in A3 is generated by the cursor in A2, the latter will close if the former closes. If we are required to sort by order date in ascending order then by food name in descending order, a cursor need to be regenerated in A6.

A7 sorts data in cursors using external memory sorting. sortx function can specify multiple sorting fields or sorting expressions. If sorting in descending order is needed in an expression, a negative sign should be added before the expression. The sorting result in A7 is still a cursor:

This time, A8 skips 20,000 records; A9 fetches data from the 20,001th row to the 21,000th row: 

Having done their job, cursors are closed in A10 to remove the external files instantly. If all data in cursors are fetched, cursors will close automatically and it is not necessary to call cs.close() instruction. 

The method of adding a negative sign before a sorting expression in performing sorting in descending order applies to not only the string-type data in A7, but the data of date type or numerical data. In the operation of sorting in descending order, A12 uses fields of date type and numerical fields respectively. A13 fetches the first 1,000 records after sorting as follows:

No comments:

Post a Comment