It is common to sort records in tables in data analysis. In esProc, sort function is used to sort sequences
or data in table sequences. External memory
sorting is required when the data being sorted is too huge to be loaded
into the memory all together and the ordinary sorting method cannot be used.
For example, the text file Order_Foods.txt contains
50,000 pieces of information of food orders:
Now sort the data of these orders according to the following
conditions:
①
Sort data by food name in
ascending order;
②
Sort data by order date in
ascending order, then sort data of the same date by food name in descending
order;
③
Sort data by order date in
descending order, then sort data of the same date by order amount in descending
order.
In fact 50,000 rows of data cannot be counted as huge data. We just
take it as an example to illustrate how to sort huge data. In practical
business, we may need to handle several hundred million records or the data
files of dozens of or a hundred GB size.
In esProc, cursors are used in computing huge data. The records in
data tables won't be fetched all at once. Only one or multiple records will be fetched
each time according to the setting. By doing so, memory overflow can be avoided
during data computing. Since only some of the data are fetched each time,
operations of sorting and grouping on all data in cursors cannot be completed
in one step. Using external memory, cs.sortx(x…;n) function in esProc
can fetch data in cursors step by step according to n,the set number of rows in
buffer area, process them separately and store the results in the external
files. In this way, memory overflow won't happen. The result of sorting with
cursors is still a cursor, whose method of fetching data is completely the same
as that with an ordinary cursor.
A2 uses text file Order_Foods.txt to generate a cursor and make the first row as column name. A3 uses sortx function to sort data of the cursor in A2 by food name in ascending order. The number of rows in buffer area is set as 1,000, which indicates sorting by the unit of 1,000 records. The result will be saved temporarily as an external file which is also a cursor:
While this cursor is being used to fetch data, all generated external
files will be merged automatically, and the required data will be fetched thus.
As in A4, the first 1,000 records after sorting are fetched as follows:
A5 closes the cursor and the external file resulting from sorting is
automatically deleted. Since the cursor in A3 is generated by the cursor in A2,
the latter will close if the former closes. If we are required to sort by order
date in ascending order then by food name in descending order, a cursor need to
be regenerated in A6.
This time, A8 skips 20,000 records; A9 fetches data from the 20,001th
row to the 21,000th row:
Having done their job, cursors are closed in A10 to remove the
external files instantly. If all data in cursors are fetched, cursors will
close automatically and it is not necessary to call cs.close() instruction.
No comments:
Post a Comment