It is often required to sort records of tables during data analysis
and computing. In esProc, sort function
is used to sort data of sequences or table sequences. External memory sorting is required when data being sorted are
massive and cannot be loaded into memory all together, for the ordinary sorting
cannot handle this situation.
1.External memory sorting of massive data
In data statistics, cursors are usually used to fetch massive data. This
applies in esProc, which also processes big data with the cursor. In esProc, the
function of a cursor, which reads one or more records each time according to
the position(s) marked by it and won’t return all data all at once, is similar
to that in a database stored procedure.
A cursor can only fetch part of the data every time, thus operations
like sorting and grouping all data in the cursor cannot be executed directly. esProc
uses external memory to handle these operations on massive data. Each time it
reads chunks of data and computes them and records the result temporarily in
the external memory. Later it will merge all the sub-results into a cursor and works
out the final result.
Let's prepare a data table with huge data in which the dates and
8-digit phone numbers are generated arbitrarily. The data table will be stored
in the format of a binary file for convenience.
Altogether 100,000 rows of data are generated. Read the 50,001th ~51,000th rows of data using the cursor and the result can be seen in C10 as follows:
We'll take PhoneRecord,
the generated data file, as an example to explore how to perform external
sorting in esProc.
In order to know about how the external memory is used in esProc to sort data, we click
A4 fetches the first 1,000 records from this cursor as follows after sorting:
While the code in A3 is executed, external files, which are also called as temporary files, are generated in the directory of temporary files:
Because the number of rows in buffer area was set as 20,000 by using sortx function, the 100,000 records in the cursor generated 5 temporary files. The data of one of the temporary files will be imported:
A2 imports the data as follows:
A3 works out the number of data in this temporary file as follows:
By comparing the data in A2 with the final sorting result previously
obtained, we can see that the result is, actually, one that obtained by sorting
a part of the data. This indicates that each temporary file is the sorting
result of some data fetched according to the number of rows in buffer area.
Then go on to execute the previous cellset file. We may find that
the temporary files will be deleted automatically when the cursor is closed.
A3 uses Date and –PhoneNum to sort data, meaning sorting by date in ascending order, and then sorting
the data of the same date by phone number in descending order. A4 reads the
first 1,000 results as follows after sorting:
2. Application of external memory sorting
In fact, from the operation of external memory sorting we can see
one of the uses of the cursor-style sorting, that is, the sorted data can be used
in orderly merging. The operation of orderly merging gets data from many
cursors according to a rule that reading records from the cursor that currently
contains the smallest ( or biggest) data. Apparently, this type of operation
can only be used when the data of every cursor are properly ordered. In
addition, joining records of a cursor in alignment with join@x() function
also requires that data in every cursor should be sorted.
A4 fetches data from the cursor in A3 until the Dateis changed, meaning the data of the first day will be fetched; A5 skips data of consecutive three days; A6 fetches the data of the fifth day. The results of A4 and A6 are as follows:
No comments:
Post a Comment