esProc, A Script Language for Data Analytics with Parallel Mechanism: Cursor Thoughts in esProc

With esProc, the big structured data can be handled easily, and the computation and analysis over big data can be accomplished smoothly. Because the big data table cannot be loaded into the memory all at once, the basic solution is cursor when handling the big data.

1.Basic principle of cursor

The esProc cursor is like the database cursor in the stored procedure: In fact, the cursor is a pointer. Records can be retrieved row by row according to the position in cursor. The data will not be returned in full. By using the cursor, the data retrieval can be completed progressively so as to ensure the big data table will not bring overburden pressure on system memory.

In order to facilitate the usage, esProc saves the effort to process the records one by one in the cursor. However, esProc also limits the usage of cursor. In esProc cursor,moving backward is forbidden, only moving forward is allowed, which is similar to TYPE_FORWARD_ONLY type of result set in JDBC. In this way, esProc cursor can only traverse the records in TSeq once, and the related computations are also to be completed in this traversal. This differs from the computational mode of normal TSeq greatly. Once a traversal is completed, the cursor will be closed automatically and invalidated. So, records can no longer be retrieved with it anymore.

In esProc, no matter the big data tables are from the databases or data files, they can all form cursors. Sometimes, the existing TSeq in memory can form a cursor to get involved in the cursor computation. For example:

In which, A3 holds a file cursor, A4 holds a database cursor, and A5 holds a memory RSeq cursor.

When using the cursor, the function cs.fetch() can be used to retrieve the records of one or several rows according to the position specified by the cursor to compute or debug.

2.Inheritance thoughts for cursor
When using cursor to handle the computation, the result returned by cursor is usually the records from a data table or result set. However, the data in cursor is usually not the final result we demand, and we need to process the query result.

When handling the result in the cursor, you may need to perform filtering, summarizing, sorting, and other jobs. Sometimes, you may also need to consolidate or merge the data in multiple cursors. To complete these jobs, in esProc, you can directly use a cursor to generate a new one. No need to query the data actually when generating new cursors. The actual retrieval will be performed when you actually get the data. For example, to complete a certain data processing job, use cursor A to generate new cursor B. The cursor B will use A to retrieve data, and return the retrieved data after it processes the data correspondingly. Based on the cursor B, you can generate cursor C as necessary, and further the processing of the data in B……When handling the joining and consolidating jobs, you can use multiple cursors to generate a new cursor.

In the mode of multiple inheritances, the complex data processing job can be divided into multiple steps to accomplish, with a clear purpose for each step. When the data is processed in each cursor, the records are retrieved through the original cursor. So, for the final data computing, the data retrieval is actually conducted by the original cursor. Such mode can ensure that the original data can only be traversed once. The data conflict can be avoided and the communications with databases can be reduced while ensuring the data efficiency.

In the mode of multiple inheritances, processing the data in the cursor will not incur the extra data operations. In addition, the multiple-inheritance cursor object only takes very limited amount of memory. So, this mode will not compromise the performance of accessing the external storage. Likewise, the computational performance also won’t be obviously affected if inheritance hierarchy is added.

3.Basic computation of cursor

3.1 Basic usage of cursor
Once the cursor is created, the cs.fetch() function can be used to retrieve data from data table. You can also use cs.skip() function to skip several rows of records. To close the cursor before all data is traversed, you can use cs.close(). For example:

This cellset is designed to find the name of the first employee who is older than 40. Once the file cursor is generated in A2, loop in A3, and retrieve one employee record from B3 each time. If all records are traversed with the cursor, then the loop will break in C4;otherwise in C6, where the application will judge if the age of employee is over 40. If the condition is met, then exit the loop and close the cursor. In A8, you will see the computation result:

In retrieving data with a cursor, you may retrieve multiple records all at once:

For the above program, all data of text data cursor in A2 will be traversed. Retrieve 100 rows of records each time, and compute the Total number of employees above 40. The final result can be viewed in A7:

Because the cursor will be closed when the traversal over all data is completed, there is no need to use the cs.close(). Please note that the cs.fetch() always get one TSeq each time, instead of an RSeq composed of various records from a same TSeq.

3.2 Filtering on cursor

When you need to filter the desired data out from the records in the cursor, you can use function cs.select(). Thus, the above example problem can be solved with the following method:

The final result in A7 is the same as above:

Because the returned result is also the big data when filtering the big data in the cursor, the cs.select() also returns a cursor,you can perform the summarizing and other operations in the further processing. If the data returned by cursor is relatively less, then all of them can be returned at once. For example, find all female employees who are older than 40in California:

In A6, you will see the results:

In the computation, multiple inheritance of cursor is used for filtering step by step. Data can be retrieved from text data table and computed to get the required result only if they are retrieved in A6. This method almost will not affect performance.

3.3 Creating and modifying cursor

Sometimes,the data retrieved from big data table is not the final result we need, and further computation maybe required to get the new records. In such case, you can use the cs.new() function to generate a new cursor:

Likewise, data will be retrieved from text data table and computed to get the required result only if they are retrieved in A7. The result in A7 is shown below:

Besides the cs.new()used for generating new data, function cs.run() can also be used to modify the fields of records directly:

The results in A8are shown below:

3.4 Foreign key in cursor
Sometimes, the big data table requires foreign keys to relatetodata in other tables. In this case, you can use cs.switch() function, for example:

In this example, find the profiles for all female employees whose name initials are W and older than 40. In this example, A7 sets the field State as the foreign key, and references the record of TSeq in A6.The results in A8are shown below:

menu

August 6, 2014

Cursor Thoughts in esProc

No comments:

Post a Comment