esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc’s Prime Keys and Their Index Function

In an esProc table sequence, a field or certain fields can be designated as the prime key(s). In this case, some special functions can be employed to make query based on the prime key(s). This can both simplify the code and increase computational performance effectively.

1. find and pfind

Prime keys are common in database tables. The field value of a prime key is solely used to mark a record, so identical values of prime keys are forbidden, which is a mandatory rule in many databases.

Under esProc, it is assumed that each value of the prime keys is unique for the records. But no mandatory check will be executed and there won't be error reporting even if there are identical values. Both T.pfind(v) function and T.find(v) function can be used to query records in table sequence T according to v, value(s) of the prime key(s). find returns the first found record and pfind returns the sequence number of this record.

It is not necessary to set a primary key, or primary keys. When no primary key is set in esProc, the first field will be regarded as the primary key. If we want to set specified fields as primary keys, we use T.primary(F_i,…) function, meaning setting F_i,… as the primary keys of table sequence T.

Usually, we use conventional position function T.select() and T.pselect() to query records in a table sequence. Now we'll compare usages of this pair of function with pfind and find.

Here EMPLOYEE table in demo database is the table sequence in which query will be executed. Add field FullName to it and use the employees' full names to make query:

In order to manifest the performance advantages of using prime keys to query data, 10,000 full names will be generated arbitrarily and use pselect and pfind to make query according to these full names. Compute the time consumed by these two methods:

Based on the same data, A5 and A9 use pselect and pfind respectively to query positions of the records in the table sequence. In A9, primary function is used to set primary keys for the table sequence before pfind starts to work. A6 and A10 work out respectively the milliseconds consumed as follows:

Though the query results in A5 and A9 are same:

Using similar method, we may compare select function and find function. To keep in line with find function, @1 option is used in select function to get the first result and return it:

A6 and A10 compute respectively the milliseconds consumed as follows:

The query results in A5 and A9 are still same:

It can be seen easily from the comparison that the query functions based on primary keys are much more efficient than conventional position functions.

2. Primary keys and the index table

Why will the efficiency increase greatly when primary keys are employed to make query in esProc? The reason is that the index table of the primary keys is used in computing.

In calling a primary function, or making query based on primary keys in a table sequence without an index table for the first time, an index table will be generated according to the primary keys. While an index table is generated, a hash table will be generated according to all values of the primary keys which will be divided into many groups by hash values. These hash values are the corresponding group numbers.

Normally, when querying a certain record in a table sequence according to field value, it needs to examine the records one by one until the target is found. For a table sequence containing n records, an average of n/2 examinations are needed.

Thanks to the index table, it's turning around in the case of querying a certain record in a table sequence according to field value of the primary key. First, we can compute the hash value according to the primary key value. This enables us to find out directly the corresponding group in the index table. Then we just need to examine records of the same group. In the same way, for a table sequence containing n records, if its primary keys are distributed in k groups according to hash values, we only have to make an average of n/2k comparisons. Using this method, despite the price paid due to computing hash values before generating an index table and executing the query, the number of comparisons is reduced significantly, and especially, it is enough to generate an index table once. Therefore, the more the data in a table sequence and the times needed for query, the higher the efficiency.

During computing, T.index(n) function can be used to create an index table for T's primary keys in advance. n represents the length of the index table. Default length will be used if there is no particular setting.

What we should know is that the reason that the functions that execute query using values of primary keys, such as find and pfind, can enhance computational performance effectively is because an index table has been created for primary keys in a table sequence. For this reason, if the primary keys themselves can be used as indexes to locate records, it is unnecessary to create an index table. Field EID itself in the above-mentioned EMPLOYEE table, for example, represents the positions of records in the table sequence, thus it is more efficient to use itself to query the corresponding records:

This time 10,000 sequence numbers of employees are generated arbitrarily. A4 queries the corresponding records directly according to these sequence numbers, yet A7 still uses find function to query. A5 and A8 compute respectively the time consumed as follows:

We can see that it is faster to locate records directly using sequence numbers. Because this method doesn't have to compare field values, nor does it compute the hash values and create an index table. The query results in A4 and A7 are same:

Thus it can be seen that suitability should be taken into consideration in esProc if we are to use index function of primary keys in a table sequence to increase efficiency.

3. switch function

The role of switch function is to relate the value of a certain field F in table sequence T₁ to the primary keys or specified fields of another table sequence T₂, thus foreign keys can be created in T₁to reference corresponding records in T_2.The operation of creating foreign references is similar to that of using find to query records according to values of primary keys in T_2.This process also involves creating an index table. However,T₁.run(F=T₂.select@1(...))can also be used to perform this operation. We’ll first compare the efficiency of the two methods:

Both A2 and A3 contain personnel information loaded from binary text file PersonnelInfo:

A4 contains states information:

In both A6 and A9, field State of PersonnelInfo is switched into corresponding states information. Their difference is that A6 uses select@1 function while A9 uses switch function. A7 and A9 compute respectively the time consumed as follows:

After the code in the cellset is executed, values of A2 and A3 are same:

Field State has been switched into corresponding records in states information table.

Before switch is executed, an index table will also be created for corresponding fields in the table sequence. In the example, an index table is created for field ABBR of states information table in A4 to increase matching efficiency. Note that here field ABBR equals a primary key set temporarily. The requirement for the field of a primary key that no identical values are allowed applies to its data too.

So the computational efficiency of switch function is obviously higher than that of cases using run+select@1. Meanwhile, since the relational computation of single foreign key field is relatively common, esProc specially offers the switch function to simplify the code produced by using functions as well as to increase the computational efficiency effectively.

menu

September 2, 2014

esProc’s Prime Keys and Their Index Function

1. find and pfind

2. Primary keys and the index table

3. switch function

No comments:

Post a Comment