September 17, 2015

An esProc Code Example of Computing Link-Relative Ratio and Year-on-year Comparison

A link-relative ratio is a percentage got by comparing the data of current period and that of the previous period. Usually the interval between the two periods is a month (sometimes an hour, a day, a week and a quarter). The link-relative ratio of April is computed by dividing, for example, its sales amount by that of March. A year-on-year comparison is a comparison between the current period and the same one from the previous year. For example, the comparison between sales amount of April 2014 and that of April 2013. In practical business, data of multiple periods is usually compared in order to find the variation trend.

The inter-row and inter-group computations related to link-relative ratio and year-on-year comparison are common and can be easily handled with esProc. The following example will show you how esProc works.

Case description:

Compute link-relative ratio and year-on-year comparison of growth rate of each month’s sales amount in a specified period. Below is a selection from the Orders table:

esProc code:
A1=esProc.query("select * from sales3 where OrderDate>=? and OrderDate<=?",begin,end)

Code explanation:
A1: Retrieve data from database according to the specified period. begin and end are external parameters. Such as, begin="2011-01-01", end="2014-07-08"(i.e. the data of the current day which can be obtained through now() function). Below is a selection of the query result:

A2: Group orders by the year and the month, and summarize data to get each month’s sales amount. Some of the computed results are as follows:

A3: Add a new field Irr, i.e, the link-relative ratio on a month-on-month basis. The expression is mAmount/mAmount[-1], in which mAmount represents sales amount of the current month and mAmount[-1] represents that of the previous month. Note that value of the first month’s (i.e. January 2011) link-relative ratio is null. Computed results are:

A4: Sort A3 by the month and the year to compute the growth rate of year-on-year comparison. Complete code should be: =A3.sort(m,y). Since A3 has been sorted by the year, so we just need to sort it by the month, that is, A3.sort(m). This has a higher performance. Some of the computed results are: 

A5: Add a new field yoy, i.e., the year-on-year comparison of monthly sales amount. The expression is if(m==m[-1],mAmount/mAmount[-1],null), which means that the comparison can only be made between the same months. Note that the values of year-on-year comparison of months in the initial year (i.e. the year 2011) are always nulls. Some of the computed results are:  

For the convenience of observation, one more line of code, A6=A5.sort(y:-1,m), can be added. That is, sort A5 by the year in descending order and by the month in ascending order. Note that the data ends in July 2014. Results are as follows: 

September 15, 2015

Grouping In-memory Data Using esProc: Code Examples

Using esProc, it is quite convenient to group data in memory. There are several main types of grouping based on how data is grouped. Here we’ll illustrate equal grouping, alignment grouping and enumeration grouping respectively with an example.

Equal grouping

That data is grouped by one or more certain fields (or one or more computed fields derived from fields) of the current data set is equal grouping. By this method each group is a subset of the original data set.

Case description: Group sales orders by the year.

Data description: The original data is as follow: 

The above data set (table sequence) can be imported from a database or a file. For example:


esProc code:

Computed result: 

Code explanation:
1. In this example, the grouping criterion comes from the OrderDate field. The order dates can be converted to the years with year(OrderDate), then data of the same year will be grouped together.

2. There may be more than one field for grouping data. For example, the data could be grouped by both the year and the sellerID to put records of each seller in each year into a group. Below is the code for doing it:,SellerId)

3.Often, the grouped data are to be aggregated, like getting each year’s sales amount based on A2’s data. The code is:,~.sum(Amount):a)

Computed result is as follows: 

Or we can combine grouping and summarizing into one step:,~.sum(Amount):a)

Alternatively, we can choose the groups function with better performance yet less flexibility:
A1.groups(year(OrderDate):y; sum(Amount):a)

Of course, sometimes we have to perform grouping and aggregate separately in order to reuse the code and improve computational efficiency, like the scenario in which one of A2’s group needs filtering and another one requires relational computing. In another scenario, the summarized data of a certain group is unusual and worth further study, then this group can be used directly in the subsequent computations without the need of filtering it again.  

4. By default, esProc’s group function will group data using hash algorithm. But for ordered data, the comparison of adjacent rows, which is equivalent to merge operation, may have higher performance. This approach can be implemented by using @o option with group function. For example:,SellerId)

Alignment grouping

Equal grouping groups data by the field(s) coming from within the dataset. If the grouping criterion is one or more fields of another data set, a user-defined array, or a parameter list, etc., the grouping model will be referred as an alignment grouping.

Different from the equal grouping, alignment grouping may produce empty subsets, which means no members in the original data can satisfy a certain grouping condition. It may also lead to incomplete grouping, that is, there may be members that will not appear in any group. Neither would happen with equal grouping. 

Case description: Group the orders table according to the list of best 10 sellers selected by KPIs.

The orders table in the previous example will also be used here. Data is stored in A1.

The best 10 sellers list is stored in B1 as follows: 

The list of sellers may come from an intermediate table, or be generated by a piece of code. It’s not important how it is produced in this example.

Computed result: 

Code explanation:

This group will be put in the end:  

3. Sometimes not all members of the grouping criterion will fall in the data set to be grouped. For instance, the grouping criterion is “a list of newly-employed sellers”. In this case, it’s normal to produce empty groups. If we modify the first record of the list into empID=100, the result will be: 

Enumeration grouping 

The grouping criterion for enumeration grouping is even more flexible. It could be any boolean expression. The records satisfying the value of the expression will be put into the same group.

Similar to alignment grouping, this is also the incomplete grouping as it probably produces empty subsets or a result in which some members are not included in any group. Moreover, this type of grouping may have the result that certain members appear in more than one group. 

Case description: Dividing orders into four groups, they are: A. order amount is less than 1,000; B. order amount is less than 2,000; C. order amount is less than 3,000; D. order amount is less than 10,000. Special requirement: data cannot be grouped repeatedly, that is, if an order has been in group A, it must not be put into group B, C, or D.

Ungrouped data set:

The orders table in previous examples will still be used. Data is stored in A1.

esProc code:


Computed result: 

Case explanation:

1.In this example, grouping criteria are multiple flexible expressions. Each record will be compared with each of the expressions. Those records that can match the same expression will be put into the same group. Groups are arranged according to the order of the grouping criteria as well.

2.By default, enumeration grouping will not produce duplicate members in the result. Because after group A’s data is selected out, expression B will be matched with the rest of the records, as this example has shown so far. But the use of function option @r allows us to match expression B with all records, which will produce duplicate members. For example the result of A3=A1.enum@r(A2,Amount) is as follows: 

3. Likewise, if values of an enumeration expression fall outside of the data to be grouped, it will correspond to an empty group. Besides, if certain records cannot match any expression, function option @n can be used to group these surplus records together. 

September 14, 2015

Code Examples of esProc Foreign Key Function

If k (a single field or a combination of fields) is table M’s primary key, and k, at the same time, exists in table B, then k is regarded as B’s foreign key. The foreign key defines a relationship between two tables and is one of the most important concepts for structured data computing. Through object references, esProc makes foreign key easy to function. The following examples aim to illustrate the how foreign key is used in esProc.

Example 1 Associate a referenced table and a referencing table
Order (containing orders information) is the referencing table, emp(containing employee information) is the referenced table. It is required to connect emp with order and display emp’s Name field, Gender field and Salary field and order’s OrderID field and Amount field in the result table.

Note: Besides emp table and order table used here, dep table (containing departments information) is to be used in subsequent examples. The relationship between emp and order/dep through the foreign key is shown as follows:

Data may originate from databases or text files. For example:
  order=esProc.query("select OrderID,SellerId,Amount,OrderDate from sales")
  emp=esProc.query("select EId,Name,Gender,Birthday,Dept,Salary from emp")
  dep=esProc.query("select * from department")

esProc code for doing this:
  A3=order.switch(SellerId, emp:EId)

Computed result:

Code explanation:

A3: Replace records of order’s SellerID with their corresponding ones in emp to create a foreign key relationship between the two tables. 
A4: Get OrderID field and Amount field from order, and get Name, Gender and Salary field from emp through foreign key references. We can see that, with object references, fields in emp can be accessed directly from order, thus saving us the trouble of writing complex and difficult join statements.

Example 2: Query referencing table according to condition existing in referenced table

Find orders signed by female sellers whose salary is greater than 10,000.
esProc code for doing this:
  A3=order.switch(SellerId, emp:EId)         / the same as above example>10000 && SellerId.Gender=="F")

Computed results:

Click the above hyperlinks in blue and corresponding employee information will be shown:

Example 3: Group data according to referenced table
Compute sales amount of each department.

esProc code for doing this:
   A3=order.switch(SellerId, emp:EId)   / the same as above example

Computed results:

You can rename fields, like order.groups(SellerId.Dept:dt;sum(Amount):amt). The effect of name changing is shown below:

Example 4: Complex association between multiple tables
Find managers of departments whose sales amount is greater than 50,000. 

esProc code for doing this:
   A3=order.switch(SellerId, emp:EId)        

Computed results:

Code explanation:
A3, A4, A5Create complete foreign key relationships.

A6: Compute sales amount of each department (See above example). The result is:

A7Use object references to solve the problem intuitionally. Expression<=50000).(dt).(Manager).(Name) can be divided into four steps according to full stops. They are:

1. Find records whose sales amount is greater than 50,000 from A6.

2Get records corresponding to those in dt field (from dep table)

3. Get records corresponding to those in Manager field (from emp table)

4. Get Name field.

Details are as follows:<=50000)




September 11, 2015

An Example of esProc Set Operations

There are a lot of occasions when set operations are needed to do structured data summarizing and analysis. For example, to list all students who have published papers, find the employees who have participated in all trainings and select students who passed the examination for the re-examination. esProc uses sets everywhere. Its most commonly used data types, such as sequences and table sequences, are all sets. Therefore, a better understanding and manipulation of sets helps to perform data computing in a more reasonable and faster way.

For example, the table below contains sales data:

We are to find the clients whose monthly sales amount is always in the top 20 in 2013. To do this we need to first retrieve the 2013 sales data, group it by the month, then select the top 20 clients of each month by loops and finally get the result by performing intersection. It is too difficult to handle such a complex problem in either SQL or the stored procedure.

esProc handles a complex problem by splitting it into different steps, and gets the final result step by step. First, from the sales data we retrieve that of 2013 and group it by the month:

esProc performs veritable data grouping by dividing data into multiple sets as required. This is different from SQL "group by" command, which returns only the aggregate results. The grouping result of A3 is as follows:

Data has been automatically sorted before it is grouped. Each group is a set of sales records. Below are the records of March:

To know the total sales amount of each client in each month, we need to group the data by clients. In esProc, we just loop through data of every month and group it by customers. We can use A.(x) to loop through set members without the need to write loop code.

After the second grouping, in A4 every month’s data is a set of sets

Now, the data of March is as follows:

We can see that each subgroup in March’s data is a set of transaction records of a certain client.

The set used in esProc is different from that in mathematical concept. esProc sets are ordered and therefore, based on them, sorting, selection by positions and other statistical operations can be performed. Then let’s find the top 20 clients of each month:

A5 loops through each month’s data to get records of the top 20 clients of each month. A6 lists their names and monthly sales amount. The result of A6 is as follows:

Now it’s time to get the final result:

List names of the top 20 clients of every month in A7 and, finally, get the intersection of the 12 groups of top 20 names in A8:

From this example we can see that esProc ordered sets make the problem-solving more intuitive. With sets, we can easily perform grouping, sorting and other computations, making each step of data processing focused and easy to understand. Moreover, the use of set concept can reduce the complexity and coding workload of loop operation as well as the set operations like intersection. 

September 10, 2015

Preparing Test Data with esProc

Test data preparation is a critical work in software testing. High-quality test data can better simulate the business case. It helps to meet the testing requirements by timely and effective evaluation of software performance, or finding potential issues in the software builds. Most of the time, the amount of data used in testing is relatively large, and the data needs to be randomly generated according to specific requirements. Sometimes there is certain relationship between the data, and there is the need to retrieve data from an existing database. Therefore, the preparation of test data often means complexity and and huge workload.

esProc is a handy tool for test data preparation.

Now we need to prepare the test data for employee’s information in text format, including employee number, name, gender, date of birth, city and state of residence, etc. Through this example, we can understand the way test data are being prepared.

We have the following requirements for test data: the employee numbers are generated sequentially. Name and gender are randomly generated. Birthdays are randomly generated, however we need to ensure that the current age of the employees are between 18 to 55 years old. City and states are randomly obtained from a table in database.

In 3 text files Top100MaleNames.txt, Top100FemaleNames.txt and Top100Surnames.txt, there are 100 most used male and female names, and surnames stored.

The cities of employees need to be retrieved randomly from the CITIES table in database: 

According to the STATEID field in CITIES table, we can retrieve the abbreviations of the states for the employees from STATES table

The code for preparing test data is as follows:

The following is the explanation of the code in the cellcet.

The first two lines generate the raw data of names. Note that when generating the employee information, the name of the employee is related to his/her gender. Therefore we need to retrieve the text data first, combine the most used male and female names, and add the gender field to them:

After data is arranged, we can see in C2 the following table sequence consisting of names and genders

Similarly, the cities and the abbreviations of states are also related. After retrieving data from database in line 3, the abbreviations of states are added to city information:

After data arrangement, the table sequence of city information in C3 is as follows:

Then the basic information for generating data is prepared in line 4, including the data structure for employee information table, and the amount of test data to be generated, etc.:

Among this, the number in C4 is the definition of cache, meaning that after generation of every 1,500 records we need to output data to the text file. This way we can control the use of memory. In B5 the data structure of employee information table is output to the text file.

As the next step, we can now run a loop from line 6 to line 15 to generate the test data for every employee:

B6 generates a random sequence number as reference to a name, while C6 generates one for a surname. They are used to generate the name and gender of an employee. Accordiing to the requirements, B9 randomly generates the age, and according to the age, line 10 selects a random date in the corresponding year as this employee's birthday. In line 11, 12 of the code, randomly select a city and get the city and state for the employee. After the required data is generated, B13 will add the data to the table sequence of employee information created in A4. C14 controls the data output, and write data to text file after every 1,500 records are generated. After data output, A4 is dumped to avoid too much memory use.

After all data output, the text file is as follows:

When preparing test data with esProc, we can run a loop to generate large amount of random data. Meanwhile, in the loop, we can invoke existing database data or text data easily, to generate data according to business needs and to avoid writing complex programs.