September 16, 2014

Computing the Online Time for Users with esProc (IV)

In last article we mentioned that IT engineers from the Web Company used esProc to code single-machine multi-threaded program which could handle large data volume and complex requirements. This leverages the full power of one multi-core multi-CPU machine. Now once again these engineers found a new issue: with the user numbers for the online application growing explosively, colleagues from the Operation Department complained that the online time computation program is still running too slow.

IT Engineers leverage esProc's multi-machine parallel computing capability, to split the task for multiple machines to complete. The performance problem is resolved successfully. The single machine parallel processing is shifted to multi-machine parallel processing, with relatively low cost for hardware and software upgrade.

To improve performance, the Web Company increased the number of server from the original number of 1 to 3. Accordingly, the following steps are needed to shift from single-machine parallel to multi-machine parallel:


The first step: Modify the esProc program for weekly log files processing. Divide user ID by3 and separate the weekly log file into 3 files according to the remainder. Every server would be processing one of these. This way the file size were reduced and file transfer time could be shortened. Later the three files were uploaded to three servers, using multiple parallel programs to do the computation. The actual program is as following:


Note in the last screenshot that, A6 used the @g option of export function to retrieve "log files for one week" into three binary files. During subsequent use of parallel processing time, the content of log files can be retrieved by blocks for different user. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks.

The second step: the single-machine multi-threaded program is unchanged. Let's go back.

Subroutine parameters are shown below. They are used to pass the log file name, block number and total number of blocks for the week when called by the main program. Here the log file name for the week, week file, was already one of the three segmented files corresponding to this machine.


The subroutine is as following:


The above screenshot illustrates that:
1. As we previously used export @g to output the file in group according to different user ID, the use of @z option by cursor in A2 to handle specific block (value is block number) among total (value is total blocks) from file will retrieve the complete group for the same userID. Data for one user will not be split into two blocks.

2.  The code line in red box returns the resulting file as cursor to the main program. Since multi-machine parallel processing were used here, this cursor is remote cursor ( Read esProc's Documents for detailed introduction on remote cursor).

The third step: writing main program for parallel computing, to call the parallel computing subroutine. As illustrated below, the main program called parallel tasks on tree machines, which effectively improved the performance for computation.

The server list in the program could also be written into the configuration file, this way any subsequent increase or decrease of the server would be easy.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.Notes on the above screen capture:

1. callx@ parameter specifies 3 servers from A1 to A3, to handle three log files B1 to B3.

2. The syntax of callx's input parameter, is to specify three servers through A5, and specify 6 parallel computing tasks for each server in A6.

3. Server list, server number, and the number of tasks for each server can be adjusted according to actual situation, to leverage full performance potential of the server.

The fourth step: implement the esProc server, and upload related program & data files. Refer to instructions on esProc for specific steps and methods.

After the transformation to multi-machine parallel computing, the Operations Department found significant improvement in the computation speed of users online time. The cost of this transformation is much lower than that for application databases upgrade, especially, in the hardware part, only 2 additional PC Servers were needed.

So far, The Web Company finished implementation of esProc based user behavior analysis and computation platform. Its main advantages are:

1. The platform is easy to be adjusted with more complex algorithm for future, shortened the response time and saved labor costs from engineers.

2. It's easy to scale out for even larger data amount in the future, with shortened project time and reduced cost of upgrade.

September 15, 2014

Computing the Online Time for Users with esProc (III)

In last article we mentioned that IT engineers from the Web Company used esProc to code program which could handle large data volume and complex requirements. Not only could it meet the demands for online time computation, but also is relatively easy to be extended to with new conditions.

However, these engineers found that the single-threaded program does not take full advantage of the of the server's computing power. Practice has proved that the use of esProc's multi-threading capability can take advantage of the server's quaddual core, or even more CPUs. The change from single-threaded to multi-threaded requires very little workload. 

The Operation Department provided the following requirements for computation of users online time:

1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.


3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.


4. If there is only login, without logout, then the last operation time should be treated as time for logout.


5. For users who completed a post operation, his/her current time online time will be tripled in computation.


To shift from single-threaded computing to parallel computing, following steps needs to be done:


The first step: Adjust the log file preprocessor with the @g option of export function, to retrieve the log file for one week into a segmented binary file. In subsequent parallel processing, log file could be retrieved by block for different users. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks. The actual procedures are as following:


The second step: Rewrite the online time computing program into a parallel subroutine. The part in the following red box is where we need to modify for parallel processing. Because different parallel tasks are used compute for different users, you can see that very little changes are required for parallel computing. The only change required, is to replace the use of files with different blocks from the binary file.

First we need to add parameters to subroutine, to pass the log file name, block number and total number of blocks for the week when called by the main program.


And then modify the program as following:

The above screenshot illustrates that:
1. As we previously used export@g to retrieve the file according to different user ID, the use of @z option by cursor to handle specific block (value is block number) among total (value is total blocks) from file, as shown in the redbox, will retrieve the complete group for the same userID. Data for one userwill not be split into two blocks.

2. A16 returns the resulting file as cursor to the main program.

The third step: writing main program for parallel computing, to call the parallel computing subroutine. Because the total cores of the server CPU is 8,the IT engineers decided to use six threads for parallel computing. This take full advantage of multi-core CPUs to improve performance.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.

Upon the meeting of this requirement, IT engineers from the Web Company are facing a new problem: the user numbers for the online application grew explosively. Colleagues from the Operation Department complained that the online time computation program is still running too slow. The single-machine, multi-threaded approach can no longer enhance the computing speed significantly. Can these IT engineers effectively solve the performance issue using esProc’s parallel multi-machine computing capability? Is it too costly to transform to a multi-machine parallel mode? See "Computing the Online Time for Users with esProc (IV)"

September 14, 2014

Using esProc to Compute the Online Time of Users (II)

In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it's easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc's file cursor functionality. 

Previously, the Operation Department provided the following requirements for computation of users online time:


1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.

After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.

After analysis, it's found that most user behavior analysis are done for each user independently. Thus, if the logfile are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:

As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.

After pre-processing, the codes for online time computation could be written as following:

Note that:
1. The volume of data for one user n seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.

2. In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of online time for each group. If there is at least one "post" action in the current group of operations, then the sum of online time for current group will be tripled.

3. Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we'll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.


After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server's computing power. Here comes another question: can these engineers leverage esProc's multi-threaded parallel computing capabilities to take full advantages of the server's quaddual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See "Computing the Online Time for users with esProc (III)".

September 11, 2014

Using esProc to Compute the Online Time of Users (I)

As the operator of an online system, the Web Company believes that the users' time spent with their online application is a key analysis scenario. Specifically, the online time refers to the cumulative time a user spent with their online business application over a certain period of time.

With the evolving of the company's online application, total number of users has grown and the task of user behavior analysis is becoming more complex. Here, we use the example of computing the online time for users to show the various computing scenarios, ranging from simple to complex. Hopefully this could serve as a reference for similar development projects. In fact, the following approach are also applicable for other categories of user behavior analysis, such as user's activity level, user churn, etc..

Let's start from the time when the application just went online. The Operation Department needed to know the user’s online time with their application every week. For this the engineers from IT department provided the following resolution.

The user behavior information is recorded in log files in the Web Company. Everyday a separatelog file is generated. For example, the following log file, "2014-01-07.log", contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:


logtime    userid       action
2014-01-07 09:27:56        258872799       login
2014-01-07 09:27:57        264484116       login
2014-01-07 09:27:58        264484279       login
2014-01-07 09:27:58        264548231       login
2014-01-07 09:27:58        248900695       login
2014-01-07 09:28:00        263867071       login
2014-01-07 09:28:01        264548400       login
2014-01-07 09:28:02        264549535       login
2014-01-07 09:28:02        264483234       login
2014-01-07 09:28:03        264484643       login
2014-01-07 09:28:05        308343890       login
2014-01-07 09:28:08        1210636885     post
2014-01-07 09:28:09        263786154       login
2014-01-07 09:28:12        263340514       get
2014-01-07 09:28:13        312717032       login
2014-01-07 09:28:16        263210957       login
2014-01-07 09:28:19        116285288       login
2014-01-07 09:28:22        311560888       login
2014-01-07 09:28:25        652277973       login
2014-01-07 09:28:34        310100518       login
2014-01-07 09:28:38        1513040773     login
2014-01-07 09:28:41        1326724709     logout
2014-01-07 09:28:45        191382377       login
2014-01-07 09:28:46        241719423       login
2014-01-07 09:28:46        245054760       login
2014-01-07 09:28:46        1231483493     get
2014-01-07 09:28:48        266079580       get
2014-01-07 09:28:51        1081189909     post
2014-01-07 09:28:51        312718109       login
2014-01-07 09:29:00        1060091317     login
2014-01-07 09:29:02        1917203557     login
2014-01-07 09:29:16        271415361       login
2014-01-07 09:29:18        277849970       login

Log files record, in chronological order, users' operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.

The Operation Department provided the following requirements for computation of users online time:
1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

As the online application was just rolled out, the data volume for log file is relatively small. To compute on data from log files for 2014-01-05 to 2014-01-11, we could retrieve all data into memory in one batch, or out to a resulting file. Thus all codes here are written for in-memory computing.

The IT Department leverages esProc to meet the above requirements.

The actualcodes are as following:


The ideas for program design are:

1. First, retrieve all log files for the week ( 2014-01-05 to 2014-01-11 ) and merge them in chronological order. Sorting them according to userid and logtime. Add two extra fields, online time and login flag for subsequent calculations.

2. Online time is for computing of the interval between two operations by the same user. If difference between the operation time of current line and last action is less than 3 seconds, or if the userid of current operation does not equal to that of last one, then online time is directly set to 0.

3. Login flag is used to indicate a valid online time. If onlinetime does not exceed 10 minutes (600 seconds), or the type of operation is logout, then loginflag is set to true. Otherwise it’s set to false. If it’s login operation, then login flag is directly set to true.

4. Upon the resulting sorted table from previous steps, compute login flag again.If loginflag was originally set to false,then leave it to false. If the value were originally set to true, then the type of last operation would result to different value. If the last operation were login, then login flag should still be set to true, otherwise it should be set to false.

5. Upon the resulting sorted table from previous steps, group the data according to userid. Compute the sum of onlinetime for all records whoseloginflag is true. This is the total online time for the same user.

6. Output the result in the last step to a file onlinetime.data.

The advantage of the above codes lies in the step-by-step way of computation, which is easy to maintain and modify.

After working for a while, a new problem was found: On the one hand, The Operation Department said that the original way for online time computation should be adjusted, with new conditions added. On the other hand, with the increase of users, the log files grow larger, which is too big to fit into memory in one batch. Well, how should the IT Departments cope with this change in the requirements? 

Please see "Computing the Online Time of users with esProc (II)".

September 9, 2014

Principle and Use of External Memory Grouping in esProc

After data are imported from a data table, they are usually grouped as required and grouping and summarizing result is needed to be worked out. In esProc, groups function is used to compute the result of data grouping and summarizing; or the function will first group the data, then further analysis and computation are to be performed later.

But the case will be different in processing huge data, for the records cannot be loaded to the memory all together and distributed into each group. Other times the number of groups is huge and the grouping and summarizing result cannot even be returned all at once. In these two occasions, the external memory grouping is required.

1. Grouping with cursor by directly specifying group numbers 


Let's create a big, simple data table containing employee information, which includes three fields: employee ID, state and birthday. The serial numbers are generated in order and the states are written in their abbreviated forms obtained arbitrarily from the STATES table of demo database; birthdays are the dates selected arbitrarily within 10,000 days before 1994-12-31.The data table will be stored as a binary file for convenience. 

Altogether 1,000,000 rows of data are generated. The result of reading the 50,001th ~51,000th rows of data with cursor can be seen in C10 as follows:

In the following, we'll take the generated data file, BirthStateRecord, as an example to explore how to group in cursor computing by directly specifying group numbers. Because the data of the big data table cannot be loaded all together into the memory, we cannot perform grouping on it as we do on an ordinary table sequence. To solve this problem, esProc offers cs.groupx(xfunction which can distribute the records in cursor cs according to the computed result of expression x into groups with specified serial numbers and return the sequence of cursor. For example:

To explain the way in which cs.groupx(x) function performs grouping by using external memory to specify group numbers, the code will be executed step by step by clicking in the debugging area of the toolbar until A6. A2 creates a cursor with the binary data file BirthStateRecord. A4 creates a sequence using the states' abbreviations in STATES table.A5 uses groupx function to group data of the cursor; in this process, we need to find the corresponding serial numbers of the groups in A4 according to states and make them as the group numbers. During the execution of groupx function in A5,a temporary file will be generated for each group to record the grouping result and the sequence of temporary cursor files will be returned as follows:

While the code in A5 is executed, external files are generated in the directory of temporary files:

In the operation of groupx function, the number of temporary files equals that of the groups of records. We can import data from one of the temporary files:

The data A2 imports are as follows:

It can be seen that the data of a temporary file is, in fact, the employee information of a state. Here it is the data of the state of Missouri. Click  on the toolbar in the previous cellset file and go on with the execution of this cellset. In A8, when all cursor files are closed, the temporary files will be deleted automatically. A6 reads from the first cursor file the employee information of the state of Alabama as follows:

A7 works out the grouping and summarizing result of the 22nd group using groups function, that is, the number of employees from the state of Michigan:

It is thus clear that a sequence consisting of temporary cursor files will be returned in grouping records of cursors using directly specified group numbers. Each cursor file contains the records of a group and the data in a cursor can be further processed.

2.Grouping and summarizing result sets of huge data

When grouping data of cursors, most of the time we needn't to know the detailed data of each group. What we only need is to get the grouping and summarizing result. To get the number of employees of each state from BirthStateRecord, for example, we use groups function to compute the grouping and summarizing result: 

Thus we can get the result in A3:

Here we notice that the groups function for grouping and summarizing will return a table sequence of the result after the computation is completed. In the operation of processing massive data, sometimes it is required to produce a great many groups and the result set of grouping and summarizing itself is too big to be returned. Such as the telecom company makes statistics of each customer's bill; online shopping malls make statistics by groups about the sales of each commodity, and the like. In these cases, the use of groups function may result in a memory overflow. We can use groupx(x:F,…;y:F,…;n) function instead to perform grouping and summarizing with the help of external memory. In the function, n represents the number of rows in buffer area. For example:

Still, the code is executed step by step until A4. In A3, groupx function uses external memory to perform grouping and summarizing. In cursor computing,groupx function is used in the operations of both grouping and summarizing with external memory and grouping by directly specified group numbers. The difference of the two operations lies in the parameters. A3 performs grouping by employees' birthdays, then summate the number of employees born each day. In the operation, the number of rows in buffer area is set as 1,000. The result returned by A3 is a cursor as follows:

After the code in A3 is executed, external files will be generated in the directory of temporary files:

The data of one of the temporary files can be imported:

The data A2 imports are as follows:

The data of A3 is as follows:

It can be seen that each temporary file is the grouping and summarizing result of a part of the data obtained according to employees' birthdays. A larger cursor composed of all temporary files will be merged and returned by esProc. When the temporary files are generated, esProc will select a group number suitable for computing, so the rows of data in the temporary files will be a little more than the number of rows we set in buffer area. Special attention is needed in this point.

Go on with the execution of cellsets in the previous cellset file. When cursors are closed in A5, the temporary files will be auotomatically deleted. A4 fetches the first 1,000 birthdays from the cursor generated in A3, and the numbers of employees of each birth date are as follows: