esProc, A Script Language for Data Analytics with Parallel Mechanism: Using esProc to Compute the Online Time of Users (II)

In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it's easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc's file cursor functionality.

Previously, the Operation Department provided the following requirements for computation of users online time:

1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.

After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.

After analysis, it's found that most user behavior analysis are done for each user independently. Thus, if the logfile are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:

As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.

After pre-processing, the codes for online time computation could be written as following:

Note that:
1. The volume of data for one user n seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.

2. In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of online time for each group. If there is at least one "post" action in the current group of operations, then the sum of online time for current group will be tripled.

3. Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we'll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.

After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server's computing power. Here comes another question: can these engineers leverage esProc's multi-threaded parallel computing capabilities to take full advantages of the server's quaddual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See "Computing the Online Time for users with esProc (III)".

menu

September 14, 2014

Using esProc to Compute the Online Time of Users (II)

No comments:

Post a Comment