September 11, 2014

Using esProc to Compute the Online Time of Users (I)

As the operator of an online system, the Web Company believes that the users' time spent with their online application is a key analysis scenario. Specifically, the online time refers to the cumulative time a user spent with their online business application over a certain period of time.

With the evolving of the company's online application, total number of users has grown and the task of user behavior analysis is becoming more complex. Here, we use the example of computing the online time for users to show the various computing scenarios, ranging from simple to complex. Hopefully this could serve as a reference for similar development projects. In fact, the following approach are also applicable for other categories of user behavior analysis, such as user's activity level, user churn, etc..

Let's start from the time when the application just went online. The Operation Department needed to know the user’s online time with their application every week. For this the engineers from IT department provided the following resolution.

The user behavior information is recorded in log files in the Web Company. Everyday a separatelog file is generated. For example, the following log file, "2014-01-07.log", contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:


logtime    userid       action
2014-01-07 09:27:56        258872799       login
2014-01-07 09:27:57        264484116       login
2014-01-07 09:27:58        264484279       login
2014-01-07 09:27:58        264548231       login
2014-01-07 09:27:58        248900695       login
2014-01-07 09:28:00        263867071       login
2014-01-07 09:28:01        264548400       login
2014-01-07 09:28:02        264549535       login
2014-01-07 09:28:02        264483234       login
2014-01-07 09:28:03        264484643       login
2014-01-07 09:28:05        308343890       login
2014-01-07 09:28:08        1210636885     post
2014-01-07 09:28:09        263786154       login
2014-01-07 09:28:12        263340514       get
2014-01-07 09:28:13        312717032       login
2014-01-07 09:28:16        263210957       login
2014-01-07 09:28:19        116285288       login
2014-01-07 09:28:22        311560888       login
2014-01-07 09:28:25        652277973       login
2014-01-07 09:28:34        310100518       login
2014-01-07 09:28:38        1513040773     login
2014-01-07 09:28:41        1326724709     logout
2014-01-07 09:28:45        191382377       login
2014-01-07 09:28:46        241719423       login
2014-01-07 09:28:46        245054760       login
2014-01-07 09:28:46        1231483493     get
2014-01-07 09:28:48        266079580       get
2014-01-07 09:28:51        1081189909     post
2014-01-07 09:28:51        312718109       login
2014-01-07 09:29:00        1060091317     login
2014-01-07 09:29:02        1917203557     login
2014-01-07 09:29:16        271415361       login
2014-01-07 09:29:18        277849970       login

Log files record, in chronological order, users' operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.

The Operation Department provided the following requirements for computation of users online time:
1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

As the online application was just rolled out, the data volume for log file is relatively small. To compute on data from log files for 2014-01-05 to 2014-01-11, we could retrieve all data into memory in one batch, or out to a resulting file. Thus all codes here are written for in-memory computing.

The IT Department leverages esProc to meet the above requirements.

The actualcodes are as following:


The ideas for program design are:

1. First, retrieve all log files for the week ( 2014-01-05 to 2014-01-11 ) and merge them in chronological order. Sorting them according to userid and logtime. Add two extra fields, online time and login flag for subsequent calculations.

2. Online time is for computing of the interval between two operations by the same user. If difference between the operation time of current line and last action is less than 3 seconds, or if the userid of current operation does not equal to that of last one, then online time is directly set to 0.

3. Login flag is used to indicate a valid online time. If onlinetime does not exceed 10 minutes (600 seconds), or the type of operation is logout, then loginflag is set to true. Otherwise it’s set to false. If it’s login operation, then login flag is directly set to true.

4. Upon the resulting sorted table from previous steps, compute login flag again.If loginflag was originally set to false,then leave it to false. If the value were originally set to true, then the type of last operation would result to different value. If the last operation were login, then login flag should still be set to true, otherwise it should be set to false.

5. Upon the resulting sorted table from previous steps, group the data according to userid. Compute the sum of onlinetime for all records whoseloginflag is true. This is the total online time for the same user.

6. Output the result in the last step to a file onlinetime.data.

The advantage of the above codes lies in the step-by-step way of computation, which is easy to maintain and modify.

After working for a while, a new problem was found: On the one hand, The Operation Department said that the original way for online time computation should be adjusted, with new conditions added. On the other hand, with the increase of users, the log files grow larger, which is too big to fit into memory in one batch. Well, how should the IT Departments cope with this change in the requirements? 

Please see "Computing the Online Time of users with esProc (II)".