esProc, A Script Language for Data Analytics with Parallel Mechanism: Computing the Online Time for Users with esProc (IV)

In last article we mentioned that IT engineers from the Web Company used esProc to code single-machine multi-threaded program which could handle large data volume and complex requirements. This leverages the full power of one multi-core multi-CPU machine. Now once again these engineers found a new issue: with the user numbers for the online application growing explosively, colleagues from the Operation Department complained that the online time computation program is still running too slow.

IT Engineers leverage esProc's multi-machine parallel computing capability, to split the task for multiple machines to complete. The performance problem is resolved successfully. The single machine parallel processing is shifted to multi-machine parallel processing, with relatively low cost for hardware and software upgrade.

To improve performance, the Web Company increased the number of server from the original number of 1 to 3. Accordingly, the following steps are needed to shift from single-machine parallel to multi-machine parallel:

The first step: Modify the esProc program for weekly log files processing. Divide user ID by3 and separate the weekly log file into 3 files according to the remainder. Every server would be processing one of these. This way the file size were reduced and file transfer time could be shortened. Later the three files were uploaded to three servers, using multiple parallel programs to do the computation. The actual program is as following:

Note in the last screenshot that, A6 used the @g option of export function to retrieve "log files for one week" into three binary files. During subsequent use of parallel processing time, the content of log files can be retrieved by blocks for different user. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks.

The second step: the single-machine multi-threaded program is unchanged. Let's go back.

Subroutine parameters are shown below. They are used to pass the log file name, block number and total number of blocks for the week when called by the main program. Here the log file name for the week, week file, was already one of the three segmented files corresponding to this machine.

The subroutine is as following:

The above screenshot illustrates that:

1. As we previously used export @g to output the file in group according to different user ID, the use of @z option by cursor in A2 to handle specific block (value is block number) among total (value is total blocks) from file will retrieve the complete group for the same userID. Data for one user will not be split into two blocks.

2. The code line in red box returns the resulting file as cursor to the main program. Since multi-machine parallel processing were used here, this cursor is remote cursor ( Read esProc's Documents for detailed introduction on remote cursor).

The third step: writing main program for parallel computing, to call the parallel computing subroutine. As illustrated below, the main program called parallel tasks on tree machines, which effectively improved the performance for computation.

The server list in the program could also be written into the configuration file, this way any subsequent increase or decrease of the server would be easy.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.Notes on the above screen capture:

1. callx@ parameter specifies 3 servers from A1 to A3, to handle three log files B1 to B3.

2. The syntax of callx's input parameter, is to specify three servers through A5, and specify 6 parallel computing tasks for each server in A6.

3. Server list, server number, and the number of tasks for each server can be adjusted according to actual situation, to leverage full performance potential of the server.

The fourth step: implement the esProc server, and upload related program & data files. Refer to instructions on esProc for specific steps and methods.

After the transformation to multi-machine parallel computing, the Operations Department found significant improvement in the computation speed of users online time. The cost of this transformation is much lower than that for application databases upgrade, especially, in the hardware part, only 2 additional PC Servers were needed.

So far, The Web Company finished implementation of esProc based user behavior analysis and computation platform. Its main advantages are:

1. The platform is easy to be adjusted with more complex algorithm for future, shortened the response time and saved labor costs from engineers.

2. It's easy to scale out for even larger data amount in the future, with shortened project time and reduced cost of upgrade.

menu

September 16, 2014

Computing the Online Time for Users with esProc (IV)

No comments:

Post a Comment