May 22, 2014

A Data Analysis Language/Script with Parallelism Feature

esProc is a data analysis language, featuring the easy-to-code, strong interactivity, dedicated debugging, and agile and arbitrary syntax. In particular, esProc is also capable of performing the parallel computation and fit for the big data analysis.

For example,a commercial Website generates several thousands of access logs daily. To analyze the user behaviors based on these logs, one of the analysis objectives is to compute how long each user spends the time to browse the products of each category in a specified time period. This analysis objective involves computing over several TB of data. The typical way of implementing on a single machine is unbearable because it takes several hours or days to complete. By comparison, the parallelism of esProc enables users to achieve it in 10-20 minute.

esProc enables the below parallel computing procedure: The summary machine receives the external parameters, decompose one great job into N small jobs, and distribute the N small jobs to M node machines, which is greater than N, in proper order. Each node machine is responsible for analyzing the data of some users, for example, for the users whose initials are A, analyze the time they spend to browse the products of each category; On completing the computation, each node machine will return the computed result to the summary machine, and proceed to perform next small job, for example, analyzing users whose initials are L. Once all small jobs are completed, the summary machine merges and outputs the result in the forms such as direct displaying on the IDE interface, displaying on the console, or returning via JDBC.

esProc is the data analysis language with parallelism feature. 

The major advantage is shown below:

Easy-to-code. esProc script is written on a grid, i.e. cellset. So, the computational logics can be laid out in a 2D space conveniently. The business algorithm can thus be interpreted into the computer language more easily. The grid-style presentation gives an intuitive view of the code indentation and the work scope, and streamlines the cell reference and reuse. Each cell represents one computing unit or step. With the natural cell name, a cell can make reference to another cell mutually, not requiring users to define any variables. By clicking cells, users can monitor the computed result intuitively, needless to search in a long list of variables.

Strong interactivity. esProc advocates the step-by-step computation - decompose a complex goal into several simple steps in a grid, and accomplish each objective of every simple step to ultimately achieve the final goal. By doing this, a complex computing goal can be simplified and solved with much higher development efficiency. esProc is more powerful in doing the step-by-step computation and interaction. esProc users can determine which is the smartest algorithm for the next step based on the insight to the current cell data; script for the next step by referencing the previous computations, and achieve the final goal through the incremental processing gradually and progressively. An obscure computing goal can be more and more clear and concrete in the interaction step by step.

Much more convenient debugging function.Designed with the "step-by-step" thoughts, a really practical debug function is introduced with esProc, including various functions like the break point, stepping, run to cursor, start, and end. Unlike the fake debugging script as SQL/SP, esProc can perform the debugging straightforwardly, not requiring an intermediate table specific to debugging. The break point can be set in any position without altering the code. Before proceeding to the next step of summarizing, users can even visually check the data to ensure they are grouped as expected. In the procedure of analyzing, 90% time is spent on debugging. The purpose-built debugging function can reduce error and analysis cycle dramatically.

Implementing the analysis object arbitrarily. esProc supports the true data type of set. A member of a set can be the data of any simple data types, records, and/or other sets. The set can be used to simplify the structured data computation, so that users will feel easier to perform the arbitrary computation from the business prospective. esProc supports the ordered set, which means that users can access the set member and perform the sequence-number-related computations arbitrarily, such as ranking, sorting, link relative ratio, and contemporary comparison. With the ideal "set of set" mechanism to represent the grouping, esProc can be used to solve various equal, align, and enum grouping problems easily, like computing the relative positions in multi-level groupings, and grouping and summarizing by a specified set.

Support for big data analysis. esProc users can compute over several TB of data from databases or files easily. With the parallel computing framework, massive data can be distributed to multiple computing nodes. Each node is only required to undertake the computation over quite few data. esProc supports the distributed computing at multiple levels. Each node can either act as the main node for distributing and summarizing, or the sub-node for undertaking the detailed computing. The node machine can be the high-grade configuration server or inexpensive PC of the Windows client or Linux server.

R language, Python, and Perl are also the common data analysis languages, and they are far less dedicated than esProc regarding the big data.

R language is the computing tool for scientists. Although R has the extremely rich extension packages and powerful computing and analysis ability, its parallelism is poor. R users have to integrate R with the third party software in the actual parallel application, and the stability and reliability is still doubtful. In addition, though R language has the powerful library functions, its performance is poor in executing the customized codes, and gets even poorer when it comes to data traversal and other computing. The syntax of R language is too obscure and purpose-built to be understood by the average users.

Python and Perl both are powerful in analyzing the character string. But they only offers the imperfect support for the parallel computing on a single machine, and relies on the third party software to implement the distributed parallel computing, for example, calling by the Streaming interface of Hadoop. Regarding the notoriously low performance of Hadoop, the performance would be even lower if integrating Hadoop with the Python and Perl. According to the publications, the performance of Python and Perl in Hadoop is far worse than that of Java. In addition, both Python and Perl lack the object type of structured two-dimensional data. So, people who develop such applications in Python and Perl would only find that the efficiency is comparatively much lower since the commercial data is structured and massive in most cases.

To conclude, let’s review the example in the beginning of this article. The core code of the node machine is as follows:

No comments:

Post a Comment