esProc
is a data analysis language, featuring the easy-to-code, strong interactivity,
dedicated debugging, and agile and arbitrary syntax. In particular, esProc is
also capable of performing the parallel computation and fit for the big data
analysis.
For
example,a commercial Website generates several thousands of access logs daily.
To analyze the user behaviors based on these logs, one of the analysis
objectives is to compute how long each user spends the time to browse the
products of each category in a specified time period. This analysis objective
involves computing over several TB of data. The typical way of implementing on
a single machine is unbearable because it takes several hours or days to complete.
By comparison, the parallelism of esProc enables users to achieve it in 10-20 minute.
esProc
enables the below parallel computing procedure: The summary machine receives
the external parameters, decompose one great job into N small jobs, and
distribute the N small jobs to M node machines, which is greater than N, in
proper order. Each node machine is responsible for analyzing the data of some
users, for example, for the users whose initials are A, analyze the time they
spend to browse the products of each category; On completing the computation,
each node machine will return the computed result to the summary machine, and
proceed to perform next small job, for example, analyzing users whose initials
are L. Once all small jobs are completed, the summary machine merges and outputs
the result in the forms such as direct displaying on the IDE interface, displaying
on the console, or returning via JDBC.
esProc
is the data analysis language with parallelism feature.
The major advantage is
shown below:
Easy-to-code. esProc script is written on a grid, i.e. cellset.
So, the computational logics can be laid out in a 2D space conveniently. The
business algorithm can thus be interpreted into the computer language more
easily. The grid-style presentation gives an intuitive view of the code
indentation and the work scope, and streamlines the cell reference and reuse.
Each cell represents one computing unit or step. With the natural cell name, a
cell can make reference to another cell mutually, not requiring users to define
any variables. By clicking cells,
users can monitor the computed result intuitively, needless to search in a long
list of variables.
Strong interactivity. esProc advocates
the step-by-step computation - decompose a complex goal into several simple
steps in a grid, and accomplish each objective of every simple step to
ultimately achieve the final goal. By doing this, a complex computing goal can
be simplified and solved with much higher development efficiency. esProc is
more powerful in doing the step-by-step computation and interaction. esProc
users can determine which is the smartest algorithm for the next step based on
the insight to the current cell data; script for the next step by referencing
the previous computations, and achieve the final goal through the incremental
processing gradually and progressively. An obscure computing goal can be more
and more clear and concrete in the interaction step by step.
Much more convenient debugging function.Designed with the "step-by-step" thoughts, a really practical
debug function is introduced with esProc, including various functions like the
break point, stepping, run to cursor, start, and end. Unlike the fake debugging
script as SQL/SP, esProc can perform the debugging straightforwardly, not
requiring an intermediate table specific to debugging. The break point can be
set in any position without altering the code. Before proceeding to the next
step of summarizing, users can even visually check the data to ensure they are
grouped as expected. In the procedure of
analyzing, 90% time is spent on debugging. The purpose-built debugging function
can reduce error and analysis cycle dramatically.
Implementing the analysis object
arbitrarily. esProc
supports the true data type of set. A member of a set can be the data of any
simple data types, records, and/or other sets. The set can be used to simplify
the structured data computation, so that users will feel easier to perform the
arbitrary computation from the business prospective. esProc supports the
ordered set, which means that users can access the set member and perform the
sequence-number-related computations arbitrarily, such as ranking, sorting,
link relative ratio, and contemporary comparison. With the ideal "set of
set" mechanism to represent the grouping, esProc can be used to solve
various equal, align, and enum grouping problems easily, like computing the
relative positions in multi-level groupings, and grouping and summarizing by a specified
set.
Support for big data analysis. esProc
users can compute over several TB of data from databases or files easily. With
the parallel computing framework, massive data can be distributed to multiple
computing nodes. Each node is only required to undertake the computation over
quite few data. esProc supports the distributed computing at multiple levels.
Each node can either act as the main node for distributing and summarizing, or
the sub-node for undertaking the detailed computing. The node machine can be
the high-grade configuration server or inexpensive PC of the Windows client or
Linux server.
R language, Python, and Perl are also
the common data analysis languages, and they are far less dedicated than esProc
regarding the big data.
R language is the computing tool for
scientists. Although R has the extremely rich extension packages and powerful
computing and analysis ability, its parallelism is poor. R users have to
integrate R with the third party software in the actual parallel application, and
the stability and reliability is still doubtful. In addition, though R language
has the powerful library functions, its performance is poor in executing the customized
codes, and gets even poorer when it comes to data traversal and other
computing. The syntax of R language is too obscure and purpose-built to be
understood by the average users.
Python and Perl both are powerful in analyzing the character string. But they only
offers the imperfect support for the parallel computing on a single machine, and
relies on the third party software to implement the distributed parallel
computing, for example, calling by the Streaming interface of Hadoop. Regarding
the notoriously low performance of Hadoop, the performance would be even lower
if integrating Hadoop with the Python and Perl. According to the publications,
the performance of Python and Perl in Hadoop is far worse than that of Java. In
addition, both Python and Perl lack the object type of structured
two-dimensional data. So, people who develop such applications in Python and
Perl would only find that the efficiency is comparatively much lower since the
commercial data is structured and massive in most cases.
To
conclude, let’s review the example in the beginning of this article. The core
code of the node machine is as follows:
No comments:
Post a Comment