Essentially, the main content of various data
analyses in our routine business is the structured data computing. For example:
- Compute the link relative ratio and year-on-year comparison of each business branch’s monthly sales during a specified period of time.
Implementation approach: filter the sales data by time range, then
group and summarize data by business branch, year and month, and at last,
perform cross-row and –group ordered data computing.
- Select stocks whose closing price has been increasing uninterruptedly for over 10 days.
Implementation approach: Group daily transaction data by stocks and
sort the data of each group by dates, compute the increasing amount of the
share price and the number of days during which the share price increases uninterruptedly,
and filter away the stocks that have been rising uninterruptedly for over 10
days.
- Relate the data of different sources, like contract and payment information, to project payments schedule and find out the overdue projects.
Implementation approach: Perform relational computing between
heterogeneous data sources, then group, summarize and filter the data.
It can be seen that these routine problems of data analysis can be
split into structured data operations including filtering, grouping, summarizing,
sorting, ranking and relational computing.
Of course, we may need to solve data analysis problems of modeling
or prediction occasionally. For example, find out goods that are closely
related between each other, or predict which stock is supposed to rise, and the
like. These operations require quite a lot of mathematical knowledge which
ordinary staff is generally not likely to have. They are really important data
analysis transaction, but they occupy only a very small part of routine data
analysis.
Structured data computing is the focus. There are many tools that
can perform it, like R language, Python, SQL and esProc.
R language provides dataframe data type for structured data
computing. However, it was originally designed for collecting and analyzing
scientific data, especially for performing matrix and vector computations. It
is not professional for structured data computing.
In fact, dataframe is a newly-developed function of R language; its
strong point is algorithms of modeling and prediction, such as regression
analysis, ANOVA analysis, Agreementevaluation, and Bernoulli distribution, etc,
which are seldom used in routine data analysis.
Pandas, Python’s third party function library, can perform
structured data computing. But it was also designed for collecting and
analyzing scientific data instead of structured data computing, so it is not
professional too. And similar to R language, the functions of Pandas center on
modeling and prediction and are seldom used in routine data analysis.
We can see that, despite lots of tools for performing structured
data computing, few can be regarded as truly professional. There are only one professional,
SQL, the old brand computer language.
SQL was designed purely for structured data computing. It is
professional and widely used.
Yet it also has drawbacks for routine data analysis. The most
obvious ones are complicated application environment and being bad at ordered
data computing. The installation, configuration, maintenance and management of
SQL are very complicated. SQL data set hasn't inherent serial numbers and gets
disadvantaged in ordered data computing, for example, the common problems in
routine data analysis like link relative ratio, year-on-year comparison,
fetching data in a relative interval, performing ranking during data grouping
and getting records in the top and bottom, etc. Most of the examples we
mentioned at the beginning involve ordered data computing. And though we can
solve them with SQL, the operation will be quite difficult.
Similar to SQL, esProc is
specially designed for structured data computing.
By comparison,
esProc's application environment, installation and configuration are simple. esProc
can fetch data from databases, and import structured data directly from Txt,
logs and Excel. Moreover, esProc table sequence has inherent serial numbers,
enabling it to perform ordered data computing easily. Unfortunately, in esProc,
the syntax for external memory computing is different from that for in-memory
computing, which requires different code. In this respect,
SQL has better consistency in its syntax.
No comments:
Post a Comment