October 1, 2014

Structured Data Computing: the Focus of Routine Data Analysis

  • Compute the link relative ratio and year-on-year comparison of each business branch’s monthly sales during a specified period of time.

Implementation approach: filter the sales data by time range, then group and summarize data by business branch, year and month, and at last, perform cross-row and –group ordered data computing.
  • Select stocks whose closing price has been increasing uninterruptedly for over 10 days. 

Implementation approach: Group daily transaction data by stocks and sort the data of each group by dates, compute the increasing amount of the share price and the number of days during which the share price increases uninterruptedly, and filter away the stocks that have been rising uninterruptedly for over 10 days.
  • Relate the data of different sources, like contract and payment information, to project payments schedule and find out the overdue projects. 

Implementation approach: Perform relational computing between heterogeneous data sources, then group, summarize and filter the data.

It can be seen that these routine problems of data analysis can be split into structured data operations including filtering, grouping, summarizing, sorting, ranking and relational computing.

Of course, we may need to solve data analysis problems of modeling or prediction occasionally. For example, find out goods that are closely related between each other, or predict which stock is supposed to rise, and the like. These operations require quite a lot of mathematical knowledge which ordinary staff is generally not likely to have. They are really important data analysis transaction, but they occupy only a very small part of routine data analysis.

Structured data computing is the focus. There are many tools that can perform it, like R language, Python, SQL and esProc.

R language provides dataframe data type for structured data computing. However, it was originally designed for collecting and analyzing scientific data, especially for performing matrix and vector computations. It is not professional for structured data computing.

In fact, dataframe is a newly-developed function of R language; its strong point is algorithms of modeling and prediction, such as regression analysis, ANOVA analysis, Agreementevaluation, and Bernoulli distribution, etc, which are seldom used in routine data analysis.

Pandas, Python’s third party function library, can perform structured data computing. But it was also designed for collecting and analyzing scientific data instead of structured data computing, so it is not professional too. And similar to R language, the functions of Pandas center on modeling and prediction and are seldom used in routine data analysis.
We can see that, despite lots of tools for performing structured data computing, few can be regarded as truly professional. There are only one professional, SQL, the old brand computer language.

SQL was designed purely for structured data computing. It is professional and widely used.

Yet it also has drawbacks for routine data analysis. The most obvious ones are complicated application environment and being bad at ordered data computing. The installation, configuration, maintenance and management of SQL are very complicated. SQL data set hasn't inherent serial numbers and gets disadvantaged in ordered data computing, for example, the common problems in routine data analysis like link relative ratio, year-on-year comparison, fetching data in a relative interval, performing ranking during data grouping and getting records in the top and bottom, etc. Most of the examples we mentioned at the beginning involve ordered data computing. And though we can solve them with SQL, the operation will be quite difficult.

Similar to SQL, esProc is specially designed for structured data computing.

By comparison, esProc's application environment, installation and configuration are simple. esProc can fetch data from databases, and import structured data directly from Txt, logs and Excel. Moreover, esProc table sequence has inherent serial numbers, enabling it to perform ordered data computing easily. Unfortunately, in esProc, the syntax for external memory computing is different from that for in-memory computing, which requires different code. In this respect, SQL has better consistency in its syntax.