October 13, 2014

Programming Languages for Desktop Data Analysis

For a programming language for desktop data analysis, the most important is that it must be user-friendly and have great computing power. We can judge whether a language is suitable for analyzing data on desktop from six aspects: application environment, file processing, text and string processing, structured data processing, predictive modeling algorithms and other less important indicators.
        
Application environment
Most users who make desktop data analysis are not professional programmers. They are accustomed to jobs under Windows, lack the skill of environment configuration which a professional should have. So the application environment of the programming language for desktop data analysis should be simple, Windows compatible and easy to install and configure.

In this respect, both esProc and R language have done well. They have simple application environment which can be used directly after installation. Python itself can manage without problem, but Pandas – frequently used to improve the computing capacity– has complex installation under Windows and is very strict about versions, though it is easily installed under Linux. Produced by Microsoft, SSAS gets along quite well with Windows except thatthe installation and configuration is a little complicated.
        
File processing
TXT and XLS are file formats most likely to be generated in routine work. And the degree of support for them shows the degree of ease of use of an interpreted language.

Generally, all analysis languages support TXT directly. In the case of XLS, the situation is complicated, especially the installation of third party modules and version compatibility. Since both Excel and SSAS are Microsoft products, SSAS can support XLS seamlessly. esProc, SPSS and SAS don’t need third party packages too, they can access XLS directly. Pandas is special, because it can directly support XLS, but, in itself, it is the third party library function; and its support for certain versions of XLS is limited. R language needs third party library function and Perl operating environment, and requires that versions of the three be matched; this makes installation complicated.

SSAS is easiest to use in read/write. Pandas and R language provide abundant parameters.

The ability to process big files should be taken into account, such as processing while the files are being imported. esProc is the best for it, with the most concise code.
        
Text and string processing
Sometimes, the disordered, semi-structured or non-structured raw data needs to be preprocessed to become the easy-to-use structured data. Thus the text and string processing is another focus in evaluating an interpreted language.
In this aspect, Python is the best and R is satisfactory. esProc comes next and SSAS is the worst.
        
Structured data processing
In practice, most of the data to be analyzed on desktop is the structured data. Therefore, the most common operation in desktop data analysis is the structured data computing and the corresponding computing power becomes the core competence of programming languages for desktop data analysis.

esProc is the most professional one in this field because it is specially designed for structured data computing. R language is unprofessional, especially in ordered data computing, though dataframe - a new data type – has been created for it. Pandas' dataframe is developed and improved on the basis of R’s, which makes it as able as R language but easier to understand. By contrast, SPSS, SAS and SSAS boast little in performing structured data computing.

Ordered data computing includes operations like link relative ratio, year-on-year comparison, fetching data in a relative interval, rank ordering during grouping data, and getting records in the top or bottom. It often involves relative position and cross-row and –group, and is a typical case of desktop data analysis. With inherent serial numbers, esProc performs the best in ordered data computing. Python and R language perform well, but because their basic element is the vector  instead of the record, the code written in them is elusive and more suitable for scientific use.

Predictive modeling algorithms
The predictive modeling is mainly used in scientific field and not common in desktop data analysis. Yet it is still an essential indicator.

R language boasts the latest, the richest and the most mature third party algorithms. SASS is easy to use but inflexible. Python/Pandas has always been trying to catch up with and imitate R language. It achieves an easier to understand syntax but hasn't been fully-formed. SAS and SPSS have established their authority in this regard. By contrast, esProc almost hasn't any ready-made predictive modeling algorithms.

Other less important indicators
Some less important indicators, such as the support for databases and parallel computing and graphing ability, also attract attention in special cases.

SASS works the best with databases, but it is not good at heterogeneous computing between text files, databases and self-defined data. esProc also gets along well with databases and performs satisfactorily in handling situations involving heterogeneous data sources. But Pandas, R Language, SAS and SPSS are not good in dealing with the both.

SASS is an expert at graphing, though it lacks flexibility. R language, esProc and Pandas, however, are flexible and have abundant inherent charts at the same time.


As to parallel computing, esProc has a built-in engine for it, which is easy to configure and develop. R language need third party software to perform parallel computing, resulting in complicated configuration and development.