For a programming language for desktop data
analysis, the most important is that it must be user-friendly and have great
computing power. We can judge whether a language is suitable for analyzing data
on desktop from six aspects: application environment, file processing, text and
string processing, structured data processing, predictive modeling algorithms
and other less important indicators.
Application
environment
Most users who make desktop data analysis
are not professional programmers. They are accustomed to jobs under Windows,
lack the skill of environment configuration which a professional should have.
So the application environment of the programming language for desktop data
analysis should be simple, Windows compatible and easy to install and configure.
In this respect, both esProc and R language
have done well. They have simple application environment which can be used
directly after installation. Python itself can manage without problem, but
Pandas – frequently used to improve the computing capacity– has complex
installation under Windows and is very strict about versions, though it is
easily installed under Linux. Produced by Microsoft, SSAS gets along quite well
with Windows except thatthe installation and configuration is a little
complicated.
File
processing
TXT and XLS are file formats most likely to
be generated in routine work. And the degree of support for them shows the
degree of ease of use of an interpreted language.
Generally, all analysis languages support
TXT directly. In the case of XLS, the situation is complicated, especially the
installation of third party modules and version compatibility. Since both Excel
and SSAS are Microsoft products, SSAS can support XLS seamlessly. esProc, SPSS
and SAS don’t need third party packages too, they can access XLS directly. Pandas
is special, because it can directly support XLS, but, in itself, it is the
third party library function; and its support for certain versions of XLS is
limited. R language needs third party library function and Perl operating
environment, and requires that versions of the three be matched; this makes
installation complicated.
SSAS is easiest to use in read/write.
Pandas and R language provide abundant parameters.
The ability to process big files should be
taken into account, such as processing while the files are being imported.
esProc is the best for it, with the most concise code.
Text
and string processing
Sometimes, the disordered, semi-structured
or non-structured raw data needs to be preprocessed to become the easy-to-use
structured data. Thus the text and string processing is another focus in
evaluating an interpreted language.
In this aspect, Python is the best and R is
satisfactory. esProc comes next and SSAS is the worst.
Structured
data processing
In practice, most of the data to be
analyzed on desktop is the structured data. Therefore, the most common
operation in desktop data analysis is the structured data computing and the
corresponding computing power becomes the core competence of programming
languages for desktop data analysis.
esProc is the most professional one in this
field because it is specially designed for structured data computing. R language
is unprofessional, especially in ordered data computing, though dataframe - a
new data type – has been created for it. Pandas' dataframe is developed and
improved on the basis of R’s, which makes it as able as R language but easier
to understand. By contrast, SPSS, SAS and SSAS boast little in performing structured
data computing.
Ordered data computing includes operations
like link relative ratio, year-on-year comparison, fetching data in a relative
interval, rank ordering during grouping data, and getting records in the top or
bottom. It often involves relative position and cross-row and –group, and is a typical
case of desktop data analysis. With inherent serial numbers, esProc performs
the best in ordered data computing. Python and R language perform well, but because
their basic element is the vector instead of the record, the code written in
them is elusive and more suitable for scientific use.
Predictive
modeling algorithms
The predictive modeling is mainly used in
scientific field and not common in desktop data analysis. Yet it is still an
essential indicator.
R language boasts the latest, the richest and
the most mature third party algorithms. SASS is easy to use but inflexible.
Python/Pandas has always been trying to catch up with and imitate R language. It
achieves an easier to understand syntax but hasn't been fully-formed. SAS and
SPSS have established their authority in this regard. By contrast, esProc
almost hasn't any ready-made predictive modeling algorithms.
Other
less important indicators
Some less important indicators, such as the
support for databases and parallel computing and graphing ability, also attract
attention in special cases.
SASS works the best with databases, but it
is not good at heterogeneous computing between text files, databases and
self-defined data. esProc also gets along well with databases and performs
satisfactorily in handling situations involving heterogeneous data sources. But
Pandas, R Language, SAS and SPSS are not good in dealing with the both.
SASS is an expert at graphing, though it lacks
flexibility. R language, esProc and Pandas, however, are flexible and have
abundant inherent charts at the same time.
As to parallel computing, esProc has a
built-in engine for it, which is easy to configure and develop. R language need
third party software to perform parallel computing, resulting in complicated
configuration and development.
No comments:
Post a Comment