Programming
languages focus on various basic data types, subject to their different design
goals. Languages such as Java and C# are designed to develop the common
applications. Their basic data types are character strings, number, boolean,
and other atomic data type, array and common object. SQL, PowerBuilder, R,
esProc, and other alike languages are designed to process data. So their basic
data types are the structured 2-dimentional data sheet object. Take this SQL statement,
for example,SELECT T1.id,T1.name,T1.value FROM T1 LEFT JOIN T2 ON
T1.id=T2.id. Of which, the T1, T2, and the computed result just use such
data type. With the multiple fields to form one record and the multiple records
to form the 2-dimentional data, the combination of such data and its field name
is the structured 2-dimenional data table object.
Why
not use the atomic data type and the common object as the basic data type for
the data processing languages? If representing the T1 and T2 from the
above-mentioned SQL statement with the array or Array List object, you will
find: The complexity will increase for several times, and the length of codes
will also increase sharply for dozens of times.
The
basic data types of data processing languages are the structured 2-dimensional
data table object. This is not a coincidence, but there are subtle reasons
instead.
Correspond
to actual business. In the real world, most business data is the structured
data. As an example, the Payroll list has the employee number, employee name,
department, date, pre-tax salary, and post-tax salary; For another example, the
retail record has the order time, outlet number, checkout counter number,
cashier number, product name, and unit price; The last example of business data
is the Website log, which comprises the browse time, URL, visitor IP, browser
version, and other properties. These properties are equivalent to the field.
Each of the records has the same structure. Though they
are stored in text while not the database, they are actually still the
structured data in nature. So, it is only natural to use the 2-dimensional data
table to represent it. The structured 2-dimensional data table object can be
used to represent the business data intuitively. Representing the actual
business in the most faithful way, no matter the storage, computing, exchange
or sharing. Such kind of data is the easiest for users to understand in a most
convenient way.
Easy
for massive processing. Business data are mostly the data of the same
structure, for example, the Payroll table, Retail record, and Website log
mentioned above. In processing such data, in some cases, we will handle a
certain data of a certain record, but in most cases, we take a certain record
as unit to process all data, for example: Compute the after-tax wage based on
the pre-tax wage; Compute the amount based on the unit price and quantity of
commodity. Count the daily on-line duration for each IP. The above-mentioned
processing mode is just the massive data processing. To implement the batch
processing, we can traverse every member of array in loops by row number and column
number just as the operations for Java. Alternatively, we can operation on the
data with the business field name directly as we would do for SQL and esProc.
The latter resolution is simpler and easier-to-use without having to write loop
statements. Programmers can thus operate on data intuitively from business
perceptive, and the corresponding code become more concise and readable.
Compatible
with the Relational Algebra. The relational algebra is the underlying
theory developed for data processing and query. By which, the association and
laws of operations among business data can be expressed in full details using
the basic operation along with the join operation, aggregation operation, and
division operation. Theoretically, any computation problem of any degree of
difficulty can be implemented and solved by relational algebra in the respects
of data processing and data query. Because the relational algebra is concise
and complete, databases are largely designed based on this theory. E.F. Codd is
thus called as the father of relational database. The structured 2-dimensional
data table object is just the data type recommended by E.F. Codd. This data
type can be used to express various operations of relational algebra, so as to
solve the computation problem in data processing easily. In facts, the database
result set is the earliest structured 2-dimensional data table object.
As
can be seen, all kinds of programming languages adopt the structured 2-dimental
data table object as the basic data type because it is corresponding to the
real business data, and easy to implement the massive computation, making it
compatible to the relational algebra theory. With the 2-dimensional data table
object, codes can be simple and easy to understand, and the development
efficiency is improved. Let me explain it with a few more examples below:
Result
set of SQL (resultSet): Group by the book type to compute the average price of the
books whose average price is greater than 15 yuan.
select avg(price),type from
books group by type having avg(price)>15
Table
sequence of esProc (TSeq): Group by department to find the top 10 best sellers
for each department.
products. group(department).
(~.top(quantity;10)
Data
window of PowerBuilder (datawindow): Sort the order by price
Order.SetSort('value d')
Order.Sort()
R language data frame (data.frame):
Left-join the orders table and customer table by customerID.
merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)
SQL,
esProc, and R code comparison: Group the order data by department, and
summarize the order data and sales amount of each department.
SQL:
Select
count(*),sum(sales) from orders group by Dept
esProc:
orders.groups(Dept; count(~),
sum(sales))
R
language:
result<-aggregate(orders$
sales,list(orders $ Dept),sum)
result$count<-tapply(orders
$ sales, orders $ Dept,length)
Let's
take a close look on result set, table sequence, data window, and data frame.
Although they are all structured 2-dimensional data table objects with
basically the same function. There are some slight differences between them.
SQL
result set is rich in various materials, widely applied, universal, and
simple to use. It is the top mainstream data type of all data processing
languages. However, SQL did not implement the relational algebra to the full,
making it a bit inconvenient for some computations, such as the set division.
DataWindow
usually retrieves number from SQL, and return the final result to the
database. It mainly servers the purpose of breaking through any barrier between
the data and UI controls, so that programmers can design and deliver the
database application with high interactivity soon. Another major function of
DataWindow is to render and edit data. It can be only used for form
computation, and the data processing capability is relatively poor.
Data
Frame is capable of handling the structured computation to some extent. As
can be seen from the above example, its syntax is obscure, and it is relatively
complex to implement the same functions with it. This is because the major
functions of R are scientific and statistical computing, focusing on the data
types of array and matrix. As a additional data type, the data frame was later
introduced to implement the structured data computing. Considering this point,
data frame is not so dedicated as that of the other three tools.
TSeq
is quite dedicated in data processing. Having incorporated all common
strong points of SQL result sets, TSeq can fully and completely implement the
relational algebra. TSeq is generic and sorted, especially fit for the
order-related complex computing in data processing, for example: Yearly link
relative ratio, year-on-year comparison, ranking, relative position computing,
and interval computing. TSeq is also charactered for it is generic, and easier
to establish relations between data and provide access to data with multi-level
association easily by object. Compared with SQL, TSeq is unable to directly
process the big data because it is the pure memory object.
As
can be seen, the structured 2-dimensional data table object is directly
related to the degree of dedication for the data processing languages. The more
powerful the former one is, the higher the degree of dedication for the latter
one would be, and vice versa. If a programming language lack the structured
2-dimensional data table object, then this language can hardly be regarded as a
dedicated one in processing the data. To research and examine if a programming
language can be used to develop any application for data analysis and
processing efficiently, the key is to find out if it offers the dedicated
2-dimensional data table object and the appropriate class library.
Perl
is often used to retrieve character string and is capable of processing the
data to some extent. However, since its code is lengthy and complex, it is not
the dedicated data processing language. For example, to complete the simplest
algorithm of grouping and summarizing, the code of Perl is shown below:
%groups=();
for each(@carts){
$name = $_->[1];
if($groups{$name} == null){
$groups{$name}=[$_];
}
else{
push($groups{$name},$_);
}
}
my @result=();
for each(
keys(%groups)){
$value=0;
while($row=pop $groups{$_}){
$value += $row->[2];
}
push @result,[$_,$value];
}
Python
is a bit simpler to write, but far more inefficient than SQL, esProc, and R in
developing by order of magnitude. The sample code is shown below:
result=[]
for key, items in
groupby(data, itemgetter(0)):
value1=0
value2=0
for subitem in items:
value1+=subitem[1]
value2+=subitem[2]
result.append([key,value1,value2])
print(result)
Perl
and Python is not the dedicated tools on data processing. Most importantly,
they lack the structured 2-dimensional table data object.
The
TSeq of esProc is not only the structured 2-dimensional table data object, but
also is characterized with its being order, generic, step-by-step computation,
making it more dedicated than other alike languages. For example, to implement
a relatively complex computational goal: Find the shares having been rising on
5 consecutive days. esProc solution code is shown below:
No comments:
Post a Comment