esProc, A Script Language for Data Analytics with Parallel Mechanism: Example & Comments for SQL Computation Disadvantages

The computing capacity of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot manage. But its supported layer is too low, which can lead to over-elaborate operation in practical application.
The over-elaborate operation is specifically reflected in the following four aspects:

Computation without sub-step: SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result.
Set is unordered: SQL does not directly provide the mechanism of using position to refer to set members, and conversion is needed to implement computation relating to order and positioning.
Set-lizing is not complete: SQL set function is simple and is only used to indicate the query result set and cannot be explicitly applied as basic data type.
Lacks object reference: SQL does not support record reference, the association between data tables adopts equivalent foreign key scheme, and in conducting multi-table joint computation, it is necessary to conduct join operation. So it is not only difficult to understand, but also low in efficiency.

Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax (which is similar to the case in which a primary-school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the above-mentioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problem-solving method into computation syntax is much greater than to find the solution of the problem.
We give the following examples to describe respectively the problems in the four aspects.
To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.

Computation without sub-step

Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.
Task1 The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees?

The number of persons of the sales department

1	select count(*) form employee where department=‘sales’

Where, the number of persons whose native place is Beijing

1	select count(*) form employee where department=‘sales’ and native_place=‘NY’

And where, the number of female employees

1	select count(*) form employee
2	where department=‘sales’ and native_place=‘NY’ and gender=‘female’

Conventional thought: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.
But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.
Task2 Each department selects a pair of male and female employees to form a game team.

1	with A as
2	(select name, department,
3	row_number() over (partition by department order by 1) seq
4	from employee where gender =‘female’),
5	B as
6	(select name, department,
7	row_number() over (partition by department order by 1) seq
8	from employee where sex =‘female’),
9	select name, department from A
10	where department in ( select distinct department from B ) and seq=1
11	union all
12	select name, department from B
13	where department in ( select distinct department from A ) and seq=1

Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought.
The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.
Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.

The set is unordered.

Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.
Task3 Company’s employees whose ages are in the middle.

1	select name, birthday
2	from (select name, birthday, row_number() over (order by birthday) ranking
3	from employee )
4	where ranking=(select floor((count(*)+1)/2) from employee)

Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a man-made sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query.
Task4 For how many trading days has this stock gone up consecutively in the longest?

1	select max(consecutive_day)
2	from (select count(*) (consecutive_day
3	from (select sum(rise_mark) over(order by trade_date) days_no_gain
4	from (select trade_date,
5	case when
6	closing_price >lag(closing_price) over(order by trade_date)
7	then 0 else 1 end rise_mark
8	from stock_price) )
9	group by days_no_gain)

Unordered set can also cause train of thought to deform.
The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over.
In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.

Set-lization is not complete.

It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.
Task5 Employees in the company whose birthday are the same as those of others.

1	select * from employee
2	where to_char (birthday,‘MMDD’) in
3	(select to_char (birthday,‘MMDD’) from employee
4	group by to_char (birthday,‘MMDD’)
5	having count>1)

The original intention of grouping is to split the source set into several
subsets, and its returned values are also these sub-sets. But SQL cannot describe this kind of “set consisting of sets”, so it forcibly conducts the next step aggregating computation on these sub-sets and forms conventional result set.
But sometimes what we want is not the summary value on sub-sets, but rather the subsets themselves. At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.
Task6 Find out students whose scores ranks in top 10 for all subjects.

1	select name
2	from (select name
3	from (select name,
4	rank() over(partition by subject order by score DESC)ranking
5	from score)
6	where ranking<=10)
7	group by name
8	having count (*)=(select count (distinct subject) from score)

Use set-lized train of thought, order and filter the sub-sets of subjects after grouping to select the top 10 of every subject, and then it is possible to complete the task by finding out the intersection set of these sub-sets. But SQL cannot describe the “set of set” and has not the intersection operation to cope with indefinite quantity set. At this time, it is necessary to change the train of thought and use the window function to find out the top 10 of every subject, and then find out, according to student sub-group, the students whose number of appearances is the same as the quantity of subjects, which causes difficulty in understanding.

It lacks object reference.

In SQL, the reference relation between tables depends on equivalent foreign key for maintenance and it is impossible to directly use the record at which the foreign key point as the field of this record. In query, it is necessary to seek help of multi-table join or sub-query to complete the query, which causes not only trouble in writing but also low efficiency in operations.
Task7 Female manager’s male employees.
Use multi-table join.

1	select A.*
2	from employee A,department B, employee C
3	where A.department=B.department and B.manager=C.name and
4	A.sex =‘male’ AND C.gender =‘female’

Use sub-query.

1	select *from employee
2	where department in
3	(select department from department
4	where manager in
5	(select name from employee where gender =‘female’))

If the department field in the employee table points at the record in the department table while the manager field in the department table points at the record in the employee table, then it is only necessary to write this query condition simply as this kind of intuitive high-efficiency form:
where department. manager. sex ='female' and sex ='male'
But in SQL, it is only possible to use multi-table join or sub-query to write out the two kinds of obviously obscure statements.
Task8 Companies with which employees have their first jobs.
Use multi-table join.

1	select name, company, first_company
2	from (select employee.name name, resume.company company,
3	row_number() over(partition by resume. name
4	order by resume.start_date) work_seq
5	from employee, resume where employee.name = resume.name)
6	where work_seq=1

Use sub-query.

1	select name,
2	(select company from resume
3	where name=A. name and
4	start date=(select min(start_date) from resume
5	where name=A. name)) first_company
6	from employee A

Without object reference mechanism and the completely set-lized of SQL, it is naturally impossible to handle the sub-table as an attribute of the primary table (field value). Regarding the query of sub-table, there are two methods. The first is to use multi-table join, increase the complexity of the statement, and use filter or grouping to convert the result set into the situation having one-to-one correspondence with the primary table record (the joined record has one-to-one correspondence with the sub-table). The second is to adopt sub-query, and each time compute temporarily the sub-table relating to the primary table record to record sub-sets, and increase the overall computation workload (it is impossible to use with sub-statement in sub-query) and trouble in writing.

Now, if you don't want to handle complex SQL, esProc can be a good helper for you.

menu

August 3, 2012

Example & Comments for SQL Computation Disadvantages

No comments:

Post a Comment