esProc, A Script Language for Data Analytics with Parallel Mechanism: A Query Language Over-perform SQL

SQL (Structured Query Language) was initially developed in early 70s. To date, SQL is still a powerful, simple, and widely-applied database query language. However, the drawbacks of SQL are evident: non-stepwise computation, incomplete set-lization, no support for ordered set, and lack of object reference. Although several producers has developed a number of incompatible solutions, such as various stored procedures like PL-SQL®, and T-SQL®, these improvement solutions cannot solve the inherited SQL drawbacks.

As a query language of the new generation, esProc over-perform SQL in respect of computational capability, usability, and applicability.

I Stepwise Computation

Case Description

A transnational retail enterprise needs to collect statistics on the newly opened retail store, that is, how many new retail stores will open in this year? Of which how many enterprises will exceed 1 million dollars? How many enterprise open overseas?

This question is progressive. The three questions are mutually related, the next question can be regarded as on the further explore on the current question, fit for stepwise computation.

The original data is from the database of stores table with the main fields: storeCode, storeName, openedTime, profit, and nation. Let’s see the SQL solution first.

SQL Solution

To solve such problem with SQL, you will need to compose 3 statements as given below.

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy');

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy') and profit>1000000;

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy') and profit>1000000 and nation<>’local’;

SQL1: Get the results of problem 1.

SQL2: Get the result of problem 2. Because the stepwise computation is impossible (that is, the results of previous computation cannot be utilized), you can only solve and take it as an individual problem.

SQL3: Solve the problem 3, and you are not allowed to compute in steps either.

esProc Solution

A1 cell: Get the record meeting the problem 1.

A2 cell: Stepwise computation. Operate on the basis of cell A1 , and get the record meeting the conditions of problem 2.

A3 cell: Proceed with the stepwise computation, and get the record meeting the problem 3.

B1, B2, B3 cells: It is still the stepwise computation. Count the corresponding records.

Comparison

For the SQL, there are 3 associations for you to compute in steps, and explore progressively. However, because stepwise computation is hard to implement with SQL, this problem have to be divided into 3 individual problems.

esProc is to compute in steps following the natural habit of thinking: Break the general objective into several simple objective; Solve every small objective step by step; and complete the final objective.

In case that you proceed with the computation on the basis of the original 3 problems, for example, seek “proportion of problem 3 taken in the problem 2”, or “group by country for problem 2”, then, in the esProc, you can simply write ”=A3/A2”, and ”A3.group(nation)”. In each step, there is a brief and clear expression of highly readable, not requiring any advanced technical ability. By comparison, SQL requires redesigning the statement. The redesigned statement will undoubtedly become longer and longer, and more and more complex. Ultimately, such job can only be assigned to those who have the advanced technical ability in SQL.

esProc can break the complex problem down to simple computation procedure based on the descriptions from the business perceptive. This is the advantage of the stepwise computation. By comparison, SQL does not allow for computation by step or problem breakdown, and thus it is against the scientific methodology, and not fit for the complex computation.

II Complete Set-lization

Case Description

A certain advertisement agency needs to compute the clients whose annual sales values are among the top 10.

The data are from the sales table, which records the annual sales value of each client with the fields like: customer, time, and amount.

SQL Solution

SELECT customer

FROM (

SELECT customer

FROM (

SELECT customer,RANK() OVER(PARTITION BY time ORDER BY amount DESC) rankorder

FROM sales )

WHERE rankorder<=10)

GROUP BY customer

HAVING COUNT(*)=(SELECT COUNT(DISTINCT time) FROM sales)

Such Problem requires ranking the sets in a set, that is, group by “time” and then rank by “customer” in the group. Since the currently popular SQL-92 syntax is unable to represent this, the SQL-2003 standard, for which some vendors gradually provide the incomplete support, will be used to solve this problem barely.

A tip can be adopted when computing the customer intersections in the last step: the count of year equals to the count of clients.

esProc Solution

A1: Group the original dataset by year. A1 is a set of sets.

B1: Get the serial number of records whose sales values are among the top 10 of each group. The rank() is used to rank in every group, and pselect() can be used to retrieve the serial number on conditions. ~ is used to represent every member in the set. B1 is the “set of set”.

A2: Retrieve the record from A1 according to the serial number stored in B2, and get the customer field of the record.

A3: Compute the intersection of sets.

Comparison

The SQL set-lization is incomplete and can only be used to represent the simple result set. Developers cannot use SQL to represent the concept of “set of set”. Only the queries of 3-level-nested-loops are available to barely perform the similar computations. In addition, SQL cannot be used to perform the intersection operation easily that developers with advanced techniques can only resort to the unreadable statements to perform the similar operations, such as count of years equal to the count of clients. This equals to the intersection of clients.

The set is the base of mass data. esProc can achieve set-lization completely, represent the set, member, and other related generic or object reference conveniently, and perform the set operations easily, such as intersection, complement, and union.

When analyzing the set-related data, esProc can greatly reduce the computation complexity. By taking the advantage of set, esProc can solve many problems agilely and easily that are hard to solve with SQL.

III Ordered Set

Case Description

Suppose that a telecommunication manufacturer needs to compute the link relative ratio of sales value compared with that of last month (i.e. the increase percent of sales value of every month compared with that of last month). The sales data is stored in the sales table with the main fields including salesMonth, and salesAmount.

SQL Solution

select salesAmount, salesMonth,

(case when

prev_price !=0 then ((salesAmount)/prev_price)-1

else 0

end) compValue

from (select salesMonth, salesAmount,

lag(salesAmount,1,0) over(order by salesMonth) prev_price

from sales) t

The currently popular SQL-92 does not provide the concept of serial number, which adds many difficulties to the computation. Considering this, the designer of SQL-2003 partly makes up this disadvantage. For example, the window function lag() of SQL-2003 is used to retrieve the next record in this example.

In addition, in the above statement, the “case when” statement is used to avoid the error of division by zero on the first record.

esProc Solution

sales.derive(salesAmount / salesAmount [-1]-1: compValue)

The derive() is an esProc function to insert the newly computed column to the existing data. The new column is compValue by name, and the algorithm is “(Sales value of this month/Sales value of previous month)-1”. The “[n]” is used to represent the relative position, and so [-1] represents the data of the previous month.

On the other hand, for the data of the first record, the additional procedure for division by zero is not required in esProc.

Comparison

From the above example, even if using SQL-2003, the solution to such problem is lengthy and complex, while the esProc that support the ordered set is simple and clear.

Moreover, SQL-2003 only provides the extremely limited computation capability. For example, esProc user can simply use the ”{startPosition,endPosition}” to represent the seeking of a range, and simply use ”(-1)” to represent the seeking of the last record. Regarding the similar functionality, it will be much harder for SQL user to implement.

In the practical data analysis, a great many of complex computations are related to the order of data. SQL users are unable to handle such type of computations as easily as esProc users because SQL lacks of the concept of Being Ordered.

IV Object Reference

An insurance enterprise has the below analysis demands: to pick out the annual outstanding employees (Employee of the Year) whose Department Manager has been awarded with the President Honor. The data are distributed in two tables: department table (main fields are deptName, and manager), and employee table (main fields are empName, empHonor, and empDept).

empHonor has three types of values: null value; ”president's award”, PA for short; and ”employee of the year”, EOY for short. There are 2 groups of correspondence relations: empDept and deptName, and Manager and empName.

SQL Solution

SELECT A.*

FROM employee A,department B,employee C

WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=‘EOY’ AND C.empHornor=‘PA’

SQL users can use the nested query or associated query to solve such kind of problems. In this case, we choose the association query that is both concise and clear. The association statement behind the “where” has established the one-to-many relation between deptName and empDept, and the one-to-one relation between manager and empName.

esProc Solution

employee.select(empHonor:"EOY",empDept.manager.empHornor:"PA")

esProc solution is intuitive: select out the employee of “EOY” whose Department Manager has be awarded with “PA”.

Comparison

The SQL statement to solve such kind of question is lengthy and not intuitive. In fact, the complete association query language is “inner join…on…” style. This statement is simplified in the above example. Otherwise it will be much hard to understand.

esProc users can use ”.” for object reference. Such style is intuitive and easy to understand. The complex and lengthy association statement for multiple tables can thus be converted to the simple object access, which is unachievable for SQL. When there are more and more tables, the complexity of SQL association query will rise in geometric series. By comparison, the esProc user can always access the data intuitively and easily by taking the advantage of object reference.

Regarding the multi-table associations of complex computation, esProc can handle it more intuitively and conveniently than SQL.

From the comparison of the above four examples, we can see that esProc is not only characterized with stepwise computation, complete set-lization, sorted sets, and object reference. The analysis style is intuitive, the syntax style is agile, and the function is powerful. esProc is a specific tool for mass data computation, and a query language of the new generation over-performing SQL.

menu

May 8, 2012

A Query Language Over-perform SQL

No comments:

Post a Comment