esProc, A Script Language for Data Analytics with Parallel Mechanism: Differences and Similarities between esProc TSeq and SQL Data Table

esProc TSeq and SQL data table are all the structured 2-dimensional data objects. The records, indexes, and primary keys can all be applied to the structured data computing. Although both of them can be applied to the structured data computing, their application scenarios differ distinctly. TSeq is fit for solving the complex computing problem involving not-so-big data volume, while data table is fit for the common computation involving possible big data volume.

Their differences are determined by the underlying mechanism.

TSeq is ordered. Each record and data in each column has a specific sequence number. TSeq supports the explicit set, and set operations can be performed directly between multiple TSeq; TSeq is also the generic set. The basic element can be numeric value, reference, or another set.

SQL data table does not have the above features. But SQL data table is transparent to the memory and external storage. A consistent syntax is used to access to the memory, external storage, or blended data.

In the below sections, we will further discuss their differences and similarities.

Common basic functions

Both esProc TSeq and SQL data table are the structured 2-dimensional data objects. In other words, they are row-style 2-dimensional table based on records with one row for one record. The 2-dimensional table together with multiple column names forms a complete data structure. Because the structures are largely similar, their basic usages are almost the same.

Example 1:Query the data in the object. Find the orders in which the number of Freight is greater than 100 and placed before 2013.

SQL: SELECT * FROM Orders WHERE Freight > 100 AND OrderDate<'2013-01-01'

TSeq: =Orders.select(Freight > 100 &&OrderDate< date("'2013-01-01"))

Note:In this example, the data object name is Orders, and another data object of Customers will be used in the subsequent sections.

Example 2: Sort. Sort the orders by EmployeeID in ascending order, then sort by Freight in reverse order.

SQL: SELECT * FROM Orders ORDER BY EmployeeID ,Freight DESC

TSeq: =Orders.sort(EmployeeID,Freight:-1)

Example 3:Grouping and summarizing. Group by employee, summarize the freight charge, and count the orders.

SQL: SELECT EmployeeID, COUNT(OrderID), SUM(Freight) FROM Orders GROUP BY EmployeeID

TSeq: = Orders.groups(EmployeeID;sum(Freight),count(OrderID))

Example 4:Join. Join the two data objects Orders and Customers to form a new data object. Use the left join and the join field is CustomerID.

SQL: Select * from Orders left join Customers on Orders. CustomerID =Customers. CustomerID

TSeq: =join@1(Orders:, CustomerID; Customers:, CustomerID)

Besides the above several basic usages, esProc TSeq and SQL data table are very alike in the algorithms like distinct, count, sum up, average, maximum, and minimum, just a few example here.

Difference of being in order

Thanks to the ordered record set of TSeq, the order-related computing can be solved easily. SQL data table lacks the sequence number and sequence-number-related access method, making it a bit inconvenient to handle the order-related computing.

Example 1: As for the sales data object, compute the sales increment of this month compared with that of the previous month.

SQL:

select salesAmount, salesMonth,

(case when

prev_price ! =0 then ((salesAmount)/prev_price)-1

else 0

end) compValue

from (select salesMonth, salesAmount,

lag(salesAmount,1,0) over(order by salesMonth) prev_price

from sales) t

TSeq:

sales.derive(salesAmount / salesAmount [-1]-1: compValue)

The monthly sales have nothing to do with orders. It is represented as salesAmount in the TSeq and the data table. The one related to the order is the Sales amount of previous month, which is equivalent to the salesAmount of the previous record relative to the current record. TSeq is ordered, sosalesAmount[-1] can be used to represent the sales of previous month. By comparison, SQL data table is not ordered. Since SQL2003 standard was introduced, the order-related features are added to the window functions. It is so tedious to use the unavoidable complex method lag(salesAmount,1,0) over(order by salesMonth)to compute the sales of the previous month.

TSeq can also be used to represent the relative interval, for example, the total five months comprising the current month, the two months before, and two months after the current month. With TSeq, it can be represented like this: salesAmount{-2,2}. In SQL, the window function can also be used to represent the summarizing, but much more troublesome.

Example 2: For the sales data table, find the first 10 records with the highest sales for each product.

SQL:

select salesMan, product ,amount

from ( select salesMan, product ,amount, rank() over (partition by product order by amount desc ) ranking from sales)

where ranking <=10

TSeq:

= sales.group(product).(~.top(-amount;10))

In this example, the most intuitive thinking pattern to implement is to group data by product, and then perform the order-related computations in the groups. The simplest method is to retrieve the first ten records with the highest amount in the group. The intuitive method is to sort the data in the group in reverse order by amount, and then get the records whose sequence numbers in the group is from 1 to 10.

TSeq supports the order-related computation quite well. For TSeq, not only the function top can be used to implement the first algorithm, as demonstrated in the example, but also the sort function and the sequence number of record can be used to implement the second algorithm, as shown below: =sales.group(product).(~.sort(Amount:-1)).(~([to(10)])).

The record in the SQL data table is not ordered. We must compute out a sequence number or field to act as the sequence number, such as ranking. The algorithm adopted in the above example is to compute the data rankings in the group, and then get the records among the top 10. Obviously, SQL syntax is a bit zigzag and difficult to understand. SQL users will have to use the window function over (partition by…… order by……) and the sub-query which is hard to track and debug.

By comparison, TSeq is more straightforward and simpler for easy track and debug. For another example, programmers can firstly write =sales.group(product) code for test. This code represents that the data will be grouped. This code snippet can run and display the result independently. If the grouping results meet your expectation, then programmers can proceed to add the second paragraph of codes: sort the data in the group in reverse order, that is, .(~.sort(Amount:-1)). In which, the “.” indicates the former computing result will be taken as a whole for further processing, “~” indicates the data in the current group, and “-1” indicates the reverse order. The current code line is =sales.group(product).(~.sort(Amount:-1)), which can run and display the result independently. If the grouping results meet your expectation, then programmers can proceed to add the third paragraph of codes: retrieve the records whose sequence numbers in the group is 1-10, i.e. .(~([to(10)])).

As can be seen, the computational procedure of TSeq can be pushed forward step by step. In solving the complex computing problems, the computation can be simplified to avoid errors. In facts, the above three continuous paragraphs of code can be written into three rows of codes step by step, so as to decompose the computing goal more clearly:

One thing worthy of noticing is that SQL cannot perform data sorting before order-related computing,even with the temporary table.The two steps must be combined into one step. The reason for this is that the data table does not support the explicit set and generic set. It is rather weak regarding its syntax representation.

In addition, although SQL uses the window function up to the ANSI standard, the database vendor does not implement it fully to the standard. The writing styles vary for various databases and some databases does not provide the window function at all. The function syntax of TSeq is independent of data source. For whatever data sources such as database, TXT file, Excel file, and binary file, programmers are not required to modify the code when using TSeq for computing.

Difference between explicit sets

Although SQL has the concept of set, the explicit set is not provided, and cannot be taken as the independent variable. The set operations can only be implemented with temporary table. TSeq is the true explicit set for implementing the set operation.

Example: Compute over the Contract data object. Suppose the contract with over 40 order quantity is the big contract, and the unit price over 2000 is the important contract. Please find the contract settled in this year which are both the big contract and the important contract, and all other contracts except these contracts.

SQL:

select SellDate,Quantity,Amount,Client from Contract where to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000

select SellDate,Quantity,Amount,Client from Contract where not(to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000)

TSeq:

=thisYear= Contract.select(year(SellDate)=2014)

=big= Contract.select(Quantity>40)

=importance = Contract.select(AMOUNT>2000)

=answer=thieYear^big^ importance

=others= Contract\answer

Comparison and Analysis:

The typical natural thinking pattern is to find the contracts that are big contract and also the important contract of this year. The intersection operation is most intuitive. If the big contract is defined as big, the important contract is defined as importance, and the contract in the current year is defined as thisYear, then we can write the pseudocode easily: big∩importance∩thisYear. TSeq is the explicit set capable of representing the equivalent expression very intuitively, i.e. thieYear^big^importance. SQL does not allow for representing it with the variables for set. So, we have to find another way, for example, converting it to the Boolean condition, as demonstrated in the example: to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000.

It is easy to solve the first problem because the respective development difficulty with these two tools are not great. With the further research into the problems, their differences become evident.

Problem 2: “Other contacts except for those big and important contracts” - it is also a typical thinking pattern - the computing goal can be achieved with difference set in one step. TSeq expression:Contract\answer, very intuitive. Using Boolean conditions, SQL can also be used to compute out the answer. But the expression writing style and business description vary greatly, as shown below: not(to_char(SellDate,'yyyy')='2014' and quantity>=40 and AMOUNT>=2000).

With SQL, we can solve it using the set operation, and the algorithm is also quite intuitive. However, the code may appear very lengthy since the data table cannot be represented with the set variables.

(select SellDate,Quantity,Amount,Client from Contract)

minus

(Select select SellDate,Quantity,Amount,Client from Contract from(

(select SellDate,Quantity,Amount,Client from Contract where to_char(SellDate,'yyyy')='2012')

Intersect

(select SellDate,Quantity,Amount,Client from Contract where quantity>=40)

Intersect

(select SellDate,Quantity,Amount,Client from Contract where AMOUNT>=2000))

Because the code is lengthy, many people would rather use Boolean condition to implement the set operations indirectly.

Undoubtedly, in most cases, it is more convenient for data table to use operation set than Boolean condition, for example, the set operations between multiple physical tables, or the set operations between multi-level sub-queries. In this case, the cost would be high if converting the set operation to the Boolean conditions, and programmers have to adopt the lengthy set operation.

Difference between generic sets

TSeq is the generic set to store both the physic data and the reference to the associated data. With this advantage, TSeq can implement the associated computing through the intuitive object reference. By comparison, the data table can only be used to store the physical data. The equivalent computation can only be completed using the associated complex statements.

Example:

Please compute which department managers received the President Award, and of these managers, which subordinates are the annual outstanding employees. This involves two data object:department and employee, in which, the deptName field of department and the empDept field of employee is in the one-to-one relation. Still, the manager field in department and the empName field in employee is also the one-to-one relation. In addition, the code for president award is PA; and the code for annual outstanding employee is EOY.

Data table:

SELECT A.*

FROM employee A,department B,employee C

WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=‘EOY’ AND C.empHornor=‘PA’

TSeq:

employee.select(empHonor: "EOY",empDept.manager.empHornor:"PA")

Comparison and Analysis:

The SQL solution can definitely work out.But the relevant associating statement is fairly complex that the average programmer could feel hard to comprehend. By comparison, TSeq solution is rather intuitive, empHonor:"EOY"is one of the conditions: “Who are Year Outstanding Employees”, while empDept.manager.empHornor indicates the “the department.the department manager.awards received by the department manager”. Obviously, if this value is PA, then the condition 2 in the problem is met: “Department managers who received the president award”. This is the object reference by definition.

The object reference enables programmers to use “.” operator to make reference to the related data. By this way, the business association relation can be translated into the computer language intuitively, so that multilevel relations can be represented conveniently, and the associating computation can be performed intuitively.

Differences regarding Memory and External Storage Transparencies

SQL

Because SQL data table does not support the generic and set data,no data will be lost when writing the data from memory into the external storage. Thus, it is transparent to the computation over memory and external storage. On accessing the data table for the first time, data are usually from the external storage; then, for the subsequent accesses to the same data table, data can be from the memory cache; for the data table with relatively large volume of data, part of the data can be from the external storage, and part from the memory. No matter the data is from memory or external storage, and data volume is big or small, the syntax for retrieving data table is always the same, and programmers need not to write different SQL statements for it.

TSeq

TSeq supports the generic type (reference in particular) and set data. Data could lost when writing the data from memory to the external storage, and can not always ensure to be loaded properly, which making the computation intransparent to the memory or external storage. TSeq is the pure memory data object, and is only capable of handling limited data volume; If the data volume is relatively big, then cursor (another data object of esProc) should be used for the computation in external storage, with different syntaxes for the cursor and the TSeq; esProc programmers also must convert the data between cursor and TSeq if they want to improve performance or handle the computation involving complex business logics.

Comparisons

TSeq is not transparent to the memory and external storage. esProc programmers need to write different codes to meet the requirement of computing in memory, external storage, or mixed computing. In addition, they need to modify the codes to meet the needs of data volume increase. Therefore, the workload for the initial design and post-maintenance is relatively great. SQL data table is transparent to the memory and external storage. Programmers only need to write one set of code to fit various scale of data. The designing and maintenance workloads are relatively small.

Through the above comparison, we can conclude: Characterized with the ordered data, explicit set, and generic set, TSeq can easily solve the order-related complex problem and reduce the complexity of set operations, with support for the intuitive object reference to handle the complex multi-table association. SQL data table is transparent to the memory and external storage, and SQL code is commoner.

menu

July 23, 2014

Differences and Similarities between esProc TSeq and SQL Data Table

Common basic functions

Difference of being in order

Difference between explicit sets

Difference between generic sets

Differences regarding Memory and External Storage Transparencies

No comments:

Post a Comment