July 29, 2014

Debugging Function Comparison RLanguage v.s. esProc

As is well known, in the development process of program, the time consumed to remove and correct the error is usually greater than time spent in coding. Therefore, a friendly debug environment can save a lot of time. In this respect, VB.NET and SQL are two extremes that the former provides almost a perfect Debug environment, while the latter nearly provides no error debugging tool.
R language and esProc as two development tools for computation and analysis are both capable to debug to some degree. We will study on their differences in this respect.

Let’s kick off by making ourselves familiar with the debugging environments of both R (take R Studio for example) and esProc from their respective interfaces:

R Studio Debugging Environment:
 
esProc's Debugging Environment

 

Let's compare the basic functions. 
Break point: For R, the break point is set by inserting browser () into the codes. Users have to remove these statements manually once debugged, which seems to be  back to the cherished old times of using BASIC to code when Windows was not invented, impressed us with a strong feeling of reminiscence. In those days, removing the stop break point statement is even an important job before releasing codes. By comparison, the break point style of esProc is similar to that of VB.NET and other alike modern programming languages. By clicking the button or pressing shortcut keys, the break point can be set to the cell in which the mouse cursor is located. This is nothing special.

Debug command: with the same style of break point, debug command of R is input from the console, including to resume running,n to go run the next statement, and Q to exit the debug mode. In addition, there are also functions like trace/set Break point/debug/undebug/stop.It is important to note that it would be best not to have any variable named after c, n, and Q in the codes. Otherwise, accidental conflicts will occur.

Regarding the procedure control, esProc is no different than VB.net and like programming languages for just requires click(s) on button or shortcut keys to implement, not requiring users to memorize any command, as we all know. 

Variable watch: The variable watch window of R language is on the right, in which all current variables will be listed. On clicking it, a new window will prompt to display the value of this variable. Alternatively, R language users can also enter the fix(variable name) at the command line window as shown below. In the right bottom corner of esProc user interface, there is a similar variable list. Seldom do esProc users use this list because esProc does not require users to specially define the variable name. The name of cell is taken as the variable name by default, and thus users can simply click the cell to review the variable value. 

One thing to notice is that R is friendly to display the variables of Frame format. However, it is comparatively not so friendly to support the irregularly-structured variables that we can say it is unreadable at all, as the below typical List for example:
 

esProc does a much better job in this respect. For the same data, in esProc, it is represented by drilling through the hyperlinks:
 

Then, let’s compare some more advanced functions, and start from checking the Immediate Running first.

As for esProc, a cell will be calculated immediately and automatically once codes are entered into this cell. Therefore, the developers can view the execution result immediately and adjust the code for re-run on conditions. This style can speed up the development speed and lower the probability of errors, allowing the green hand to become familiar with it quickly. R Studio provides the similar means that more resembles the "immediate window" of VB, that is,user's type in codes at a command line window and run immediately. If it is run correctly, then copy the codes to the formal code section. Judging on the whole, R is less convenient than esProc in this respect.

Finally, let us discuss the function to debug the functions separately.

R users can use the debug(Function Name) to debug the functions separately and directly so as to modularize in development and implement the large-scale test. esProc users,on the contrary, are not allowed to debug the function separately, which is a pity more or less. However, the debug function of R language has not implemented the true “separate” test. Its working principle is actually to add a browser () command prior to the function to be debugged, still requiring running all codes before entering the function to debug. 

From another perspective, such computational analysis software is rarely used for the large-scale development and test. There is not much significance and value for its ability to debug function separately. 

Through the above comparison, we can see that both R and esProc provide some debugging functions. In which,esProc is performing better in terms of convenience and usability.

Set Operators in esProc

Since sets are commonly used in esProc, the latter provides comprehensive set operations.

1. Binary Operation on Sets

The most basic set in esProc is sequence. Let’s look at some basic binary operations between two sequences A and B.

  • A|B

Concatenate: Concatenate the two sequence straightforwardly, of which the member of B is added behind the members of A. If A or B or both A and B are the single-value member instead of sequence, then it will be handled as a single member sequence.

  • A&B

Union: Join the two sets and remove the members of B that already exist in A. If one of A and B or both vales are the single-value member instead of sequence, then you can take it as single member sequence to process.

  • A^B

Intersection: Intersection set of A and B. Get a sequence composed of members that not only exist in A but also exist in B.

  • A\B

Complement: Members in A but not in B. If B is not a sequence, then treat it as the single member sequence.

  • k*A

A|A|…|A, copy A for k times, positions of k and A are interchangeable in the expression.



These basic operations - concatenate, complement, intersection and union - can also be used to deal with more complex set operations. Such as, to seek the "inverse intersection" of sets A and B, i.e., the set composed of all that is not the common members of A and B.
 

The "inverse intersection" of two sets is the complement of their union and intersection. With this kind of transformation, more complex binary operations will be completed.

2. Alignment Arithmetic Operation on Sets

Two sequences which have the same length will perform the alignment computation on members, and return a new sequence.

  • A++B

[A(1)+B(1),A(2)+B(2),…]

  • A--B

[A(1)-B(1),A(2)-B(2),…]

  • A**B

[A(1)*B(1),A(2)*B(2),…]

  • A//B

[A(1)/B(1),A(2)/B(2),…]

  • A%%B

[A(1)%B(1),A(2)%B(2),…], the % here is the Mod computation.

  • A\\B

[A(1)\B(1),A(2)\B(2),…], the slash here means integer division


3. Comparison of Sets

In esProc, the function cmp(A,B) can be used to compare the sequence A and B.

  • cmp(A,B)

Compare the member values of two sequences at the same location one by one, and return the 1 or -1 when encountering the first unequal member, if A is identical to B, then return 0. Specifically, cmp(A) or cmp(A,0) will compare A with the sequence with the same length and members are all 0, i.e. cmp(A,[0,0,…,0]).


Comparison of two sequences can be briefly expressed as A==B, A>B.

Note that a sequence in esProc is an ordered set, so order plays an important role in comparing the size of two sequences A and B. A.eq(B) is used to see if the two sequences have common members.



Because the member order in A1 and A2 is different, result in A3 is false, showing the two sequences are not equal. While result in A4 is true, showing the two sequences have the same members.

July 28, 2014

Implement Basic Functions of SQL through R Language & esProc

As we know, SQL users have to finish composing all codes and then run them all at one time, resulting in inconvenient interactive computing. However, the simple and easy-to-understand query syntax of SQL is always welcomed by programmers. R language and esProc as the computing and analysis tool are surely required to offer the similar query syntax. In this essay, we will discuss how they implement the basic functions of SQL through R language and esProc through some examples.

The example data is from 2 tables of the classical Northwind database:
Orders table with the main fields: OrderID, EmployeeID, OrderDate, Freight, CustomerID

Customer table with the main fields: CustomerID, CompanyName

Retrieve data of the entire table
SQL: select * from Orders

R: A1<-sqlQuery(conn,'select * from Orders')

esProc: $select * from Orders

Comments: The word count of esProc and SQL codes only differs by one word. The R language has a strong flavor of programming.

Where: Search for the order which has a freightage higher than 100 and is placed before the year of 1998
SQL: select * from Orders where Freight > 100 and OrderDate < '1998-01-01'

R : subset(A1,Freight > 100 & OrderDate < as.POSIXlt('1998-01-01'))

esProc: =A1.select(Freight > 100 && OrderDate < date("1998-01-01"))

Comments: Both SQL and R solutions are close to SQL to some extent. R is of the typical function style, and esProc is of the typical object style.

Order: Sort by employees correctly, and then sort by freightage in reverse order
SQL: select * from Orders order by EmployeeID, Freight desc

R: A1[order(A1$ EmployeeID,-A1$Freight),]

esProc: =A1.sort(EmployeeID,Freight:-1)

Comments: R solution is to retrieve 2 vectors from A1 at first, pass them to the function order to group them together for sorting, and then export the serial numbers. At last, data will be rearranged according to the serial number. The computing process of R language is rather winding, not as straightforward as SQL. It is because that R is good at handling vector, the access to structured data will take column as the basic unit, and the parameter usually takes up a whole column. By comparison, SQL takes the record (column) as the basic unit, with parameter as the column name.

esProc solution resembles that of SQL because esProc takes the record (row) as the basic unit .

Group & Sum: Summarize by employee, sum up the freightage, and count the orders:
SQL: select EmployeeID, count(OrderID), sum(Freight) from Orders group by EmployeeID

R :
A4<-aggregate (A1$Freight,list(A1$EmployeeID),sum)  
A4$count<-tapply(A1$Freight,A1$EmployeeID,length)

esProc : =A1.group(EmployeeID;EmployeeID,~.count(),~.sum(Freight))

Comments: In this case, it is obvious that R and SQL differ greatly. The algorithms available in R may be clearer for mathematicians, and means more learning efforts for users accustomed to SQL.

Join:  Perform left join on Orders table and Customers table by CustomerID.
SQL :select * from Orders leftjoin Customers on Orders.CustomerID=Customers.CustomerID

R :merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

esProc: =join@1(A1:CustomerID:Orders, B1:CustomerID:Customers)

Comments: The join of SQL equals to join of esProc or the merge of R. Similarly, the left join of SQL equals to join@1 of esProc, or merge(...all.x=TRUE) of R. Obviously, esProc is more alike SQL in the respects of both the syntax conventions and the literal meanings.

Distinct: Remove the duplicate CustomerID
SQL: selectdistinct CustomerID fromOrders

R: unique(B2$CustomerID)

esProc: =B2.id(CustomerID)
Comments: The keywords of the two solutions respectively differ to that of SQL. However, their usages are basically the same to that of SQL. In which, R is the typical function style, and esProc is the typical object style.

Like: Search for the record with Island in ShipName
SQL : select * from Orders where ShipName like '%Island%'

R: subset(A1,grepl("Island",ShipName,ignore.case = TRUE))

esProc : =A1.select(like@c(ShipName ,"*Island*"))

Comments: R supports several means to match, including the regular expressions, and is more powerful than esProc in this respect. The usages of esProc are more close to that of SQL, and fit for those who are familiar with SQL.

As can be seen from the above comparison, esProc has a coding style more close to that of SQL since esProc supports the data structure of “Record” by the infrastructure, making it more suitable for users who are familiar with SQL. In addition, compared with R language, esProc provides the representation style of "object + functions" that is much easier for those who are familiar with database to accept.
R is more resourceful in details and third party support, ideal for mathematicians.

Grouping Function Comparison R Language vs. esProc

Grouping is to allocate the samples into several groups according to a specific flag. There is a difference between groups and the relative commonness shared by group members. The grouping plays an important role in statistical analysis. For example, the type grouping is used to differentiate the types of economy, society, sciences, and other phenomena. The structural grouping is used to study the internal structure, and the analysis group is used to analyze the coexistence relation between data.

As the data analysis language, both R language and esProc provide the rich functions of grouping. Let’s use some examples to have an idea of their difference. In these cases, we will use the Orders table from Northwind database as the sample data.

Basic grouping: group by a certain column, for example, view data by employee.
   R:orderByEmp<-split(result,result$EmployeeID)
 
   esProc:A2=A1.group(EmployeeID)

Regarding the basic functions, both R and esProc implement it well. In addition, users can expand the basic grouping functions, such as, group by multi-columns, group and summarize concurrently, first group and then summarize, and continuous grouping at one level after another, and the inter-row computation on data to be grouped. 

The basic grouping, in effect, can be characterized as follows: the original member will always be assigned to a certain group, and no duplicate member is allowed. This is the completely-divided grouping that is championed by the relation algebra (i.e. SQL). In some cases, the conditions are even more complicated. For example, the Marketing department sent a list of advertisements & regions (AdCountry by name). These regions are the location where the advertisement campaigns are intensively launched. Currently, we need to analyze the order conditions in these regions. Such conditions are characterized as follows:  
The advertisement list has definitely less countries than those in the Orders table because it is the advertisement for partial countries.
The advertisement list may comprise more countries than those in the Orders table because it is quite normal for some countries having no orders.

Such type of grouping can be referred to as "incompletely divided grouping". This is not supported by SQL theory and hard to implement. Let's check whether R language and esProc can over-perform SQL in this respect:

R solution:
   AdCountry<-c("USA","Finland","Canada","NotInOrders")
   AdCountryFac<-factor(result$ShipCountry,levels=AdCountry)
   groupByAd<-split(result,AdCountryFac)

esProc solution:
   A3=["USA","Finland","Canada","NotInOrders"]
   A4=A1.align(A3,ShipCountry)

Comments:Both R and esProc can solve this problem well. The data is grouped into 4 groups. There are some data in the first 3 groups, and the last group is empty, as expected. 

Let us then check the grouping on simple conditions: classify the freightage into 3 categories of 0-30, 30-100, and 100. 

R solution:
   Frekind<-cut(result$Freight,breaks=c(0,30,100,Inf))
   orderByFei<-split(result,Frekind)

esProc solution:
   A5=["?<=30", "?<=100&& ?>=30" ,”?>100”]
   A6=A1.enum(A5,Freight)

Comments: Both solutions solve the problem perfectly. However, you may have noticed that the representation of esProc is much more flexible. For example, esProc users can carry out the Boolean judgment on character string or data of other types. They can also compute the Boolean expression on 2 fields concurrently. By comparison, it is much more complicated for R language users to implement the similar functions. Since R language users can only perform grouping on one field of numeric type to group it into a non-overlapped range/category. The limitations are really not just a few.

Then, let us check a case of much more complicated example of grouping on conditions. For example, the freightage belongs to these 3 categories: the 5$-15$ is the freightage range that is most easily to be accepted by users, the low freightage range is for those below the 50$, and the high freightage range is for those higher than 50. In this case, there is some overlapped area of category 1 and category 2. Then, the record whose freightage below 10$ must exist in both these 2 groups.

R solution:
   subset(result,Freight>=5 &  Freight<=15)->g5to15
   subset(result,Freight<=50 )->g0to50
   subset(result,Freight>50 )->g50toinf
   gFeight<-list(g5to15=g5to15,g0to50=g0to50,g50toinf=g50toinf)

esProc solution:
   A5=["?<=5 &&?>=15", "?<=50" ,”?>50”]
   A6=A1.enum@r(A5,Freight)

Comments: R does not provide the function / feature to implement the grouping on complicated conditions. In fact, such grouping is made out awkwardly. So, it by no means resembles the "grouping action". esProc solution is the same to that for the previous example. In which, @r is used to indicate that the duplicates are allowed in the groups. Such syntax style is flexible, and you can implement lots of functions or features on the basis of the limited number of functions, not having to name any new functions. 

As can be seen from several examples above, R language can be used to implement the advanced grouping function. R language is more powerful than SQL in this respect. However, it is still behind esProc regarding the flexibility and usability.

July 27, 2014

esProc’s Option Syntax

1.esProc's function options

Many functions in esProc can use function options with which the same function can have different work patterns. The basic format of function options is f@o(…) in which o is f function's option. For example:

We can use interval function to compute the days between two dates. In expressions in A3 and A4, @y and @m are interval function’s options. With these options, the function will use the year and the month as computational units in computing time interval. With function options, a function can meet multiple needs, extend its functions and avoid too many function names or function parameters.

2. Common function options
Some function options are common in esProc. They can be used by many functions.

  • @1 and @a

@1 option and @a option are frequently used by functions to position, select and join, such as, A.pos(), A.select(), A.pmax(), A.pselect(), A.minp(), P.align(), etc.
Use of @a option can make functions that return the first query result by default return all eligible results. Contrary to @a, functions that return multiple members by default will return only the first query result by using @1 option.

We’ll look at the use of these two options through some examples:



By the way, since digit 1 is difficult to distinguish from the lowercase letter l, in most cases, the former is used in esProc's options.
One other thing need to be made known is that the same option in different functions can have different meanings. For example, @a used in position functions means returning all results, while in functions for file writing, like f.write() and f.export(), it means appending.

  • @z

@z option is often used in some functions relating to order, like sort, position, select, etc. Such as A.rank(), A.sort(), A.pos(), A.pselect(), A.select() and so on.

With @z option, position or select in a sequence or a table sequence will be executed from back to front. Thus in this example, A2 returns the position of the last 2 and A3 returns results of sorting in descending order. Records obtained by A5 are also sorted in descending order.

  • @b

@b option is often used in functions to position, select, etc., like A.pos(), A.pselect(), A.select(), and so on. The use of @b option is usually accompanied by binary search algorithm, which is more efficient, in query, with the prerequisite that A is ordered; otherwise results may be wrong.


Since data in A1 are not sorted according to state ID, only one result is obtained with the binary search method when @b option is used in A2; while A4 obtains the correct results because data are sorted in A3.

But it is another thing when @b is used in functions to read and write, such as f.import(), f.export(), f.cursor(). In this case, it is the binary files that being read in or written out. In esProc, binary files have smaller storage size and faster access speed, therefore the use of binary files will bring higher efficiency.

3.Simultaneous use of multiple options

Multiple options can be used simultaneously in esProc when needed. There is no particular order among these options. For example:

Because both A 2 and A3 get positions of all the 2 in A1 from back to front, their results are identical. A5 gets the first city whose state ID is 5 from back to front.
Note that some function options are contradictory to each other, and cannot appear simultaneously. Such as @a and @1, @t and @b options of f.import().