esProc, A Script Language for Data Analytics with Parallel Mechanism: BI

Showing posts with label BI. Show all posts

April 10, 2014

Leveraging esProc to for a Heterogeneous Data Source Computation Issue with Crystal Report

Recently we resolved an issue with Crystal Report, related with heterogeneous data source computation. Due to its complexity, the computation cannot be done with the current functionality of Crystal Report. Thus esProc is used for cross database computation.

Project background: The customer has recently rolled out their KPI system, which required some modification on their original salary computation algorithm. In the past the salary list is mainly calculated from the basic salary of the employees, which is stored in the MSSQL database used by their Accounting System. The new salary list consists of two parts, basic salary and performance-based salary. The performance-based salary is calculated according to the performance score for each employee, which is stored in the Oracle database used by KPI system. Obviously, the new salary list has to be generated with cross database computation on two heterogeneous databases.

The actual algorithm for computing performance-based salary is quite complicated. Firstly, the algorithm for each position is different. Some are based on the level of the basic salary, while others are not. Some positions are purely based on performance, while others need to consider the performance and how many years the employee has worked for the company. Moreover, there are also positions that have performance score but no performance-based salary. Secondly, even for those positions based on the level of the basic salary, the algorithm might differ, as the salary level is different for each position. Within each level, the algorithm might also be different. Finally, the salary for all employees needs to be combined into one report.

For better understanding, we simplified the algorithm significantly, and ignored the tax implication. We limited the positions to 2: “normal” and “sales”. Position “normal” has performance score but no performance-based salary. The pre-tax salary equals basic salary. For position “sales”, the pre-tax salary is the sum of basic salary and performance-based salary. Among which, the performance-based salary is computed in this way:

For employees with basic salary below 2000: performance-based salary=basic salary*(performance sore /100)

For employees with basic salary between 2000 and 4000: performance-based salary=basic salary*(performance sore*0.9 /100)

For employees with basic salary above 4000: performance-based salary=basic salary*(performance sore*0.8 /100)

Thus we could see that to generate a complete salary list, we need to separate the employees in employee table in MSSQL into several groups (2 groups after the simplification). For each group we need to compute the pre-tax salary, and then combine them into one list. For two different positions, the computation is different. For employees with position as “sales”, the “performance” table in Oracle database needs to be joined, with pre-tax salary being computed according to respective levels. For employees with position “normal”, no such joined is needed.

The difficulty in this report lies in: 1) table employee and performance belongs to two heterogeneous databases, which requires cross database computation. 2) the algorithm is too complicated, as simply join the two tables cannot do the job.

The ideal solution for cross database computation is to do this through reporting tool. If the reporting tool can process two heterogeneous data sources in one report, cross database computation can then be done on the “report level”. However Crystal Report handles heterogeneous data sources in a very complicated way, and it is done with a high implement cost. Plus, reporting tool can only work with simple inner and outer join, not the kind of complicated computation, such as what-if judgments in a loop, and multiple result sets aggregation.

Since reporting tool cannot solve such issues, we can only turn to other way. Loading the data to a separate database with ETL is not a good choice, because the development for ETL is costly, and data synchronization, as well as real time updates need to be considered. With user-defined data source the problem can be simplified significantly. Well, esProc can be a very good self-defined data source for reporting tool.

See the codes below:

These codes are easy to understand.

A1A2: retrieving data from ORACLE and MSSQL databases. A3: adding an empty column to table “employee”, to store the future pre-tax salary.

A4A10: extract data for all employees with positions of “sales” and “normal”. For future aggregation purpose, business name is more convenient. Thus we name these two sets as sales and normal respectively. Of course, we did not define extra variable for temporary computation result like A3, which, instead, is called in A4 by the name of the cell. The same is for A1, which is called A5.

A5-C9: computation the pre-tax salary for sales. Here A5 is join, which is done between the basic salary for sales and their respective performance-based salary. A6 to C9 is a loop, used to compute the pre-tax salary for each row of sales, based on the actual salary level for the employee. Three things need to be noted here: 1) the loop is indicated with indentation, with B7-C9 as the body of the loop. 2) The variable of the loop is in cell A6, after the “for” operator. In the loopA6 can be used to refer to current record. 3) The way that A6.empID.score is used how an object is referred to. This refers to the score field of the records joined with empID field of current record A6 (eg., record in performance), which is the performance score of the current employee.

A11: replace the preTax value in normal with baseSalary.

A12: combine the computation result sets for different positions. Of course, in reality the positions are not limited to two in algorithm. The algorithm for computing pre-tax salary of each position is also more complicated than the above example.

A13: Select some fields from A12 for output.

A14: output A13 by JDBC, so that JAVA code or reporting tools can call it directly through JDBC URL. We can see that this is also a way to unify heterogeneous data sources. However, the data source consolidation in Crystal Report is too simple to handle such process-based cross database computation.

Now the cross database computation issue with Crystal Report is conveniently resolved. Besides, Crystal Report can use esProc in the same way as MSSQL or Oracle databases, which is easy to do.

About esProc: http://www.raqsoft.com/product-esproc

October 14, 2013

Why esProc is needed in Hadoop

esProc is a brand new parallel computing framework with support for reading and writing to the files in HDFS and the full commitment to improve the computational capability, performance, and development efficiency of Hadoop.

>Enhance Computational Capability
The computational capability of Hadoop is developed on the basis of the Java language and the MapReduce parallel framework. Java is really outstanding for it is broadly and generally used in many common applications. However, Java is not powerful enough for the computation in many specialized fields. MapReduce lacks the library functions to support even the simplest data algorithm. No direct support for typical data algorithms of associated computation, sub-query, inter-row computation, and ordered computation. Its computational capability is rather weak.

esProc is also a Java-based parallel framework and provides the script more specific to the big data computation and optimal to process the big data. esProc can work with HDFS to improve the computational capability of Hadoop greatly.

In order to boost the computational capability, Hive SQL is packaged on MapReduce for Hadoop. The computational capability of Hive SQL is quite limited since it is just a subset of SQL with no support for stored procedure, incapable of completing the complex data computation.

With the complete computational system and powerful computational capability, esProc can meet any computational demand effortlessly, and solve the complex data computing in a way easier than stored procedures do. esProc allows for invoking the computational result of Hive, and improves the computational capability of Hadoop by working with Hive.

>Boost Computational Performance
Developed with a rigid frame, MapReduce is inflexible in decomposing and allocating tasks, extremely resource-consuming, and relatively poor in the real-time actions. By comparison, esProc enables the arbitrary task allocation. In the extreme conditions, the time spent on allocating task is only one out often million of the time required by MapReduce, and the parallel performance of esProc is superior.

In MapReduce, the intermediate result of the cross-machine interactions must be stored in HDFS as a file. Although this is an advantage for fault tolerance, the great obstacle of delay is also incurred due to this. By comparison, esProc allows users to make the flexible choice according to the duration of computation. The intermediate result can either be used directly to reduce the obstacle of delay or stored in HDFS to increase the fault tolerance.

It is awkward for MapReduce to complete the common data computations such as the multi-table association, year-over-year and link relative ratio comparisons. If implementing such computation with any workarounds or indirect solutions of MapReduce, then the computation performance will decline dramatically. By comparison, esProc provides the native support for such computation. The combination use of esProc and HDFS boosts the computation efficiency of Hadoop dramatically.

The infrastructure of Hive is still the MapReduce, which implements the common algorithms like associated computation at the cost of performance, usually resulting in a performance of one order of magnitude inferior to that of RDB. The performance of esProc is close or even partly superior to RDB. esProc can work with Hive via JDBC to undertake the computational task with strict requirements on real-time processing.

>Improve Development Efficiency
Even for the simplest computation, MapReduce users will have to program manually - the development efficiency is low. Moreover, MapReduce requires relatively stronger development skills and greater workload to implement the associated computation, ordered computation, equal grouping, year-over-year and link relative ratio comparisons. Hive does not support the stored procedure, and still have to rely on MapReduce to handle a little bit more complex computations.

For the common algorithms, esProc provides abundant library functions for direct use; For the complex algorithm logics, esProc provides the agile syntax and professional IDE for implementing with ease. Working with HDFS and Hive, esProc can greatly boost the development efficiency of Hadoop. With the true support for the data type of set, esProc enables the ordered set and the set-lized groupings, such as the equal grouping, align grouping, and enum grouping. esProc scripts are written in a grid-style cellset so that users can reference the intermediate computational result directly without defining anything.

The debugging function of MapReduce is so outdated that users can only identify the error by checking error messages in the log file. By comparison, esProc supports the break point, step-by-step run, run to cursor, start, end, and other specialized debugging function to ensure the development efficiency.

To define the task scale arbitrarily, MapReduce users will have to customize MapReduce framework, which is not only tough but also compromises the development efficiency seriously. esProc is flexible and arbitrary in task allocation, and the development efficiency is quite high.

esProc has all outstanding features of Hadoop - parallel computations on multiple nodes, inexpensive hardware for scale-out, and open external interface. In addition, esProc renovates Hadoop with the flexible parallel framework, specialized script for big data, agile syntax, and professional IDE.

September 3, 2013

Powerful and Easy-to-use Data Computation Tool

esProc is the development tool for database computation. esProc IDE is simple and easy-to-use, providing an higher development efficiency than SQL does. It is especially ideal for reaching a complex computational goal, or acting as a data source computation tool of reports, or a data computational layer of applications.

Characteristics: Tailored for Database Computation

1. Its basic data type supports the structured data

The TSeq is the commonest basic data type introduced in esProc As the result set of esProc, TSeq is the set of structural records, which is same as that of SQL; However, TSeq is more superior than the result set of SQL in many respects owing to its support for the access via sequence number, generic data, and explicit set. The TSeq can be used for common computations on structural data, and more suitable for simplifying the complex database computation.

2. Its syntax structure is especially tailored for the database computation.

The syntax of esProc is agile and efficient, especially designed for the database computation. For example:

Basic filtering: A1.select(conditions)
Dichotomic filtering: A1.select@b(conditions)

Dichotomic filtering and filtering out the complementary set: A1.select@bx(conditions)

Getting the amount field for the last record: A1.m(-1).(amount)

The intersection of 2 result sets: A1^B1

The cyclical function of esProc can simplify and streamline the complex SQL, for example:

Add the computational column LRR, and compute the yearly link relative ratio of amount: A1.derive(amount/amount[-1]-1: LRR)

Compute the moving average of recent 3 days, and assign it to column ma:A1.run (ma= ~{-1,1}.( amount).avg())

3. Fascinating functions for database
In esProc IDE, users can compose SQL statement directly, and take advantages of the database structure browser, SQL wizard, and other facilitating functions. The figure is shown below:

4. Retrieve and modify the structural data directly

The structural data is mainly stored in the database, and partly from the Excel and Txt. For these three types of data, esProc supports the direct retrieval and modification.

For example, firstly, retrieve the order information from the database and client details from Txt file. Secondly, by computing, find out the clients who have bought the products of all kinds. Lastly, write the result into the Excel file.

Characteristics: Easy-to-use Development Environment

1. Cellset of grid style

The script of esProc is stored in the grid-like cellset, which saves the effort of formatting. The scripts are presented in a clear and readable way by nature. For example, scripts will auto-aligned to the ruler. When composing the judgment statement, loop statement, and other long statements, users can be benefited by indentation in indicating the work scope of computation straightforwardly. The figure is shown below:

2. Step-by-step computation

The step-by-step refers to decomposing the computation goal into several simple steps. This is the most effective method and the most important metric on the ability to solve the complex computation. Because SQL lacks the ability to support the step-by-step computation, it is quite difficult to achieve the complex computation goal. With the grid style, esProc can achieve the step-by-step computation and result reuse easily.

As shown above, the data filtering for B5 cell has completed easily. Click B5 to view the result of this cell on the right directly. In other cells, B6 for example, the result of B5 can be referenced directly through the cell name. No need to define the variables specially. B6 has completed the grouping and summarizing on B5, and B7 can continue to work on B6. In this way, the interactive computation can be carried on constantly. Each step only requires completing one simple computation to get the solution to the complex computational goal.

3. Debug functions
esProc provides the perfect debugging functions, with support for the breakpoint, single step, and run to cursor functions, as shown in the below figure:

Many long SQL/SP statements can only be composed and comprehended as a whole. The internal working details cannot be monitored. So, their debugging functions are not practical. For example, for the grouping and the summarizing, esProc can implement them in 2 steps, while SQL cannot.

4. Instant computation mode

esProc supports the instant computation mode of auto-computing after each step of scripting. The result will auto-appear in the result column on the right side. With the instant computation mode, users can script and monitor the result at the same time. The algorithm for the next step can thus be decided on the basis of the characteristics of result. The computational procedure will be more focused, and scripting will be more smooth and natural. Neither R language, SQL, or other computational utilities have such ability.

5. JDBC output

Since esProc supports JDBC interface, other tools (reporting tools for example) or Java language can retrieve the result from esProc through JDBC interface. Judging from the code reuse and maintenance perspectives, esProc can be taken as the data computational layer of low coupling in the application. Regarding the performance, esProc can off-load the complex data computing from database, so that the pressure on database server is relieved greatly.

6. Big data computation

esProc has implemented the hadoop interface, is capable of being called by mapreduce, allows for the easy retrieval and writing to the HDFS, and supports the big data computing directly. esProc is especially fit for the complex data computation, which makes it superior than other big data computation tools.

7. Plotting arbitrarily
esProc supports the graphic plotting functions and the graphic parameter editor. Not only the common statistical charts can be generated directly, but also the underlying chart element control functions are made available to esProc users. They are thus enabled to plot the personalized charts of any kinds arbitrarily.

March 26, 2013

Problem: Further Computations on Grouped Data in Spreadsheet

Spreadsheets can usually carry out grouping on data easily, but the further computing on grouped data is much more difficult. For example, a data (wine inventory) is after shown as below after group and summary :

Say, if we are to compute the total amount of each wine or even each type. In the spreadsheet, it is a bit difficult to carry on these calculations or actions based on the above: calculate the ”UnitPrice*Stock” amount; sort the data in the group by inventory level; add an additional level of grouping by Name to calculate the inventory of the same type of wine; calculate the average unit price of each type of wine; and even highlight the subtotal section such as the F4 and F10.

To calculate the resulting amount of ”UnitPrice*Stock”, a new column G need to be appended along with an expression of “=E4*F4”, and then drag the expression to every row of this column. Because UnitPrice is unavailable in the subtotal row, the formula is invalid there and must be removed for each row manually, which is error-prone since there are many types of wines.

To sort the data in a group by the inventory level, you can only sort manually on each group of data since the data is allocated to several irrelevant groups. For example, the 1st group of data is from the row 4~9. It is error-prone if the data volume is huge. Another solution is to ungroup and present the data in a 2D sheet without subtotal, sort by Type and Stock, and then summarize in groups. In this solution, if there is any data in the subtotal row like “proportion of subtotal to total”, then users will have to recalculate these data once the row of subtotal disappears in the procedure. All in all, both solutions suffer from obvious drawbacks.

Add another level of grouping by Name to calculate the inventory of a same type of wine. Because the data is not sorted by Name, the data cannot be grouped directly and we have to go back to solve the previous problem.

To calculate the average unit price of each type of wine, both the 2 solutions available are quite cumbersome. The first solution is to ungroup and paste the data into 2 pieces, either of which will respectively go through the average unit price calculation or the inventory summary calculation, then join the summary values of these 2 pieces of data. The second solution is to write the formula in every subtotal section, for example “=SUBTOTAL(1,E4:E9)” in E3. Since the data in various groups differ, users will have to write N different formulas again and again and errors may quite easily occur.

To highlight the subtotal section, users have no choice but to repeat the simple action on each group one after another manually. Without any means to automate, it is so sarcastic when considering the Office Automation.

In facts, the above problem can be solved with esCalc. esCalc introduces the concept of homocell to present the business relations between cells so as to enable the auto-processing on data of various business senses. For example, if inputting the formula =E4*F4 in G4 for once, then the formula will be pasted to all rows of the details data except for the summary row, as shown in the below sheet:

The data in the group is sorted by the inventory level. In the same way, just sort once in the F4, F5 or F11 cell, the respective positions will be auto-identified, and the data in each group will also be auto-sorted, as shown in the below sheet:

As you know, the esCalc homocell mechanism can solve other problems automatically, and we will not dwell on it for it’s the topic of another essay.

A Strong Helper Is Available When Excel
Get a Standalone Alternative for Excel at $0
Truth behind Ticket Purchasing Rush: Statistical Analysis Works

January 23, 2013

What's an Excellent Professional Reporting Software - Part I

The report is a tool to use the table and statistical chart to present the data. It is the basic need to develop business, the basic means to manage the enterprise, and the basic advantage to enhance the competitiveness. The reporting software is one of the commonest tools for business personnel. What's such professional reporting software that is ideal for those business personnel lacking the technical experience to develop the static reports by themselves.

For example: Bill is the sales director of a pharmaceutical products company. He is in urgent need to prepare a sales report for specific products, so as to impress the clients in an important bidding event. This report needs to show the monthly sales of 3 kinds of products, growth compared with the previous month, and the comparison with the same period of previous year. However, he did not find the existing report at hand, but the short lead time did not allow for requesting assistance from the IT personnel. Therefore, this report must be handled by yourself and the sales force. Similar situations also include:

A retail business is newly equipped with a supply chain management system. Besides the report shipped with system, some temporary reports may be required additionally according to the volatile market change. The cost would be obviously quite high if specially building a technical team to handle it. This job should be left to the business personnel.

A bank improves the business workflow, so they need to prepare the corresponding report. For the data safety purpose, the business department needs to prepare the report all by itself.

A medical products company needs to prepare a report about Clinical Practice. Preparing the report involves many concepts that may easily confuse the average technical personnel. In this case, having the business personnel to prepare the report can reduce errors.

As a professional reporting tool, first it should support for rapid report building:

Take esCalc for example. It has a report design interface of grid style, easy to learn and understand, and capable of building reports rapidly.

The whole interface is shown below:

In which, the data zone is similar to the Excel spreadsheet whose grid employ a letter as the header and a number to indicate the column number. In the cell, you can reference other cells for calculation with the cell name, as shown in the below figure:

The tool bar is available to set the style elements for massive cells, such as font, size, alignment, ruling, merge & split, and the format painter to copy and paste the style, as shown in the below figure:

On the right, there is a cell property panel for users to set the cell properties. For example, input the formula once to implement the color changing in every other row, as shown in the below figure:

Right click on the cell to show the prompt menu, and perform the group, query, locating, and sorting operation, as shown in the below figure:

The grid-style design interface is simple and easy-to-use, saving the efforts of cumbersome data alignment procedure; Referencing the data with the cell name can avoid the complicated variable definition; With the multi-level grid, the data hierarchy and summarization can be implemented more easily. In the grid, reporting becomes more intuitively. In this way, the result can be presented undistortedly and consistently in the procedure of design, preview, print, and export.

Related News from Raqsoft:

Raqsoft Organizes Training to Better Serve Customers
Instant Computing of esProc Brings Flexibility to Analysts
What Makes Self-service Statistical Computing Tools So Important?

November 15, 2012

Computation after Grouping: R Language vs. esProc

Original post: 　http://it.toolbox.com/blogs/data-analytics/computation-after-grouping-r-language-vs-esproc-53756

For SQL, the grouping and summarizing actions are inseparable and must be performed at the same time, which compromises its ability to analyze interactively. By comparison, for R language and esProc, the users can group first and then decide whether to summarize or carry out any more complex computation. For example, without summarizing, R or esProc users can perform the inter-row computation within the group. They can select one of the groups to regroup after studying the values summarized.

Some of you smart readers may have discovered that the latter one is just the OLAP drilling indeed. Well, we will discuss it in details in another essay. Now, let's focus on the subject of this essay and check the respective characteristics of R
and esProc.

Please find the Orders table from the typical Northwind database, as given in the blow example.

Example 1: Group by year without summarizing.

R solution: orderByYear<-split data-blogger-escaped-br="br" data-blogger-escaped-format="format" data-blogger-escaped-rderdate="rderdate" data-blogger-escaped-result="result">
esProc solution: A2=A1.group(year(OrderDate))

Comments: Regarding the basic computation functionalities performed in steps, both R and esProc are perfect in achieving the goal.

Example 2: On the basis of example 1, summarize the freightage of the data from each group by totaling up.

R solution: sumByYear<-mapply data-blogger-escaped-br="br" data-blogger-escaped-function="function" data-blogger-escaped-orderbyyear="orderbyyear" data-blogger-escaped-reight="reight" data-blogger-escaped-sum="sum" data-blogger-escaped-x="x">
esProc solution: A3=A2.(~.sum(Freight))

Comments: Both solutions are perfect.

Example 3: On the basis of example 1, regroup the data by month this time, and then sum up the freight.

R solution:

orderBymonth<-orderbyyear data-blogger-escaped-br="br">
for(i in seq(orderByYear)){

orderBymonth[[i]]<-aggregate data-blogger-escaped-br="br" data-blogger-escaped-format="format" data-blogger-escaped-i="i" data-blogger-escaped-list="list" data-blogger-escaped-m="m" data-blogger-escaped-orderbyyear="orderbyyear" data-blogger-escaped-rderdate="rderdate" data-blogger-escaped-reight="reight" data-blogger-escaped-sum="sum">
}

esProc solution: A4=A2.(~.group(month(OrderDate);~.sum(Freight)))

Comments: As can be seen from the above, it seems that the R solution is fairly complicated. However, it follows the same procedure of that of esProc actually. They all aim to the data of each year (the orderByYear[[i]] for R, and the ~ for esProc,) for grouping operation (the aggregate for R, and the group for esProc). The difference lies in that esProc use ".()" to represent the looping of array. While in R, the looping is represented with for/while/loop. One point worthy of notice is that there are several representations of various usages are available for R to represent the grouping action, which increases the learning difficulties quite a bit. Even the SQL of 10 years ago can beat R in this aspect. We will discuss the flexibility of syntax in a specific essay. Let's proceed with the topic to see the examples below.

Example 4: From the data in the Example 3, we can see that only the statistics of year 1997 is complete with the data of all 12 months. Let's compute the month-on-month value in the year 1997. In which, the data of year 1997 is represented as orderBymonth$"1998" in R, and it is represented as A(2) in esProc.

R solution: orderBymonth$"1997"$lrr<- data-blogger-escaped--="-" data-blogger-escaped-br="br" data-blogger-escaped-c="c" data-blogger-escaped-length="length" data-blogger-escaped-orderbymonth="orderbymonth" data-blogger-escaped-x="x">
esProc solution: A5=A4(2).derive((#1-#1[-1])/#1[-1])

Comments: In this case, the two solutions differ totally. Let's talk about esProc first, #1 represents the first field, that is, the summarized value of Freight. It corresponds to the orderBymonth$"1997"$x in R, and #[-1] represents the previous piece of data in this loop. Therefore, compared with that of the previous period, the expression is simply: (This month – Previous month)/Previous month. Such style of presentation is really simple and clear.

For R user, they can also adopt the similar practices with loops as a must. In fact, if the month-on-month comparison is to be conducted every year, esProc users will only need to replace the 2 in A4(2).derive((#1-#1[-1])/#1[-1]) with ~, while R user will have to use the nested double loops.

Considering the business expert is less skilled in IT field than that of the program developer or the IT pro, I remove the loop with a trick so as to let the business people to handle it smoothly. The actual solution of R is to move the original X down a row, acting as the computational column. Based on this new column and the original column to compute, the computation of relative position can be converted to the inter-column computation. Someone may think the solution becomes simpler at the cost of the adding difficulty of comprehension. Because the R lacks the means to represent the relative positions, the R users will have no choice. They get such a dilemma of compromising the complexity of coding or that of the understanding.

Judging from above examples, we can find that both R and esProc are highly capable for users to compute after grouping. However, esProc provides a much simpler and easier style of representation for users to understand.