esProc, A Script Language for Data Analytics with Parallel Mechanism: reporting tool.

Showing posts with label reporting tool.. Show all posts

October 14, 2013

Why esProc is needed in Hadoop

esProc is a brand new parallel computing framework with support for reading and writing to the files in HDFS and the full commitment to improve the computational capability, performance, and development efficiency of Hadoop.

>Enhance Computational Capability
The computational capability of Hadoop is developed on the basis of the Java language and the MapReduce parallel framework. Java is really outstanding for it is broadly and generally used in many common applications. However, Java is not powerful enough for the computation in many specialized fields. MapReduce lacks the library functions to support even the simplest data algorithm. No direct support for typical data algorithms of associated computation, sub-query, inter-row computation, and ordered computation. Its computational capability is rather weak.

esProc is also a Java-based parallel framework and provides the script more specific to the big data computation and optimal to process the big data. esProc can work with HDFS to improve the computational capability of Hadoop greatly.

In order to boost the computational capability, Hive SQL is packaged on MapReduce for Hadoop. The computational capability of Hive SQL is quite limited since it is just a subset of SQL with no support for stored procedure, incapable of completing the complex data computation.

With the complete computational system and powerful computational capability, esProc can meet any computational demand effortlessly, and solve the complex data computing in a way easier than stored procedures do. esProc allows for invoking the computational result of Hive, and improves the computational capability of Hadoop by working with Hive.

>Boost Computational Performance
Developed with a rigid frame, MapReduce is inflexible in decomposing and allocating tasks, extremely resource-consuming, and relatively poor in the real-time actions. By comparison, esProc enables the arbitrary task allocation. In the extreme conditions, the time spent on allocating task is only one out often million of the time required by MapReduce, and the parallel performance of esProc is superior.

In MapReduce, the intermediate result of the cross-machine interactions must be stored in HDFS as a file. Although this is an advantage for fault tolerance, the great obstacle of delay is also incurred due to this. By comparison, esProc allows users to make the flexible choice according to the duration of computation. The intermediate result can either be used directly to reduce the obstacle of delay or stored in HDFS to increase the fault tolerance.

It is awkward for MapReduce to complete the common data computations such as the multi-table association, year-over-year and link relative ratio comparisons. If implementing such computation with any workarounds or indirect solutions of MapReduce, then the computation performance will decline dramatically. By comparison, esProc provides the native support for such computation. The combination use of esProc and HDFS boosts the computation efficiency of Hadoop dramatically.

The infrastructure of Hive is still the MapReduce, which implements the common algorithms like associated computation at the cost of performance, usually resulting in a performance of one order of magnitude inferior to that of RDB. The performance of esProc is close or even partly superior to RDB. esProc can work with Hive via JDBC to undertake the computational task with strict requirements on real-time processing.

>Improve Development Efficiency
Even for the simplest computation, MapReduce users will have to program manually - the development efficiency is low. Moreover, MapReduce requires relatively stronger development skills and greater workload to implement the associated computation, ordered computation, equal grouping, year-over-year and link relative ratio comparisons. Hive does not support the stored procedure, and still have to rely on MapReduce to handle a little bit more complex computations.

For the common algorithms, esProc provides abundant library functions for direct use; For the complex algorithm logics, esProc provides the agile syntax and professional IDE for implementing with ease. Working with HDFS and Hive, esProc can greatly boost the development efficiency of Hadoop. With the true support for the data type of set, esProc enables the ordered set and the set-lized groupings, such as the equal grouping, align grouping, and enum grouping. esProc scripts are written in a grid-style cellset so that users can reference the intermediate computational result directly without defining anything.

The debugging function of MapReduce is so outdated that users can only identify the error by checking error messages in the log file. By comparison, esProc supports the break point, step-by-step run, run to cursor, start, end, and other specialized debugging function to ensure the development efficiency.

To define the task scale arbitrarily, MapReduce users will have to customize MapReduce framework, which is not only tough but also compromises the development efficiency seriously. esProc is flexible and arbitrary in task allocation, and the development efficiency is quite high.

esProc has all outstanding features of Hadoop - parallel computations on multiple nodes, inexpensive hardware for scale-out, and open external interface. In addition, esProc renovates Hadoop with the flexible parallel framework, specialized script for big data, agile syntax, and professional IDE.

September 17, 2013

Data Source Preparation Tool Especially for Report Developers

Many report developers may have the experience in presenting the KPIs in a report for those outstanding sales man whose sales has been rising over 10% for consecutive 3 months. The procedure of finding the outstanding salesmen is actually preparing the data source.

Preparing the data source is the key and the tough part of developing a report.

There are multiple ways to prepare the data source. SQL or SP can be used to handle the normal data computation of a single database; R language for the complex data computation; ETL or data warehouse for cross-database computation, by arranging all to a same database and then compute. For the structural data from non-database files or sheets, the senior programing languages can be used to generate result sets, for example, retrieving the data from Text file with Java class. However, these methods all have some drawbacks as discussed below.

Let’s start the discussion with SQL/SP. First, SQL/SP alone can only work on a single database because various cumbersome workarounds are unavoidable for multi-database computation. What’s worse, second, SQL statement is hard to debug. That situation is even worse for the long SQL statement since a more complex computational goal will inevitably give rise to more steps and a longer statement. It is the real nightmare for preparing data source. Third, the inability of SQL statement to run step by step has great impact on the maintenance and re-use. The SQL statement can only run as a whole, and all computational logics must be crammed into a single statement. It is impossible to split one SQL statement into several examinable computational procedures for users to check out the result at each step intuitively. Forth, SQL lacks the explicit sets and the direct support for the ordered computation which are common in the complex computation. So, SQL/SP usually costs several-fold more time and effort than other tools do in the related computation.

R language is quite good at handling the complex data computing, isn’t it a better choice? No. The truth is that R language has not incorporated with a perfect IDE. It is very inconvenient for users to compose and edit the computational scripts, not to mention its poor debugging. The report developers are not the professional coder, so their productivity will suffer if working in such IDE. More importantly, R does not provide JDBC or any output interface for the direct use by reporting tools. In order to use the R in reports, users have to implement a report interface program additionally to process the data and receive the parameters. Too much trouble.

For ETL or data warehouse, first, it usually incurs a great expense on human resources, equipment, maintenance, and training. Second, report developers will have to grasp ETL scripts like PHP, Perl, VBScript, and JavaScrip, and design the massive update algorithms. Considering these troubles, 99% report developers will surely get a headache.

The real trouble for Java and other senior languages is that users will have to implement all the details by themselves: open the Excel file, build a record, generate a List, retrieve with loops, seek the maximum value, group, compute the average, filter the data, sort, and then seek the top N – the greatest flexibility seems to be obtained at the cost of the greatest workload.

In view of all these discussions above, that would have been good news for report developers if there is a data source computation tool specially built for the report, with all advantages of the above methods, and free from their disadvantages.

esProc is such a tool. On the one hand, it is as capable as SQL or SP regarding its professional database computational capability; on the other hand, it offers the convenient debugging and enables the step-by-step computation. Compared with R, esProc also supports the ordered computations and the explicit sets for solving the complex computation problem, while still offering a more user-friendly IDE interface and JDBC output interface to ensure the usability. esProc is as capable as ETL/data warehouse on performing the cross-database computation, but more cost-effectively owing to its low TCO and efficient deployment and usage. esProc allows for the direct data retrieval from Excel and Txt file as Java does, and more superior to Java in the respect of handling the massive structural data directly.

In conclusion, esProc is the ideal tool specially designed to prepare the data source of reports.

March 21, 2013

An Example Where Circles are Useful in Graphs

The originally post by Naomi Robbins,: http://www.forbes.com/sites/naomirobbins/2013/01/15/an-example-where-circles-are-useful-in-graphs/

I recently saw the infographic in Figure 1 and thought to myself, “Another application of bubbles in infographics.” After all, I criticized circles and bubbles in Misleading Graphs: Figures Not Drawn to Scale. In The Functional Art, Alberto Cairo wrote “No fashion plague is more prevalent as I write this book than the bubble.” I didn’t study Figure 1 carefully since the text is in Italian, a language that I don’t read.

This figure was originally published on July 22, 2012 in La Lettura, the Sunday cultural supplement of Corriere della Sera, the highest circulation newspaper of Italy. It was produced by Accurat, a design agency in Milan and New York that does amazing work.

Later I discovered Figure 1 was translated into English and discussed in Parson’s Journal for Information Mapping. There I learned that the circles represent the distance you can travel underground from the center of cities using their subway systems. Overlapping symbols show that you can get from one city to the other underground. I can’t think of a better way to show distances from the center of a city than with a circle.

The visualization then provides lots of other information by which each city’s subway system could be compared. Pictograms show the number of passengers, the stroke width of the circles shows the cost of a one-way ticket, and the little colored squares show the colors of the lines.

There is more than one takeaway from this story. First, our first impressions are not always correct. Second, exceptions exist for many graphical principles. There are examples where circles are an excellent choice to show the data we want to display.

menu