Showing posts with label Analysis Tools. Show all posts
Showing posts with label Analysis Tools. Show all posts

October 22, 2013

Four Methods for Structured Big Data Computation

All data can only have the existence value by getting involved in the computation to create value. The big data makes no exception. The computational capability on structural big data determines the range of practical applications of big data. In this article, I'd like to introduce several commonest computation methods: API, Script, SQL, and SQL-like languages.
         
API: The "API" here refers to a self-contained API access method without using JDBC or ODBC. Let's take MapReduce as an example. MapReduce is designed to handle the parallel computation cost-effectively from the very bottom layer. So, MapReduce offers superior scale-out, hot-swap, and cost efficiency. MapReduce is one of the Hadoop components with open-source code and abundant resources.
Sample code:

            public void reduce(Text key, Iterator<Text> value,
            OutputCollector<Text, Text> output, Reporter arg3)
            throws IOException {
            double avgX=0;
            double avgY=0;
            double sumX=0;
            double sumY=0;
            int count=0;
            String [] strValue = null;
            while(value.hasNext()){
                count++;
                strValue = value.next().toString().split("\t");
                sumX = sumX + Integer.parseInt(strValue[1]);
                sumY = sumY + Integer.parseInt(strValue[1]);
            }
           
            avgX = sumX/count;
            avgY = sumY/count;
            tKey.set("K"+key.toString().substring(1,2));
            tValue.set(avgX + "\t" + avgY);
            output.collect(tKey, tValue);
        }
Since the universal programing language adopted is unsuitable for the specialized data computing, MapReduce is less capable than SQL and other specialized computation languages in computing. Plus, it is inefficient in developing. No wonder that the programmers generally complain it is "painful". In addition, the rigid framework of MapReduce results in the relatively poorer performance.
         
There are several products using API, and MapReduce is the most typical one among them.

Script: The "Script" here refers to the specialized script for computing. Take esProc as an example. esProc is designed to improve the computational capability of Hadoop. So, in addition to the inexpensive scale-out, it also offers the high performance, great computational capability, and convenient computation between heterogeneous data sources, especially ideal for achieving the complex computational goal. In addition, it is the grid-style script characterized with the high development efficiency and complete debug functions.

Sample code:


A
B
1
=file(“hdfs://192.168.1.200/data/sales.txt”).size() 
//file size
2
=10     
//number of tasks
3
=to(A2)
//1 ~ 10, 10 tasks
4
=A3.(~*int(A1/A2))
//parameter list for start pos
5
=A3.((~-1)*int(A1/A2)+1)
//parameter list for end pos
6
=callx(“groupSub.dfx”,A5,A4;[“192.168.1.107:8281”, “192.168.1.108:8281”]) 
//sub-program calling, 10 tasks to 2 parallel nodes
7
=A6.merge(empID)    
//merging task result
8
=A7.group@i(empID;~.sum(totalAmount):
orderAmount,~.max(max):maxAmount,~.min(min)
:minAmount,~.max(max)/~.sum(quantity):avgAmount)       
//summarizing is completed


Java users can invoke the result from esProc via JDBC, but they are only allowed to invoke the result in the form of stored procedure instead of any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.
         
The Script is widespread used in MongoDB, Redis, and many other big data solutions, but they are not specialized enough in computing. For another example, the multi-table joining operation for MongoDB is not only inefficient, but also involves the coding of one order of magnitude more complex than that of SQL or esProc.

SQL: The "SQL" here refers to the complete and whole SQL/SP, i.e. ANSI 2000 and its superset. Take Greenplum as an example, the major advantages of Greenplum SQL are the powerful computing, highly efficient developing, and great performance. Other advantages include the widespread use of its language, low learning cost, simple maintenance, and migration possibility - not to mention its trump-card of offering support for stored procedure to handle the complex computation. By this way, business value can be exploited from the big data conveniently.
Sample code:
CREATE OR REPLACE function view.merge_emp()
returns void as $$
BEGIN
                 truncate view.updated_record;
                  insert into view.updated_record select y.* from view.emp_edw x right outer join       emp_src y       on x.empid=y.empid where         x.empid is not null;
                  update view.emp_edw set deptno=y.deptno,sal=y.sal from view.updated_record y   where          view.emp_edw.empid=y.empid;
                  insert into emp_edw select y.* from emp_edw x right outer join emp_src y on   x.empid=y.empid where  x.empid is null;
end;
$$ language 'plpgsql';

The other databases with the similar structure to MPP include Teradata, Vertical, Oracle, and IBM. Their syntax characteristics are mostly alike. The disadvantages are similar. The acquisition cost and the ongoing maintenance expenses are extremely high. Charging its users by data scale, the so-called inexpensive Greenplum is actually not a bargain at all - it is way more like making big money under cover of big data. Other disadvantages include awkward debugging, incompatible syntax, lengthy down-time if expansion, and awkward multi-data-source computation.

SQL-like language: It refers to the output interfaces like JDBC/ODBC and only limited to those scripting languages that are the subset of standard SQL. Take Hive QL as an example. The greatest advantage of Hive QL is its ability to scale out cost-effectively while still a convenient tool for users to develop. The SQL syntax feature is kept in Hive QL, so that the learning cost is low, development efficient, and maintenance simple. In addition, Hive is a component of Hadoop. The open-source is another advantage.
Sample code:

SELECT e.* FROM (
         SELECT name, salary, deductions["Federal Taxes"] as ded,
                  salary * (1 – deductions["Federal Taxes"]) as salary_minus_fed_taxes
        FROM employees
         ) e
WHERE round(e.salary_minus_fed_taxes) > 70000;
         
The weak point of Hive QL is its non-support for stored procedure. Due to this, it is difficult for Hive QL to undertake the complex computation, and thus difficult to provide the truly valuable result. The slightly more complex computation will rely on MapReduce. Needless to say, the development efficiency is low. The poor performance and the threshold time can be regarded as a bane, especially in task allocation, multi-table joining, inter-row computation, multi-level query, and ordered grouping, as well as implementing other algorithm alike. So, it is quite difficult for Hive QL to implement the real-time Hadoop application for big data.
         
There are also some other products with SQL-like languages - MongoDB as an example - they are still worse than Hive yet.

The big data computation methods currently available are no more than these 4 types of API, Script, SQL, and SQL-like languages. Wish they would develop steadfastly and there would be more and more cost-effective, powerful, practical and usable products for data computing.

October 14, 2013

Why esProc is needed in Hadoop


esProc is a brand new parallel computing framework with support for reading and writing to the files in HDFS and the full commitment to improve the computational capability, performance, and development efficiency of Hadoop.

>Enhance Computational Capability
The computational capability of Hadoop is developed on the basis of the Java language and the MapReduce parallel framework. Java is really outstanding for it is broadly and generally used in many common applications. However, Java is not powerful enough for the computation in many specialized fields. MapReduce lacks the library functions to support even the simplest data algorithm. No direct support for typical data algorithms of associated computation, sub-query, inter-row computation, and ordered computation. Its computational capability is rather weak.

esProc is also a Java-based parallel framework and provides the script more specific to the big data computation and optimal to process the big data. esProc can work with HDFS to improve the computational capability of Hadoop greatly.

In order to boost the computational capability, Hive SQL is packaged on MapReduce for Hadoop. The computational capability of Hive SQL is quite limited since it is just a subset of SQL with no support for stored procedure, incapable of completing the complex data computation.
 
With the complete computational system and powerful computational capability, esProc can meet any computational demand effortlessly, and solve the complex data computing in a way easier than stored procedures do. esProc allows for invoking the computational result of Hive, and improves the computational capability of Hadoop by working with Hive.

>Boost Computational Performance
Developed with a rigid frame, MapReduce is inflexible in decomposing and allocating tasks, extremely resource-consuming, and relatively poor in the real-time actions. By comparison, esProc enables the arbitrary task allocation. In the extreme conditions, the time spent on allocating task is only one out often million of the time required by MapReduce, and the parallel performance of esProc is superior.

In MapReduce, the intermediate result of the cross-machine interactions must be stored in HDFS as a file. Although this is an advantage for fault tolerance, the great obstacle of delay is also incurred due to this. By comparison, esProc allows users to make the flexible choice according to the duration of computation. The intermediate result can either be used directly to reduce the obstacle of delay or stored in HDFS to increase the fault tolerance.

It is awkward for MapReduce to complete the common data computations such as the multi-table association, year-over-year and link relative ratio comparisons. If implementing such computation with any workarounds or indirect solutions of MapReduce, then the computation performance will decline dramatically. By comparison, esProc provides the native support for such computation. The combination use of esProc and HDFS boosts the computation efficiency of Hadoop dramatically.

The infrastructure of Hive is still the MapReduce, which implements the common algorithms like associated computation at the cost of performance, usually resulting in a performance of one order of magnitude inferior to that of RDB. The performance of esProc is close or even partly superior to RDB. esProc can work with Hive via JDBC to undertake the computational task with strict requirements on real-time processing.

>Improve Development Efficiency
Even for the simplest computation, MapReduce users will have to program manually - the development efficiency is low. Moreover, MapReduce requires relatively stronger development skills and greater workload to implement the associated computation, ordered computation, equal grouping, year-over-year and link relative ratio comparisons. Hive does not support the stored procedure, and still have to rely on MapReduce to handle a little bit more complex computations.

For the common algorithms, esProc provides abundant library functions for direct use; For the complex algorithm logics, esProc provides the agile syntax and professional IDE for implementing with ease. Working with HDFS and Hive, esProc can greatly boost the development efficiency of Hadoop. With the true support for the data type of set, esProc enables the ordered set and the set-lized groupings, such as the equal grouping, align grouping, and enum grouping. esProc scripts are written in a grid-style cellset so that users can reference the intermediate computational result directly without defining anything.

The debugging function of MapReduce is so outdated that users can only identify the error by checking error messages in the log file. By comparison, esProc supports the break point, step-by-step run, run to cursor, start, end, and other specialized debugging function to ensure the development efficiency.

To define the task scale arbitrarily, MapReduce users will have to customize MapReduce framework, which is not only tough but also compromises the development efficiency seriously. esProc is flexible and arbitrary in task allocation, and the development efficiency is quite high.

esProc has all outstanding features of Hadoop - parallel computations on multiple nodes, inexpensive hardware for scale-out, and open external interface. In addition, esProc renovates Hadoop with the flexible parallel framework, specialized script for big data, agile syntax, and professional IDE.

March 26, 2013

Problem: Further Computations on Grouped Data in Spreadsheet



Spreadsheets can usually carry out grouping on data easily, but the further computing on grouped data is much more difficult. For example, a data (wine inventory) is after shown as below after group and summary :


Say, if we are to compute the total amount of each wine or even each type. In the spreadsheet, it is a bit difficult to carry on these calculations or actions based on the above: calculate the ”UnitPrice*Stock” amount; sort the data in the group by inventory level; add an additional level of grouping by Name to calculate the inventory of the same type of wine; calculate the average unit price of each type of wine; and even highlight the subtotal section such as the F4 and F10. 

To calculate the resulting amount of ”UnitPrice*Stock”, a new column G need to be appended along with an expression of “=E4*F4”, and then drag the expression to every row of this column. Because UnitPrice is unavailable in the subtotal row, the formula is invalid there and must be removed for each row manually, which is error-prone since there are many types of wines.

To sort the data in a group by the inventory level, you can only sort manually on each group of data since the data is allocated to several irrelevant groups. For example, the 1st group of data is from the row 4~9. It is error-prone if the data volume is huge. Another solution is to ungroup and present the data in a 2D sheet without subtotal, sort by Type and Stock, and then summarize in groups. In this solution, if there is any data in the subtotal row like “proportion of subtotal to total”, then users will have to recalculate these data once the row of subtotal disappears in the procedure. All in all, both solutions suffer from obvious drawbacks.

Add another level of grouping by Name to calculate the inventory of a same type of wine. Because the data is not sorted by Name, the data cannot be grouped directly and we have to go back to solve the previous problem.

To calculate the average unit price of each type of wine, both the 2 solutions available are quite cumbersome. The first solution is to ungroup and paste the data into 2 pieces, either of which will respectively go through the average unit price calculation or the inventory summary calculation, then join the summary values of these 2 pieces of data. The second solution is to write the formula in every subtotal section, for example “=SUBTOTAL(1,E4:E9)” in E3. Since the data in various groups differ, users will have to write N different formulas again and again and errors may quite easily occur.

To highlight the subtotal section, users have no choice but to repeat the simple action on each group one after another manually. Without any means to automate, it is so sarcastic when considering the Office Automation.

In facts, the above problem can be solved with esCalc. esCalc introduces the concept of homocell to present the business relations between cells so as to enable the auto-processing on data of various business senses. For example, if inputting the formula =E4*F4 in G4 for once, then the formula will be pasted to all rows of the details data except for the summary row, as shown in the below sheet:


The data in the group is sorted by the inventory level. In the same way, just sort once in the F4, F5 or F11 cell, the respective positions will be auto-identified, and the data in each group will also be auto-sorted, as shown in the below sheet: 


As you know, the esCalc homocell mechanism can solve other problems automatically, and we will not dwell on it for it’s the topic of another essay.

Related Articles:

September 12, 2012

Implement SQL Basic Functions through R Language & esProc


It is known to us that SQL users have to finish composing all codes and then run them all at one time, resulting in a poor ability for interactctive data analytics. However, the simple and easy-to-understand query syntax of SQL is always welcomed by programmers. As powerful computation and analysis tools, R language and esProc are surely need to offer the similar query syntax. In this article, let's discuss how to implement the basic functions of SQL through R language and esProc with some examples.

The example data is from 2 tables of the classical Northwind database:
Orders table with the main fields: OrderID, EmployeeID, OrderDate, Freight, CustomerID
Customer table with the main fields: CustomerID, CompanyName




Retrieve data of the entire table

SQL solution: select * from Orders
R solution: A1<-sqlQuery(conn,'select * from Orders')
esProc solution: $select * from Orders
Comments: The word count of esProc and SQL codes only differs by one word. The R language has a strong flavor of programming. 

Where: Search for the order which has a freightage higher than 100 and is placed before the year of 1998

SQL solution: SELECT * FROM Orders WHERE Freight > 100 AND OrderDate < '1998-01-01'
R solution: subset(A1,Freight > 100 & OrderDate < as.POSIXlt('1998-01-01'))
esProc solution: =A1.select(Freight > 100 && OrderDate < date("1998-01-01"))
Comments: Both SQL and R solutions are close to SQL to some extent. R is of the typical function style, and esProc is of the typical object style. The programmers prefer the former, while the business experts or analysts prefer the latter esProc, for esProc is easier for them to understand.

Order: Sort by employees correctly, and then sort by freightage in reverse order

SQL solution: SELECT * FROM Orders ORDER BY EmployeeID ,Freight DESC
R solution: A1[order(A1$ EmployeeID,-A1$Freight),]
esProc solution: =A1.sort(EmployeeID,Freight:-1)
Comments:  
The R solution is to retrieve 2 vectors from A1 at first, pass them to the function order to group them together for sorting, and then export the serial numbers. At last, data will be rearranged according to the serial number. The computation process of R language is rather winding, not as straightforward as SQL. It is because that R is good at handling vector, the access to structured data will take column as the basic unit, and the parameter usually takes up a whole column. By comparison, SQL takes the record (column) as the basic unit, with parameter as the column name.
esProc solution resembles that of SQL because esProc takes the record (row) as the basic unit .

Group & Sum: Summarize by employee, sum up the freightage, and count the orders:

SQL solution: SELECT EmployeeID, COUNT(OrderID), SUM(Freight) FROM Orders GROUP BY EmployeeID
esProc solution: =A1.group(EmployeeID; EmployeeID, ~.count(), ~.sum(Freight))
R solution:
A4<-aggregate(A1$Freight,list(A1$EmployeeID),sum)  
A4$count<-tapply(A1$Freight,A1$EmployeeID,length)
Comments: In this case, it is obvious that R and SQL differ greatly. The algorithms available in R may be clearer for mathematicians, and means more learning efforts for users accustomed to SQL.

SQL: http://www.sql.org/
R: http://en.wikipedia.org/wiki/R_%28programming_language%29
esProc: http://www.raqsoft.com/products

From the above functions, when come to SQL data computing and complex data calculation, SQL is replaceable in some aspects actually. We can see from the above comparison, esProc has a coding style more close to that of SQL since esProc supports the data structure of “Record” by the infrastructure, making it more suitable for users that are familiar with SQL. In addition, compared with R language, esProc provides the representation style of “object + functions” that is much easier for business experts to accept for financial data analytics, marketing and sales data analysis.

R is more resourceful in details, ideal for the programmers and mathematicians. In addition, supporting the regular expressions and other functions makes R more open as a preferred analytics tool for programmers.