esProc, A Script Language for Data Analytics with Parallel Mechanism: agile language

Showing posts with label agile language. Show all posts

February 11, 2014

Program Language with Agile Syntax to achieve better efficiency and performance

In the previous article, I’ve shared some experiences in Hadoop coding with the agile esProc syntax. This article is the supplementary and in-depth discussion based on the previous one.

Firstly, let’t talk about the Cellset Code.

In the previous article, I ‘ve introduced the convenience of using cellset code to define variable, make reference to variable, and achieve the complex computing goal in multiple steps. In facts, the cellset or grid can be used to make it more simple to reuse the computational result. Please refer to the code block below:

As can be seen, the computational result in A2 is reused in B2 and A3.

The introduction of grid line in the cellset is a good idea. The grid line can keep the code lines aligned naturally, for example, form a clear and intuitive work scope by indentation. Take the below code for example:

Look good. The branch of judgment statement can be recognized well. The code block appears clear and neat without the deliberate edits.

Then, let’s talk about the Object Reference. What is the object reference? Take a previous code snippet for example: A10: =A9. sort(sumAmount: -1). select(#<=10),

The code in A10 can be rewritten in two cells separately, one for sorting, and another for filtering. But in the actual given code, the “,” is used to consolidate the computations of these two steps - this mechanism is referred to as the Object Reference. Object Reference reduces the workload of coding and result in the more agile coding.

Support for direct writing the SQL Statement

The big data computation usually involves the access to Hive database or traditional database. MapReduce requires users to write the complex connect\statement\result statement, while esProc supports direct composing the SQL statement to saves users all these troubles. For example, to get the sales record from the the data source HData of a Hive database, esProc enables users to complete all work with one statement: $(HData)select * from sales.

Function options

Firstly, let’s check out these two statements in the sample code from the first article:

 Code for node machine A2: =A1. groups(${gruopField};${method}(${sumField}): Amount)

 Code for summary machine A9: =A8. groups@o(${gruopField};${method}(Amount): sumAmount)

The former one uses the groups directly to group the unsorted data. The latter one uses the @o option to indicate that the sorted data have been grouped for a much higher speed. @o is a function option to reduce the complex function of heavy workload and make it easier to memorize the names of various functions to achieve different functions. In addition to @o, there are @m and @n function options of the groups function

The function option is a nice design to make the function structure much simplier, and the coding more agile.

Multi-level Parameter

The multi-level parameter (or hierarchy parameter by name) can make the syntax much agile. This is a way to represent the parameters at different levels of the function, for example, ranking the employee by its performance score:

 If the performance score is higher than 90, then set it to “A”

 If the performance score is between 90 and 60, then set it to “B”

 If the performance score is between 60 and 30, then set it to “C”

 If the performance score is below 30, then set it to “D”

In the esProc, the above parameters can be represented like this: score>90:" A",score>60 && score< 90:" B",score>30 && score<=60:" C";"D"

In this case, the parameter can be classified into three levels, and the outermost level: The branch and the default branch is separated with “;”; The middle level: Each branch is separated with “,”; The innermost level: The judgment expressions and results in each branch are separated with “: “. This is a parameter combination of three-level tree structure.

Set-style Grouping

esProc supports the set-style grouping, and is also capable of coding in an agile way. The essence of dynamic data type is the set. Specifically, the simple data type is the set of single value, the array is the set of alike data, and the two dimensional table is the set of records. The member of a set can be another set. Therefore, esProc can be used to represent the concept of grouping in the data computation: Each group is a member of a set, and the member itself is a set. Thanks to the agile syntax, the set-style grouping can be used to solve the complex grouping and computational problems. For example, find the sales person who signed the most and the least insurance policies. The code is as shown below:

A1 cell: Group by sales person. Each group is a set of all policies of one sales person.

A2 cell: Sort the group by the number of policies. In the code snippet, the “~” represents a group of policies corresponding to each sales person.

A3 cell: Find the groups having the most or the least polices. They are the first group and the last group in cell A2.

A4 cell: List the name of sales person. They are the sales persons corresponding to the two groups of policies in A3.

The agile syntax of esProc boosts the efficiency of code development, and reduces the development workload dramatically.

web: http://www.raqsoft.com/product-esproc

January 27, 2014

Parallel Computing and Columnar Storage in one program language

The columnar storage is good, especially when there are lots of tabular fields (this is quite common). In querying, the data to traverse is far less than that on the row storage. Less data to traverse brings less I/O workloads and higher query speed. However, the Hadoop application consumes most time on the hard disk I/O without columnar storage.

Both Hive and Impala support columnar storage, but columnar storage only available with the basic SQL interface. As for some more complex data computing, it seems quite difficult for the MapReduce framework to do columnar storage.

The sample data is a big data file sales.txt on the HDFS. This file has twenty fields, 100 million data entries, and the file size is 14G. Let’s take the GROUP computing for example to demonstrate the whole procedure. Concretely speaking, it is to summarize the total sales for everyone by the sales person. The algorithm involves two fields, empID and amount.

Firstly, compare it with the classic code for Grouping and Summarizing on Columnar Storage:

Code for summary machine:

main node

Code for node machine:

sub node

The above columnar storage algorithm follows such train of thought: Scheduling a large task into forty smaller tasks according to the file bytes; distribute them to the node machine for the initial summarization; and then further secondary summarize on the summary machine. This train of thoughts is similar to that for MapReduce. The difference is that this columnar storage thinking pattern is simple and intuitive because the task scale could be customized, and most people can understand it easily.

As can be seen, the row storage is to store all fields in the file in the form of fields. Therefore, no matter there are two or twenty fields, we still have to traverse all data - the file of 14 G. However, the columnar storage is not like this. The data of each field are stored as a file. If only two fields are stored in the query, then you only need to retrieve the corresponding file of these two fields, i.e. 1.4 G. Please note that the 1.4 G is an average value. The total volume of the data of the two fields is slightly higher than 1.4 G.

As mentioned above, to implement the columnar storage, we must decompose the file field by field first, by using f.cursor() to retrieve first and then f.export() to export. The detailed procedure about columnar storage will be explained at the end of this article.

Code for Grouping and summarizing once Columnar Storage is adopted:

Code for summary machine:

main node

Code for node machine:

sub node

As can be seen, the greatest change is the code for node machine. The A3 cell uses the [file list]. Use the cursor() to merge a group of field files to generate a cursor of 2 dimension table. The remaining codes for grouping and summarizing are just the same as before.

Another change is about the task-scheduling. In fact, great trouble can be caused in this stage. As we know, in the row storage, the task-scheduling can be performed based on the number of byes. For example, a 14 G file can be divided into 40 segments evenly, and each segment will be assigned to a node machine as a task. But this method does not work in the columnar storage because of the record misalignment, and the result can be always wrong. The right method is to divide evenly by the number of records (Any suggestions on dividing it evenly, please leave a comment. Thanks.). 100 million entries can be allocated to 40 segments, and each segment will hold 2’500’000 entries.

For example, the empID column will be ultimately divided into 40 smaller files: empID1. dat”,”empID2. dat”……”empID40. dat”. Then, the algorithm codes are shown below:

In order to prevent the memory overflow, the above algorithm will retrieve 100’000 entries from cursor, and append to the file. By doing this, a new file will be created with every 2’500’000 entries of data. In which, A3 is the file count, C3 is used to monitor if the data entries in the current files reaches 2.5 M.

The data-scheduling is surprisingly troublesome, but the columnar computing after splitting is quite simple.

Web: http://www.raqsoft.com/

January 16, 2014

To write Hadoop code using agile program language

Hadoop is an outstanding parallel computing system whose default parallel computing mode is MapReduce. However, such parallel computing is not specially designed for parallel data computing. Plus, it is not an agile parallel computing program language, the coding efficiency for data computing is relatively low, and this parallel computing is even more difficult to compose the universal algorithm.

Regarding the agile program language and parallel computing, esProc and MapReduce are very similar in function.

Here is an example illustrating how to develop parallel computing in Hadoop with an agile program language. Take the common Group algorithm in MapReduce for example: According to the order data on HDFS, sum up the sales amount of sales person, and seek the top N salesman. In the example code of agile program language, the big data file fileName, fields-to-group groupField, fileds-to-summarizing sumField, syntax-for-summarizing method, and the top-N-list topN are all parameters. In esProc, the corresponding agile program language codes are shown below:

Agile program language code for summary machine:

Agile program language code for node machine:

How to perform the parallel data computing over big data? The most intuitive idea occurs to you would be: Decompose a task into several parallel segments to conduct parallel computing; distribute them to the unit machine to summarize initially; and then further summarize the summary machine for the second time.

From the above codes, we can see that esProc has parallel data computing into two categories: The respective codes for summary machine and node machine. The summary machine is responsible for task scheduling, distributing the task to every parallel computing node in the form of parameter to conduct parallel computing, and ultimately consolidating and summarizing the parallel computing results from parallel computing node machines. The node machines are used to get a segment of the whole data piece as specified by parameters, and then group and summarize the data of this segment.

Then, let’s discuss the above-mentioned parallel data computing codes in details.

Variable definition in parallel computing

As can be seen from the above parallel computing codes, esProc is the codes written in the cells. Each cell is represented with a unique combination of row ID and column ID. The variable is the cell name requiring no definition, for example, in the summary machine code:

n A2: =40

n A6: = ["192. 168. 1. 200: 8281","192. 168. 1. 201: 8281","192. 168. 1. 202: 8281","192. 168. 1. 203: 8281"]

A2 and A6 are just two variables representing the number of parallel computing tasks and the list of node machines respectively. The other agile program language codes can reference the variables with the cell name directly. For example, the A3, A4, and A5 all reference A2, and A7 references A6.

Since the variable is itself the cell name, the reference between cells is intuitive and convenient. Obviously, this parallel computing method allows for decomposing a great goal into several simple parallel computing steps, and achieving the ultimate goal by invoking progressively between steps. In the above codes: A8 makes references to A7, A9 references the A8, and A9 references A10. Each step is aimed to solve a small problem in parallel computing. Step by step, the parallel computing goal of this example is ultimately solved.

External parameter in parallel computing

In esProc, a parameter can be used as the normal parameter or macro. For example, in the agile program language code of summary machine, the fileName, groupField, sumField, and method are all external parameters:

n A1: =file(fileName). size()

n A7: =callx(“groupSub. dfx”,A5,A4,fileName,groupField,sumField,method;A6)

They respectively have the below meanings:

n filename, the name of big data file, for example, " hdfs: //192. 168. 1. 10/sales. txt"

n groupField, fields to group, for example: empID

n sumField, fields to summarize, for example: amount

n parallel computing method, method for summarizing, for example: sum, min, max, and etc.

If enclosing parameter with ${}, then this enclosed parameter can be used as macro, for example, the piece of agile program language code from summary machine

n A8: =A7. merge(${gruopField})

n A9: =A8. groups@o(${gruopField};${method}(Amount): sumAmount)

In this case, the macro will be interpreted as code by esProc to execute, instead of the normal parameters. The translated parallel computing codes can be:

n A8: =A7. merge(empID)

n A9: =A8. groups@o(empID;sum(Amount): sumAmount)

Macro is one of the dynamic agile program languages. Compared with parameters, macro can be used directly in data computing as codes in a much more flexible way, and reused very easily.

Two-dimensional table in A10

Why A10 deserves special discussion? It is because A10 is a two-dimensional table. This type of tables is frequently used in our parallel data computing. There are two columns, representing the character string type and float type respectively. Its structure is like this:

In this parallel computing solution, the application of two-dimensional table itself indicates that esProc supports the dynamic data type. In other words, we can organize various types of data to one variable, not having to make any extra effort to specify it. The dynamic data type not only saves the effort of defining the data type, but is also convenient for its strong ability in expressing. In using the above two-dimensional table, you may find that using the dynamic data type for big data parallel computing would be more convenient.

Besides the two-dimensional table, the dynamic data type can also be array, for example, A3: =to(A2), A3 is an array whose value is [1,2,3…. . 40]. Needless to say, the simple values are more acceptable. I’ve verified the data of date, string, and integer types.

The dynamic data type must support the nested data structure. For example, the first member of array is a member, the second member is an array, and the third member is a two-dimensional table. This makes the dynamic data type ever more flexible.

Parallel computing functions for big data

In esProc, there are many functions that are aimed for the big data parallel computing, for example, the A3 in the above-mentioned codes: =to(A2), then it generates an array [1,2,3…. . 40].

Regarding this array, you can directly compute over each of its members without the loop statements, for example, A4: =A3. (long(~*A1/A2)). In this formula, the current member of A3 (represented with “~”) will be multiplied with A1, and then divided by A2. Suppose A1=20000000, then the computing result of A4 would be like this: [50000, 100000, 1500000, 2000000… 20000000]

The official name of such function is loop function, which is designed to make the agile program language more agile by reducing the loop statements.

The loop functions can be used to handle whatsoever big data parallel computing; even the two-dimensional tables from the database are also acceptable. For example, A8, A9, A10 - they are loop functions acting on the two dimensional table:

n A8: =A7. merge(${gruopField})

n A9: =A8. groups@o(${gruopField};${method}(Amount): sumAmount)

n A10: =A9. sort(sumAmount: -1). select(#<=10)

Parameters in the loop function

Check out the codes in A10: =A9. sort(sumAmount: -1). select(#<=10)

sort(sumAmount: -1) indicates to sort in reverse order by the sumAmount field of the two-dimensional table of A9. select(#<=10) indicates to filter the previous result of sorting, and filter out the records whose serial numbers (represented with #) are not greater than 10.

The parameters of these two parallel computing functions are not the fixed parameter value but parallel computing method. They can be formulas or functions. The usage of such parallel computing parameter is the parameter formula.

As can be seen here, the parameter formula is also more agile syntax program language. It makes the usage of parameters more flexible. The function calling is more convenient, and the workload of coding can be greatly reduced because of its parallel computing mechanism.

From the above example, we can see that esProc can be used to write Hadoop with an agile program language with parallel computing.By doing so, the code maintenance cost is greatly reduced, and the code reuse and data migration would be ever more convenient and better performance with parallel computing mechanism.

Official web: http://www.raqsoft.com/

menu