December 30, 2013

Parallel computing in hadoop with esProc instead of MapReduce

Hadoop is an outstanding distributed computing system. However, many users find MapReduce is too complex and too tough to manage, the response is too slow, the realtime is too poor, and the running is inefficient. In a word, it is too heavy. They are users of middle-and-small-scale cluster. They do not need such heavyweight solution. In this case, using esProc to realize the lightweight data parallel computing for Hadoop would be one of the choices.

Let’s demonstrate how esProc realizes the lightweight parallel computing for Hadoop with a typical grouping computing example below. In this example, we summarize and count the order data from a file in HDFS to compute the sales value for each sales person. Because the data volume is too large for a single computer to process, we need to speed up by distributing the computing workload.

As the “most intuitive idea”, the distributed computing procedure is to decompose the big data into several segments. Each node is assigned with a segment of data for interior grouping and summarizing. Then, the summary machine collects and merges the data computing result from each node and carries out the secondary grouping and summarizing. esProc implements this hadoop parallel computing like this:
Code for summary machine: Decomposing tasks and secondary summary.

Code for computing node: Aggregate, group, and summarize in any segment.

If using one figure to illustrate the relation between codes on summary machine and node machine, then it should be like this:

As can be seen, esProc coding follows the “intuitive ideas” for computing. Each step is implemented concisely and smoothly. Most importantly, esProc has a simple structure with no hidden specifics. The actual computing is carried out step by step by strictly following the code flow. To optimize, users can easily modify the code during running in each step. For example: Change the granularity of task-decomposing and specify the node machine for computing.
In the below sections, let’s analyze the realtime, data exchange performance, maintenance and development of the above-mentioned codes.

Analysis on Realtime

With esProc, the task-decomposing scale can be customized according to their practical computing environment. For example, if the node machine is a high performance PC Server, then the 100 million pieces of data can be processed in 10 shares, with each node responsible for 10 million pieces of data. If the computing node is an obsolete and outdated notebook, then the data can be arranged to process in one thousand shares, and each node responsible for 10 thousand pieces of data. The ability to adjust the task decomposing granularity freely will dramatically lower the cost of scheduling for task decomposing. The realtime will thus be boosted.

The cluster computing has a certain failure rate as a result of network or node machine failure. The failure would lead to unavoidable resource-consuming re-computing. The smaller each task is, the fewer resources would be consumed for re-computing. However, decomposing a great job into smaller tasks is itself resource-consuming. The more you decomposing, the higher the scheduling cost and the poorer the realtime would be. They are two ends of the scale, and you cannot strike the balance.

MapReduce is designed for the large scale cluster. Each Map task is designed to process one data by default. Such mechanism of task allocating results from the complex infrastructure and is extremely difficult to change. Therefore, MapReduce costs more on scheduling, and people may think Hadoop is poor regarding its realtime. Users of the middle-and-small-scale cluster only need to handle fewer nodes which seldom fail and always provide the satisfactory uptime. Such users are more eager to customize the scale of task-decompose freely.

Analysis on Data Exchange Performance

In the above esProce codes, users not only have the choice to determine the granularity of task-decomposing, but also the method of data exchange. Users can choose to store the data into HDFS and then exchange, or just exchange directly. In the sample code, data is exchanged directly. By doing so, the data exchange time between nodes is significantly reduced, and the overall performance is improved effectively.

The node failure that occurs during computing can be solved by means of decomposing and breaking-down the task into smaller jobs. What if the failure occurs in the next round computing after the computing of a certain node is completed? MapReduce offers a rather simple solution: Write the result to HDFS, exchange the data with files, and no more depend on the legacy computers. However, this practice is obviously quite inefficient. Directly sending the data from memory to the summary machine is much faster. That’s why users of middle-and-small scale cluster would say the performance of Hadoop is poor. In a large scale cluster environment of high failure rate, such practice is necessary for effectively addressing the unstable cluster environment.

Users of middle-and-small scale cluster seldom experience cluster failures, and consider it desirable to choose a more flexible method to exchange the data: In order to minimize the required time to exchange data, the data is exchanged directly in most cases and file exchange is only performed when necessary.

Analysis on Maintenance and Development

From the above code, we can see: The structure of esProc is simple and thus can be deployed and managed at extremely low cost. Without the components in complex relations, few errors might occur, and the troubleshooting is quite easy if any error. Owing to the simple structure, the study and development procedure of esProc is also quite easy, and the code-debugging becomes quite intuitive. In fact, esProc offers the true script debugging and thus saves the efforts of checking logs for debugging in MapReduce.

In the great scale environment, MapReduce is designed to have a complex architecture for leveraging resources efficiently and performing the computing task and complex logic programming steadfastly and effectively. Such degree of complexity is very necessary in the cluster environment of hundreds or thousands of machines. However, this also brings about the complexity of deployment. Not only the great workload in maintenance and management, but also the increasing complexity to study and develop grows significantly. Besides, the complex structure will degrade the overall computing performance. Various components are very likely to encounter error, and troubleshooting is difficult. The complex structure compromises the flexibility of users and gives rise to the development cost. For example, MapReduce is designed to process one data each time by default. Although a modified MapReduce architecture can process multiple data at one time. Such modification is relatively hard to implement for it is time-consuming and only achievable for those with high technical competence.

Since there are only the relatively small number of node machines in an environment of the middle-and-small scale cluster, the machine is not so diversified, and the computing environment is stable. The complex computing architecture as such is thus not necessary to get the computing task run smoothly.

With all the above code analyses, we can reach the conclusion that: The simply structured database and big data parallel computing frameworks such as esProc enable users of middle-and-small scale cluster to take full advantage of the inexpensive scale-out feature of Hadoop for the highly efficient lightweight parallel computing solution.