esProc, A Script Language for Data Analytics with Parallel Mechanism: Features:High Performance

esProc optimizes the syntax for the structureddata, supports the in-memory computing and ordered set, and enables programmers to select the optimized path based on the characteristics of data and algorithms. The actual measured data indicates that the performance of esProc is close or even higher than that of database for the standalone machine. The column storage can significantly increase the performance.

Standalone grouping performance

Below please find the performance comparison between esProc and esProc in standalone environment on handling the big data grouping and summarizing. In this article, we respectively compare the three scenarios of single thread, four threads, and column storage. Each scenario can be classified into three types according to the grouping columns and summarizing columns. Two sets of data are tested, with the wide table of 100 columns and the narrow table of 10 columns.

Hardware: Dell Power Edge T610, CPU Intel Xeon E5620*2, RAM 20G, HDD Raid5 1T
Software: CentOS 6.4, JDK 1.6, Oracle 11g,esProc 3.1

As can be seen, the performance of Oracle is greater than that of esProc for the standalone machine and single threads. However, the parallel option of Oracle does not have any actual function. For the standalone machine with 4 threads, the computing performance of esProc has reached or even exceed that of Oracle. If running on the column storage, esProc can gain an advantage in performance in a certain order of magnitude.

Standalone grouping performance

Below please find the performance comparison between esProc and Oracle in standalone environment on handling the big data joining operation. In this article, we respectively compare the three scenarios of single thread, four threads, and column storage. Two sets of data are tested: Association between wide table and narrow table.

Hardware: Dell Power Edge T610, CPU Intel Xeon E5620*2, RAM 20G, HDD Raid5 1T
Software: CentOS 6.4, JDK 1.6, Oracle 11g, esProc3.1

As can be seen, when the computing becomes more and more complicated, the advantage of esProc parallel computing gets more obvious. By taking the advantages of multiple threads, the performance of esProc has already surpassed Oracle.

esProc supports the lightweight parallel computing, the huge data volume, great computing workload, and high degree of concurrency for distributing the workload to multiple nodes evenly. To support the data sharing in node, the computing performance with multiple threads can be elevated effectively. For the data between nodes, users can select the data swap in external storage or direct memory exchange, depending on the size of result set. Users can strike a balance between the fault tolerance and performance. For those small tasks with high degree of concurrency, the in-memory computing can be employed to boost the performance. For those time-consuming great job, the external storage can be used to ensure the reliability.

Cluster grouping performance

Below please find the performance comparison between esProc, Hive, and Impala in handling the big data grouping in the cluster environment. According to the difference between grouping columns and summary columns, there are four algorithms. Two sets of data are tested, with the wide table of 106 columns and the narrow table of 11 columns.

Hardware: 4 PCs, CPU Intel Core i5 2500, RAM 16G, HDD 2T/7200rpm, LAN 1000M
Software: CentOS 6.4, JDK 1.6, CDH5. 0beta, esProc 3.1

In-memory exchange is more powerful for the small data set, while external storage exchange is more stable for the big data set. esProc supports both methods and in-memory exchange is adopted in the above test. As can be seen, for Implala that only supports the in-memory exchange, the performance of esProc is better; for Hive that only supports the exchange in external storage, the performance of esProc would rise for multiple times.

For other scenarios when the big data set exchange in external storage is a must, esProc offers a more efficient solution than HDFS does, and is more powerful than the MapReduce-based Hive.

Cluster join performance

Below please find the performance comparison between esProc, Hive, and Impala in handling the big data joining in the cluster environment. Two sets of data are tested: Join wide table and narrow table.

Hardware: 4 PCs, CPU Intel Core i5 2500, RAM 16G, HDD 2T/7200rpm, LAN 1000M
Software: CentOS 6.4, JDK 1.6, CDH5. 0beta, esProc 3.1

As can be seen, when the computing intensity gets greater, the performance gap between Implala and esProc is narrowed gradually. This is probably the result of local code mechanism of Impala. In this comparison, although Hive and esProc are the same in interpreting and execution, the performance gap between them is evengreater. So, we can conclude that the in-memory computing can boost the performance significantly.

menu

June 10, 2014

Features:High Performance

No comments:

Post a Comment