In the previous article, we've tested the
grouping computing. In this article, we will test their performances and compare
their results in associating computing.
Associating computing test on narrow tables
Data
sample:
Col. count: 11
Row count: 500
million
Space occupied
if saving as text: 120. 6G.
Data structure:
personid int,name string,sex int,cityid int,birthday int,degree int,col1
string,col2 int,col3 int,col4 int,col5 string
Dimension table
d_narrow
Col. count: 9
Row count: 10
million rows
Space occupied
if saving as text: 563 M.
Data structure:
id int, parentid int, col1 int, col2 int, col3 int, col4 int, col5 int, col6
int, col7 int
Description:
Associated table: It is similar to
joining the table on the left with SQL, and there are quite a lot of rows, for
example, the order table.
Dimension
table: It is similar to joining the table on the right with SQL, and
there are quite a lot of rows, for example, the client ID and client
name table.
Test
case:
Hive:
select sum(p_narrow. col3),
count(p_narrow. col5), sum(d_narrow. col7), d_narrow. id%10000 from p_narrow
join d_narrow on d_narrow. id=p_narrow. col7 group by d_narrow. id%10000
esProc: The
codes can be divided into 3 parts. They are respectively: Program for summary
machine, main program for node machine, and subprogram for node machine.
Impala:
select sum(p_narrow. col3), count(p_narrow.
col5), sum(d_narrow. col7), d_narrow. id%10000 from p_narrow join d_narrow on
d_narrow. id=p_narrow. col7 group by d_narrow. id%10000
Test
results:
Hive
|
Impala
|
esProc
|
773s
|
262s
|
279s
|
Result
description:
1.
esProc and Impala outperform
Hive obviously, almost 3 times better.
2.
Impala is slightly better than
esProc, but the difference is not great.
Associating
computation test on narrow tables
Data
sample:
Col. count: 106
Row count: 60
million rows
Space occupied
if saving as text: 127. 9G.
Data structure:
personid int,name string,sex int,cityid int,birthday int,degree int,col1
int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9
int,col10 int,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17
int,col18 int,col19 int,col20 int,col21 int,col22 int,col23 int,col24 int,col25
int,col26 int,col27 int,col28 int,col29 int,col30 int,col31 int,col32 int,col33
int,col34 int,col35 int,col36 int,col37 int,col38 int,col39 int,col40 int,col41
int,col42 int,col43 int,col44 int,col45 int,col46 int,col47 int,col48 int,col49
int,col50 int,col51 int,col52 int,col53 int,col54 int,col55 int,col56 int,col57
int,col58 int,col59 int,col60 int,col61 int,col62 int,col63 int,col64 int,col65
int,col66 int,col67 int,col68 int,col69 int,col70 int,col71 int,col72 int,col73
int,col74 int,col75 int,col76 int,col77 int,col78 int,col79 int,col80 int,col81
int,col82 int,col83 int,col84 string,col85 string,col86 string,col87
string,col88 string,col89 string,col90 string,col91 string,col92 string,col93
string,col94 string,col95 string,col96 string,col97 string,col98 string,col99
string,col100 string
Dimension
table d
Col.
count: 102
Row
count: 10 million rows
Space
occupied if saving as text: 6. 8G
Data
structure: id int, parentid int,col1 int,col2 int,col3 int,col4 int,col5
int,col6 int,col7 int,col8 int,col9 int,col10 int,col11 int,col12 int,col13
int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21
int,col22 int,col23 int,col24 int,col25 int,col26 int,col27 int,col28 int,col29
int,col30 int,col31 int,col32 int,col33 int,col34 int,col35 int,col36 int,col37
int,col38 int,col39 int,col40 int,col41 int,col42 int,col43 int,col44 int,col45
int,col46 int,col47 int,col48 int,col49 int,col50 int,col51 int,col52 int,col53
int,col54 int,col55 int,col56 int,col57 int,col58 int,col59 int,col60 int,col61
int,col62 int,col63 int,col64 int,col65 int,col66 int,col67 int,col68 int,col69
int,col70 int,col71 int,col72 int,col73 int,col74 int,col75 int,col76 int,col77
int,col78 int,col79 int,col80 int,col81 int,col82 int,col83 int,col84 int,col85
int,col86 int,col87 int,col88 int,col89 int,col90 int,col91 int,col92 int,col93
int,col94 int,col95 int,col96 int,col97 int,col98 int,col99 int,col100 int Description:
Associated table: It is similar to
joining the table on the left with SQL, and there are quite a lot of rows, for
example, the order table.
Dimension
table: It is similar to joining the table on the right with SQL, and
there are quite a lot of rows, for example, the client ID and client
name table.
Test
case:
Hive:
select sum(p. col3), count(p. col5), sum(d.
col7), d. id%10000 from p join d on d. id=p. col7 group by d. id%10000
esProc: The
codes can be divided into 3 parts. They are respectively: Program for summary
machine, main program for node machine, and subprogram for node machine.
Impala:
select sum(p. col3), count(p. col5), sum(d.
col7), d. id%10000 from p join d on d. id=p. col7 group by d. id%10000
Test
results:
Hive
|
Impala
|
esProc
|
525s
|
269s
|
268s
|
Result
description:
Let’s conclude the results of the four
tests, and explain it one by one.
Grouping
and Summarizing for Narrow Table
Test case
|
Hive
|
Impala
|
esProc
|
1 col. for grouping and 1 col. for summarizing
|
501s
|
256s
|
233s
|
1 col. for grouping and 4 col. for summarizing
|
508s
|
254s
|
237s
|
4 col. for grouping and 1 col. for summarizing
|
509s
|
253s
|
237s
|
4 col. for grouping and 4 col. for
summarizing
|
536s
|
255s
|
237s
|
1.
esProc and Impala outperforms
Hive obviously, almost 1 time or above.
2.
The performance of esProc is a
bit stronger than Impala, but the superiority is not great.
3.
The column counts for grouping
and summarizing do not have much impact on the performance of the three
solutions.
Grouping
and summarizing for wide table
Grouping col. * Summarizing col.
|
Hive
|
Impala
|
esProc
|
1 col. for grouping and 1 col. for summarizing
|
457s
|
272s
|
218s
|
1 col. for grouping and 4 col. for summarizing
|
458s
|
265s
|
218s
|
4 col. for grouping and 1 col. for summarizing
|
475s
|
266s
|
219s
|
4 col. for grouping and 4 col. for
summarizing
|
488s
|
271s
|
218s
|
1.
esProc and Impala outperforms
Hive obviously, almost 1 time or above.
2.
The performance of esProc is a
bit stronger than Impala, but the superiority is not great.
3.
The column counts for grouping
and summarizing do not have much impact on the performance of the three
solutions.
4.
Compare with the data from
narrow tables. You may find that the table columns make no difference on
performance, while the volume of the whole table has direct impact on the
performance. In addition, for the wide table, the performance of Impala will
drop slightly, while the performance of Hive and esProc will increase a bit.
Associating
computation on narrow tables
Hive
|
Impala
|
esProc
|
773s
|
262s
|
279s
|
1.
esProc and Impala outperform
Hive obviously, almost 3 times better.
2.
The performance of Impala is
slightly stronger than esProc, but the superiority is not great.
Associating
computation on wide table
Hive
|
Impala
|
esProc
|
525s
|
269s
|
268s
|
1.
esProc and Impala outperform
Hive greatly, almost 2 times higher.
2.
Impala performs slower than
that of esProc by 1 second. Despite this slight difference, both of them can be
regarded as performing equally good.
Interpretation
and Analysis:
The
performance of Hive is rather poor, which is easy to understand: as the
infrastructure of Hive, MapReduce exchanges the data between computational
nodes via files in external storage, so a great deal of time is spent on the
hard disk IO. Impala and esProc offer the better performance because they
exchange the intermediate result through memory directly. But, the performance
of Impala is not as better than Hive for dozens of times as widely believed.
Exchanging
data in the form of files do bring some benefits, which can actually ensure the
reliability of intermediate result in the unstable environment of large
cluster. esProc supports two ways to exchange the data (depend on programmer’s
choice). Impala only supports the direct exchange, and Hive only supports the
file exchange.
For grouping
and summarizing, esProc performs better than Impala a bit. This is mainly
because esProc enables the direct access to the local disk. By comparison,
Impala must rely on HDFS to access to the hard disk. The process gets slow down
naturally when there is a more layer of control.
However, in
the associating computation, we may find that the data processing performances
of esProc and Impala are contrary to that in grouping and summarizing. The
performance of esProc is equal to or slightly stronger than Impala. It is
probably because that the Impala implemented the technology of localizing the
code generation. In CPU computing, its performance is slightly higher than esProc
that executing codes by interpreting. So, although Impala relies on HDFS to
access the hard disk, the high efficiency of CPU saves the time and situation.
. As you can imagine, in grouping and summarizing, the time spent on hard disk
access is much greater than CPU computing. While in the associating
computation, the time spent on CPU computing gets greater, so that the Impala
will overtake esProc. In addition, according to the analysis, it is not
difficult to reach the conclusion that the workload ratio between the CPU
computation and the hard disk access for narrow table operations is greater
than that for wide table. The test data also tells that the advantage for
Impala performance is much more obvious when handling the narrow table, which
proves and verifies the above assumption from another perspective.
The column
counts for grouping and summarizing do not have great impact on performance.
This is because the syntax for this case is quite simple, and most time is
spent on hard disk access but not the data computing. However, Hive and Impala
are not the procedural languages like esProc. They cannot handle the complex
computation and such idle CPU usage becomes common.
No comments:
Post a Comment