Let's illustrate it with an example:
In the above two files PersonnelInfo and PersonnelInfo.txt, the same personnel information is respectively stored in a binary file and a text file, including 100,000 records of 6 fields. As can be seen, in the hard disk, size of a binary file is less than that of a text file.
Then, let's check out how these two files retrieve data:
Using the binary file,computing time (millisecond) consumed in A5 is shown below:
Using the text file,the computing time is shown below:
In the above two cellsets, the binary file and text file are respectively used to perform the same grouping and aggregate computation, make statistics on the number of employees in each state, and compute the consumed time (millisecond) in cell A5. As can be seen from the result,by using the binary file,the data can be retrieved at a speed significantly higherthan by using the textfile.
In a word, it is more convenient to store data in a binary file in esProc.
In the big data computation, it is the common solution to split the data first and then compute respectively. When using the file data, no matter the text file or binary file, the @z option can be used to retrieve data from both of them by block. For example:
In both A2 and A4,in generating cursor, @z option is used to divide
the data in cursor into 5 parts based on parameters. The 1st partis returned by
A2, and the 2nd is returned by A4. In A3 and A5, all retrieved data are shown
below:
In the file
PersonnelInfo,
there are 100,000 records in total. So, we can say
that it is roughly divided into 5 parts based on the approximate size, not
precisely divided according to the number of records.When
retrieving the data by block, esProc will adjust the range of data retrieval
automatically so as to ensure the data integrity.Take the 1st and the 2nd pieces of data for example. This can ensure that
the data is just continuous with no duplicates during computing.
Regarding
the use of text file, the usage of @z option is exactly the same with that of binary files.
If an
access-intensive big data table contains multiple fields, then you can use the columnar storage of binary file to store the data table
into multiple files by fields. In this way, you can select the data file of the
desired fields to generate thecursor, so as not to read the unnecessary data.
For example:
In A9, the data in cursor are saved as multiple binary files in
a columnar format, and each file only stores one column of data. In A10,
according to the desired fields, select the corresponding files to build the
cursor jointly, which can be used convenientlyas the normal cursor, while
keeping the system resources from consuming by the extra data.From A11,
retrieve the first100 records, as shown below:
However,
because retrieving data by block is based on dividing the data volume of the
file itself instead of the number of records, the consistency cannot be ensured
for the multiple files in a columnar format. Therefore,
the piecewise access is not allowed regarding the
file cursor composed of multiple columnar files.
If
accessing multiple files simultaneously on the mechanical disks and the file
buffer is small, then the data retrieval efficiency will be reduced greatly due
to the frequent access to different files. Thus the
file buffer must be set to a greater one, like 16M. However, please note that
the memory overflow may be incurred if the file buffer is over-sized or there
are too many parallel threads. If using the solid state disk instead of the
mechanical hard disk, then you will not encounter the great decrease in the
data retrieval speed. Just set the default file buffer settings, i.e. 64K/65536.
Hadoop
system is characterized with the high fault-tolerant ability, low cost, and
high transfer speed. With all these advantages, HDFS
is a popular distributed file system and often used as the big data storage.
But it is not efficient to use HDFS data in esProc though convenient still.
This is because you will have to retrieve data from network, and the network
transfer will delay data retrieval, and significantly slower than just using
data on the local disk. Therefore, to access the HDFS data, the raw data is usually stored on the HDFS. During
computation, the temporary files are generated locally to achieve a higher
performance.
No comments:
Post a Comment