August 13, 2014

Binary File in esProc

In esProc, we often use two kinds of data file: The normal txt file and the binary file, of which the binary file adopts compressed encoding of low CUP consumption, meaning that a compressed binary file will take less space than an uncompressed txt file, and the data reading efficiency will be higher. Thus, we can conclude that the binary file is a better choice when you need to use data files.

Let's illustrate it with an example:

In the above two files PersonnelInfo and PersonnelInfo.txt, the same personnel information is respectively stored in a binary file and a text file, including 100,000 records of 6 fields. As can be seen, in the hard disk, size of a binary file is less than that of a text file.
Then, let's check out how these two files retrieve data:

Using the binary file,computing time (millisecond) consumed in A5 is shown below:

Using the text file,the computing time is shown below:

In the above two cellsets, the binary file and text file are respectively used to perform the same grouping and aggregate computation, make statistics on the number of employees in each state, and compute the consumed time (millisecond) in cell A5. As can be seen from the result,by using the binary file,the data can be retrieved at a speed significantly higherthan by using the textfile.

In a word, it is more convenient to store data in a binary file in esProc.

In the big data computation, it is the common solution to split the data first and then compute respectively. When using the file data, no matter the text file or binary file, the @z option can be used to retrieve data from both of them by block. For example:
In both A2 and A4,in generating cursor, @z option is used to divide the data in cursor into 5 parts based on parameters. The 1st partis returned by A2, and the 2nd is returned by A4. In A3 and A5, all retrieved data are shown below:
In the file PersonnelInfo, there are 100,000 records in total. So, we can say that it is roughly divided into 5 parts based on the approximate size, not precisely divided according to the number of records.When retrieving the data by block, esProc will adjust the range of data retrieval automatically so as to ensure the data integrity.Take the 1st and the 2nd pieces of data for example. This can ensure that the data is just continuous with no duplicates during computing.

Regarding the use of text file, the usage of @z option is exactly the same with that of binary files.

If an access-intensive big data table contains multiple fields, then you can use the columnar storage of binary file to store the data table into multiple files by fields. In this way, you can select the data file of the desired fields to generate thecursor, so as not to read the unnecessary data. For example:
In A9, the data in cursor are saved as multiple binary files in a columnar format, and each file only stores one column of data. In A10, according to the desired fields, select the corresponding files to build the cursor jointly, which can be used convenientlyas the normal cursor, while keeping the system resources from consuming by the extra data.From A11, retrieve the first100 records, as shown below:
However, because retrieving data by block is based on dividing the data volume of the file itself instead of the number of records, the consistency cannot be ensured for the multiple files in a columnar format. Therefore, the piecewise access is not allowed regarding the file cursor composed of multiple columnar files.

If accessing multiple files simultaneously on the mechanical disks and the file buffer is small, then the data retrieval efficiency will be reduced greatly due to the frequent access to different files. Thus the file buffer must be set to a greater one, like 16M. However, please note that the memory overflow may be incurred if the file buffer is over-sized or there are too many parallel threads. If using the solid state disk instead of the mechanical hard disk, then you will not encounter the great decrease in the data retrieval speed. Just set the default file buffer settings, i.e. 64K/65536.

Hadoop system is characterized with the high fault-tolerant ability, low cost, and high transfer speed. With all these advantages, HDFS is a popular distributed file system and often used as the big data storage. But it is not efficient to use HDFS data in esProc though convenient still. This is because you will have to retrieve data from network, and the network transfer will delay data retrieval, and significantly slower than just using data on the local disk. Therefore, to access the HDFS data, the raw data is usually stored on the HDFS. During computation, the temporary files are generated locally to achieve a higher performance.