Showing posts with label text files. Show all posts
Showing posts with label text files. Show all posts

June 30, 2015

Segmented Data Exports to Text Files of Specified Format in esProc



Below is a selection from a database table, TB1:

Based on this data, you need to generate text files of the specified format and with the following names (yyyyMMddHH) by the hour.
2011101106.txt
1|2011-10-11 6:24:21|aaa
2|2011-10-11 6:30:45|dddd

2011101107.txt
3|2011/10/11 7:10:12|dsf
4|2011/10/11 7:50:38|dffew
5|2011/10/11 7:59:59|dfae

2011101108.txt
6|2011/10/11 8:00:00|edfae

After grouping data by the hour, you export every group to a text file according to the specified format (|) and name it after the hour: 

A1: Use a SQL to create a database cursor sorted by CreateTime. db represents the connection to a database.
A2-B2: Group data by CreateTime (yyyyMMddHH), and fetch data by loop, one group each time, to export, through B2, to text files named after the create time. And specifiy format for the text data at the same time.

Text files are as follows: 

June 15, 2015

Find Differences between Text Files with esProc


Compare the following two text files of the same format and write the differences (different records) into a text file.
Content of 1.txt
Proto      Local Address Foreign Address     State
TCP 111.11.1.111:1975   221.181.70.12:5926       ESTABLISHED

Content of 2.txt
Proto      Local Address Foreign Address     State
TCP 111.11.1.111:1975   221.181.70.12:5926       ESTABLISHED
TCP 111.11.1.111:1111   169.254.241.33:2222     ESTABLISHED

Content of result.txt with exported data will be:
Proto      Local Address Foreign Address     State
TCP 111.11.1.111:1111   169.254.241.33:2222     ESTABLISHED


esProc script for doing this

A1: The path where text files are stored.

A2: fns is the script parameter for storing the two text files, 1.txt and 2.txt, under comparison. Import the files and sort each of them by all the fields.

A3: The first half line of the code - A2.merge@d(#1,#2,#3,#4) - gets difference between 1.txt and 2.txt; the second half - A2.rvs().merge@d(#1,#2,#3,#4) - gets difference beweeen 2.txt and 1.txt. Results will then be merged.

A4: Export A3’s result to result.txt, which concludes the computation. 

June 5, 2015

esProc Joins Text files and Generates a Computed Column

There are two tab-separated structured text files. chr column in AssociatedMarkers.txt is the logical foreign key pointing to Chr column in DiseaseMarkers.txt. We want to create a new structured text file, in which one column comes from AssociatedMarkers.txt’s snps_BCG24 column and the other is a computed column that will get its values through the following algorithm: If a value of AssociatedMarkers.txt’s hg19pos column falls within the startLoc and endLoc in DiseaseMarkers.txt, then output it as inLocus; otherwise output it as an empty string. Selections of the two files are as follows:

AssociatedMarkers.txt

DiseaseMarkers.txt

esProc approach

A1,A2: Import the files into memory. @t means importing column names at the same time.

A3: Perform join operation. Result is as follows:

A4: Retrieve desired columns from A3. _1.hg19pos column corresponds AssociatedMarkers.txt’s hg19pos column. The final result is as follows:

July 17, 2014

Process Text Files with esProc

It is quite convenient to process data from text files with esProc, which provides many functions for processing text files: import various text files; process big text files; visit text files of hdfs; as well as general operations, such as, file moving, deleting and checking whether a file exists. The following will illustrate these functions through examples.

File function and import function can be used if data in text files need to be read in. For example, the following text file, empolyee.txt, stores employee information:

EID   NAME SURNAME GENDER STATE BIRTHDAY   HIREDATE    DEPT SALARY
1   Rebecca Moore F California 1974-11-20 2005-03-11  R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16  Finance   11000
3   Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4   Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5   Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales   11000
7 Alexis Smith F Illinois   1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07  HR 3000


It is convenient to import the file code:


In the above figure, cell A1 uses file function to open a designated file and cell A2 uses the file’s import function to import the file’s content. We can see in the figure that esProc's integrated development environment provides debugging functions, like single step, break point and so on. The imported file content can be displayed visually in the window on the right, making  debugging program and the observation of result more convenient. 

As the other functions provided by esProc, import function contains options and parameters. @t in import@t() in the above figure shows option t is used, and by looking up in the function reference, @t means that the data file’s first row is the field name. It is the parameter that the parentheses contains. The blank in parentheses in the above figure shows that all parameters are default.

Let’s look at how to code with those functions in esProc program:

1. Separator 
In the above figure, because tab is used as a separator in text file empolyee.txt, no input parameter is used in the parentheses of import function. The default character tab is used directly instead. If other separators need to be imported, they can be defined with parameters. An example: comma is used to separate fields in csv file supported by excel, so the code for import function is import@t(;”,”).   

2. Import a portion of the fields
When a portion of the fields, instead of all of them, is to be imported, parameters can be used to define them, e.g., import@t(EID,NAME,BIRTHDAY).
3. Data type designated to be imported
esProc automatically judges the type of imported fields. For instance, BIRTHDAY is in consistent with the format of date and time, so esProc will automatically import the datetime type. If the field is required to be imported as a string, the code may be like this: import@t(EID,NAME,BIRTHDAY:string). The data types compatible with esProc include bool, int, long, float, decimal, number, string, date, time and datetime. 

4. Importing data piecewise 
It is required to piecewise import data when the data size of a text file is larger than the memory. import function of esProc provides two input parameters to read data from designated starting bytes to end bytes. For instance, import@t(EID,NAME,BIRTHDAY;,1000:2000) reads data from the 1000th bytes to the 2000th bytes. Usually, text file data gets its meaning when read by rows. import function automatically adjusts the position of starting point and end point to ensure a full data row. In order to ensure that the imported data consists exclusively of full rows, when the starting point happens to be in the middle of a data row, the import function will skip the row and begin with the next row; but if it is the end point that happens to be in the same position, the reading will continue until the end of the row. Reading data piecewise rules that the starting character will not be read while the end character will. So by making each end point to be the starting one of the next reading, all data rows will be imported without omission and repetition. Thus, in the preceding example, import@t(EID,NAME,BIRTHDAY;,2000:3000) is the appropriate code for the next reading. 

It's worth pointing out that because the data should not be segmented according to rows, the current row number can only be determined after the traversal of all preceding rows. Otherwise, the expected excellent performance of importing data piecewise won’t be achieved.  

esProc also segments data automatically. For example, if a file is divided into three parts and the second one is to be read, the code is like this: import@tz(EID,NAME,BIRTHDAY:string;,2:3). With the option @z, meanings of the two parameters become the total number of parts and the serial number of current part waiting to be read. In this method, it is convenient for multiple parallel tasks to piecewise process a big file. 

5. The process and export of big files
After the file is imported to the memory, it can be exported or re-imported into the database. For example, after empolyee.txt is imported, NAME and SURNAME could be merged into FULLNAME to be exported to employee_result.txt. If data size of the file is big enough, text cursor can be used to process it. esProc cursor can take out file data one by one and make computations. In addition, it provides quite a lot of cursor functions, making code more concise.

A complete esProc code is as follows:


In the above figure, cell A2 defines a cursor of file; cell A3 calculates full name; cell A4 exports the result of A3 to the result file. 

6. Access hdfs files.
It’s easy to access hdfs files with esProc. Your just need to change file functions to hdfsfile functions, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.txt").For a compressed file, change the extension, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.gzip"). Thus hdfs compresses or decompresses according to the extensions of files when reading and writing hdfs files. Compress methods includes gzip, LZO and LZ4 etc.

7. General file operations
We can make general file operations with esProc. For example, functions, such as move, exists, date, size, property and file name, respectively have the function of move (delete), checking whether a file exists, modifying date, size checking, reading property of configuration files and splitting full path, as well as obtaining file names and extensions.