esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Counts Distinct Columns in a Text File

Problem source：http://unix.stackexchange.com/questions/161885/using-awk-to-identify-the-number-identical-columns .

There are some text files under /data directory. Each of them has certain columns. We want to know how many distinct columns are there in each file. For instance, the number of distinct columns in f1.txt is 3.

1 0 0 0 0 0

0 1 1 1 0 0

Suppose there is only one file. Then the code could be:

file("/data/f1.txt”).import().fno().((c=#,A3.(~.field(c)))).id().len()

fno function is used to get the number of columns in a two-dimensional table; ~ represents the loop variable of a loop function; # represents loop number; and id function is used to get the distinct columns.

If there are a great number of files under /data directory, the code will be more complicated:

pjoin((d=directory@p("/data")),d.((f=file(~).import(),f.fno().((c=#,f.(~.field(c)))).id().count())))

This line of code calculates sequentially the number of distinct values in each file and joins the results with corresponding file names. The result table is as follows:

_1	_2
/data/f1.txt	3
/data/f2.txt	2
/data/f3.txt	3
/data/f4.txt	4

For the convenience of observing computational logic, the above code can be written in multiple cells using a long statement:

== indicates the beginning of the long statement, whose working range is the indented block of B2-C5. B5 is the last executable cell whose result will be returned to A2.

menu

September 6, 2015

esProc Counts Distinct Columns in a Text File

No comments:

Post a Comment