July 21, 2014

Basic Computations of esProc Sequences

A sequence is an ordered set consisting of some data, which are called members of the sequence. A sequence is similar to an array in high-level language, but the data type of its members not have to be the same. The following will explain its basic computation through creation, access, operators and functions.

Creation

1.Create with Constant
Bracketing members with “[]” represents sequence constant, e.g.

A5=[15.2,"b",1]        /Members of a sequence can be float, string and integer
B5=[A1:C3]             /Members are cell area, i.e.


C5=[3,A5,B4]            /A member of a sequence can also be a sequence,i.e.


A6=[1,2,3,3]              /Members can be repeated
B6=[]                      / A empty sequence
C6=[[]]                /A non-empty sequence that its member is an empty sequence

Note: Members of a sequence can be any data types, including atom type, another sequences, records, etc. A sequence that all its members are integers is called an integer sequence.

2.Create with functions
to(2,6)                         / Numbers of the integer sequence are [2,3,4,5,6]. A integer sequence beginning with 1 can be expressed with to(6)
"1,a,b,c".array()      /Split a character string into a sequence [1,a,b,c], a reversed joint can be expressed with [1,a,b,c].string()
periods@y("2012-08-10 12:00:00",now(),1)            /Create a integer sequence for a period of time by dividing it into smaller time periods whose starting and ending points are in adjacent years, sequential value is ["2012-08-10 12:00:00","2013-01-01 00:00:00","2014-01-01 00:00:00","2014-07-01 10:10:41"]
file("e:/sales.txt").import@t()                                              /Import records from structured text files to form a sequence, sequential value is: 


A sequence whose members are records is table sequence, which is often used to make computations of structured data. It’s not the focus in this article, for more information please refer to “Basic Computation of esProc Table Sequence and Record Sequence”.

3. Create by computing
Description: read the text file sales.txt into table sequence A1, fetch column Client to create sequence A2; group the records according to Client to create sequence A3.

A1=file("e:/sales.txt").import@t()
A2=A1.(Client)
A3=A1.group(Client)

Note:

1.Value of sequence A2:


2.Value of sequence A3:


We can see that members of sequence A3 are a number of sequences whose members are records.

Accessing
4.Access members according to serial numbers

A1=[a,b,c,d,e,f,g] /Sequence A1
A1(2) /Fetch the second member whose value is string “b”, which equals to A1.m(2)
A1([2,3,4]) /Fetch members from the second to the fourth, whose value is expressed by the sequence [b,c,d]. Note that [2,3,4] is also a sequence(integer sequence). Intervals can be used to rewritten the expression as A1(to(2,4)).
A1.m(-1) /Fetch the last member. Note that m function must be used when fetching members backwards, the expression cannot be abbreviated to A1(-1). 

5.Assignment and modification
A1(2)=r /Modify the second member to r, now value of sequence A1 is [a,r,c,d,e,f,g]
A1([2,4])=["r","s"] /Modify the second and the fourth member. A1=[a,r,c,s,d,e,f,g]
A1.Modify(2,["r","s"]) /Modify in turn from the second member, expression A1= [a,r,s,d,e,f,g] equals to A1([2,3])=["r","s"]

6.Add members
A1.Insert(0,"r") /Add members at the end of the sequence, A1=[a,b,c,d,e,f,g,r]
A1.Insert(2,["r","s","t"]) /Insert three members consecutively before the second member, A1=[ a,r,s,t,b,c,d,e,f,g,r]

7.Delete members
A1.Delete(2) /Delete the second member
A1.Delete([2,4]) /Delete the second and the fourth member

Operators
8.Sets computation
Sets computation include ^ intersection, & union, \complement, and |concatenate, etc. For example:
A1=["a","b",1,2,3,4] /Sequence A1
B1= ["d","b",10,12,3,4]      /Sequence A2
A1^B1 /Intersection, return the sequence made up of members of both thetwo sequences, value is ["a","b",3,4]
A1\B1 /Complement, a new sequence created by successively removing from A1 the members of B1, value is ["a",1,2]
A1&B1 /Union, value is["a","b",1,2,3,4,"d",10,12]
A1|B1 /Concatenate, value is ["a","b",1,2,3,4,"d","b",10,12,3,4]
Note: Both union and concatenate are created by combining members of two sequences in order. Common members only appear once in union while, in concatenate, all of them will appear. 

9.Alignment arithmetic operation
Two sequences of the same length can make alignment operation according to members and return the sequence. The operation includes ++ (add), -- (subtract), ** (multiply), // (divide) and %% (complementation). For example: 

A1=[1,2,3,4] /Sequence A1
B1= [10,12,3,4] /Sequence A2
A1++B1 /Counterpoint addition, value is [11,14,16,18]

10.Boolean operation
Two sequences can compare in alignment, the result is a Boolean type.
[1,2,3]==[1,2,3] /Comparative result is true
[1,B,3]<=[1,b,4] /Comparative result is true, because B is less than b
[1,2,3]<[1,3,4] /Result is true, because the second member of [1,2,3] is “2” , which is smaller than the second member “3” of [1,3,4]
Note: “in” function is used to judge the inclusion relation between sequences.

Functions
11.Aggregate function
Functions for sequences include sum, avg, max, variance,etc. For example:
A1=[2,4,6] /Sequence
A1.sum() /Summation, result is 12
A1.sum(~*~) /Quadratic sum, which equals to 2*2+4*4+6*6, result is 56. ~ represents each member of a sequence.

12.Loop function
Loop function can make computation aiming at every member of a sequence, and express complex loop statement with simple functions, including loop computation, filter, locate, look up, rank, sort, etc.
A1=[2,4,-6] /Construct sequence A1 
A1=(~+1) /Add 1 to every member, result is [3,5,-5]
A1.select(~>1) /Filter out members that are greater than 1, result is [2,4]
A1.pselect@a(~>1) /Locate serial numbers of members that are greater than 1, result is [1,2]
A1.pos([-6,2]) /Look up serial numbers of members -6 and 2 in A1, result is [3,1]
A1.rank() /Rank of members of the sequence, result is [2,1,3]
A1.sort() /Sort in ascending order, result is [-6,2,4]; [2,4,-6].sort(~:-1) is the expression when sorting in descending order

Concepts of esProc Sequence,Table Sequence and Record Sequence

Sequence,table sequence and record sequence are commonest data types of esProc. This article tries to expound their respective characteristics as well as relations between them.

A sequence is an ordered generic set
Collectivity:

A sequence consists of multiple data,which are members of the sequence. The members can be any type of data, such as string,integer,decimal and date, or null.A sequence has the general characteristics of a set, and can make set operations. A1, A2 and A3 in the following are all sequences.

A1=[] /empty sequence
A2=[5,6,7] /a sequence whose members are integers is also called an integer sequence
A3=["red","blue","yellow"] /a sequence whose members are strings
A4=["blue","yellow","white"] /a sequence whose members are strings
A5=A3^A4 /intersection operation of sequences. A5 is also a sequence whose value is ["blue","yellow"]

Genericity
A sequence is a genericity set, which contains members of various data types. Its member can also be a sequence, like B1 and B2:

B1=[1,date("2014-05-01")] / members of sequence include an integer and a date
B2=["blue",[],[5,6,7]] /a sequence consisting of a sequence equals to ["blue",A1,A2]

Orderliness
Generally, a set is unordered, that is, two sets with same members of different order are equal.A sequence is ordered.Two sequences with same members of different order are not equal.e.g. There are two sequences A1= ["Mike","Tom"] and B1=["Tom","Mike"]. When examined with expression A1==A1 to see whether they are equal or not,the result is false.

Orderliness is a common feature of business data. For instance,that Mike comes before Tom may mean that Mike has done a better job in school study.Sorting sales amount monthly can clearly present its changing rule. It is more convenient to use a sequence to do ordered computation. Such as,

A1(2) /Fetch the second member. This operation can also be expressed with A1.m(2).
A1.m(-1) /Fetch the last member.
A1.p("Tom")/Fetch the sequence number of member Tom.
A1.rvs() /Reverse the sequence.

In addition, there are operations like insert, delete, modify, copy, compare, convert, sub-sequence,sorting,rank, sets computation, mutual transformation of strings and sequences, etc.
An integer sequence is a sequence whose members are integers. It has a more detailed access method.Such as,
to(2,5) /Create sequence [2,3,4,5]. If the sequence begins with 1, it can be abbreviated to to(5).

Basic Computations of esProc Sequences

A table sequence is a structured sequence
Structured two-dimensional data objects
Members of a sequence can be any types of data, such as,atom type, another sequence or a record. If members of a sequence are records of the same structure(of the same field number and field name), it is called a table sequence. The data objects in the following figure constitute a table sequence:

Because a table sequence isa structured two-dimensional data object,it is usually created from SQL, text files, binary files, Excel files or anempty table sequence. A1,B1 and C1 in the following are table sequences.

A1=file("e:/sales.txt").import@t() /table sequence created from a text file
B1=Oracle1.query("select * from sales") /table sequence created from SQL
C1=create(OrderID,Client,SellerId,Amount,OrderDate) /create anempty table sequence

A great deal of structured data operations can be performed within a table sequence, including query, sort, sum, average value, merging repeated records, etc. Such as,

A1.select(Amount>=2000 && year(OrderDate)==2009) /Query out records whose Amount field is greater than or equal to 2000 and OrderDate is the year 2009.
A1.sort(SellerId,OrderDate:-1) /Sort records in ascending order according to SellerID field. With the same SellerID, sort records in descending order according to OrderDate.
A1.groups(SellerId,year(OrderDate);sum(Amount),count(~)) / Sum up Amount of each group of data and count up orders of each group according to SellerID and the year.

A table sequence is a special sequence
A table sequence is still a sequence.The latter’s collectivity, orderliness and related functions apply in a table sequence. A table sequence hasn’t the feature of genericity because its members must be records of the same structure. But, the field values of the records can be generic data, which in this sense is another form of genericity.Thanks to these features, a table sequence is better at handling complicated computation than traditional program language does.

For instance, based on orderliness, we can find the growth rate of each month compared with the previous one in a table sequence. The statement will be:
sales.derive(salesAmount /salesAmount [-1]-1:compValue)

Another instance. Assume that, a big contract in business is that whose amount of order is greater than 40 and an important contract is that whose unit price is more than 2000. Please find, according to collectivity, 1. contracts of this year which are both big and important; 2. the other contracts.
thisYear=Contract.select(year(SellDate)=2014) /contracts of this year
big= Contract.select(Quantity>40) /big contracts
importance = Contract.select(AMOUNT>2000) /important contracts
answer=thisYear^big^importance /answer to question 1
others= Contract\answer /answer to question 2

Note that Contracts in the code are table sequences while thisYear, big, Importance,answer,others are record sequences originating from table sequences. The difference and relation between a table sequence and a record sequence will be explained in the following.
Make contextual computing according to genericity: Among subordinates of department managers who have won the President Award, who have been awarded the Annual Outstanding Staff?
employee.select(empHonor: "EOY",empDept.manager.empHornor:"PA")

A record sequence is the reference of table sequence records
Obviously, if each computation in a table sequence is to produce a new table sequence, a great deal of the memory will be occupied. For instance, a table sequence, sales, has 5,000 records, and 3,000 ones will be produced by query. If new table sequences are to created, there will be 8,000 records in memory. In fact, as the 3,000 records are part of the original table sequence, it is unnecessary to create table sequences anew. We just need to store their references by using certain data objects. This type of data objects is called a record sequence.

Transparency of record sequences
Usually, it’s not necessary for programmers to differentiate record sequences and table sequences, as what they do with references and physical data. Operations in the preceding example, such as, query, sorting and intersection, can be used in both record sequences and table sequences with the same syntax:
A1.select(Amount>=2000 && year(OrderDate)==2009)
A1.sort(SellerId,OrderDate:-1)
Note: when A1 is a table sequence, the result will be a record sequence; and the same result will be obtained when A1 is a record sequence. Sets operation can be performed between record sequences, e.g.,
answer=thisYear^big^importance

It hasn’t any practical significance to make sets operation between table sequences, for members of different table sequences are always different objects. For instance, intersection of this year and big is null forever if both of them are table sequences.

When data structure changes, esProc will automatically create new table sequences, like what will happen in grouping and summarizing.
A1.groups(SellerId,year(OrderDate);sum(Amount),count(~))

A table sequence has a one-way influence on a record sequence
Different table sequences represent different physical records, so modifying a table sequence will not affect the other ones. But, a record sequence is the reference of the records of a table sequence and the two have the same physical data, any change of the table sequence will affect the record sequence. Such as,

TSeq=file("e:/sales.txt").import@t() / TSeq is a table sequence, Client field value of the record represented by OrderID=5 is DSG
RSeq=TSeq.select(Amount>2000) /RSeq is a record sequence created from TSeq. Amount of the record represented by OrderID=5 is greater than 2000, so the record is a member of RSeq
TSeq.modify(5, "WVF":Client) /Modify Client field of the record represented by OrderID=1 into WVF

Now it can be seen that Client field of the record represented by OrderID=5 in RSeq has also changed into WVF. If there are multiple record sequences that originate from the same table sequence, their data will change accordingly. Often, this is not what a programmer wants to see. For instance, the Amount field value of the record represented by OrderID=5 is 3730, and if the following computation is to executed:

RSeq =TSeq.select(Amount>2000) / The record represented by OrderID=5 is in RSeq
TSeq.modify(5, 1000: Amount) /Modify Amount field of the record represented by OrderID=5 into 1000, i.e., less than 2000

Now it can be found that the record represented by OrderID=5 in RSeq is still there (because its reference didn’t change), and its Amount value is 1000, which is inconsistent with the condition Amount>2000, and sometimes incorrect in business.
In order to avoid this influence, users should finish modifying the table sequence before the record sequence is created. Such as,

TSeq.modify(5, 1000: Amount) /Modify Amount field of the record represented by OrderID=5 into 1000, i.e., less than 2000
RSeq =TSeq.select(Amount>2000) /The record represented by OrderID=5 won’t appear in RSeq

This computed result is what the business is required. It’s only natural for the above computing order as long as the reference relation between a table sequence and a record sequence is understood.

It’s important to note that operations to modify the original table sequence, like modify, cannot be accepted in a record sequence, the relation between the two is a safe one-way influence. So users can safely use record sequences and table sequences.

Index Performance Comparison between Oracle and esProc

Data table indexing is a common method to accelerate query in Oracle. esProc also provides indexing function. With the actual measurements in the several examples below, we can compare their speed after indexing.

The test is performed on data table T3 with 165 million records. The binary data file being saved in esProc format takes up 14.6 G physical storage,in which fields are shown below:

CREATE TABLE "T3"
( "L11" NUMBER(11,0),
"L4" NUMBER(9,0),
"D4" VARCHAR2(9),
"C4" VARCHAR2(10),
"R2" DATE,
"R4" DATE,
"FL6" NUMBER(9,0),
"FD6" VARCHAR2(6),
"FC6" VARCHAR2(9),
"TL1" NUMBER(2,0),
"TL11" NUMBER(7,0),
"TN1" NUMBER(5,2),
"TN11" NUMBER(23,2),
"TN21" NUMBER(9,2),
"TN31" NUMBER(9,2)
)

Provide the same hardware for both Oracle and esProc with the below environment configuration:
Model for test: Dell Power Edge T610
CPU: Intel Xeon E5620*2
RAM: 20G
HDD: Raid5 1T
Operating system: CentOS 6.4
JDK: 1.6
Oracle version: 11g
esProc version: 3.1

1.Indexing Performance Comparison
4 indexes have been created for both Oracle and esProc. The first 3 are the single-field indexes, and the forth index is the composite indexes.
Note: The test results in this article are all represented in seconds unless otherwise remarked.

As can be seen from the above figure, indexing in Oracle is faster than that in esProc. The main application scenario of esProc is the data computing in BI. In this sector, the data seldom changes. Owing to this, the comparatively slow speed is acceptable since indexing can be regarded as a one-off job.

2.   Single-field Indexing Performance Comparison

In below discussion, let’s compare the query speed between Oracle and esProc. To start with, let’s compare three single-field indexes. In which, the ind1 is the single-field index for the integer field L4; the ind2 is the single-field index for real number field TN21; and the ind3 is the single-field index for the character field C4. Compare the respective time consumed to query based on filtering criteria.

2.1.  Less than 10 records satisfying the query conditions 




As can be seen from the above figure, compared with Oracle, esProc is faster in handling the integer field, comparable in handling the real number field, and slower in handling the character field.

2.2.  Around 100 records satisfying the query conditions 


As can be seen from the figure, esProc is relatively faster for the numeric field, and the speeds of esProc and Oracle in handling the character field are close.

2.3.  Around 10000 records satisfying the query conditions.



As can be seen from the above figure, esProc runs faster, but the difference is not great.

2.4.  Around 100000 records satisfying the query conditions 


As can be seen from the above figure, esProc runs faster, demonstrating its obvious advantages.

3.   Multi-field Composite Indexing Performance Comparison

3.1.  Less than 10 records satisfying the query conditions 


As can be seen from the above figure, Oracle performance is better.

3.2.  Around 100 records satisfying the query conditions 


As can be seen from the above figure, Oracle performance is relatively better if there are around 100 records satisfying the conditions.

3.3.  Around 10000 records satisfying the query conditions. 


As can be seen from the above figure, esProc performance is obviously superior when the composite indexes of the 3 fields are all the filter criteria and there are 10000 records satisfying the condition.

3.4.  Around 100000 records satisfying the query conditions 


As can be seen from the above figure, esProc performance is better if there are 100000 records satisfying the condition.

4.   Findings on Performance Comparison

1.  Indexing in Oracle is several times faster than that in esProc. Comparatively, esProc more fit for the BI data computing. In most BI scenarios, relatively few data changes, and indexing can be regarded as a one-off job. A bit slowdown in speed is also acceptable.
2.  After indexing, in case a small number (<10000) of records satisfy the query conditions, Oracle is often superior to esProc; In case the number of records is at a medium level (>10000,<100000), their performance is close; In case a great number of records (>100000) return, esProc demonstrates an obvious performance advantage.

July 17, 2014

Process Text Files with esProc

It is quite convenient to process data from text files with esProc, which provides many functions for processing text files: import various text files; process big text files; visit text files of hdfs; as well as general operations, such as, file moving, deleting and checking whether a file exists. The following will illustrate these functions through examples.

File function and import function can be used if data in text files need to be read in. For example, the following text file, empolyee.txt, stores employee information:

EID   NAME SURNAME GENDER STATE BIRTHDAY   HIREDATE    DEPT SALARY
1   Rebecca Moore F California 1974-11-20 2005-03-11  R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16  Finance   11000
3   Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4   Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5   Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales   11000
7 Alexis Smith F Illinois   1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07  HR 3000


It is convenient to import the file code:


In the above figure, cell A1 uses file function to open a designated file and cell A2 uses the file’s import function to import the file’s content. We can see in the figure that esProc's integrated development environment provides debugging functions, like single step, break point and so on. The imported file content can be displayed visually in the window on the right, making  debugging program and the observation of result more convenient. 

As the other functions provided by esProc, import function contains options and parameters. @t in import@t() in the above figure shows option t is used, and by looking up in the function reference, @t means that the data file’s first row is the field name. It is the parameter that the parentheses contains. The blank in parentheses in the above figure shows that all parameters are default.

Let’s look at how to code with those functions in esProc program:

1. Separator 
In the above figure, because tab is used as a separator in text file empolyee.txt, no input parameter is used in the parentheses of import function. The default character tab is used directly instead. If other separators need to be imported, they can be defined with parameters. An example: comma is used to separate fields in csv file supported by excel, so the code for import function is import@t(;”,”).   

2. Import a portion of the fields
When a portion of the fields, instead of all of them, is to be imported, parameters can be used to define them, e.g., import@t(EID,NAME,BIRTHDAY).
3. Data type designated to be imported
esProc automatically judges the type of imported fields. For instance, BIRTHDAY is in consistent with the format of date and time, so esProc will automatically import the datetime type. If the field is required to be imported as a string, the code may be like this: import@t(EID,NAME,BIRTHDAY:string). The data types compatible with esProc include bool, int, long, float, decimal, number, string, date, time and datetime. 

4. Importing data piecewise 
It is required to piecewise import data when the data size of a text file is larger than the memory. import function of esProc provides two input parameters to read data from designated starting bytes to end bytes. For instance, import@t(EID,NAME,BIRTHDAY;,1000:2000) reads data from the 1000th bytes to the 2000th bytes. Usually, text file data gets its meaning when read by rows. import function automatically adjusts the position of starting point and end point to ensure a full data row. In order to ensure that the imported data consists exclusively of full rows, when the starting point happens to be in the middle of a data row, the import function will skip the row and begin with the next row; but if it is the end point that happens to be in the same position, the reading will continue until the end of the row. Reading data piecewise rules that the starting character will not be read while the end character will. So by making each end point to be the starting one of the next reading, all data rows will be imported without omission and repetition. Thus, in the preceding example, import@t(EID,NAME,BIRTHDAY;,2000:3000) is the appropriate code for the next reading. 

It's worth pointing out that because the data should not be segmented according to rows, the current row number can only be determined after the traversal of all preceding rows. Otherwise, the expected excellent performance of importing data piecewise won’t be achieved.  

esProc also segments data automatically. For example, if a file is divided into three parts and the second one is to be read, the code is like this: import@tz(EID,NAME,BIRTHDAY:string;,2:3). With the option @z, meanings of the two parameters become the total number of parts and the serial number of current part waiting to be read. In this method, it is convenient for multiple parallel tasks to piecewise process a big file. 

5. The process and export of big files
After the file is imported to the memory, it can be exported or re-imported into the database. For example, after empolyee.txt is imported, NAME and SURNAME could be merged into FULLNAME to be exported to employee_result.txt. If data size of the file is big enough, text cursor can be used to process it. esProc cursor can take out file data one by one and make computations. In addition, it provides quite a lot of cursor functions, making code more concise.

A complete esProc code is as follows:


In the above figure, cell A2 defines a cursor of file; cell A3 calculates full name; cell A4 exports the result of A3 to the result file. 

6. Access hdfs files.
It’s easy to access hdfs files with esProc. Your just need to change file functions to hdfsfile functions, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.txt").For a compressed file, change the extension, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.gzip"). Thus hdfs compresses or decompresses according to the extensions of files when reading and writing hdfs files. Compress methods includes gzip, LZO and LZ4 etc.

7. General file operations
We can make general file operations with esProc. For example, functions, such as move, exists, date, size, property and file name, respectively have the function of move (delete), checking whether a file exists, modifying date, size checking, reading property of configuration files and splitting full path, as well as obtaining file names and extensions.

July 16, 2014

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (II)

Comparison Between esProc’s Sequence Table Object and R’s Data Frame (I)

Actual case

In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.

Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.

Sequence Table Solution:


Data frame Solution:

01     library(gdata) #use excel function library
02     A1<- read.xls("e:\\data\\all.xlsx") #import data
03     A2<-subset(A1,as.POSIXlt(Date)>=as.POSIXlt('2012-06-01') &as.POSIXlt(Date)<=as.POSIXlt('2012-06-30')) #filter by date
04     A3 <- split(A2,A2$Code) #group by Code
05     A8<-list()
06     for(i in 1:length(A3)){
07       A3[[i]][order(as.numeric(A3[[i]]$Date)),] #sort by Date in each group
08       A3[[i]]$INC<-with(A3[[i]], Close-c(0,Close[- length (Close)])) #add a column, increased price
09       if(nrow(A3[[i]])>0){  #add a column, continuous increased days
10         A3[[i]]$CID[[1]]<-1
11         for(j in 2:nrow(A3[[i]])){
12           if(A3[[i]]$INC[[j]]>0 ){
13             A3[[i]]$CID[[j]]<-A3[[i]]$CID[[j-1]]+1
14           }else{
15             A3[[i]]$CID[[j]]<-0
16           }
17         }   
18       }
19       if(max(A3[[i]]$CID)>=5){  #stock max CID is bigger than 5
20         A8[[length(A8)+1]]<-A3[[i]]
21       }
22     }
23     A9<-lapply(A8,function(x) x$Code[[1]]) #finally,stock code

Comparison
1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.

2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.

3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computationby reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.

4.To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.

Test Performance

Test 1: Generating 10 million records in memory, each consists of three fields. All values ​​are random numbers. Records are filtered, and each field is summed.
         
Sequence table:


Data frame

> library(timeDate)
> start=Sys.timeDate()
> col1=rnorm(n=10000000,mean=20000,sd=10000)
> col2=rnorm(n=10000000,mean=40000,sd=10000)
> col3=rnorm(n=10000000,mean=80000,sd=10000)
> data1=data.frame(col1,col2,col3)
> data2=subset(data1,col1>90)
> result=colSums(data2)
> print(result)
        col1         col2         col3
200844165732 390691612886 781453730448
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.533333 mins

Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.

Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields

Sequence Table:


Data frame:

>library(timeDate)
> start=Sys.timeDate()
> data<-read.table("d:/T21.txt",sep = "\t")
> data1=subset(data,V1>90,select=c(V9,V11))
> result=colSums(data1)
> print(result)
         V9         V11
 5942982895          59484930179
> end=Sys.timeDate()
> print(end-start)
Time difference of 1.134722 hours

Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.

From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.