In grouping and summarizing operation, if the
names and numbers of the groups have been given, then we call the operation as
fixed grouping and summarizing. The grouping criterion of this type of
operation comes from outside of the data set, such as grouping by the client
list of parameters list, or by conditions list. This type of operation involves
some problems like if the grouping criterion is beyond the boundary of the data
set, if extra groups are needed and if there is overlap among the data. To
solve these problems requires some effort. In the following, we'll look at
several methods for realizing grouping and summarizing in R language.
Case
1: Group criterion is within the data set
Data frame sales contains records of sales orders, in which column CLIENT
contains clients list and column AMOUNT contains the orderamounts. Please group
sales by “potential clients list” and
then summate columns AMOUNT in each group. The potential clients list is [ARO,BON,CHO],
which happens to be a subset of column CLIENT.
Note: Source of sales may be a database or
a file, such asorders<-read.table("sales.txt",sep="\t",
header=TRUE). The first rows of data are as follows:
Code:
Computed result:
Code
interpretation:
1.factor function generates a grouping
criterion (which is called as factor in R language), and aggregate function groups and summarizes according to the grouping
criterion. The structure of the code is quite clear.
2. Note that the grouping
criterion is neither a vector quantity nor an array, so it cannot be written as
byFac<-
c("ARO","BON","CHO"); it cannot be used
directly too unless it is converted into the list type. They are not easy to understand for beginners and
require special attention.
3. If column CLIENT is set as the
grouping criterion (that is, the unfixed grouping), only a line of code is
enough:result<-aggregate(sales$AMOUNT,list(sales$CLIENT),sum).
Summary:
aggregate function can
easily fulfill this task.
Case 2: Grouping criterion
is beyond the boundary of the data set
It is the exceptional case that the grouping criterion is within the
column data. In fact, since the grouping criterion comes from outside of the
data set (like the external parameters), its members may not in the column
data. In this case, we try to solve such a problem.
Supposed that the value of “potential clients list” is [ARO,BON,CHO,ZTOZ],
please divide data into four groups according to “potential clients list” and
summate column AMOUNT in each group. Note that client ZTOZ is not in column
CLIENT.
Code similar to that of
case 1:
result<-aggregate(sales$AMOUNT,list(byFac),sum)
The computed result of the above code:
It can be seen that there are only three groups of data instead of
the four groups as required in the computed result. Apparently, the above code
fails to fulfill the task and needs some improvement.
The improved code:
Computed
result is:
Code
interpretation:
1. The improved code is in line
with the business logic, which makes all the required four groups be displayed
in the result.
2. tapply is used to group and summarize. It is more
universal than aggregate. But it
confuses beginners most of the time as it lacks a sense of immediacy compared
with aggregate.
3. That the summating value of
ZTOZ is NA means that ZTOZ is not in column CLIENT. If the summating value of
ZTOZ is zero, ZTOZ is in column CLIENT and order amount is zero too.
4. In this case, there are just
four groups in the grouping and summarizing result. Extra clients shouldn't be
here. These extra clients can be called "extra group" and the code for
computing its summating value cannot be produced only through some simple modification
on the basis of the current code. Instead, a new function is needed:
filtered<-sales[!is.element(sales$CLIENT,byFac),]
It is not a piece of
complicated code, but the way of coding is apparently different from the
previous one.
Summary:
tapply can fulfill this task easily.
Case 3: No overlap among the
grouping criteria
It is also one of the fixed groupings that set
some conditions as the grouping criteria. For example, divide the order amount
into four intervals by 1000, 2000 and 4000. Each interval is a group of orders.
Then summate the order amount of each group.
Code:
result<-tapply(sales$AMOUNT,
list(byFac),function(x) sum(x))
Computed result:
Code interpretation:
cut function divide the data frame into four intervals, then tapply
function groups the data frame by these intervals and summarize the result of
each group.
Summary:
The cooperation of cut and tapply can easily perform the simplest conditional grouping.
The simplest conditional grouping not
involves overlapping among the conditions. But in reality, it is common that
conditions are overlapping. For example, group the order amount according to
the following conditions:
1000
~ 4000:standard orders r14
<2000:non-important
orders r2
>3000:important
orders r3
The group of standard orders overlaps
with the other two groups. In this circumstance, the problem of whether we
should re-compute the overlapping data arises. Let's first deal with situations
which require the re-computing.
Code:
r14<-subset(sales,AMOUNT>=1000
& AMOUNT<=4000 )
r2<-subset(sales,AMOUNT<2000)
r3<-subset(sales,AMOUNT>3000 )
grouped<-list(r14=r14,r2=r2,r3=r3)
result<-lapply(grouped,FUN=function(x)
sum(x$AMOUNT))
Computed result:
Note: r2 and r3 contain some of the data of
r14.
Code
interpretation:
1. The above code can complete
this task but is too complicated. If there are more conditions, the code will
become lengthy.
2. A new function lapply is used in the code. So far we
have used a lot of functions, including factor,
aggregate, list, tapply, cut, subset,
lapply, etc, in order to perform the
operation of fixed grouping. We have to use different functions and ways of
coding to perform the operation of conditional grouping just because overlapping
conditions are involved or not. To be honest, it is difficult to master all these
methods.
3. The computed result of the
above code is list, while some results of the code in the previous cases are
data.frame, and some others are arrays. This
inconsistency will bring trouble in practice.
Summary:
The method can complete the task, but the
code is complicated and effort is needed to learn to use a lot of functions.
In the previous case, we solved the problem when re-computing the overlapping
conditions is needed. But sometimes, we need the non-repeating computed result,
that is, the grouped data shouldn't appear again. In this case, the operation
will be like this: r2 shouldn't contain data of r14, and r3 shouldn't contain
data of r2 and r14.
Code:
r14<-subset(sales,AMOUNT>=1000 & AMOUNT<=4000 )
r2<-subset(sales,AMOUNT<2000 & !(AMOUNT>=1000 & AMOUNT<=4000))
r3<-subset(sales,AMOUNT>3000 & !((AMOUNT>=1000 & AMOUNT<=4000)) & !(AMOUNT<2000))
grouped<-list(r14=r14,r2=r2,r3=r3)
result<-lapply(grouped,FUN=function(x) sum(x$AMOUNT))
Computed result:
Note that when it is no need to re-compute
the overlapping data, the value of r2 and r3 is smaller than the result
computed previously.
Code interpretation:
More logical judgments are used in the
above code, which further complicates the code. It thus can be imagined that
the amount of code will be huge if groups are many and conditions are
complicated.
Summary:
This method can complete the task but the
code is complicated.
The third-party solution
We
can also fulfil the task in this case using Python, esProc and Perl. All of
them can perform the operation of fixed grouping and summarizing as well as the
computation of structured data. The following will briefly introduce the
solution in esProc.
byFac=["ARO","BON","CHO"]
grouped.new(byFac(#), ~.sum(AMOUNT))
Computed result:
Case 2:
The code is omitted because it is
completely the same as that in case 1.
Computed result:
If we want to summate the extra group, a
little modification will be enough:
byFac=["ARO","BON","CHO","ZTOZ"]
grouped=sales.align@a@n(byFac,CLIENT)
grouped.new((byFac|"redundant")(#), ~.sum(AMOUNT))
Modified part is marked in red. @n represents that an extra group is
added to the result set. This way of coding is easy to master than that of R
language.
Computed result:
Case 3:
For simple conditional grouping, esProc
will only need to change the align
function to enum function. The other
part will remain unchanged.
byFac=["?<=1000"
,"?>1000 && ?<=2000","?>2000 &&
?<=4000","?>4000"]
grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:
esProc:
@r option is added to the previous code when re-computing the
overlapping data is needed.
byFac=["?>=1000
&& ?<=4000","?<2000" ,"?>3000"]
grouped=sales.enum@r(byFac,AMOUNT)
grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:
esProc:
Remove the @r option when it is no need to compute the overlapping data. This is
the same as the operation of simple condition grouping.
byFac=["?>=1000
&& ?<=4000","?<2000" ,"?>3000"]
grouped=sales.enum(byFac,AMOUNT)
grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:
It can be seen that esProc needs only two
functions: align and enum, to realize all types of fixed
grouping and summarizing. Its code is consistent and simple.
No comments:
Post a Comment