esProc, A Script Language for Data Analytics with Parallel Mechanism: esProc Assists Report Development – Crosstab In Which Row & Column Headers Are Intervals

It is difficult to deal with some unconventional statistical tasks using the reporting tool, like Jasper and BIRT, alone or SQL. One example is the crosstab in which both the row headers and column headers are intervals, and whose measurement comes from anther database table. With powerful structured data computing engine and being integration-friendly, esProc can conveniently handle the case. I’ll explain the process of realizing dynamic data source through the following example.

account_no is the primary key of table account_detail, which has a one-to-many relationship with both table Paysoft_result andNAEDO through foreign key custno and foreign key customer_code. Report design requires that, according to the external parameters, empirica_score field of table account_detail be divided into segments that are used as the row headers and thatmfin_score field also be divided into segments that are used as the column headers. The computation of measurement is to divide the number of records of table Paysoft_result to which the account_no, which is the intersection-point where row headers and column headers meet, corresponds by that of table mfin_score to which it corresponds.

The following figure shows relations between the database tables and between certain fields and the report:

esProc will perform the data preparation using the following code:

A1=myDB1.query("select * from account_detail order by empirica_score,mfin_score")

This line of code retrieves data from table account_detail. myDB1 is the data source name that points to the database. queryfunction executes the SQL query statement. A1’s result is as follows:

A2=myDB1.query("select * from paysoft_result")

B2=myDB1.query("select * from NAEDO")

Then retrieve data from paysoft_result and NAEDO respectively in A2 and B2 likewise. Results are shown separately as follows:

A3=rowList.array()

B3=colList.array()

These two lines of code convert the parameters passed from the report into esProc sequences. Parameter rowList represents the row headers, like “560,575,585,595,605,615,625,635,645,654,665”, which includes ten consecutive intervals; parametercolList represents the column headers, like “39, 66, 91, 116, 137, 155”, which includes five consecutive intervals. array function is used to convert a string separated by commas into a sequence. The converting results are as follows:

A4=A1.select(empirica_score>=A3(1) && mfin_score>=B3(1))

In this example certain data in the source table exceed the range of the specified interval. For instance, customer “No501”’sempirica_sore is 540, which is smaller than the lower limit of the interval – 560. This line of code will filter out the data that are smaller than the lower limit in order to increase performance and simplify the expression.

select function executes data query or data filtering. empirica_score is a field of A1, A3(1) represents the first member of A3, i.e. 560, the lower limit of the interval. The logical operator “&&” means “AND”. A4’s result is as follows:

A5=A4.group(A3.pselect(empirica_score<~[1]):row,

B3.pselect(mfin_score<~[1]):col;

~:accounts,

A2.select(accounts.(account_no).pselect(~==custno)):p,

B2.select(accounts.(account_no).pselect(~==customer_code)):n

)

This line of code groups table account_detail in A4 according to the intervals in A3 (rowList) and those in B3(colList) and find out each group’s corresponding records in A2(paysoft_result) and B2(NAEDO).

group function is used to group data according to multiple fields (or grouping criteria). The syntax is A.group(field1,field2…). It is also used to compute subtotals or perform subsequent computations based on each group of data. The syntax isA.group(field1,field2… ; subtotal1,subtotal2…) . Fields of the grouped data can be renamed with “:new name”. The result of the above grouping operation contains five fields, which are row, col, accounts, p, n respectively. A5’s result is as follows:

To compute the grouping criterion row: Group A4’s empirica_score field according to A3’s intervals using the codeA3.pselect(empirica_score<~[1]). pselect function finds the sequence numbers of the eligible members in A3. “~” represents the current member of A3, ~[-1] represents its previous member and ~[1] represents the next one. The current interval is(empirica_score>=~ && empirica_score<~[1]). As all values of account_no field are bigger than the lower limit of the interval, the expression can be simplified as empirica_score<~[1]. According to “560,575,585,595,605,615,625,635,645,654,665”, A1 can be divided into 10 intervals - 560-574,575-584,585-594,595-604,605-614,615-624,625-634,635-644,645-654,655-664 – whose sequence numbers are from 1 to 10 in order.

To compute the grouping criterion col: Similarly, group A4’s mfin_score field according to B3’s intervals with the codeB3.pselect(mfin_score<~[1]). According to “39,66,91,116,137,155”, A1 can be divided into 5 intervals - 39-65,66-90,91-115,116-136,137-154 – whose sequence numbers are from 1 to 5 respectively.

Summary field account directly gets each group of data. “~” represents members of the current group. Click on accountscolumn highlighted in blue and the detail data will be displayed. For example, as the following figure shows, “row=1,col=1” corresponds the two intervals “560-574 and 39-65”; “row=2,col=5” corresponds the two intervals “575-584 and 137-154”:

To compute the summary field p: Find out records corresponding to accounts from A2 using the code A2.select(accounts.(account_no).pselect(~==custno)). select function accesses A2’s data and selects the eligible data by the filtering criterion. In the case of “row=1,col=1” and “row=2,col=5”, the records corresponding to column p are shown separately as follows (the relationship between accounts and A2 is one-to-many):

To compute summary field n: Similarly, find out records corresponding to accounts from B2 using the code B2.select(accounts.
(account_no).pselect(~==customer_code)). In the case of “row=1,col=1” and “row=2,col=5”, the records corresponding to column n are shown separately as follows:

A6=A5.derive(p.count():pCount,n.count():nCount)

This line of code appends to A5 the new columns pCount and nCount for computing the number of records of p and n in each group. The result is as follows:

A7=A6.derive(pCount/nCount:rate)

This line of code appends column rate to A6. The arithmetic is dividing pCount by nCount. The result is as follows:

A8=A7.run(string(A3(row))+"-"+string(A3(row+1)-1):row,string(B3(col))+"-"+string(B3(col+1)-1):col)

This line of code converts the sequence numbers in row field and col field into the corresponding intervals. run function performs the same computation on every member of A6 (a member is a row, where, for instance, row=1 and col=1). stringfunction converts a number into a string. The expression “A3()” gets A3’s members by their sequence numbers, A3(1), for instance, is 560. A8’s result is as follows:

The result of A8 contains the three fields the report requires. Then we only need to combine them into a new two-dimensional table and return it to the reporting tool through JDBC interface. This job will be done in A9 with the code result A8.new(row,col,rate).

new function retrieves the specified columns (or computed columns) from A8 to create a two-dimensional table. The result of executing A8.new(row,col,rate) is as follows:

Note: esProc provides the operator parentheses to compute the expressions separated by commas in order and return the last expression’s value. With the parentheses, the code from A4 to A7 can be encapsulated into a single line:

A4=A1.select(empirica_score>=A3(1) && mfin_score>=B3(1)).group(

A3.pselect(empirica_score<~[1]):row,

B3.pselect(mfin_score<~[1]):col;

(accounts=~,A2.count(accounts.(account_no).pselect(~==custno)) /

B2.count(accounts.(account_no).pselect(~==customer_code))):rate

)

The result is as follows:

A9 is the data set the reporting tool needs. Now let’s design a simple crosstab with JasperReport in the following template:

Three points should be noted: Don’t place the crosstab in the detail band; configure the property of Data Pre Sorted as true; define parameters corresponding to those in the esProc script in the report, such as pRowLlist and pColList. A preview of the report is as follows:

The reporting tool calls the esProc script in the same way as that in which it calls the stored procedure. Save the esProc script as, say unregul.dfx, to be called by unregul $P{pRowList},$P{pColList} in JasperReport’s SQL designer. See related documents for detailed integration solution.

menu

August 11, 2015

esProc Assists Report Development – Crosstab In Which Row & Column Headers Are Intervals

No comments:

Post a Comment