esProc, A Script Language for Data Analytics with Parallel Mechanism: data analysis

Showing posts with label data analysis. Show all posts

October 13, 2014

Programming Languages for Desktop Data Analysis

For a programming language for desktop data analysis, the most important is that it must be user-friendly and have great computing power. We can judge whether a language is suitable for analyzing data on desktop from six aspects: application environment, file processing, text and string processing, structured data processing, predictive modeling algorithms and other less important indicators.

Application environment

Most users who make desktop data analysis are not professional programmers. They are accustomed to jobs under Windows, lack the skill of environment configuration which a professional should have. So the application environment of the programming language for desktop data analysis should be simple, Windows compatible and easy to install and configure.

In this respect, both esProc and R language have done well. They have simple application environment which can be used directly after installation. Python itself can manage without problem, but Pandas – frequently used to improve the computing capacity– has complex installation under Windows and is very strict about versions, though it is easily installed under Linux. Produced by Microsoft, SSAS gets along quite well with Windows except thatthe installation and configuration is a little complicated.

File processing

TXT and XLS are file formats most likely to be generated in routine work. And the degree of support for them shows the degree of ease of use of an interpreted language.

Generally, all analysis languages support TXT directly. In the case of XLS, the situation is complicated, especially the installation of third party modules and version compatibility. Since both Excel and SSAS are Microsoft products, SSAS can support XLS seamlessly. esProc, SPSS and SAS don’t need third party packages too, they can access XLS directly. Pandas is special, because it can directly support XLS, but, in itself, it is the third party library function; and its support for certain versions of XLS is limited. R language needs third party library function and Perl operating environment, and requires that versions of the three be matched; this makes installation complicated.

SSAS is easiest to use in read/write. Pandas and R language provide abundant parameters.

The ability to process big files should be taken into account, such as processing while the files are being imported. esProc is the best for it, with the most concise code.

Text and string processing

Sometimes, the disordered, semi-structured or non-structured raw data needs to be preprocessed to become the easy-to-use structured data. Thus the text and string processing is another focus in evaluating an interpreted language.

In this aspect, Python is the best and R is satisfactory. esProc comes next and SSAS is the worst.

Structured data processing

In practice, most of the data to be analyzed on desktop is the structured data. Therefore, the most common operation in desktop data analysis is the structured data computing and the corresponding computing power becomes the core competence of programming languages for desktop data analysis.

esProc is the most professional one in this field because it is specially designed for structured data computing. R language is unprofessional, especially in ordered data computing, though dataframe - a new data type – has been created for it. Pandas' dataframe is developed and improved on the basis of R’s, which makes it as able as R language but easier to understand. By contrast, SPSS, SAS and SSAS boast little in performing structured data computing.

Ordered data computing includes operations like link relative ratio, year-on-year comparison, fetching data in a relative interval, rank ordering during grouping data, and getting records in the top or bottom. It often involves relative position and cross-row and –group, and is a typical case of desktop data analysis. With inherent serial numbers, esProc performs the best in ordered data computing. Python and R language perform well, but because their basic element is the vector instead of the record, the code written in them is elusive and more suitable for scientific use.

Predictive modeling algorithms

The predictive modeling is mainly used in scientific field and not common in desktop data analysis. Yet it is still an essential indicator.

R language boasts the latest, the richest and the most mature third party algorithms. SASS is easy to use but inflexible. Python/Pandas has always been trying to catch up with and imitate R language. It achieves an easier to understand syntax but hasn't been fully-formed. SAS and SPSS have established their authority in this regard. By contrast, esProc almost hasn't any ready-made predictive modeling algorithms.

Other less important indicators

Some less important indicators, such as the support for databases and parallel computing and graphing ability, also attract attention in special cases.

SASS works the best with databases, but it is not good at heterogeneous computing between text files, databases and self-defined data. esProc also gets along well with databases and performs satisfactorily in handling situations involving heterogeneous data sources. But Pandas, R Language, SAS and SPSS are not good in dealing with the both.

SASS is an expert at graphing, though it lacks flexibility. R language, esProc and Pandas, however, are flexible and have abundant inherent charts at the same time.

As to parallel computing, esProc has a built-in engine for it, which is easy to configure and develop. R language need third party software to perform parallel computing, resulting in complicated configuration and development.

October 1, 2014

Structured Data Computing: the Focus of Routine Data Analysis

Essentially, the main content of various data analyses in our routine business is the structured data computing. For example:

Compute the link relative ratio and year-on-year comparison of each business branch’s monthly sales during a specified period of time.

Implementation approach: filter the sales data by time range, then group and summarize data by business branch, year and month, and at last, perform cross-row and –group ordered data computing.

Select stocks whose closing price has been increasing uninterruptedly for over 10 days.

Implementation approach: Group daily transaction data by stocks and sort the data of each group by dates, compute the increasing amount of the share price and the number of days during which the share price increases uninterruptedly, and filter away the stocks that have been rising uninterruptedly for over 10 days.

Relate the data of different sources, like contract and payment information, to project payments schedule and find out the overdue projects.

Implementation approach: Perform relational computing between heterogeneous data sources, then group, summarize and filter the data.

It can be seen that these routine problems of data analysis can be split into structured data operations including filtering, grouping, summarizing, sorting, ranking and relational computing.

Of course, we may need to solve data analysis problems of modeling or prediction occasionally. For example, find out goods that are closely related between each other, or predict which stock is supposed to rise, and the like. These operations require quite a lot of mathematical knowledge which ordinary staff is generally not likely to have. They are really important data analysis transaction, but they occupy only a very small part of routine data analysis.

Structured data computing is the focus. There are many tools that can perform it, like R language, Python, SQL and esProc.

R language provides dataframe data type for structured data computing. However, it was originally designed for collecting and analyzing scientific data, especially for performing matrix and vector computations. It is not professional for structured data computing.

In fact, dataframe is a newly-developed function of R language; its strong point is algorithms of modeling and prediction, such as regression analysis, ANOVA analysis, Agreementevaluation, and Bernoulli distribution, etc, which are seldom used in routine data analysis.

Pandas, Python’s third party function library, can perform structured data computing. But it was also designed for collecting and analyzing scientific data instead of structured data computing, so it is not professional too. And similar to R language, the functions of Pandas center on modeling and prediction and are seldom used in routine data analysis.

We can see that, despite lots of tools for performing structured data computing, few can be regarded as truly professional. There are only one professional, SQL, the old brand computer language.

SQL was designed purely for structured data computing. It is professional and widely used.

Yet it also has drawbacks for routine data analysis. The most obvious ones are complicated application environment and being bad at ordered data computing. The installation, configuration, maintenance and management of SQL are very complicated. SQL data set hasn't inherent serial numbers and gets disadvantaged in ordered data computing, for example, the common problems in routine data analysis like link relative ratio, year-on-year comparison, fetching data in a relative interval, performing ranking during data grouping and getting records in the top and bottom, etc. Most of the examples we mentioned at the beginning involve ordered data computing. And though we can solve them with SQL, the operation will be quite difficult.

Similar to SQL, esProc is specially designed for structured data computing.

By comparison, esProc's application environment, installation and configuration are simple. esProc can fetch data from databases, and import structured data directly from Txt, logs and Excel. Moreover, esProc table sequence has inherent serial numbers, enabling it to perform ordered data computing easily. Unfortunately, in esProc, the syntax for external memory computing is different from that for in-memory computing, which requires different code. In this respect, SQL has better consistency in its syntax.

June 16, 2014

Who uses the R programming language and how do they use it?

I think R's users are mostly mathematicians in probability and statistics.

This is also the main advantage of R, that is, R has a library with rich mathematical functions, while other free scripts can't still compete it till now;only charged SPSS and MATLAB include such content; Python’s panda package still has a gap to reach it.

It could be said this is a niche market, after all, most users are unable to understand and do not need those profound knowledge about statistics and mathematics.

Day-to-day data analytics is based on structured data, its difficulty lies in the complexity of process steps, rather than esoteric operations, such as, compute rising stocks within 5 days, find users whose arrears have exceed three months, and so on.

Besides statistical function package, R is lackluster, cryptic in syntax, and so bad in performance; it is more likely that R will be replaced by Python.

For structured data computing, R has slightly superior set-style syntax and data frame object to Python, allows you to write less loops and shorter code (but obscure). But more advanced script is none but esProc. The data object esProc provides is far more powerful than data frame, and a set-style syntax is more nature, easy to understand; the performance to traverse file data is ten times higher than R.

May 22, 2014

A Data Analysis Language/Script with Parallelism Feature

esProc is a data analysis language, featuring the easy-to-code, strong interactivity, dedicated debugging, and agile and arbitrary syntax. In particular, esProc is also capable of performing the parallel computation and fit for the big data analysis.

For example,a commercial Website generates several thousands of access logs daily. To analyze the user behaviors based on these logs, one of the analysis objectives is to compute how long each user spends the time to browse the products of each category in a specified time period. This analysis objective involves computing over several TB of data. The typical way of implementing on a single machine is unbearable because it takes several hours or days to complete. By comparison, the parallelism of esProc enables users to achieve it in 10-20 minute.

esProc enables the below parallel computing procedure: The summary machine receives the external parameters, decompose one great job into N small jobs, and distribute the N small jobs to M node machines, which is greater than N, in proper order. Each node machine is responsible for analyzing the data of some users, for example, for the users whose initials are A, analyze the time they spend to browse the products of each category; On completing the computation, each node machine will return the computed result to the summary machine, and proceed to perform next small job, for example, analyzing users whose initials are L. Once all small jobs are completed, the summary machine merges and outputs the result in the forms such as direct displaying on the IDE interface, displaying on the console, or returning via JDBC.

esProc is the data analysis language with parallelism feature.

The major advantage is shown below:

Easy-to-code. esProc script is written on a grid, i.e. cellset. So, the computational logics can be laid out in a 2D space conveniently. The business algorithm can thus be interpreted into the computer language more easily. The grid-style presentation gives an intuitive view of the code indentation and the work scope, and streamlines the cell reference and reuse. Each cell represents one computing unit or step. With the natural cell name, a cell can make reference to another cell mutually, not requiring users to define any variables. By clicking cells, users can monitor the computed result intuitively, needless to search in a long list of variables.

Strong interactivity. esProc advocates the step-by-step computation - decompose a complex goal into several simple steps in a grid, and accomplish each objective of every simple step to ultimately achieve the final goal. By doing this, a complex computing goal can be simplified and solved with much higher development efficiency. esProc is more powerful in doing the step-by-step computation and interaction. esProc users can determine which is the smartest algorithm for the next step based on the insight to the current cell data; script for the next step by referencing the previous computations, and achieve the final goal through the incremental processing gradually and progressively. An obscure computing goal can be more and more clear and concrete in the interaction step by step.

Much more convenient debugging function.Designed with the "step-by-step" thoughts, a really practical debug function is introduced with esProc, including various functions like the break point, stepping, run to cursor, start, and end. Unlike the fake debugging script as SQL/SP, esProc can perform the debugging straightforwardly, not requiring an intermediate table specific to debugging. The break point can be set in any position without altering the code. Before proceeding to the next step of summarizing, users can even visually check the data to ensure they are grouped as expected. In the procedure of analyzing, 90% time is spent on debugging. The purpose-built debugging function can reduce error and analysis cycle dramatically.

Implementing the analysis object arbitrarily. esProc supports the true data type of set. A member of a set can be the data of any simple data types, records, and/or other sets. The set can be used to simplify the structured data computation, so that users will feel easier to perform the arbitrary computation from the business prospective. esProc supports the ordered set, which means that users can access the set member and perform the sequence-number-related computations arbitrarily, such as ranking, sorting, link relative ratio, and contemporary comparison. With the ideal "set of set" mechanism to represent the grouping, esProc can be used to solve various equal, align, and enum grouping problems easily, like computing the relative positions in multi-level groupings, and grouping and summarizing by a specified set.

Support for big data analysis. esProc users can compute over several TB of data from databases or files easily. With the parallel computing framework, massive data can be distributed to multiple computing nodes. Each node is only required to undertake the computation over quite few data. esProc supports the distributed computing at multiple levels. Each node can either act as the main node for distributing and summarizing, or the sub-node for undertaking the detailed computing. The node machine can be the high-grade configuration server or inexpensive PC of the Windows client or Linux server.

R language, Python, and Perl are also the common data analysis languages, and they are far less dedicated than esProc regarding the big data.

R language is the computing tool for scientists. Although R has the extremely rich extension packages and powerful computing and analysis ability, its parallelism is poor. R users have to integrate R with the third party software in the actual parallel application, and the stability and reliability is still doubtful. In addition, though R language has the powerful library functions, its performance is poor in executing the customized codes, and gets even poorer when it comes to data traversal and other computing. The syntax of R language is too obscure and purpose-built to be understood by the average users.

Python and Perl both are powerful in analyzing the character string. But they only offers the imperfect support for the parallel computing on a single machine, and relies on the third party software to implement the distributed parallel computing, for example, calling by the Streaming interface of Hadoop. Regarding the notoriously low performance of Hadoop, the performance would be even lower if integrating Hadoop with the Python and Perl. According to the publications, the performance of Python and Perl in Hadoop is far worse than that of Java. In addition, both Python and Perl lack the object type of structured two-dimensional data. So, people who develop such applications in Python and Perl would only find that the efficiency is comparatively much lower since the commercial data is structured and massive in most cases.

To conclude, let’s review the example in the beginning of this article. The core code of the node machine is as follows:

April 14, 2014

A More Convenient Data Computing Script than SQL

Database plays an irreplaceable role in the modern economy and is widely used in the business computing areas like Enterprise Resources Planning (ERP), Customer Relation Management (CRM), Supply Chain Management (SCM), and the Decision Support System (DSS).

Computation of structured data in the database mainly relies on SQL (Structured Query Language). SQL is the powerful, simple-to-use, and widely-applied database computing script. However, it has some native drawbacks: non-stepwise computation, incomplete set-lization, and no object reference available. Although almost all vendors have introduced and launched some non-compatible solution, such as various stored procedure like PL-SQL, T-SQL. These improved alternatives cannot remedy the native SQL drawbacks.

esProc solves these drawbacks with more powerful computational capability, much lower technical requirement, and broader scope of application. It is a more convenient database computing scripts.

I. Step-by-step Computation

Case Description

A multinational retail enterprise needs to collect statistics on the newly opened retail store, including: How many new retail stores will open in this year? Of which how many companies have the sales over 1 million dollars? Among these companies with over-1-million sales, how many companies are abased overseas?

This question is progressive. The three questions are mutually related, the next question can be regarded as the further exploring on the current question, fit for step-by-step computation.

The original data is from the database of stores table with the main fields:storeCode, storeName, openedTime, profit, and nation. Let's check the SQL solution first.

SQL Solution

To solve such problem with SQL, you will need to write 3 SQL statements as given below.

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy');

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy') and profit>1000000;

l SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy') = to_char (sysdate,'yyyy') and profit>1000000 and nation<>’local’;

SQL1:Get the result of question 1.

SQL2:Solve the problem 2.Because the step-by-step computation is impossible (that is, the results of previous computation cannot be utilized), you can only solve and take it as an individual problem.

SQL3: Solve the problem 3,and you are not allowed to compute in steps either.

esProc Solution

A1 cell: Get the records requested in problem 1.

A2 cell: Step-by-step computation. Operate on the basis of cell A1, and get the record meeting the conditions of problem 2.

A3 cell: Proceed with the step-by-step computation, and get the records requested in the problem 3.

B1, B2, and B3 cell: It is still the step-by-step computation. Count the corresponding records.

Comparison

For the SQL, there are 3 associations for you to compute in steps, and explore progressively. However, because step-by-step computation is hard to implement with SQL, this problem has to be divided into 3 individual problems.

esProc is to compute in steps following the natural habit of thinking: Decompose the general objective into several simple objective; Solve every small objective step by step; and ultimately complete the final objective.

In case that you proceed with the computation on the basis of the original 3problems, for example, seek "proportion of problem 3 taken in the problem 2", or"on" problem 3, group by country". As for esProc users, they can simply write ”=A3/A2”, and ”A3.group(nation)”. In each step, there is a brief and clear expression of highly readable, without any requirements on a strong technical background. By comparison, SQL requires redesigning the statement. The redesigned statement will undoubtedly become more and more complex and longer. Such job can only be left to those who have the advanced technical ability in SQL.

esProc can decompose the complex problem into simple computation procedure based on the descriptions from the business perceptive. This is just the advantage of the step-by-step computation. By comparison, SQL does not allow for computation by step or problem decomposition, and thus it is against the scientific methodology, and not fit for the complex computation.

Complete Set-lization

Case Description

A certain advertisement agency needs to compute the clients whose annual sales values are among the top 10.

The data are from the sales table, which records the annual sales value of each client with the fields like customer, time, and amount.

SQL solution

SELECT customer

FROM (

SELECT customer

FROM (

SELECT customer,RANK() OVER(PARTITION BY time ORDER BY amount DESC) rankorder

FROM sales )

WHERE rankorder<=10)

GROUP BY customer

HAVING COUNT(*)=(SELECT COUNT(DISTINCT time) FROM sales)

Such Problem requires ranking the sets of a set, that is, group by “time” and then rank by “customer” in the group. Since the popular SQL-92 syntax is still hard to represent this, the SQL-2003 standard, which is gradually supported by several vendors, will be used to solve this problem barely.

Just a tip to compute the customer intersections in the last step, the count of years equals to the count of clients.

esProc Solution

A1: Group the original dataset by year so that A1 will become a set of sets.

B1: Get the serial number of records whose sales values are among the top 10 of each group. The rank() is used to rank in every group, and pselect() can be used to retrieve the serial number on conditions. ~ is used to represent every member in the set. B1 is the “set of set”.

A2: Retrieve the record from A1 according to the serial number stored in B2, and get the customer field of the record.

A3: Compute the intersection of sets.

Comparison

The SQL set-lization is incomplete and can only be used to represent the simple result set. Developers cannot use SQL to represent the concept of “set of set”. Only the queries of 3-level-nested-loops are available to barely perform the similar computations. In addition, SQL cannot be used to perform the intersection operation easily that developers with advanced techniques can only resort to the unreadable statements to perform the similar operations, such as “count of years equal to the count of clients”. It equals to compute the intersection of client sets.

The set is the base of massive data. esProc can achieve set-lization completely, represent the set, member, and other related generic or object reference conveniently, and perform the set operations easily, such as intersection, complement, and union.

When analyzing the set-related data, esProc can greatly reduce the computation complexity. By taking the advantage of set, esProc can solve many problems agilely and easily that are hard to solve with SQL.

III Ordered Set

Case Description

Suppose that a telecommunication equipment manufacturer needs to compute the monthly link relative ratio of sales value (i.e. the increase percent of sales value of each month compared with that of the previous month). The sales data is stored in the sales table with the main fields including salesMonth, and salesAmount.

SQL solution

select salesAmount, salesMonth,

(case when

prev_price !=0 then ((salesAmount)/prev_price)-1

else 0

end) compValue

from (select salesMonth, salesAmount,

lag(salesAmount,1,0) over(order by salesMonth) prev_price

from sales) t

The popular SQL-92 has not introduced the concept of serial number, which adds many difficulties to the computation. Considering this, the designer of SQL-2003 has partly remedied this drawback. For example, the window function lag() is used to retrieve the next record in this example.

In addition, in the above statement, the “case when” statement is used to avoid the error of division by zero on the first record.

esProc Solution

sales.derive(salesAmount / salesAmount [-1]-1: compValue)

The derive() is an esProc function to insert the newly computed column to the existing data. The new column is compValue by name, and the algorithm is “(Sales value of this month/Sales value of previous month)-1”. The “[n]” is used to indicate the relative position, and so [-1] is to represent the data of the previous month.

On the other hand, for the data of the first record, the additional procedure for division by zero is not required in esProc.

Comparison

From the above example, even if using SQL-2003, the solution to such problem is lengthy and complex, while the esProc solution is simple and clear owing to its support for the ordered set.

Moreover, SQL-2003 only provides the extremely limited computation capability. For example, esProc user can simply use the ”{startPosition,endPosition}” to represent the seeking of a range, and simply use ”(-1)” to represent the seeking of the last record. Regarding the similar functionality, it will be much harder for SQL user to implement.

In the practical data analysis, a great many of complex computations are related to the order of data. SQL users are unable to handle such type of computations as easily as esProc users because SQL lacks of the concept of Being Ordered.

IV Object Reference

An insurance enterprise has the below analysis demands: to pick out the annual outstanding employees (Employee of the Year) whose Department Manager has been awarded with the President Honor. The data are distributed in two tables: department table (main fields are deptName, and manager), and employee table (main fields are empName, empHonor, and empDept).

empHonor has three types of values: null value; ”president's award”, PA for short; and ”employee of the year”, EOY for short. There are 2 groups of correspondence relations: empDept and deptName, and Manager and empName.

SQL solution

SELECT A.*

FROM employee A,department B,employee C

WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=‘EOY’ AND C.empHornor=‘PA’

SQL users can use the nested query or associated query to solve such kind of problems. In this case, we choose the association query that is both concise and clear. The association statement behind the “where” has established the one-to-many relation between deptName and empDept, and the one-to-one relation between manager and empName.

esProc Solution

employee.select(empHonor:"EOY",empDept.manager.empHornor:"PA")

esProc solution is intuitive: select the employee of “EOY” whose Department Manager has be awarded with “PA”.

Comparison

The SQL statement to solve such kind of question is lengthy and not intuitive. In fact, the complete association query language is “inner join…on…” style. This statement is simplified in the above example. Otherwise it will be much hard to understand.

esProc users can use ”.” for object reference. Such style is intuitive and easy to understand. The complex and lengthy association statement for multiple tables can thus be converted to the simple object access, which is unachievable for SQL. When there are more and more tables, the complexity of SQL association query will rise in geometric series. By comparison, the esProc user can always access the data intuitively and easily by taking the advantage of object reference.

Regarding the multi-table associations of complex computation, esProc can handle it more intuitively and conveniently than SQL.

From the comparison of the above four examples, we can see that esProc is not only characterized with step-by-step computation, complete set-lization, sorted sets, and object reference. The analysis style is intuitive, the syntax style is agile, and the function is powerful. esProc is a tool especially designed for mass data computation, and a more convenient database computing script.

About esProc: http://www.raqsoft.com/product-esproc

April 10, 2014

Tap Utmost Value of Excel

Excel is the most widely-used spreadsheet tool. The nontechnical persons love to use it for computation and analysis though, they usually find the formulas and functions available in Excel are rather poor and the VBA is just double Dutch to them for further analysis. Thus, a huge volume of data with valuable information has been wasted in vain.

esProc is introduced to better the situation. Empowered esProc users can tap the utmost value of Excel by taking the esProc advantages of powerful computation ability, agile and easy-to-use analysis style, and programmed running mode.

I Case and Comparison

Description

In an advisement agency, a Sales Director receives Client Reports from eight Regional Account Managers by every quarter. The Client Report is an Excel spreadsheet, mainly comprising the client, sales value, and other information about the respective region, as given in the below figure:

Suppose that the Sales Director wants to compare the big client across various regions. For example, regarding the client of whom the sales value ranks top 10%, 20% or 30%, what’s the average sales, and which enterprise is among the Top 500?

First, let’s have a look at the attempt to solve it with Excel formulas.

Excel Formula Solution

The first step is to compute the average sales of clients whose sales values rank top 10%. We may adopt the following procedure: firstly, sort the sales value in descending order, and then use count( ) function to compute the total number of clients. Secondly, multiple the total number by 10%, and round the result to get the row number with the round ( ) function. Finally, copy these clients onto a new spreadsheet, and compute the average value. This procedure is not difficult for those who are familiar with Excel.

Then, let’s proceed with this computation: How many of these big clients are among Top 500? To solve the problem, you need to get the intersection of the two datasets. In other words, this is to compute the common part of big client set and the Top 500 list from the previous step. The computational expression is:

=INDEX(A:A,SMALL(IF(COUNTIF($B$2:$B$15,$A$2:$A$20),ROW($A$2:$A$20),4^8),ROW(A1)))&""

Since the above formula requires 5 various combinations of functions, it is a great challenge to compose it.

To make it worse, the computational procedure of Excel requires the user to carry out manually and only acceptable for the specific Excel spreadsheet. This is not as universal as a program, for example, if program, the whole computational procedure will rerun automatically on receiving different file names. In this case, there are 8 Excel files. Excel users will have to run the computational procedures for 8 times. Moreover, there are 3 rankings: 10%, 20%, and 30%, which means the computation will have to be repeated for 8X3=24 times.

It is obvious that it is too tough to solve this problem with Excel formula.

Let’s try VBA, the most powerful extension tool of Excel.

Excel VBA Solution

Function Collection(a As Range, b As Range)

On Error Resume Next

Dim arr1(), arr2(), times, tmpindex

Set newcoll = CreateObject("Scripting.Dictionary")

With Application.WorksheetFunction

arr1 = .Transpose(a.Value)

arr2 = .Transpose(b.Value)

times = .Mode(arr1, arr2)

If IsEmpty(times) Then

Exit Do

Else

newcoll.Add times, Empty

tmpindex = .Match(times, arr1, 0)

arr1(tmpindex) = arr1(UBound(arr1))

If UBound(arr1) = 1 Then

arr1(1) = Empty

Else

ReDim Preserve arr1(1 To UBound(arr1) - 1)

End If

tmpindex = .Match(times, arr2, 0)

arr2(tmpindex) = arr2(UBound(arr2))

If UBound(arr2) = 1 Then

arr2(1) = Empty

Else

ReDim Preserve arr2(1 To UBound(arr2) - 1)

End If

times = Empty

End If

Loop

End With

arr3 = newcoll.keys

If newcoll.Count = 0 Then

Collection = False

Else

Collection = arr3

End If

End Function

Isn't it unreadable and indigestible? The above “double Dutch” is only one step to compute the intersection set of several steps. Undoubtedly, VBA needs a great programming capability and is by no means suitable for nontechnical persons.

Then, let’s check the impressive esProc solution below.

esProc Solution

A1 and A6: Retrieve the “Client Report” of a certain region respectively and “Top 500 list”. Please note that “rangeFile” is a parameter, and you can assign various file names to get various results. In addition, the “percent” in the B2 is also a parameter, for example, 10%, 20%, and 30%.

A2: Sort the data in A1 by sales value. The ”amount” is column name retrieved automatically, and the ”-1” represents the descending order.

B2: Compute the row number of clients ranking the top 10%, 20% or 30% respectively.

A3: Compute the clients from row 1 to B2 that are all big clients. Assume that B2 equals to 3, then “to(B2)” equals to ”1,2,3”.

A4: Compute the average sales value of big clients

A7: Compute the clients which not only big clients but also among the Top 500. In other words, this is to compute the intersection set of the customer column from the Client Report and the 500Name column from the Top 500. The ”^” represents the intersecting action.

A8: Compute the number of clients in the intersection set from the previous step.

As we can see, the style of esProc expression is similar to that of Excel, agile and intuitive but more powerful in computing and capable to rerun just as a program does. It is a great analysis tool to empower the nontechnical persons who are familiar the style of Excel.

Perfect! esProc is just the best tool to solve such problems.

II Features Fit for Excel

Better Usability

esProc provides an operation interface of “cellset” style with the letter as column name and number as row no. The cells can be mutually referenced with cell name. Such style is quite friendly to people who are familiar with Excel.

The cellset allows the business analyst to work from the business perspective, process and analyze the data intuitively. Therefore, esProc demands little on technical capability from users, and thus ideal for business person with no technical background.

esProc can be installed on the normal PC with common OS , and run in a environment similar to that of Excel.

Strong Analysis Ability

As a tool specially designed to handle massive data computation, esProc has all capability of SQL statements and senior languages. On one hand, esProc can be used to query, filter, group, and collect statistics, just like SQL statements; On the other hand, it can be used in the loop and branch judgment for the procedure analysis, just like VBA.

In the practical use, esProc over-performs the SQL and senior languages, thanks should go to the below advantages: esProc users will never face the dilemma of lengthy and unreadable SQL statements and the poor computability of senior languages. Even the nontechnical person can also resort to esProc to complete the complex analysis computation all by themselves.

Programmed Running Mode

esProc has special optimizations for Excel, providing the easy-to-use functions for reading from or writing back to Excel spreadsheets of various versions from Excel97 to Excel2007.

In a programmed running mode, esProc users can analyze various Excel spreadsheets according to various parameters, which is ideal for the repetitive computation. It is indeed a timesaving and effort-saving analysis tool.

III Significance and Value to Excel

esProc is a powerful analysis tool for Excel, and particularly suits the need of nontechnical persons to implement complex computational analysis on data from Excel spreadsheets.

esProc facilitates the data mining on Excel with the convenience and power for all people to deliver and ensure the valuable data will truly support the decision-making of enterprises.

esProc saves the long-stored Excel from turning into a legacy over time. esProc will tap the utmost value of Excel.

About esProc: http://www.raqsoft.com/product-esproc

menu

October 13, 2014

Programming Languages for Desktop Data Analysis

October 1, 2014

Structured Data Computing: the Focus of Routine Data Analysis

June 16, 2014

Who uses the R programming language and how do they use it?

May 22, 2014

A Data Analysis Language/Script with Parallelism Feature

April 14, 2014

A More Convenient Data Computing Script than SQL

April 10, 2014

Tap Utmost Value of Excel