The data warehouse is essential to
enterprise business intelligence, which accounts for a great part of the total
enterprise cost. With the global data
explosion in recent years, the business data volume grow significantly, posing
a serious challenge for enterprise data warehouse to meet the diverse and complex
business demands. More data, more data warehouse applications, more concurrent
accesses, higher performance, and faster I/O - all these demands give more
pressure on data warehouse. Every IT manager nowadays has
concern over expanding the data warehouse capacity at lower cost.
Here
is an example. A data warehouse is originally provisioned, as shown below:
Server: One cluster with two high performance database servers.
Storage
space: 5TB high performance disk array.
CPU: 8 high performance CPUs.
User
license agreement: 100
To meet the storage
capacity expansion need for the recent 12 months:
Computational
performance: Double
Storage
space: Quadruple
Concurrency: Double
How
can an IT manager achieve his storage expansion goal? The common practice is to
upgrade the database hardware and software: replace with more advanced data
warehouse servers, replenish two data warehouse servers of the same class, add
a 15T data-warehouse-specific disk or change to a 20T hard disk cabinet, and
add 8 CPUs. In addition, they have to pay for the additional user license
agreement, CPU, and disk storage space with expensive software licensing fees.
No
matter which way you choose to upgrade, the data warehouse vendor will
ultimately bind you with their products and charge you for the expansive
upgrades.
The
computation outside database is an alternative to expand storage capacity. As
we all know, of the 20T data warehouse data (including 30% real data, and 70%
buffer), the core data is usually less than 1/10, i.e. taking up 1T space. The
remaining 19T spaces are all for the redundant data. For example, after a new
application is deployed, for the sake of core data security protection, the
data warehouse usually requires a copy of the used data, not allowing for the
direct access to core data from application. Quite often, the new application
needs the access to the records with summarized and processed core data. For
which, a core-data-based intermediate table is fabricated to speed the access.
Such redundant data are growing with the development of existing and emerging
business. The total amount of core data will always keep low.
These
redundant data is not the core data, not requiring the high level of security
protection. To move these redundant data to the average PC, and use the tools
other than database for reading/writing and computing, the cost of database
capacity expansion will be reduced dramatically. So, we can say the computation
outside database in combination with the database computing is the best choice
to achieve the database capacity expansion. The benefits include:
Computationalperformance: Implement parallel computation across
multiple nodes using the inexpensive PCs and desktop CPUs. Compared with the
high performance of databases, the same or even greater computational
performance can be achieved at the relatively lower cost.
Storagespace: With the cost-effective desktop level disk,
users can get a storage space far greater than data-warehouse-specific disk at
a extremely low cost. HDFS also facilitates the data security protection,
access consistency, and non-stop disk capacity expansion.
Concurrency: With the concurrent access from multi-nodes, the centralized
concurrent access can be allocated to multiple node machines for more accesses
than just the centralized access from data warehouse. In addition, users do not
have to pay for the access license agreement, additional CPUs, and disk storage
spaces.
It
seems that the computation outside database is pretty good. Hadoop and other
alike software are available in the market to meet all above demands. But why
few people take Hadoop as an option to alleviate the pressure on expanding the
data warehouse capacity? This is because they are not as powerful as database
in computing, in particular the computation involving complex logics.
What
about there is the software meeting the above-mentioned demands on
computational performance, storage space, and concurrency, while is still equal
or even more powerful than database in computing? With this software, it's
evident that the storage capacity expansion pressure on database will be
relieved greatly, so does the database capacity expansion cost.
esProc
is built to meet these demands. It is the middleware specially designed to
undertake the computation jobs between database and application. For the
application layer, esProc has the easy-to-use JDBC interface; For the database
layer, esProc is powerful in parallel computation. By implementing the
computation outside database or in external storage, esProc alleviates the
computational pressure on the database & storage, and concurrency. Owing to
this, organizations can cut the cost of database software and hardware
effectively while still optimizing the database administration.
esProc
is built with a comprehensive and well-defined computing architecture, which is
fully capable of sharing the workload on databases, and undertaking various
computations of whatsoever complexity for applications. In addition, esProc
supports the parallel computations across multiple nodes. The massive or intensive
data computation workload can be shared by multiple average servers or
inexpensive PCs balancedly.
esProc
is built with a comprehensive and
well-defined computing architecture, which is fully capable of sharing the
workload on databases, and undertaking various computations of whatsoever
complexity for application. In addition, esProc supports the parallel
computations across multiple nodes. The massive or intensive data computation
workload can be shared by multiple average servers or inexpensive PCs
balancedly.
With
the supports for parallel computation, esProc can balancedly decompose and
allocate the computation jobs used to solve centrally to multiple average PCs.
Each node only needs to undertakes a few data computations.
With esProc, the core data can be stored in the database, while the intermediate table and script deprived from the core data can now be stored outside the database. By leveraging resources reasonably, the workload pressure on database will be alleviated effectively, database cost will be kept under control, management problems will be solved effectively, and various data warehouse applications will be handled with ease. These applications include the real-time high performance application, non-real-time big data application, desktop BI, report application, and ETL.
No comments:
Post a Comment