Knowledge Integration Dynamics

Persistence pays off


By Mervyn Mooi, Director at Knowledge Integration Dynamics (KID).
[Johannesburg, 11 Dec 2012]

The persisted analytical database in the majority of cases should remain a cornerstone of the modern data warehouse system over that of federated, virtualised data referenced across multiple source systems spread throughout a business.

Accessing data, be it in a virtualised environment or analytical, still requires processing. That means the data is transformed, summarised, aggregated, de-duplicated, quality is checked and maintained, and it is merged.

To do so on-the-fly for every query for every user in the virtualised environment is costly, because it is resource intensive. For example, a standard query may run 100 times per day by 100 users that scan half of the accessed tables, putting immense load on the systems because the data must be loaded into memory, processed, the results persisted and then presented to each user. That will impact anywhere up to many thousands of table accesses in a large system environment.

Keeping with tradition
An analytical database in a traditional data warehouse architecture relieves that processing overhead, because it does the accessing and processing only once, upfront. It pre-calculates or predetermines the results and then stores them. While physically storing the resultant data is a disadvantage because it requires computing resources, such as disk space, this technique in many cases is far more economical than a virtual solution, where the same query is run multiple times per day by multiple users, because storage costs are well below that of the other processing resources such as CPU, memory, and bandwidth.

One major advantage of the persisted analytical database of the data warehouse is that the data is stored in only one place, so if there is a problem with the data there is usually only one point of resolution. It also allows for transition, so subsets of the data can be created, for example, franchised data marts, cubes and others.

The data is also current. It can be refreshed in the warehouse within milliseconds of it entering the operational stream, so in real terms it is current and can almost be considered a mirrored instance. For example, a person could go to an ATM and withdraw cash and that transaction can be entered into the warehouse in milliseconds. In fact, many South African companies are using their previously largely stagnant and wasted disaster recovery processing capacity and storage infrastructure to load balance the processing with writes in one system and reads in the other, to increase throughput and reduce dependence and impact on operational business systems.

All checks done
The analytical database contains qualified information. Qualified information is data that has been verified, validated, data quality-processed, enriched, summarised and augmented. In the virtual context, data has yet to be qualified and, once it has been used, the query and results are no longer present in the system – it is usually immediately removed as this is not a persisted process.

A data warehouse's primary purpose is to enable data discovery and eventually knowledge throughout an organisation. The analytical, persisted database meets the original tenets of data warehousing: serving up quality information that is accurate, consistent and reliable. These principles haven't evaporated because of virtualisation – these remain relevant and fundamental to meeting the needs of businesses today, just as they were when data warehousing made its appearance in the late 80s.

Persisted analytical databases in the modern warehouse also provide central control and maintenance while economising on resources.

A data warehouse's primary purpose is to enable data discovery and eventually knowledge throughout an organisation, opening up a wealth of information to business people through business intelligence (BI) applications and practices – this is to ensure reliable results. This cannot be done straight from operational databases which are intended to keep the wheel of business turning – for example, to manage business transactions and events. In one South African bank, just one department has 24 operational databases – a mix-and-match of technologies as is typically the case with systems that have grown alongside the business. So it would be rather impractical to apply BI in a federated or virtualised architecture to serve the informational and analytical needs of the business. However, one can generate both analytical and operational or transactional reports from the persisted data warehouse, and many companies are in fact doing just that.

Data warehouses are not inexpensive – so at first glance it seems to make financial sense to choose a virtualised environment as a less costly option. But that approach is highly debatable in the context of what has been discussed above. Additionally, a platform change will require all the logic necessary for data accesses, transformations and information presentation in a virtualised environment to be transcribed or re-bound. In contrast, if that logic is contained in a database and the database platform is changed, it is a simple matter of porting the data, pre-calculated results and quality information to the new database platform.

The traditional data warehouse architecture, with costs calculated in a standard five-year period, remains far more cost-effective than a virtualised data environment with the additional processing requirements. Many people, however, overlook the ongoing operational costs because this tends to be less visible. The trend is to focus on the capital costs because this can be easily ascertained, but doing so will result in higher costs for the organisation.

comments powered by Disqus