KID Press 28 Feb 2014

Even the best big data systems become a morass of spaghetti without robust architecture

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID).
Johannesburg, 28 Feb 2014

Architecture remains the fundamental issue at the core of realising the benefits of big data in organisations today – be they big or small enterprises.

A problem with big data is that it has huge volumes of complex data – data that's typically in many different formats and often found all across organisations. It extends well beyond the traditional structured data such as sales figures, employee data, product specifications, inventories, production schedules and the like. In fact, just dealing with big data can generate its own data that only adds to the pile.

With the need to integrate so many source systems – often in complete disparity – tools were needed and the industry turned to those that existed, which in the past comprised mostly extraction, transformation and loading (ETL) software. Many of these tools however, are not designed for the demands of businesses today, chiefly scalability, easy-to-use functionality, flexibility and throughput performance.

The underlying issue is that the database and file systems that store data were often not designed to communicate with one another so they had to be jury-rigged together. However, with the advent of purpose-built tools that situation has reversed. These tools fit into the middle area between data sources, be it databases, ESBs, data and system integrators a CRM, ERP or other business systems – hence the term middleware.

An issue with middleware is that it creates latency – precisely the opposite of the desired effect. Adding a step to the middle of an existing process is bound to slow it down as well as add complexity. An effective solution is in-memory computing, particularly if the systems are connected in a grid format to provide elasticity and scalability on demand. Another solution is to only process changed and new (Delta) data, which is the case and standard used in most organisations today.

One of the most serious concerns this middleware creates is that it can develop into a huge cost centre that experiences wasted resources, under-utilised resources, excess resource and management and administrative overheads. Often there are replicated, duplicated, overlapping processes and disparate models that drive costs even higher up, and as data systems overtake business systems in size, complexity and cost, that can quickly result in an astronomical figure. Another curiosity of this situation is an ever-growing reliance on quick-win practices and projects that ultimately boil down to futility and wasted resources (aka "the hairball or spaghetti junction effect").

Architecture, therefore, is a crucial component of advancing any big data strategies, regardless of how good and efficient the underlying systems are at their specific jobs.

Accessing, integrating, disseminating, verifying, transforming, qualifying and consuming or interpreting any data or information must be done based on standards or an architected approach. If not, you run the risk of creating a spaghetti junction or hairball of crossed connections and replicated/duplicated processes and models between data sources and systems – this will defeat the purpose of almost every data project in the past and hamper future initiatives. Organisations should strive to achieve resource economy that leads to reliable and consistent information that business people can use to keep the organisation running efficiently and effectively.