For several years now, we have been talking about a ‘big data’ boom. All this data was supposed to make us smarter, enable us to make near real-time decisions and maybe even predict future behaviours. Yet, counter-intuitively, many companies we work with fight the opposite problem.
They find themselves using old data, because processing all the information takes longer and longer. As volumes increase, the data they use gets even older. So, how do we stop the ageing of data? How do we make sure big data makes us smarter, not dumber?
The idea was a good one: we have more data sources than ever before, we can measure things we couldn’t measure before, we can overlay internal sources with various external data points from partners, social and government figures. All this data paints a complete picture of the business and many of these sources are not updated quarterly, monthly but in real time.
One would have thought that in such a world, we’d be better equipped to make critical business decisions. That was true, until sheer volume made it impossible for many companies to get data in real-time and ended up relying on old data. How old? Some of our customers tell us that even some of the time sensitive supply chain reporting was delayed by about two to three weeks.
There are various reasons and bottlenecks that drive the ageing of data. First, and for many most importantly, companies can’t complete data extractions in time. Businesses often run a large number of enterprise apps (for example our banking clients count them in thousands) and they have very complex processes that this data goes through before it gets to the business intelligence software for analysis.
The data must move from the applications into operational data stores before it ends up in a data warehouse. To get this process done, there is usually a limited window in which this has to be completed - a fairly easy task until the volume of data exploded and there was not enough time to get it done. Suddenly, rather than having day-old data in the reporting environment, companies ended with data that was in some cases weeks old. One of our customers calculated the cost at 50% of their BI investment.
Secondly, there is a growing need for having multiple instances of each database. An average organisation has eight to ten copies of its database for development, testing, quality assurance, user acceptance testing, pilots, training, back-up and disaster recovery.
Most of these processes, especially testing, require fresh data, but making physical copies of databases is taking longer and longer and, with limited resources, someone will get the short end of the stick and will be asked to do with an old copy. It may then not be surprising when you see a lot of errors late in the production process.
Lastly, any organisation that deals with customer data must mask the data when it is used for development, testing or Q&A. Data masking is not too difficult until you need to mask multiple copies and the volume of data you are protecting is growing exponentially.
Combine all these three trends in development, it is likely that every company will at some point reach (or is already experiencing) the problems associated with old data.
Managers and executives are thus often making decisions based on inaccurate and out-of-date information. In some instances this might not make much difference; at worst, however, the business will be making potentially serious errors. It might as well have trusted its gut and disregarded the data altogether.
This might be an overstatement, but it gets to the core problem - what’s the data good for if we can’t trust it, if it’s old?
So, what’s the solution? Unfortunately, in most cases it involves a lot of compromises. It means making the data sets smaller by selectively choosing what’s needed and deciding whether a subset of data may suffice. The other option is being quite selective in what you need real-time and what can be reported weekly, monthly or quarterly rather than applying a blanket approach. The last option is to turn to technology for help.
More recently, a new concept of database virtualisation has been used by the likes of Facebook, RBS and StubHub. Rather than creating physical copies of databases, it creates virtual copies that take seconds rather than hours or days. Whatever the option, something needs to be done, because ageing data are making us dumber not smarter. What’s your approach to keeping your data fresh?
Posted by Iain Chidgey, vice president EMEA, for Delphix, the agile data management company