Big Data .... petabytes of data is fast becoming commonplace

‘Big data’ is fast becoming the new buzzword, but the definition of ‘big’ is in the eyes of the beholder. Not too long ago, terabytes of data was considered big, but today, petabytes of data are becoming more commonplace....


‘Big data’ is fast becoming the new buzzword, but the definition of ‘big’ is in the eyes of the beholder. Not too long ago, terabytes of data was considered big, but today, petabytes of data are becoming more commonplace. The crux of this is that many businesses are struggling to capture, track, analyse and store these large volumes of data.

Organisations now need tools that can help them load data faster, store it more compactly and reduce the cost, resources and time involved in its analysis and management. In response to this, an expanding set of new technologies has emerged - from columnar databases, to Hadoop clusters, to document-oriented databases and other “NoSQL” variants.

It has become more confusing than ever to figure out what to use for a given problem. In many cases, the right answer is a combination of technologies. In regard to data-intensive online businesses, for example, some store huge volumes of cookie-level detailed data in Hadoop, then extract summary data into an analytic database to provide users with easy access for ad-hoc analytics.

Machine-generated information - such as call detail records, Web and event logs, financial trade data and RFID readings - is increasing at the highest rate. It’s hardly surprising that rapidly expanding data volumes are bumping up against many organisations’ ability to store and manage it all.

In the past a data warehouse was seen as the only solution to the information management challenges presented by machine-generated data. Yet, throwing more servers and storage systems at the problem only creates massive infrastructure footprints that are extremely costly to scale, house, power and maintain.

What’s more, with users requiring near-real time access to data and the ability to analyse it without constraints, the days of database administrators spending hours or days to tune databases for high performance are over.

The fact is that the traditional, hardware-centric approaches to information management are no longer enough. Traditional databases are well suited to initially storing machine-generated data, but they are ill-suited for analysing it.

They are inadequate in dealing with the volume, query speed and need for immediacy by users. In the past, the instinctive response to this challenge has been to recruit more people (database administrators), increase investment (expanded disk storage subsystems) in an attempt to overcome the problem, or scale back by archiving to reduce the size of the dataset. It works, but only provides a short-term and expensive fix.

Just as there are growing numbers of purpose-built databases for very large social networks and SaaS environments, it makes commercial sense to deploy databases that are specifically designed to manage machine-generated data. In an ideal world, users should have near immediate access to their data and the flexibility to run new reports, perform ad-hoc analytics or do complex data mining without IT intervention to tune or manage the database.

Further advantages like deep data compression should be integral to the tools, so they can take advantage of the characteristics of machine-generated data to achieve 10:1, 20:1, even 50:1 compression and require less hardware to run.

For just such reasons, column databases (which store data column-by-column rather than in the traditional row-by-row format) have now emerged as a compelling choice for high-volume analytics.

As most analytic queries only involve a subset of the columns in a table, a column database retrieves just the required data; this speeds up queries, while reducing disk I/O (input/output) and computing resources. These types of databases can also enable significant compression and accelerated query processing so users don’t need as many servers or as much storage.

These benefits can be further extended by using analytic solutions that combine column orientation with capabilities to use knowledge about the data itself - this intelligently isolates relevant information and return results more quickly.

These tools provide an instantly downloadable, easy-to-use, open source alternative to traditional database solutions. They make analysis of machine-generated data possible for companies without armies of database administrators, large budgets or huge amounts of time.

They also offer self-tuning technology that can reduce administrative effort by almost 90% whilst providing data compression from anywhere between 10:1 to more than 40:1 and substantially enhancing query performance. This offers users a fast, simple and affordable path to high-volume, high-performance analytics.

Posted by Don DeLoach, CEO, Infobright

"Recommended For You"

Apache Cassandra database project supports 2 billion columns in single row IBM remolds DB2 10.5 as a Hadoop killer