Many people associate open source data framework Hadoop with managing truly massive amounts of data.
And with good reason: Hadoop storage is used by Facebook and Yahoo, which many people (rightly) associate with massive data. Yahoo, an early adopter and contributor to Hadoop, has implemented a 50,000-node Hadoop network. Facebook has a Hadoop system with more than 10,000 nodes in place.
So, there's the big in "big data" for you.
But Arun Murthy, Hadoop VP at the Apache Software Foundation and architect at Hortonworks, paints a different picture of Hadoop and its use in the enterprise. For Murthy, Hadoop's use goes far beyond big data.
One of Hadoop's strongest capabilities is its ability to scale. Yahoo and Facebook are excellent examples of how Hadoop can scale up, but little is usually said about how Hadoop can scale the other way and provide analytic decision-making data for businesses of any size.
All data created equal
Data storage, Murthy explained, used to be expensive. As recently as five years ago, enterprises and SMBs found themselves having to keep track of an exploding array of datasets: emails, search results, sales data, inventory data, customer data, clickthroughs on websites... all of this and more might be coming in, and trying to manage it in a relational database management system (RDBMS) was a very expensive proposition.
With all of these events and signals coming in, an organization trying to keep costs down and data management sane would typically sample that data down to a smaller subset. This downsampled data, which Murthy calls "historical data", would automatically be classified based on certain assumptions, the number one assumption being that some data would always be more important than other data.
For example, the priorities for e-commerce data would be set on the (reasonable) assumption that credit card data would be more important than product data, which in turn would be more important than click-through data.
If you were trying to run a business model based on one given set of assumptions, then it wouldn't be hard to pull information out to make decisions for the business. But the information would always be predicated on those assumptions, what would happen when the assumptions changed? Because data was downsampled, any new business scenario would have to use the sanitised data still in storage, all the raw data would be long gone.
And, because of the expense of RDBMS-based storage, often this data would be siloed within an organisation. Sales would have their data, marketing would have theirs, accounting their own and so on. So business-model decisions would be limited to each part of the organisation examined, not the complete whole.
"With Hadoop," Murthy argued, "there are no assumptions, because you keep all of the data."
This is perhaps the biggest benefit of Hadoop, though it often lurks in the background, behind the notion of Hadoop's low financial costs. "Downsampling makes the assumption that some data is going to be bigger and more important than other data," Murthy explained. "In Hadoop, all data has equal value."
Because all data is equal, and equally available, business scenarios can be run with raw data at any time, without limitation. Moreover, formerly siloed data can be equally accessed and shared for more holistic analysis of an organisation's business.
This shift in how data can perceived is huge, because now there is no such thing as historical data. Moreover, because data can be stored as is, much of the data management overhead associated with such things as extract, transform and load operations will be reduced.
There's no getting around the fact that Hadoop's most touted benefit is probably its lower cost. Because the entire framework is open source under the Apache Software Licence, there are no licensing fees for the base software.
When shouldn't you use Hadoop?
While Murthy is enthusiastic about the potential for deploying Hadoop, there are some limitations you should keep in mind if your organisation is considering moving to Hadoop.
First, if you are doing sub-second interactive reporting from your data, or using the data in multi-step, complex transactions, an RDBMS solution may still be your best bet, since Hadoop is not particularly strong in these areas. If your data is being updated and changed through insertions and deletions, that's another reason to shy away from Hadoop.
Cloudera, a commercial Hadoop vendor that employs Doug Cutting, one of the framework's inventors, uses an open core model, so the base Hadoop software is free but Cloudera's extensions have a licence fee. Hortonworks, which Murthy co-founded with several members of Yahoo's Hadoop team in early 2011, keeps all of the software free and open source, and builds revenue through training and support programs.
A source of additional savings: unlike an RDBMS, Hadoop does not require expensive hardware or high end processors. Any commodity server hooked into the Hadoop network will do. That means that a Hadoop node only needs a processor, a network card and a few hard drives, and will cost around $4,000 (£2,500). An RDBMS system might cost $10,000 to $14,000 per terabyte. Such a massive difference definitely explains why Hadoop is getting such strong attention, perhaps deservedly so.
Care must be taken, however, that all of those saved dollar signs don't create a siren effect and get businesses to rush willy nilly into a Hadoop migration plan. The type of experience Hadoop system engineers and administrators need means that companies interested in building their own Hadoop deployment is likely to end up paying a big premium in personnel costs, whether the company deploys a commercial or free version of Hadoop.
In fact, the market for qualified Hadoop engineers has gotten so hot that two of the biggest Hadoop players, Google and Facebook, have gotten into multi-million dollar bidding wars over qualified Hadoop engineers.
No matter what your deployed software is, expect to pay big bucks for qualified Hadoop staff. Depending on your needs and location, that could be anywhere from $120 to $190K annually (not counting any stock and perks you may need to sweeten the deal). But is this enough to offset the savings in hardware and software?
Breaking down the deployment of a completely free software deployment of Hadoop, then, presuming 100 $4,000 nodes amortised over three years and an engineer paid $150,000 in salary, you get something like this:
- Hourly hardware cost (over three years): $15.21
- Hourly maintenance cost: $17.11
- That comes out to an operational cost of about $32 per hour for the entire system, or about $283,320 total annually (excluding power).
Now, assuming a similarly sized RDBMS system, in 2008 Oracle was pricing out a database machine with 168 TB of storage at $650,000 for the hardware and $1.68 million for the software, which puts this system right at the top of the $14,000/TB range. Presuming an annual Oracle database administrator's salary of $95,000, the operation costs break down to:
- Hourly hardware cost (over three years): $88.60
- Hourly maintenance cost: $10.27
Even with a reduction of the salary for an Oracle administrator versus the premium salary of a Hadoop engineer, you're talking an operational cost for such an Oracle system being $98.87 an hour, or $866,694 annually. That's a big difference, over three times the cost of a similar-sized Hadoop deployment.
Assuming the lower end of the RDBMS cost scale ($10,000/TB) doesn't improve things that much. Plugging that number in gets you an annual cost of $644,827, still 1.8 times the cost of a Hadoop distributed system.
These are operational costs, of course, and they don't factor in the migration costs, nor any costs for ongoing Hadoop support should you decide to use an outside vendor. But the dramatic difference in costs means that even with a paying a Hadoop admin a premium, companies will still save a big chunk of change in the long run.
With lower hardware costs and such strong business advantages for any size organisation that wants to get the most out of their data, Hadoop's benefits are attracting a lot of attention in the enterprise and SMB spaces.