First off it is important to state that Hadoop and Spark are broadly different technologies with different use cases. The Apache Software Foundation, from which both pieces of technology emerged, even places the two projects in different categories: Hadoop is a database, Spark is a big data tool.
In Apache's own words Hadoop is "a distributed computing platform," or, "a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer."
Spark on the other hand is "a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics."
In a large majority of cases when someone is talking about Hadoop they mean the Hadoop Distributed File System (HDFS) which is "a distributed file system that provides high-throughput access to application data." The reason Spark often enters the conversation here is because it was built to read and write data from and to HDFS, as well as other storage systems like as HBase and AWS S3.
Then there is Hadoop YARN, a job scheduling and cluster resource management tool, and Hadoop MapReduce for parallel processing of large data sets. Both of these make it significantly easier to run Spark than standalone on HDFS.
So how do they come together?
Both are big data frameworks. Basically, if you're a company with a fast-growing pool of data, Hadoop is an open-source software that will allow you to store this data in a reliable and secure way. Spark is a tool for understanding that data. If Hadoop is the Bible written in a language you can't understand, Spark is a dictionary and phrasebook.
The choice really comes down to what you want to do with your data and the skill set of your IT staff. Once your data is in Hadoop there are lots of way to extract value from it. You can go down the standard analytics route of plugging a tool into the data lake for data cleansing, querying and visualisation.
Big players in the analytics and business intelligence market like Splunk offer Hadoop integrations and solutions, and data-visualisation firms like Tableau will let you present this data back to non-data people.
Spark on the other hand is useful if you are looking to run high velocity analytics on top of real-time data, with the intention that someone, or an algorithm, is going to make decisions off the back of that data.
So if your data is simply a large amount of structured data, such as a database of medical records, then the streaming capabilities of Spark aren't strictly necessary.
As Perry Krug, principal architect for strategic accounts at Couchbase told Computerworld UK: "Hadoop alone has struggled to find real business value. Spark is a very useful tool that people are using in production and getting real value out of the box."
Pros and cons
Redundancy: One major benefit with Hadoop is that by being a distributed platform it is less prone to failure, allowing your underlying data to be always available. This is why it is the chosen database of many webscale companies, because the internet never sleeps.
Cost: Hadoop and Spark are both projects from the Apache Software Foundation, so they are free and open source. The price comes from how you want to implement it - the total cost of ownership - so the time and resource related to implementation due to the skills required and the hardware it's running on.
The license model of traditional database providers like Oracle and SAP has long been the bane of many CIOs existence, so the Software-as-a-Service model provided by most of the Hadoop/Spark specialists gives greater flexibility while you figure out if the technology is useful.
Speed: Spark is reported to run up to 100 times faster than Hadoop MapReduce according to the Apache foundation. This is because Spark works in-memory rather than reading and writing to and from hard drives. MapReduce will read data from the cluster, perform an operation and write the results back to the cluster, which takes time, whereas Spark performs this process in one place.
Generality: Spark can load data from various data sources, giving teams lots of flexibility, be it MySQL, Amazon S3, HDFS or Couchbase.
Skills: Whatever the vendors tell you, Spark is not an easy tool to use. It is intended for data analysts and experts and is generally applied to deeply complex and constantly changing streaming data sets.
Gartner's latest market guide for Hadoop distributions found that organisations were increasingly turning to cloud-based Hadoop deployments.
The three analysts Nick Heudecker, Merv Adrian and Ankush Jainwith found that inquiries were up 25 percent for cloud deployments year over year. This mainly landed in the laps of the major cloud IaaS vendors, with Microsoft Azure inquiries up 57 percent year over year, and AWS up 171 percent over 2015.
This shift to managed cloud services is being driven by the aforementioned skills gap for Hadoop, as it helps teams avoid having to manually upgrade on-premise clusters. By deploying in the cloud organisations can get up and running with Hadoop far quicker than on-premise.
This change has seen traditional pure play Hadoop vendors like Cloudera and Hortonworks pricing their offerings on a more cloud-friendly consumption basis.
Despite all of this, actual Hadoop deployments going into production remains stubbornly low. Gartner estimates that roughly just 14 percent of Hadoop deployments are in production. This is up from a measly 10 percent in February of 2015.
The analyst house concludes: "Hadoop remains complementary to data warehousing environments, rather than a replacement for it. More current successes are as small-footprint clusters supporting line-of-business analytics applications with small user bases adept at data preparation and manipulation."
General use cases
Due to its ability to store more and more data, some classic Hadoop use cases include a 360 degree view of your customers, recommendation engines for retailers and security and risk management.
Spark also has retailers and Internet of Things (IoT) companies interested because of its ability to conduct real time, interactive data analytics to deliver greater personalisation on the fly.
According to MongoDB's former VP of strategy Kelly Stirman, Spark's growing popularity is due to its compatibility with one important use case: machine learning.
Stirman told Computerworld UK: "Ten years into Hadoop and the hallmarks are still promising, but most people have found it hard to use and not well suited to artificial intelligence and machine learning."
In a broad sense, a Hadoop vendor like Hortonworks claims to work with 55 of top 100 financial services companies and 75 of top 100 retailers. Actual use cases are harder to come by though, perhaps because the technology isn't as mature as the vendors would lead us to believe, or because the customers still see the technology as a secret sauce.
DataStax customer British Gas Connected Homes spoke to Computerworld UK about how it is using Spark and Apache Cassandra to deliver real-time usage statistics to its customers from its smart home devices.
Former head of data and analytics at Connected Homes Jim Anning told Computerworld UK at the time: "We always knew we were doing IoT and we know that the number of connected devices is only going to rise. Those sensors are collecting data all the time. For example our temperature sensor is delivering data every couple of minutes. Scaling that process with a traditional, relational database just wasn't going to cut it."
Innovative electric car maker Tesla uses Hadoop for its connected car data, travel booking company Expedia has been moving its data into a Hadoop environment as it continues to scale, and British Airways is a big exponent of Hadoop for data storage and analytics.
Financial news company Bloomberg has also been shifting towards open source data solutions over the past few years, with head of data science Gideon Mann telling Computerworld UK: "With big data processing, over the past five to ten years the impact of Hadoop and now Spark has given us a whole new set of tools, and we are investing heavily in both of those.
"There was a time we were involved heavily with HBase but we are very aggressive with Spark right now. I don't know if we are an early adopter but we are certainly all in."
Hadoop implementation can be undertaken in house (Apache provides all of the documentation required), or you can have one of the below vendors conduct an enterprise deployment for you, complete with support. Spark is a similar situation, either do it yourself or take to one of the vendors to run it as-a-service, such as Hortonworks' Spark at Scale, Cloudera or MapR.
As of 2016, Gartner has the principal Hadoop distribution vendors as: Amazon Web Services (AWS), Hortonworks, Cloudera and MapR. They have been more recently joined by some big names in Oracle and Microsoft, as well as SAP, which has entered the market via its acquisition of Altiscale to offer Hadoop-as-a-Service. Gartner estimates that AWS is the largest Hadoop provider by both revenue and user base.
Other vendors offering commercial versions of Hadoop include: IBM, Pivotal and Transwarp. Vendors like Couchbase, MongoDB, DataStax and MemSQL offer Spark built upon competing data management platforms.
Despite its relative maturity, compared to Spark, Hadoop still isn't delivering the kind of transformative results many vendors will claim. According to Gartner: "Through 2018, 70 percent of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges."
The answer? "Match projects to specific business requirements and identify the existence and readiness of supported technology components suitable for them," says Gartner.
Spark on the other hand has the potential to be truly transformative for the right kind of companies with the relevant expertise. As Gartner puts it: "Apache Spark emerged as a force as potentially disruptive to Hadoop as Hadoop was to traditional database management systems."
Despite the headline this isn't a straight fight between the two open source solutions, they can be highly complimentary technologies within the enterprise or, depending on the nature of your data, one is better suited than the other. The advantage is that you don't have to lock into a single vendor, so with the right team, getting a proof of concept off the ground is easier than ever.