Apache Spark's marriage to Hadoop will be bigger than Kim and Kanye

Apache Spark is an open source cluster computing platform designed to process big data as efficiently as possible. Sound familiar? That's what Hadoop is designed to do. However, these are distinctly different, but complementary, platforms.

Share

Apache Spark is an open source cluster computing platform designed to process big data as efficiently as possible. Sound familiar? That's what Hadoop is designed to do. However, these are distinctly different, but complementary, platforms.

Hadoop is designed to process large volumes of data that lives in an Hadoop distributed file system (HDFS). Spark is also designed to process large volumes of data, but much more efficiently than MapReduce, in part, by caching data in-memory. But, to say that Spark is just an in-memory data processing platform is a gross oversimplification and a common misconception. It also has a unique development framework that simplifies the development and efficiency of data processing jobs. You'll often hear Hadoop and Spark mentioned in the same breath.

That's because, although they are independent platforms in their own right, they have an evolving, symbiotic relationship. Application development and delivery professionals (AD&D) must understand the key differences and synergies between this next-generation cluster-computing power couple to make informed decisions about their big data strategy and investments. Forrester clients can read the full report explaining the difference and synergies here: Apache Spark Is Powerful And Promising

Spark and Hadoop - a marriage of celebrities

Don't believe the technology charlatans that tell you Spark will replace Hadoop. That's poppycock, for now. Today, Spark and Hadoop are meant to be together.

The last thing an enterprise needs is yet another cluster to manage. Hadoop and Spark coexist on the same cluster to provide that killer big data combination of volume and speed for data processing. Like many marriages, some habits will have to be negotiated. Many data processing jobs that were originally written in Hadoop MapReduce will be rewritten for Spark. In the near future, Spark will become the primary API against which data stored in HDFS will be processed.

Will it last?

Many marriages don’t, especially celebrity ones. There are two possibilities that could cause an irretrievable breakdown between Hadoop and Spark:

  • The Spark community builds its own Hadoop-less ecosystem. Remember thatSpark does not require Hadoop to run. The Spark community, led by Databricks the commercial company formed by the founders of Spark could develop and push its own filesystem and other technologies that make it an independent ecosystem.
  • The Hadoop community creates its own Spark-like features. The is no reason that the Hadoop open source community and formidable commercial vendors such as Hortonworks, Cloudera, and MapR could not develop technology that competes with the Spark benefits described in this report.

Judging from the lofty amounts of venture capital that have been invested in big data technologies, the profit motive certainly exists for both of these possibilities to happen. Stay tuned. In the meantime, enterprises can’t wait. It will be full steam ahead for both Hadoop and Spark.


Posted by Mike Gualtieri

Find your next job with computerworld UK jobs