Open Enterprise Interview: Mike Olson, Cloudera


Yesterday, I wrote about the launch of the open source company Cloudera.

It's always hard to tell whether startups will flourish, but among the most critical factors for survival are the skills of the management team. The fact that less than three hours after I sent out some questions about Cloudera to Mike Olson, one of the company's founders, I had the answers back would seem to augur well in this respect.

Olson explains the background to the company, and to Hadoop, the software it is based on: what it does, and why business might want to use it; he talks about his company's services and business model, and why he thinks cloud computing is neither a threat nor an opportunity for open source.

GM: What was your background before starting Cloudera?

MO: I started college at Berkeley in 1979. In 1980, I got a work-study job for the Computer Systems Research Group, which produced the Berkeley Software Distribution -- what most people call "Berkeley Unix". I worked for Bill Joy and Bob Fabry on that project (and the guy in the next office was a kind of geeky grad student named Eric Schmidt...).

I dropped out for a few years in 1982 and returned to Berkeley as an undergrad in 1988. I got a job working for Mike Stonebraker on the Postgres project, where I was one of the key developers. I got my BA in 1991, MSc in 1992 and left grad school in 1993 to join Stonebraker's Postgres startup, Illustra. One of the last things I did at Berkeley was to write, with Margo Seltzer, an embedded database engine called Berkeley DB.

After Illustra, I spent a year in biotech. I joined Margo's startup, Sleepycat, on Jan 1, 1998, to do sales, and became CEO in 2001. Grew that company - profitable, open source model, no outside investment - through 2005 and sold in early 2006 to Oracle.

GM: What's the history of the Hadoop project?

MO: Google developed its MapReduce technology beginning in the very early part of this decade in order to process the huge amounts of data it collected in its web crawls. Google, like other web properties that operate at scale - Amazon, Facebook, Yahoo and others - builds its infrastructure from cheap servers with lots of local storage. The MapReduce software was designed to run well and to scale easily on unreliable hardware.

MapReduce is able to survive disc and server failures without interrupting jobs and without losing data.

Google published a widely-read paper on MapReduce in 2004. Doug Cutting, the engineer behind the Lucene and Nutch open source projects, read the paper and decided he would like to build an open source implementation of MapReduce.

He was joined by Mike Cafarella at the University of Washington, and the two of them produced the very earliest versions of Hadoop. (Hadoop is named for the stuffed elephant that belongs to Doug's child). Yahoo! hired Doug early and invested substantially in the development of Hadoop. Like Facebook and others, Yahoo! wanted the power of MapReduce to operate on the very large amounts of data it needed to process, and believed that an open source project would encourage participation and spread the power of Hadoop most broadly.