Back in 2003 a small group of developers started working to turn their vision of an open source web search engine into reality. Based on Apache Lucene as core indexing component they added capabilities for crawling the web to allow for automatic discovery of web pages.
A web database was added capable of computing, storing and analysing the graph of all web pages discovered. Plugins for parsing content from various file formats were added.
Soon thereafter the project - by then named nutch - entered the Apache Incubator. With building a web search engine as the major goal of the project the need for distributing storage emerged: The reason for working on a “nutch distributed file system”, or in short: NDFS.
To enable distributed processing on top of this file system a map reduce engine was implemented. Both components raised enough interest to warrant being separated out into their own project called Apache Hadoop comprising the Hadoop distributed file system (HDFS) and map reduce.
What happened after that separation is most intriguing: Back in the time the technology Hadoop provided filled a gap that until then no-one could fill: The combination of allowing for distributed computations to run on clusters built off of commodity hardware machines glued together by software that does not incur licencing fees seemed most interesting even to teams at Yahoo!.
They decided against an internal fork of the project, against a re-implementation of the ideas backing Hadoop. Instead they leveraged the full potential of the project by getting involved in the official project, using what is out there, contributing vital changes and improvements back.
Fast forward a few years and we find large corporations rely on Hadoop: Facebook uses the system both for hosting their online services (e.g. Facebook messaging) and for their backend analytics engine. Several additions and related projects have been released under open source licenses and contributed as Apache projects.
. Also smaller companies like German nugg.ad focused on targeted advertisement successfully uses the technology for their system.
Quickly start-ups like Datameer, Cloudera, MapR, Hortonworks and others were founded that offer not only consultancy and training but also provide additional components that make Hadoop easier to use for analysts, simpler to deploy or faster. Recently even Microsoft announced their plan to integrate Hadoop and even contribute changes back to the project.
In addition to avoiding vendor lock-in by using software that is available under an open source license getting active provides added opportunities: Not only can one learn more on the direction projects are going to but by providing patches and infrastructure work developers can earn the merit to drive the project.
Apache projects at Apache usually exhibit a diverse committer base. By working together in teams that cross corporation boundaries, nations and continents they provide a unique chance to bring together the bet talent in one technological area to solve a specific technical problem - in the case of Hadoop providing a platform for distributed data analysis.
One of the greatest advantages in contributing to open source is the ability to collaborate with a diverse, enthusiastic community. While it may be difficult at times to find the right opportunity to jump in and develop code, contributing in other ways such as writing documentation, blogs, FAQs, and articles on a project provides important value by deepening understanding and interest.
Posted by Isabel Drost
Isabel will be presenting several sessions, including "Talking People into Creating Patches”, at ApacheCon, 7-11 November 2011, in Vancouver, Canada. To register, visit http://apachecon.com/
Biography: Isabel Drost co-founded the Lucene sub-project Apache Mahout. She is employed at Nokia Berlin. After an internship at Google in 2005/06 Isabel joined Neofonie GmbH. She worked as research assistant in Berlin. Isabel holds a master's degree in computer science from the University of Applied Sciences Mittweida.