The age of intelligence - for software at least

Despite the seemingly never ending noise coming out of reality show contestants and politicians these days, we are, in fact, in a new era of intelligence - for software. That’s right: in case you haven’t noticed, the machines are...


Despite the seemingly never ending noise coming out of reality show contestants and politicians these days, we are, in fact, in a new era of intelligence - for software.

That’s right: in case you haven’t noticed, the machines are getting smarter. Not Arnold the Cyborg from the future smarter, thankfully, but smarter as in they often know what we are looking for, what we want to buy and when we are doing bad things.

Don’t believe me? Take a look around at the top players on the web these days, sites and services like Google, Amazon, Facebook, Yahoo! and Twitter to name a few. They all incorporate software that learns and adapts into their services in order to provide you, the customer with a better experience. In fact, the beauty (or horror) of these sites is that you often don’t even know they are doing it.

If consumer driven sites aren’t enough for you, spend a day reading a few Requests For Proposals put out by high tech government organisations or talk with graduate students doing cutting edge research in the space.

It will show that the era of practical, everyday intelligent software sifting through massive amounts of data on massive clusters of computers is here. Are you ready?

Designing, building and incorporating intelligent software isn’t just for PhDs in Artificial Intelligence anymore, it’s for you and me. More importantly, it’s for the masses. Why? Intelligent software isn’t just about buzzwords like machine learning, big data and collective intelligence; it’s about making the user experience better.

It’s about software that can adapt to meet the needs of both the technocrats and the technophobes. It’s software that takes in the knowledge of the masses and mashes it up with the personal and produces an experience just for you.

OK, OK, I know the software isn’t really intelligent. After all, the machines are just doing what we tell them to do and they still can’t pass the Turing Test. That being said, they can do many of the things that we associate with intelligence such as find patterns in a sea of noise and predict future occurrences based on past behaviours.

They can parse human language, answer questions, organise your email (spam anyone?) and provide you with recommendations for your spouse’s birthday. Not only that, but they can do these things at scales not obtainable by humans.

“That’s all well and good”, you might say, “but what’s this got to do with me and my organisation?” For good measure, you might also ask “What’s this got to do with the Apache Software Foundation (ASF) and Open Source?”

After all, you can’t afford to build intelligent software from the ground up and the last time you checked, the ASF was building web servers, right? On the first count, the answer is you don’t have to build it from the ground up. And the reason you don’t have to build it from the ground up is because, to spin a common phrase around, this isn’t your predecessor’s ASF, and open source isn’t just about saving you money on operating system licenses. (And, yes, the ASF still does produce the web server that powers the web, so it is your predecessor’s ASF too!)

Just as open source built the modern day web via things like Linux, MySQL, Apache HTTP Server, Apache Tomcat and the like, the ASF is laying the foundations for intelligent software through projects like Apache Hadoop (large scale distributed computing APIs and file systems, Lucene and Solr (scalable search), UIMA (unstructured information analysis) and Mahout (scalable machine learning libraries for recommendations, clustering, classification and transaction analysis).

Throw in other open source projects, just to mention a few, like the R Project for Statistical Computing, GATE, Pentaho, JasperSoft and OpenNLP and you have the foundation to build intelligent applications without the need to go off and hire a whole lab full of PhDs and buy a whole bunch of really expensive hardware. (You still will want at least a few people with significant experience in the area to lead the team, but there are great resources and open source businesses available to help. You also are going to need some hardware, but luckily the cheap commodity stuff will do.)

Digging in a bit, what does this intelligent stack look like? It starts with two modes, one online and one offline. The online mode is responsible for interacting with the user, their profile, their history and their likes and dislikes and mashes it up with the results of the offline mode to produce relevant information for the user.

With my Apache and search hats on, the online pieces are powerful search capabilities like Solr (which uses Lucene) combined with Mahout’s recommendation engine (sometimes called collaborative filtering), combined with sophisticated business intelligence analysis and reports that are based on the results from dynamic user interactions (i.e. search and discovery) with the application. The offline mode, meanwhile, does the data crunching.

It takes in the massive amounts of data your organisation or your customers produce on a daily, hourly and minute-by-minute basis and feeds them into the likes of Mahout, R and UIMA, using Hadoop and related technologies where necessary to scale. For instance, an offline workflow might take all of your content along with all of your users’ interactions with that content and do things like:

  1. Organise the content according to your company’s knowledge centres and workflows as well as the concepts in the content.
  2. Calculate relationships between users, between content and between users and content to dynamically engage people with content across networks in ways never before possible.
  3. Feed all of this information into the online engines on a regular basis so that people can not only find the content they are looking for but can also create synergies with others that move whole organizations and communities forward.

The offline mode also adapts to new interactions and new trends because it is constantly aware of how people are interacting with it.

That particular view of the stack is, of course, just one view of the landscape. Many of these tools can also be used to produce a deeper understanding of manufacturing processes, power systems and a whole host of other information.

Finally, the point of all this isn’t to replace humans, but to augment them. The best systems are “human-in-the-loop”. They do the heavy lifting and the mundane and we combine it with our qualities to make it worthwhile. Now, to get to work on writing that “Intelligent Politician” software…

Blog post by Grant Ingersoll, Project Management Committee (PMC) Chair Apache Lucene, Committer on Apache Lucene, Solr and Mahout as well as a PMC Member on the Mahout project.

Grant is co-founder of Lucid imagination, and works in the areas of search, machine learning, and natural language processing. He is the co-author of the upcoming "Taming Text", as well as several articles on search and machine learning. Grant is the Track Chair for Lucene, Mahout + Friends/Search and will be speaking on how to build intelligent applications with Lucene and Mahout at ApacheCon NA, 1-5 November 2010.