As real-time and batch analytics evolve using big data processing engines such as Hadoop, corporations will be able to track our activities, habits and locations with greater precision than ever before.
"It will change our existing notions of privacy. A surveillance society is not only inevitable, it's worse. It's irresistible," said Jeff Jonas, a distinguished engineer with IBM. Jonas spoke to a packed house of several hundred people Wednesday at the Structure Big Data 2011 conference.
For businesses, knowing where people are by using geo-locational data will help them personalise advertising and marketing materials over the web. For example, if a company knows a customer is in Aruba, it won't bother offering him or her advertising for restaurants in New York. Instead it may market sun-tanning lotion or scuba-diving excursions.
Knowing where people are will also accurately determine which potential customer is which. For example, if there are five people living in the US with the same name and the same date of birth, but live in different cities, knowing their locations at a given time verifies their identities.
"Just look at the last 10 years of address histories ... it is very telling if this is the same person or not," Jonas said. "Two different things cannot occupy the same space at the same time."
Jonas said 600 billion electronic transactions are created in the US every day, much of which comes from geo-locational data generated by cell phones, which through cellular towers, triangulate a person's exact location at any time. Wireless providers have that data in real time.
By looking at data over years, corporations can know how you spend your time, where you work and with whom you're typically with.
"This is super food [for big data analytics]," Jonas said. "With 87 percent certainty, I can tell you where you'll be next Thursday at 5:35pm."
Big data, an industry term that refers to large data warehouses, includes machine- and human-generated data such as computer system log files, financial services electronic transactions, web search streams, email meta data, search engine queries and social networking activity. In 2010 alone, 1.5 zetabytes of this data was created, most of which was machine-generated. Corporations filled their data centre storage systems with about 16 exabytes of that data last year, according to Jason Hoffman, founder and chief scientist at cloud software provider Joyent.
Bill McColl, CEO of analytics engine vendor Cloudscale, said up until now, big data analytics has been about off-line queries or "MapReduce" algorithms, which were developed by Google. But 90 percent of corporate data warehouse users say they want to move forward into a world with real-time analytics.
"Companies know if they can extract more insight from data faster than their competitors, they're going to win," McColl said.
Jim Baum, founder and CEO of Netezza, the maker of a massively parallel processing (MPP) data warehouse appliance, agreed with McColl. Baum argued that if a corporate user has to wait even three days to get an answer to an analytics query, the user won't bother asking a follow-on question that could mean gaining the real value of the information.
"If I can get an answer in real time, I will ask the next question and next question and that'll be followed by another. Getting answers in near real time is critical. It's the enabler of what we can do with big data," said Baum, whose company was purchased by IBM last year. IBM's Netezza buyout was among a flurry of big data analytics vendor acquisitions over the past year, including EMC's purchase of Greenplum, Hewlett-Packard's purchase of Vertica and Teradata's planned purchase of Aster Data Systems.
Todd Papaioannou, vice-president of cloud architecture at Yahoo, said instead of thinking about big data analytics as the empowerment of corporate Big Brother, consumers should consider it as an enabler of a more personalised web experience.
"If someone can deliver a more compelling, relevant experience for me as a consumer, then I don't mind it so much," he said.
Yahoo on Wednesday launched a new upgraded search engine called Search Direct . Similar to Google's Instant, the Search Direct delivers more rich content to users based on search history. For example, when typing the words "New York" into the search window, the most popular searches that include those two words instantly come to the top of the list even before entering the term. In this case, "New York Times".
Marc Parrish, vice-president of retention and loyalty marketing for bookseller Barnes & Noble, said machine-generated data growth has "exploded" since electronic book sales have taken off. For example, ebook sales outstripped hardback book sales on Amazon.com last year.
"Our web logs on how customers are using e-readers and ebooks ... has produced 35TB of data and will load us up with another 20TB this year," Parrish said.
With that data, the store can determine buyer behaviour, such as what percent of customers purchase books based on their favourite author.
"We have to decide with analytics on hand how we capture the customer's imagination and how we move forward," he said.
Other companies are using big data analytics to track the use of content on their websites in order to better tailor it to users' tastes.
Sondra Russell, a metrics analyst with National Public Radio (NPR), said she needed a way to track website audience use trends in near real time. NPR offers podcasts, live streams, on-demand streams and other radio content on its website. Her organisation had been using web analytics engine Omniture, but it felt like she was trying to jam log-based data into a client-side tracking system that couldn't handle the volume.
Russell said NPR experienced query delays that at best were six to 12 hours long and at worst, weeks long. The organisation finally switched to Splunk's reporting tool, which crawls logs, metrics and other application, server and network data and indexes it in a searchable repository.
"I just want to know how many times someone listened to a program during a certain period of time," she said. "With Splunk I had no delays between data appearing in a query folder and data appearing in reports. I can get any number of graphs without weeks of prep time."
IBM's Jonas compared big data to puzzle pieces, saying until you take them to the table top and begin assembling them, you don't know what you have. That's where Hadoop, Cassandra and other analytics engines come in. Hadoop is a distributed software file system, based on Google's MapReduce algorithm, which allows large-scale computations (batch processing) to be performed across large server clusters in parallel. The computations can be performed on user or machine-generated data, whether structured or unstructured. But Hadoop works best on unstructured random data sets, allowing analytics engines to increase the speed at whcih it gathers information from queries.
MapReduce systems differ from traditional databases in that they can quickly presort data in a batch process, regardless of the type of data: file or block. They can also interface with any number of languages: including C++, C#, Java, Perl, Python and Ruby. Once sorted, a more specific analytical application is required to perform specific queries. Traditional databases can be considerably slower, requiring table-by-table analysis. They also do not scale nearly as well.
For example, Alfred Spector, vice-president of research and special initiatives at Google, said it's not inconceivable that a cluster of servers could someday include 16 million processors creating one MPP data warehouse.
"It doesn't seem there are any limitations other than good engineering required to get there," he said. "Moore's law or not, we have essentially unlimited computation available."
Spector sees a day when distributed computing systems will offer web developers what he calls "totally transparent processing," where the big data analytics engines learn over time to, say, translate file or block data into whatever language is preferred based on a user's profile history, or act as a moderator on a website, identifying spam and culling it out. "We want to make these capabilities available to users through a prediction API. You can provide data sets and train machine algorithms on those data sets," he said.
Yahoo has been the largest contributor to Hadoop to date, creating about 70 percent of the code. The search site uses it across all lines of its businesses and has standardised on Apache Hadoop, favouring its open source attributes.
Papaioannou said Yahoo has 43,000 servers, many of which are configured in Hadoop clusters. By the end of the year, he expects his server farms to have grown to 60,000 because the site is generating 50TB of data per day and has stored more than 200 petabytes.
"Our approach is not to throw any data away," he said.
That is exactly what other corporations are hoping to accomplish: Use every piece of data to their businesses' advantage so that nothing goes to waste.