I recently had the chance to learn more about Entity Analytics and Big Data from one of the top industry thought leaders, Jeff Jonas of IBM.
This opportunity came when Marcel Jemio, Chair of the Fiscal Service Data Stewards at the US Treasury Dept. (and a Forrester client), invited me to a presentation Jeff gave at a special internal event at the Fiscal Service in Washington, D.C. So of course I leapt at the opportunity! Marcel opened the session with an overview of why Treasury is interested in data and analytics: Treasury is charged with helping the nation guard against the kind of national or global financial collapse that triggered the 2007-2009 recession. Therefore it's crucial that the stewards of the nation's financial data, like Marcel and his colleagues continuously improve the insights we gain from this data.
This data is more connected and interoperable all the time, across multiple public and private sector organisations with common goals. Making key insights from this data available more openly, but securely, increases transparency and visibility of potential issues to key decision makers in government and commercial enterprises. But to link all this related data, to gain these insights, requires the Fiscal Service to leverage global industry data standards to gain deep insights into integrated information. If you can't link and reuse data, it's much less valuable!
A Lifetime Spent Linking Data Together
Jeff invented a really cool way to link information called NORA (Non Obvious Relationship Awareness), and he's implemented it multiple times on different platforms in different eras since the first version in the mid-80s, through the course of his career as an innovator, scientist, and entrepreneur. Jeff also found time to found Systems Research & Development (SRD), later acquired by IBM in 2005, which is how Jeff ended up as IBM Fellow and Chief Scientist of the IBM Entity Analytics Group at Watson labs.
Vegas casinos used a version of NORA to bring down that MIT card-counting team led by Bill Kaplan, through surveillance of live Blackjack play. Casinos routinely aim to connect eighteen or more different lists of people who are known to have defrauded or otherwise attempted to take advantage of the casinos, and they do this through the kind of entity matching that NORA enables. For example, I may connect information about a person who was arrested for a crime under an alias, with a person applying for a job as a croupier under her real name, because they both have the same phone number. Jeff also helped intelligence agencies deal with the "connect the dots" problems they faced after 9/11.
Making these connections requires an integrated view across observations from many different sources, including structured, semi-structured, and unstructured data, and even advanced sources such as video with facial recognition. Jeff pointed out that one flaw in conventional business intelligence tools is that they require smart people to ask smart questions, and only then can these tools give answers. There's no way your organisation has enough smart people to ask all the right questions all the time, so you need analytics that find relevant connections and bring them to your attention, telling you things you would otherwise never have known, such as the connection between the arrest record and the croupier's job application. Entity Analytics are also quite valuable for developing richer "views of the customer" as well as for householding and other techniques crucial to success in the era of Digital Disruption.
Jeff used a story about jigsaw puzzle pieces to convey a powerful metaphor for linking information and observations. He has used groups of people assembling jigsaw puzzles to conduct experiments that reveal important insights about the way humans' analytical thinking enables them to link pieces together to make a picture, just as analysts want to link disparate observations together to form a cohesive picture of an intelligence threat, to find a perpetrator after a bombing, or even to learn enough about you to make you offers that you just can't refuse. But Jeff's presentation happened the week before the Boston Marathon bombing, and when that happened I wondered what role NORA's descendants might have played in analysing video feeds and finding the bombers.
Unfortunately, Jeff sees many organisations getting dumber about their data - the algorithms they have developed to help them make sense of their data are not growing and innovating fast enough to keep up with the flood of new data from new sources, such as location data, which is a potential source of deep new insights. He calls this gap "enterprise amnesia," and told the story of retailers that have been known to hire associates who were previously arrested for shoplifting - from the same store location.
Lessons Jeff Learned From A Lifetime Of Linking Data Together
- Data is often imperfect - and that's usually a good thing! You don't need perfect information to find interesting relationships in the data - in fact, counter-intuitively, "dirty" data is sometimes better for finding relationships, because cleansing may remove the very attribute that enables matching. On the other hand, some information is a lie, as "bad guys" will intentionally try to fool you, or to separate their interactions with your firm into different channels (web, mobile, store) to avoid detection. You should assign a trust level to "known" information, and it rarely approaches 100%.
- Your data can make you smarter as time passes. As new observations continue to accumulate, they enable you to refine your understanding, and even to reverse earlier assertions of your analysis based on what you knew at the time. Therefore, be sure to rerun earlier analyses over the full dataset, and don't assume the conclusions of your previous analysis were correct.
- Partial information is often enough. It's surprising how soon you can start to see a picture emerge - with puzzles, the picture can often be identified with only 50% of the pieces, and this aspect of human cognition often applies to machine learning, too. Once the picture starts to emerge, you can more quickly understand each new puzzle piece (observation) by seeing it in the context it occupies among those around it
- More data is always good. The case for accumulating more data - Big Data - is strong: not only does it bring deeper insights, it also can reduce your compute workload - Jeff's experience shows that the length of time it takes to link a new observation into a large information network actually goes down as the total number of observations goes up, beyond a certain threshold.
- Can you count? Good! Accurate counting of entities (people, cases, transactions), a.k.a. Entity Resolution, is critical to deeper analysis - if you can't count, you can't determine a vector or velocity, and without those, you can't make predictions. Many interesting analyses in fraud detection involve detecting identities - accurately counting people, knowing when two identities are the same person, or when one identity is actually being used by more than one person, or even when an identity is not a real person at all Identity matching is also the source of analyses that identify dead people voting and other such fraud.
- Privacy matters, but it's not an obstacle. Once identity comes into play, then privacy concerns (and regulations) must of course be taken into consideration. There are advanced techniques such as one-way hashes that can be used to anonymise a dataset without reducing its usefulness for analytical purposes.
- Bad guys can be smart, too. Skilled adversaries present unique problems, but they can be overcome: to catch them, you must collect observations the adversary doesn't know you have (e.g. a camera on a route, that they don't know you have), or compute over your observations in a way the adversary can't imagine (e.g. recognising faces or license plates, and correlating that with other location information).
How To Stay Ahead Of The Game
Jeff pointed out that location data presents tantalising new possibilities for insight. There are 600 billion location records created every day in the US alone! This data is being routinely de-identified and shared with multiple third parties, in volume and in real time, and it's amazing what you can figure out from it. Consider the example of Malte Spitz, who as an act of political protest over his privacy concerns sued Deutsche Telekom for release of his location records. They revealed that over six months, he "hung out" 2400 times at 130 unique places. Know three of those locations - home (sleeps at night), work (goes in the daytime), and pub (goes to meet friends - links to other trails of location data) and I can tell you who the person is, despite the anonymised data - and who his friends are.
Although there's a strong trend toward analysing data in memory and delivering insights in real time - to inform "sense and respond" systems - don't imagine that the world is going all real-time. Instead, Jeff advises that you should view batch approaches to analysis as an important complement, as delivering "periods of reflection" that can deliver insights that you can then use to improve the accuracy and usefulness of the model that drives your "sense and respond" systems. Jeff labels these two sides of the analytical world with catch phrases: "sense and respond" (relevance finds the sensor) vs. "explore and reflect" (relevance finds you). Jeff advises we use both sides together, which should inform future architectures for doing advanced analytics.
In contrast, today we do analytics in stovepipes - we have one set of algorithms to analyse structured data, different algorithms for unstructured data, and still more (different) algorithms for social data! Jeff believes that in the future we must take a more integrated approach to analytics, with algorithms that reason over datasets that mix all types of data, and link them all. It's only through this broader view that we can do what casinos do, and catch the bad guys while they are still playing Blackjack.
What This All Means For You
Below find my take on how you should act upon Jeff Jonas' insights, but I also urge you to engage with Forrester's analysts who spend every waking moment thinking about business intelligence, Big Data, and the potential for deeper business insights that these and other innovations can bring:Boris Evelson, Martha Bennett, Mike Gualtieri, Noel Yuhanna, Michele Goetz, Brian Hopkins, and others. In my view::
- Integrate your analytics stovepipes. Gaining deep insight requires a more integrated approach to analytics, bringing together all sources of information, whether structured, semi-structured, or unstructured (including media) into one pool of observations for analysis. This runs counter to the current practice in many organisations of more stove-piped approaches to analytics, so will require a major upheaval to accomplish, but it will be worth it for those that most require this kind of intelligence. The implications impact organisation, staffing/skills, choices of technology, and architecture.
- Integrate real-time and batch analytics for deeper insight. Both real-time and batch approaches are critical, and are also more complementary than many people realise. Although the need to act quickly on information that develops in real time (sense and respond) is the primary driver of the need to increase investments in real-time, the opportunity to inform batch analyses/models with new insights that are constantly emerging from real-time channels is an under-recognised source of added value that can help support the business case for real-time, just as insights from "deep reflection" via batch methods can inform and improve "sense and respond."
- Don't be afraid of real-time. I was struck by Jeff's view that real-time may not cost more, as many expect it does. My own research, talking to people who are doing new work in-memory and using new technology like SQLStream or Streambase, or CEP, suggests that Jeff is right, that these innovative new ways of gaining insight often develop those insights much more efficiently than through other approaches that require cranking through the whole haystack, instead of reaching in and picking out just the needle you care about.
- You need the right people to gain these insights. Transforming your approach to analytics will depend mainly on having the right people - as Jeff put it, you should hire "curious" people. In the future it will be more important for an analyst to be curious, even driven, than for the analyst to know SQL. These curious people will be seeing the emerging picture uncovered as data finds data - algorithms discover connections among many different observations - and using those insights to continually refine their analytical models and augment their sources with additional observations.
- Beware the privacy and regulatory implications of integrating analytics. The value of combining information from multiple sources will motivate organisations that urgently require better insights from this data to consider how to obtain insights from the datasets they need without violating policies and regulations designed to protect the interests of citizens, while staying away from the legal jeopardy of a "fishing expedition."
This opportunity to integrate multiple sources of insight is too important to our business success, good governance, and security, to let it go by. Be sure you enhance your strategy for analytics and business intelligence to exploit the opportunities that Jeff Jonas' research and innovation shows us are real and compelling.
Posted by Mike Gilpin
Find your next job with computerworld UK jobs