Advances in technology have never coincided with a radical change in business requirements such as we are seeing today in relation to the growth in unstructured data. This phenomenon is enabling early adopters of this new technology to achieve significant competitive advantage and is driving very fast growth for the vendors in this hot new space.
In recent years the volume of information that most organisations need to manage has grown exponentially and this data often contains many new formats and structures that didn’t exist a few years ago.
To make things even more complicated, the majority of this new information comes from outside the organisation leaving them with less control over it. Furthermore, the management requirements of this new breed of information are very different.
Most businesses need to really interrogate this information at a granular level and be able to identify important, often related elements within the information and trigger alerts based on this or re-purpose the information into another application or product.
New breeds of systems are starting to make significant headway in this, addressing these requirements with radically different technology. In much the same way a relational database was designed from the bottom-up to ingest, manage and analyse structured information, requiring fixed schemas to maximise its effectiveness.
A database for unstructured information is designed from the bottom-up to manage the information it is intended for, with no requirement for structure and in depth analysis and pre-indexing of what’s inside the content, as it arrives in the system. To provide some insight to this, here are some of the original design assumptions:
- The unstructured information to be managed will often be messy
- There will often be millions to billions of items
- The information often has an important textual component
- Rarely do you get to control the inputs, so they are likely to change
- Information can be sparse, with each item having different attributes
- Users want answers in under one second, regardless of volume
Clearly to have embarked on the design of a new database in the early 2000s you needed to be convinced you could deliver significant advantage over established technologies. Some of the key differences in approach can be defined as follows:
No enforced schemas, can use none or many
Uses inverted indexes to index what you see (not what you expect)
- Uses in memory indexes to go from billions to hundreds before you hit disk
- Able to load and query at massive scale
- Uses a query language that helps you deal with ambiguity (Xquery)
It’s a fundamentally different approach to solving the problem technically and in the way you go about implementing it. As there is no enforced schema, the requirement to do extensive analysis of information and user requirements up-front to determine schema and massage data to fit is not required. Typically step one is load the information “as is”, see what you can do with it, then refine search, enrich data, try again, then repeat in iterations.
Compared with technologies that have come before, the speed of implementation and, in particular, the reduced administration and management overheads make the value proposition compelling, particularly when viewed against the benefits that can be achieved in association with this exponentially growing challenge.
As with most major inflection points in the information landscape, the companies that really appreciate the degree of change, and consider the implications on their business, can achieve significant business advantage.
This applies equally to business and the vendors. Businesses that have embraced this change have adopted revolutionary technology that has enabled them in most cases to change their business model to maximise the impact of the new information on their organisation and their customers, whilst reducing costs and driving significant additional revenue.
Vendors that have embraced the change and developed solutions specifically to manage it are thriving also, not only through the provision of software, but more importantly, developing significant expertise around the best way to get the most from this new information to help more customers achieve their goals. This is a time of great opportunity, don’t wait for the existing mega-vendors to solve this problem, as they may not even recognise it.
Posted by John Pomeroy, vice president, MarkLogic Europe