Oops! The input is malformed!
Originally published octubre 2, 2007
As users of the Internet, we have become accustomed to readily available information at our fingertips. We frequently accomplish this by using an Internet search engine, which quickly returns an exhaustive list of results on almost any query we can make. Companies have recognized the tremendous power of search engines, and many have deployed these search engines against their own internal data. Not long ago, there were those who suggested that search engines would be a replacement for traditional data analysis tools, including business intelligence (BI) tools.
This turns out to be an overly simplistic view; we have learned that search tools and BI tools have very different – and complementary – capabilities. BI tools excel at allowing us to explore and interrogate structured data, while the forte of search tools is the traversing of unstructured data. The amount of structured data that companies generate and store has been growing steadily over the last four decades, as applications have been deployed to run more and more of our business processes and as those applications get richer in terms of the data they capture.
But the true explosion in data volumes has been on the side of unstructured data. We are capturing and storing astounding amounts of unstructured electronic bits and bytes: e-mails, electronic documents, images of scanned paper documents, voice mails, diagrams, photographs and video files, instant messaging transcripts, and other data types. Part of this dramatic increase is because these electronic media types are more and more important to our daily business and we rely on them more heavily to document our business decisions and relationships. Nevertheless, this growth is also fueled by changes in the regulatory climate; recent SEC and civil code regulations have mandated that companies retain very detailed electronic records far beyond the scope of what many people previously considered to be “official” company data.
Because company data lives in both structured databases and in unstructured data files, we need both business intelligence and search tools in order to have access to all of our data. Business Intelligence tools are tremendously powerful and flexible and have revolutionized the way that knowledge workers analyze data. However, they are limited to searching structured data. We store transaction data in databases so that it can be used to drive business processes and support reporting and analysis. However, a company’s structured data only tells part of the story. Exceptions to business processes, for example, are rarely captured in structured databases; instead, those exceptions are documented in some form of electronic “paper trail.” These things are invisible to BI tools today.
Conversely, search tools can find a word or a phrase in almost any document with tremendous speed and accuracy, but they are no substitute for BI tools. Search tools typically don’t cover the data in transaction systems. They may technically be able to access data in a relational database; but without the context of the database schema, that search is meaningless. If one searches on a part number, for instance, how useful will it be when the search tool returns a list of every purchase order line item detail that includes that part number? Because the search engine doesn’t understand the context and structure of the database, it won’t return useful business information from complex relational databases. And search tools do not provide a facility for executing complex queries. Even a moderately simple SQL query can be impossible to recreate using query language of most search tools.
The bottom line is that companies need access to both structured and unstructured data, and no data retrieval paradigm is capable of handling both. What companies need is a way to combine BI (sql-based retrieval) against their structured data with search-based retrieval of their unstructured data.
As companies have explored this problem, they typically adopt one of two approaches: adding structure to their unstructured data, or unstructuring their structured data. Let’s look at each of these approaches more closely.
Many companies elect to unstructure their structured data so that they can more easily include it in search results. In practice, this means extracting data from relational databases and dumping it into flat files. Along the way, the data can be denormalized so that it retains some amount of context. Imagine creating a single electronic purchase-order document by combining the order header, order details, order line items, and purchaser information into a single record. The challenge with this approach is that the search results are only as good as the denormalization of the data. Ad hoc queries may very well return meaningless results if they were not envisioned when the data was denormalized. In addition, it’s not possible to execute complex query logic, and a query risks returning an overwhelmingly large number of results.
Making unstructured data available for search via a relational database is an even trickier proposition. The most common approach is to load the unstructured data items (e.g., Word documents) into a database as BLOBs, or binary objects. These are then described by some metadata that is extracted from them, such as keywords that describe the contents, the creation and revision dates, the author, etc. But this approach is limited by our ability to extract meaningful metadata from these documents. In addition, they can result in very large and inefficient databases, and maintaining or updating the metadata that describes each document is very cumbersome.
What is needed then is an architecture that does not rely on manipulating data so that it can be retrieved with a single tool. Instead, smart companies today are creating a hybrid approach in which business intelligence and search tools work together to help business users get access to all of their company data, regardless of whether it is structured or unstructured.
SOURCE: When Data is Not Enough
Recent articles by David Gleason