(Note: this article discusses techniques and processes that are intellectual property and patent pending. For licensing, please contact the author.)
Everybody knows that there is a lot more unstructured data in the corporation than there is structured data. This odd fact often comes as a shock to the database designer or database architect who has spent his/her life doing data warehouses based on structured data. Indeed, one large insurance company has been saving emails for a number of years now and has collected 175,000 terabytes of emails.
Volumes of data produce challenges just because there are volumes. The volume of data presents its own challenge with no other factors taken into consideration. The sheer volume of data presents challenges when it comes to –
- The cost of storage,
- The number and type of processors needed to shuffle the storage,
- The ability to access and analyze the storage,
- The longevity of the storage medium needed, and so forth.
When there are no other factors taken into consideration, the massive volume of data that can be wrapped up in unstructured data is its own challenge. But when taken in combination with other factors, the volume of data that is wrapped up in unstructured data can become even more daunting.
Consider the case of marrying unstructured data with structured data inside a data warehouse. Looking at the problem simplistically, suppose a ten terabyte structured data warehouse is merged with a two hundred terabyte unstructured data warehouse. Once the merger takes place, several nasty things happen –
- Database scans start to take an exceedingly long amount of time,
- Database loads take a long amount of time,
- Indexing data takes a long amount of time, and so forth.
The very basic activities of data warehouse operation and usage suddenly start to take far more time than they ever used to.
So merely considering placing unstructured data in a data warehouse is a grave and serious concern because of the implications of volume.
The good news is that with unstructured data, often it is neither wise nor prudent to bury the structured portion of the data warehouse with textual data. Thankfully, there are a variety of opportunities to reduce the volume of data found in the unstructured environment without damaging the opportunity for textual analytical processing. The following compendium describes some of the opportunities.
Removing irrelevant data. The world of unstructured data is free form. When an author sits down to write, there is no one telling the author what to write or how to write. Instead, the author can write anything desired. This goes for emails as well as other forms of unstructured data. When one looks at emails, how many of those emails are personal (not relating to the business of the organization)? When a young man writes an email to his girlfriend (“Let’s go out for a movie on Saturday night. I’ll pick you up at 7:00pm”), is there any business relevance to this email? Of course not. Only under the most far-fetched and contrived set of circumstances could this email have any relevance to the business. Therefore, personal emails need to be removed from the communications work stream. In doing so, the volume of the unstructured data can be reduced dramatically.
Text is full of stop words. Stop words are those words that are necessary for communication and grammar but are not relevant to the subject being discussed. In English, some stop words are: a, and, the, was, is, that, there and were. These words are necessary for communications, but are not useful to convey meaning. Removing these words actually helps the content of unstructured data, as well as reduces the volume of textual data.
Creating named and patterned variables. Depending on the text, it may not be useful to index every non stop word. Instead, the analyst can be selective in choosing what words need to be indexed and index only those words. By using named indexes and patterned indexes, only the most relevant, most useful data is stored in the data warehouse. The vast volume of data remains stored in the source documents. The analyst still has some capabilities when it comes to textual analytical processing, but the sheer volume of data does not find its way into the data warehouse. By indexing only selective named words and patterned words, the analyst has a nice compromise when it comes to effective textual analytical indexes and managing the volume of data placed in the data warehouse.
Operating at the stemmed level. Most languages (certainly all Romantic languages) have word stems. Moving, mover, moved and moves are stemmed to “move.” Not only does this aid in the textual analytic processing, stemming words reduces the volume of data. (Note: the potential of stemming is far less than the potential of other techniques when it comes to managing the volume of data found in the unstructured data warehouse. However, it is a technique that should be employed and does reduce the volume of textual data found in the integrated data warehouse.)
Being selective as to which documents are entered into the data warehouse. A large savings in the volume of data that finds its way into the data warehouse is being careful as to which documents are needed to be entered into the data warehouse. By referencing only those most obvious, most relevant documents and placing their data into the data warehouse, the analyst can greatly influence the volumes of data that the data warehouse holds.
These then are some of the more important techniques that are used to manage the volumes of data that go into the data warehouse from the unstructured environment.
SOURCE: Managing the Unstructured Volume
Recent articles by Bill Inmon