Cleaning Structured Transaction-Based Data It’s Time to Clean the Beach House
por Bill Inmon
Originally published marzo 25, 2010
When the last new shoots of spring have appeared and the pollen has run its course, the days become long and there is time for a round of golf after work in the afternoon, it’s time to clean the beach house.
When we open up the beach house, we find a few Coke bottles left from last year’s parties, a few cobwebs, and the birds nesting in the eaves. So let’s get to work.
First, we have to sweep the sand out of the back yard. We take a broom and start sweeping. And we continue sweeping. The more we sweep, the more sand appears, because the back yard is the beach. We turn to shovels instead of brooms, but the sand keeps appearing. There is just no end to the sweeping and shoveling.
So when are we going to be finished with sweeping the beach? Why…probably never. In fact, it is a silly thing to do – trying to sweep the beach.
Almost as silly as waiting for all the structured, transaction-based data to be cleaned up in the corporation.
The other day I was at a customer site and I was discussing the possibility of gathering, cleaning up, and organizing their textual data. This included warranties, corporate contracts, email, chat log sessions, and many other forms of text. And the pushback I got was, “How can we think about cleaning up our textual data when we haven’t cleaned up our structured, transaction-based data?”
At first, this seems like a logical question. Nearly all corporations are struggling with their structured transaction-based data. In fact, nearly all corporations – in one way or the other – have been struggling with their structured transaction-based data for decades. This is not something new.
And is there any end in sight? Probably not. Trying to completely organize and master transaction-oriented structured data is sort of like trying to sweep the sand off of the beach. You are going to be waiting a LONG, LONG time before the job is ever finished. Indeed, that job may NEVER be finished.
So when someone suggests to you that you should get your unstructured textual data in order, it seems logical that you should say, "We will tackle the unstructured data when we finish the structured data." In other words, you are never going to tackle gathering, integrating, and organizing unstructured data.
And that’s a real shame because new tools exist in the marketplace that allow you to tackle unstructured data. Now there is textual ETL. With textual ETL, you can gather, integrate, and structure text into a relational database. Once you have done that, whole new worlds of analytical opportunity open up.
And not only is textual ETL a reality, it is discovered that textual ETL takes a FRACTION of the time to operate compared to classical legacy systems ETL. There are some good reasons for the blistering efficiency versus the three-toed turtle approach of classical ETL. Consider the job that must be done by classical ETL. Classical ETL has the task of integrating data from older applications (often undocumented) in older technology that was never designed for integration. One old IMS system written in COBOL must be integrated with a new spiffy system written in Java which in turn must be integrated with a CICS/VSAM program written in assembler in 1965. These systems are about as far apart as systems can get. They are undocumented, written in technology of yesterday, and were never designed to be integrated. The definitions of data are incompatible and similar calculations and algorithms are exactly that – similar. No wonder it takes classical ETL months to achieve results.
Consider the world of textual ETL. In textual ETL, you have – at the basis of all text – a common language. Regardless of whether it is a contract, a life insurance agreement, a letter, a warranty, a chat log – the contents of text all share a common basis of integration. While there is a lot to doing textual ETL properly, there is nothing like trying to integrate older legacy applications.
Trying to clean up unstructured data is like trying to clean up the beach house. Yes, there is work to be done. There is dust that needs to be swept. There are garbages to be emptied. There are bottles to be thrown away. But it is a finite task. Trying to sweep the sand off of the beach is another matter entirely. It is possible that the sand cannot ever be swept off of the beach. And trying to clean up the structured, transaction data is like trying to sweep the sand off of a beach.
Recent articles by Bill Inmon
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC