Usamos cookies y otras tecnologias similares (Cookies) para mejorar su experiencia y proporcionarle contenido y anuncios relevantes para usted. Al utilizar nuestro sitio web, acepta el uso de Cookies. Puede cambiar su configuracion en cualquier momento. Politica de cookies.


No Hits, No Runs, No Errors Handling Errors that Occur During ETL Load

Originally published enero 29, 2009

“No hits, no runs, no errors” is what the manager of a baseball team loves to hear at the end of the inning after his side has just been in the field playing defense. And it is pretty typical too, as many innings of baseball end up being three quick outs.

The database administrator also likes to hear those same words when loading ETL data into a data warehouse. He likes to hear that there were no errors in the execution of the load process. And, usually, that is the case.

But on occasion, there are errors at the point of loading data by means of the execution of an ETL process. And these errors become problematic for the database administrator whose job it is to make sure the data warehouse is up and running and healthy.

The question then becomes: What should be done about erroneous data that is detected at the point of an ETL load?

The first consideration is how much data is erroneous? In some cases, one record out of a thousand is in error. This usually indicates that some editing criteria has not been met or that someone has fat-fingered an entry. These errors must be taken care of; but often, they are not a high priority. The other type of error occurs when whole masses of data are incorrectly loaded by ETL. In this case, something is drastically wrong and must be corrected as soon as possible. This type of problem is definitely a top priority.

But once the error has been detected (and that’s the first step), there remains the issue of what should be done about the erroneous data being placed in the data warehouse? There are several possibilities for mending the data warehouse, or at least ensuring that the damage caused by the erroneous data is minimal.

One possibility to manage errors encountered in the load of data by ETL is to create an error file. The error file is created and maintained once the error is detected. Instead of being placed in the data warehouse, the erroneous data is placed on the error file. Then, one at a time, the error is corrected and the erroneous data is removed from the error file and placed into the data warehouse after the error has been corrected. The notion of an error file sounds good; but in practice, there are many drawbacks. Some of the drawbacks of an error file are:

  • The error file grows faster than it can be repaired. In this case, a monster has been created because there will NEVER be an end to the errors.

  • Certain items on the file are difficult to correct and stay on the error file almost eternally. After enough time passes, the item of data and the error cease to be of interest – but the item still remains on the error file.

  • The error file represents a manual activity that is being generated by an automatic process. This is always a prescription for a bad problem.

Yet perhaps the worst drawback to an error file is that data is not placed onto a data warehouse at the outset. Suppose there are 10 elements of data in a record and one of those elements is determined to be faulty. The entire record is sent to the error file. The problem is that the nine correct elements of data are being held hostage by the one faulty element of data. It is entirely possible that the error that has occurred is trivial and that important elements of data are not being entered into the data warehouse based on the incorrectness of a single trivial erroneous piece of data.

All in all, error files are not a good way to go in dealing with ETL processes.

Another alternative is to not produce an error file (for all of its limitations) but to not include the erroneous data in the data warehouse either. This approach helps keep bad data out of a data warehouse, but it also prevents the data in the data warehouse from being complete.

Another alternative is to actually include the erroneous data in the data warehouse. This allows the data warehouse to have a complete set of data, but it introduces bad data into the data warehouse. This may be acceptable if the bad data that has been introduced is of the trivial variety. In other words, if the bad data that has been introduced is of an unimportant nature, then keeping good important data out of the data warehouse can be avoided.

Yet another alternative is to introduce flags when an error condition is detected. In this case, the record is placed in the data warehouse, but a flag is raised that there is something wrong with the record of data in the data warehouse. This is a compromise that can be useful, but it is not perfect because incorrect data is placed in the data warehouse.

Yet another alternative is to insert default values when an incorrect value is found prior to insertion. For example, suppose that the age of a human is to be inserted into a data warehouse. Now, let’s suppose that we run across a person who is deemed to be over 1,000 years old (and his name is not Methuselah). Under all normal circumstances a human age greater than 1,000 years old is considered to be an error. So we decide to place a default value as an age into the data warehouse. What value should the default be? For lack of a better value, let’s assume that we show a value of zero as the age. At least we don’t have people whose age is over 1,000, but now we have introduced another problem. We have now introduced people who have an age of zero into our data warehouse. The question then becomes: Which is worse – having people over the age of 1,000 in the data warehouse or people whose age is zero? Depending on what you are doing, it may be just as bad to have one value as to have another. The determination of the default value is a nontrivial consideration.

These then are some of the considerations of what to do about errors at the time of ETL load. The good news is that are actually a lot of options. The bad news is that each option has its own definite downside. There simply are not any perfect solutions.

The fact that there are no perfect solutions reinforces another fact about data warehousing – the building of the data warehouse always involves compromises. In the case of correcting errors in the ETL process, compromises are simply inevitable. They are a fact of life.

SOURCE: No Hits, No Runs, No Errors

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon


Related Stories


 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!