Usamos cookies y otras tecnologias similares (Cookies) para mejorar su experiencia y proporcionarle contenido y anuncios relevantes para usted. Al utilizar nuestro sitio web, acepta el uso de Cookies. Puede cambiar su configuracion en cualquier momento. Politica de cookies.


Textual Discovery and Textual Analysis

Originally published septiembre 24, 2009

There is a truly symbiotic relationship between textual discovery and textual analysis. Textual discovery is the survey of the totality of a textual universe. In textual discovery, you look for all the variables found in the text and how they relate. The point of textual discovery is to look across the landscape and see what all the components are and how those components fit with each other. The goal of textual discovery is to achieve a broad overview of the information found in the body of text. 

On the other hand, textual analytics is the process of a detailed analysis of several interesting aspects of a body of text. Whereas textual discovery looks at an overview, textual analysis looks at a detailed inspection of a small subset of the totality of text. Figure 1 shows the difference between textual discovery and textual analysis. 

alt

Stated differently, the goal of textual discovery is breadth of analysis and the goal of textual analysis is depth of analysis. Textual discovery and textual analysis are two different sides of the same coin. 

The primary tool for textual discovery is a self-organizing map (SOM). A self-organizing map is an organization and display of a body of text. There is no design for an SOM. The SOM “creates” and arranges itself. The contents of the SOM and the relationship of the contents of the SOM are created from the body of text that makes up the SOM (thus the name – self organizing map). Figure 2 depicts an SOM. 

alt

An SOM reflects many aspects of a body of text. Perhaps the simplest things that are shown in an SOM are the words where there is a high concentration (i.e., where there are a lot of occurrences of the word) and the words where there is a low concentration (i.e., where there are few occurrences of a word). Where there is a high concentration of words, there are many occurrences of the word and the stems of the word. Where there is a large occurrence of words and stems, there is usually a corresponding concentration of related (or clustered) words. These concentrations of occurrences of words and stems and their clusters are shown by a “darkness” in the SOM. 

Where there is a “lightness” in the SOM, there is a sparsity of words, stems, and associated clustering words. The SOM then visually brings out the distribution of words, stems, and clusters. 

alt

One immediate use of an SOM is that of immediately identifying the correlation of occurrences of words in a body of text. Where there is a darkness for words sharing the same SOM, the words occur in a correlative manner. With one look at an SOM, you will see where the major correlations of words are. (Note: SOMs are really good for finding correlations. SOMs are not good for depicting cause and effect relationships. For a further explanation of the differences between correlation and cause and effect, see a good book on statistics.) Figure 4 shows an SOM and the depiction of correlations.

alt

The other immediate use of an SOM is the ability to look at the complete picture formed by the visualization of the body of text. ALL of the text in the body contributes to the creation of the SOM (with the exception of the stop words and other filtered words). The SOM shows ALL words in the body of text. (Note: If a word occurs infrequently, it may not appear in the high levels of categorization at the top of the SOM. But the word can always be found by drill down processing. In other words, just because a word is not one of the words that occurs the most in a body of text does not mean that the word is not in the SOM.) 

It is really useful for the analyst to survey all of the words in the body of text, and to see what words occur the most. This property of completeness allows interesting words to “jump out” at the analyst. Stated differently, without a visualization of the completeness of the body of text, the analyst may well miss important whole vistas of data that exist in the body of text that the analyst does not realize are there. An SOM brings out ALL the text in the proper proportion of the text. Figure 5 shows that an SOM shows all of the text. 

alt

The SOM is created from a body of text. In truth, ANY text can be used to populate an SOM. In actuality, only a homogenous set of text is used to populate an SOM. For example, if data about the Dallas Cowboys, liver cancer survivors, Lego projects, and the Dow Jones industrial average is mixed together, the results can be used to create an SOM; however, it will be a nonsensical SOM. In order for an SOM to be meaningful, it needs to have a certain homogeneity of content. A much more productive collection of text entering into an SOM might be:

  • Liver cancer survivor medical records

  • Treatments for liver cancer

  • Symptoms of liver cancer

  • Therapies for liver cancer

There is a common theme running through all the text and the resulting SOM may show some startling and interesting findings. Figure 6 shows the need for homogeneity of sources of text entering the SOM.

alt 

Another consideration for the text entering the SOM includes the number of documents in the body of text. As a general rule of thumb, at least 500 documents should be used. Within the bounds of reasonability, the more documents there are, the better. In addition, the meatier the documents, the better. SOMs created from email are usually nonsensical because the emails simply are bereft of meaty material. In addition, SOMs created from contracts usually are not useful because there usually is so much repetition of text from one contract to the other. The nature of the input then makes a difference in the value of the SOM that has been created. Yet another important difference in SOMs is the organization of the text. Simple unstructured data is used to produce one type of SOM. Semi-structured is used to produce another type of SOM. 

Another consideration is that of the integration algorithms that are used to treat the text. At a minimum, stop words are removed and other useless data can be filtered out. But the integration of text can go much deeper. External categorizations can be included, alternate spellings can be included, homographic analysis can be done, and so forth. The amount and the degree of integration that can be done to the text entering the SOM varies considerably. Figure 7 shows that the input into the SOM can be manipulated significantly, and that this manipulation profoundly affects the end results. 

alt 

Once the SOM is created, it can be analyzed in many ways. The simple ways to analyze an SOM are to look for correlative analysis and for completeness analysis. But many other types of analysis can be done. 

One of the types of analysis that can be done with an SOM is drill down analysis. In drill down analysis, the analyst starts with one word or set of words, and then examines what stems there are and what other words have correlated with the words. Or there is drill across analysis. In drill across analysis, the analyst picks a word and determines the many places where the word has been used in a clustered fashion. 

And these are merely some of the many ways the SOM can be used for analysis. Figure 8 shows this further analysis. 

alt

Once the analyst has surveyed the entirety of the body of text through an SOM, the next step is to look for in-depth relationships among the body of text. This is where the unstructured database comes into play. The unstructured database is created from the text but is significantly different from the SOM. The unstructured database is made up of text placed into a relational environment. The text is organized for business intelligence processing. 

Figure 9 shows an unstructured database. An unstructured database is a relational database that has been created from unstructured data. Raw text is passed through a textual ETL process (such as that offered by Forest Rim Technology) and the resulting data is placed inside a relational database. 

alt 

There are many kinds of analysis that can be made from the unstructured database. One kind of simple analysis that can be done is that of counting the number of occurrences of selected variables. Figure 10 shows this simple activity. 

alt

As an example of counting simple variables in an unstructured database, the analyst may want to calculate how many contracts are of the type “PAID UP LEASE.” This is a simple calculation. 

A more sophisticated type of calculation is for counting categories of data found in the unstructured database. Figure 11 shows this calculation. 

alt

As an example of counting categories of data, the analyst may want to count how many contracts there are for hydrocarbons. That would include contracts for oil, gas, benzene, casing head oil, and so forth. In order to count categories of information, the analyst must have first read the raw textual data and then recognize that the data belongs to a category. Then, such an analysis becomes easy to do. 

Another simple analysis of an unstructured database is done when data representing the intersection of unstructured data and structured data is created. Figure 12 shows such an example.

alt 

An example of counting the joins of structured and unstructured data may be counting the number of PAID UP LEASES for wells that are overproduced this month. In order to determine which leases are PAID UP, the unstructured contract data must be examined. In order to tell which wells are overproduced this month, the analyst must look at the structured data. Then the join is made and the qualified leases can be counted.

And of course the analyst can add more and more qualifications of data to the mix. Figure 13 shows this analytical activity. 

alt

As an example of adding more and more qualifications of data to the analysis, first the analyst looks for all PAID UP LEASES. Upon finding those, the analyst then looks for all PAID UP LEASES in Texas. Upon finding those pieces of information, the analyst looks for PAID UP LEASES in Texas where the lease is for oil only. The analyst continues to further sub qualify the data until he/she finds what is being looked for. 

The unstructured database – like the SOM – is created from a base of textual data, as seen in Figure 14.

alt

Raw textual data enters the textual database. As part of the entry process, the data is integrated. This includes many of the activities found in processing the SOM textual input. Typical of the textual data integration that occurs are the activities of:

  • Stop words

  • Stemming

  • Identification of patterned variables

  • Identification of named variables

  • Decomposition of semi-structured information

  • Homographic resolution

  • Identification of delimiters

  • External categorization

  • Alternate spelling resolution

In fact, there are many activities that need to be done to textual data in order to make it fit for inclusion into a relational database. Forest Rim Technology has patent pending technology to do the integration of text in preparation for inclusion into a database. 

Once the text has been prepared for an unstructured database, the unstructured data can be joined with the structured data to form a really robust database. Figure 15 shows this possibility. 

alt

The processes of discovery work hand in hand with the processes of analytical processing. There, in fact, is a feedback loop between the two processes, as seen in Figure 16.

alt

First the analyst surveys the body of text using a SOM. The SOM can be adjusted as much as needed. Upon discovering a promising set of information, the analyst then turns to the unstructured database for deeper, more focused analysis. If the analyst finds what he/she is looking for, the analyst then goes on and enhances the decision making process. If the analyst does not find satisfactory results, the analyst can return to the SOM and do another global search for information. 

The analyst is able to coordinate the results of analyzing the SOM and the unstructured database because the analyst starts from the same body of text. Figure 17 shows that the body of text feeds both the SOM process and the unstructured database creation process.

alt

SOURCE: Textual Discovery and Textual Analysis

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon


Related Stories


 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!