Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Data Investigation and Re-Engineering Implementation Techniques

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

Data Investigation and Re-Engineering Implementation Techniques focus on typical problems data problems that will be uncovered using a Data Profiling tool and those which can be resolved as part of the Data Re-Engineering process. It can be used in conjunction with the next section of this document, which gives some examples that are tool-specific and can act as a operational guide for the developer. More design patterns will be added to this section over time.

Contents

=Column Profiling of Single Domain Fields

Column analysis typically examines the data found in a single column/field in either a table or a flat file. This analysis can either (1) examine the actual values found in the column (e.g. produce a complete frequency report listing each and every value found), (2) look for the various "data patterns" for the values in the column or (3) discover the underlying "parsed patterns" that might exist within more complex, free-form columns.

Typical Value Problems identified by Column Profiling

Data problems often occur at a very "atomic" level: specific values within specific columns/fields. To discover these problems a data investigation can either produce a complete frequency report, a partial frequency report, or conduct a more focused search for specific values within the column. A complete frequency report would be most useful on columns with a limited number of discrete values (e.g. gender code, employee status code, state of residence). The resulting table would then be small enough to visually examine for various data quality issues, such as missing or incorrect values.

A partial frequency report can set the frequency cut-off for values to be displayed at a specific number. For instance, an examination of a phone number column might set the cut-off to ten, so the report would only contain all phone numbers with a frequency of ten or more. This would be useful if the vast majority of valid phone numbers occurred with a low frequency but "default" values (e.g. 9999-9999, 00000000, XX-XXXXXXXX) tended to occur with higher frequencies. Finally, a query approach could be used to search for specific values with a column. For instance, if the business identified values they knowingly use as "default" phone numbers the analysis could look for those values within various phone number columns. However the analysis is conducted, the following issues can be detected, and perhaps corrected, for value problems:

Missing (Blank) Values

In any given column missing values could present a major problem or raise no concerns at all. If the business rules specify the column in the target database must never contain a blank value, then the presence of missing values in a potential source system file is a problem. The magnitude of the problem, of course, depends upon how many blank values were found in the source system. If the percentage of missing values was small, say less much less than 1% of all records, then it\’s a minor issue. But if the percentage is substantial, let\’s say 20% of all records, then you have a major problem. With a very small percentage of missing values, it might be possible to decide to simply not load the affected records. When missing values are more widespread in a given file, the option of excluding such a large percentage of records is less viable.

Regardless of the magnitude of the problem, one possible solution is to send all the records with a missing value back to the source system. The source system would then investigate and supply a valid value before sending a record back. Another possible solution is for the ETL process to replace a blank missing value with a "null" value acceptable to the relevant target system.

Detecting the presence and relative extent of missing values is also important for two other reasons. One is to simply document in the metadata the presence and extent of missing values in the various source and target columns. The second reason has to do with matching. If any sort of de-duplication or matching process is to take place then analysts need to be aware of missing values within columns in order to design an efficient matching strategy. Within the QualityStage tool, for example, columns with a high percentage of missing values make poor candidates for inclusion within a matching strategy. Thus, analysts designing and coding a match need to be aware of how prevalent missing values are within each column of any file that could be used in a match.

Default Values

These values are essentially "missing values" with a different code. For instance, a "missing" home phone number could be either a record with all blanks in the column or the values "999999999" and "5555555555." In some cases the source system might be able to supply an explicit list of all the default values used in various columns. In other cases only a thorough data investigation will discover what values are probably be utilized as defaults in some table columns. Even when provided with a list of all purported default values a data investigation might reveal some "undocumented" defaults. For instance, some data entry personnel might be entering "0123456789" when they don\’t know a home phone number even though this not an officially acceptable default value. When defaults values are discovered the potential solutions are similar to those discussed above for missing blank values (i.e., either send them back to the source for correction or substitute a valid default value for an invalid one).

'Out of Range' Values

In some cases the business will be able to supply a range of valid values for a given column. For instance, the business might clearly specify that any non-blank "employee status" value must be a number within the range of 10 to 30. Thus, any values appearing in a frequency table outside of this restricted range are invalid. When such invalid values are found one possible step is to flag the record and send it back to the source system for correction. Another possibility is to attempt to replace the invalid value with a valid value based upon data found in other fields. For example, the business might provide a detailed set of business rules such as "if the employment date is on or before December 31, 2000 and the compensation level is "5," then the employee status code should be 22." In such circumstances the relevant records should probably be flagged to indicate the process has replaced an invalid value provided by the source system.

Invalid Values

If the business supplies a clear-cut list of valid values for a given column, then a frequency table analysis can highlight any possible values found in the data but not in the list. For instance, if the "marital status" field can only take on the values "M," "S," "W,""D," or "U," then any value found outside this list is invalid (e.g. "P"). The correction steps for such problems are the same as for the "out of range" issue discussed above.

Codes needing to be transformed/translated

When integrating data across disparate systems the codes used by one system will often differ somewhat from the codes used by other systems. This might also hold true within a given source system if it contains historical data and a code change was instituted at some point in time (e.g. some records might contain the "old" code values rather than the updated values). For example, a frequency table on one system might contain the values "M", "F" and "U" for the gender code column whereas the frequency table from another source would contain the gender values of "1," "2," and "3."

As part of the ETL "cleansing and conforming" process it might be specified that the codes from some source systems, or perhaps all the source systems, are to be transformed to a "load file standard." With the gender code, for instance, there might be a specification that "1" should be converted to "M," "2" to "F," and "3" to "U." When transformations are required special provision would of course need to made for missing values and invalid values.

Truncated values

Potential problems with data truncation are among the most difficult to identify. Often only careful inspection of a complete frequency table will reveal that some data values are truncated, either in the source system, the final "load table," or some intermediary ETL staging table.

Data abbreviations

A thorough examination of a data investigation frequency table occasionally will reveal that the same entity is being stored in multiple ways, often by using a variety of abbreviations. For instance, in a company name column the following might all represent the same entity:

  • International Business Machines Corporation
  • International Business Machines Corp
  • Int Business Machines Co
  • I.B.M. Corporation
  • I.B.M. Corp
  • IBM
  • I-B-M
  • I/B/M

As with truncated values the detection of such a problem is not easy. Often such issues can only be discovered by a careful inspection of a frequency table, usually by a business expert very familiar with the data.

This issue needs to be addressed or query results within the final data warehouse could be wildly inaccurate. As part of the ETL "cleansing and conforming" process special care will need to taken to fix this problem. Within a tool such as QualityStage this would be handled by a "standardization" routine designed to detect specific patterns, or even specific data values, and produce a single unified output. For instance, all the examples above might be converted to "IBM" in the company name field in the final load table for the data warehouse.

Typical "Data Type" Problems identified by Column Profiling

Rather than examine the actual data values found within a specific column, a data investigation could also look for "data patterns." This investigation looks at each byte in the column and assigns the "data type" as follows:

  • c: An alpha character (a-z, A-Z)
  • n: A number (0-9)
  • b: A blank space
  • Special characters: appears as they do in the data (e.g. $,\^,&,!,@)

This type of investigation comes in very handy for fields that could contain a large number of discrete values, but a relatively small number of "data type patterns." An example would be a phone number field. Although a file of 100,000 records could contain a huge number of different phone number values, the "data type" investigation might produce the following output:

A Typical Data Type Investigation
FrequencyPatternData Example
95,000bbnn-nnnn-nnnn02-9205-1234
4,000(nn)bnnnn-nnnn(02) 5032-9876
400nnnn-nnnnbbbbb8042-1057
300bbbbbbbbbbbbb
200+nnbnnnnnnnnnn+25 78165005432
50n-nnn-aaaaaaaa1-800-CALLNRMA
25nn/nn/nnnnbbbb11/11/2001
20nnnn5555
5aaaa@aaaaa.aaa[joed@yahoo.com-mailto:joed@yahoo.com]


Whereas a full frequency report on the phone number field might yield a report hundreds of pages long, the "All T" option yields a report that\’s much easier to read and interpret.

The "data type" report does an excellent job of providing a quick assessment of the validity of the values found within a given column/field. In the example above, it appears the phone number field is fairly "clean" in that 99,000 records have a valid phone number "data type pattern." The "data type" analysis is particularly valuable in providing the following information regarding data values:

  • Alpha characters appearing within what should be a numeric field. Columns defined to be numeric fields should never contain an "a" in any of the resulting data type patterns. In the phone example above, we can see that 50 records contain alpha characters. These 50 records could either be reported back to the source system for correction or the ETL process could "clean" the records by substituting the correct number for each alpha characters (e.g. substitute "2" for "C", "6" for "N" and so on).
  • Numbers appearing within what should be all alpha fields. An investigation of the given name field, for instance, might yield the following pattern: anaaa. This could occur when the data entry personnel mistakenly enter a number where they meant to type a letter (e.g., "C0lin" instead of

"Colin"). This type of investigation is especially valuable for "code" fields where the business specifies that valid codes within a column should only contain specific patterns with all alphas, such as "aa-aaa" or "aa-a."

  • Special characters appearing in text strings or codes. In the phone example above, the final pattern contains the special character "@" and appears to indicate that some email addresses have mistakenly been entered in the phone number field. In other columns a single special character or a series of special characters might indicate that a default value has been entered in place of a valid value (e.g., \#\#\#\#-\#\#\#\# in a phone field, \*\*\*\*\*\*\*\*\* in the given name field or ? in a date of birth field).

forfait b and you rio bouygues portabilité calcul IMC rio orange

Column Profiling of Complex/Multi-Domain Fields

Single domain columns, discussed above, contain just one isolated conceptual variable (e.g., tax file ID number, branch number, home address post code). Multi-domain or complex columns explicitly, or sometimes implicitly, contain two or more conceptual variables. For instance, a column labeled "Home Address" might appear to capture a single variable, but it actually contains a number of smaller, more "atomic" level subcomponents: unit type and number, floor type and number, house number, street name, street type, and perhaps even suburb, state name or code, post code and country name.

Multi-domain fields are often "free-form" columns in which data entry personnel can enter virtually any text they desire. Such columns present special problems for both data investigation and data cleansing. Typical examples of such fields would be names (personal and organizational), addresses, part descriptions, and notes taken from interviews or phone conversations. Such fields produce (1) a large number of discrete values and (2) a wide variety of "data errors."

Although a standard frequency report could be run on such "free-form" fields, the output for even files of moderate size would be voluminous and difficult to interpret. For instance, a frequency report on a "customer name" column could yield thousands of pages of output, and the majority of the records would only have a frequency of one. A superior alternative is to utilize a variation of the "data pattern" approach discussed above called "Lexical Analysis."

What is Lexical Analysis?

One formal definition of lexical analysis is "the process of taking an input string of characters and producing a sequence of symbols called \’lexical tokens," or just \’tokens\’" (Dictionary.LaborLawTalk.com). Within the context of a data investigation project, this sequence of symbols/tokens yields a great deal of insight regarding the quality of data. A data investigation tool, such as QualityStage (QS), implements lexical analysis with rule-sets and "Classification Tables." Rule-sets typically focus on a very specific type of "domain" and are often very country-specific. For instance, one rule-set specializes in parsing and tokenizing US addresses whereas another is specially designed for French addresses. A key component of each rule-set is a "Classification Table" which creates abstract tokens out of input data. Within QS a classification table typically consists of three columns: (1) the input data value, (2) the "standardized" version of the value (i.e. the value you want the input converted to), and (3) the abstract "token" value to be assigned in the pattern, which must be a single letter within QS. An excerpt taken from the classification table for the US address domain is presented in Table 1.

Lexical Analysis on US Address Fields
Data Value"Standardized Value" Token
ApartmentAptU
AptAptU
FlatAptU
UnitAptU
AvenueAveT
AvenueAveT
BoulevardBlvdT
BlvdBlvdT
BlvardBlvdT
RoadRdT
RdRdT
PO"PO Box"B
POB"PO Box"B
POBX"PO Box"B
NorthND
NND
SouthSD
SSD
SouthEastSED
SESED
FloorFLF
FlrFLF

In this table we can see that the data value "Apartment" has a standardized value of "Apt" and is assigned the "token" classification of "U." Similarly, the value "Avenue" will be standardized to the abbreviation "Ave" and is assigned the token value "T." In addition to the tokens assigned within a given classification table, there are a set of "default tokens" for a wide range of data values falling outside the table. A sample of these tokens is presented in Table 2.

Lexical Analysis on US Address Fields
Data ValueTokenExamples
Any number\^2, 125, 10834
Leading alpha<A5, BA515
Leading numeric>5A, 125CAB
Unclassified alpha "word" or string of alpha "words" ?Sydney, George, Bayswater, Prince Albert, Old Harbour Pass

In this table we can see that any whole number appearing in the input text string will be assigned the abstract token tag of "\^" and a leading numeric such as "12B" will be represented by the token ">." Specials characters, such as "@","%","&", are often used as "token tags" (i.e. the "@" sign in any input string of data is itself a token). Unclassified words are those that are NOT listed in the classification table.

Lexical Token Patterns

When presented with an input sting, any given rule-set in conjunction with a specific Classification Table must perform three key tasks: (1) parsing the input into separate, discrete "words," (2) removing any unwanted characters and/or "words" and (3) assigning an abstract "token tag" to each "word." Within the QS tool the first two tasks are accomplished by use of a "Separator List" and a "Strip List." The "Separator List" contains a complete list of all the special characters that should be used to separate one "word" from another within a string of text. A blank space is almost always included within this list as a separator, but others are very specific to the nature of the input data. For instance, if the column contains dates you might want to use the forward slash, "/," as a separator. The "Strip List" contains a list of special characters to be ignored when constructing a pattern from the input data. For instance, if the left and right parentheses were included in the "Strip List" they would be removed the pattern even though they existed in the actual input data string. (When conducting an initial data investigation this "Strip List" should probably be reduced to just a blank space so any special characters would appear within the output patterns.) With the following examples, let\’s assume the "Separator List" and "Strip List" consist of just a blank space.

Example #1
Input Sting: Floor 17, 99 Walker Street
Token Pattern: F \^, \^  ? T
Example #2
Input Sting: 8 Flr, 18-20B Prince Albert St
Token Pattern: \^ F, \^ - >  ? T
Example #3
Input Sting: POB 159A
Token Pattern: B >

A major advantage of using Lexical Analysis relates to performing data investigation on any complex, freeform field. Although complex fields can contain thousands, and perhaps even millions of different, discrete values the number of underlying patterns is probably significantly smaller.

For instance, consider the following four addresses:

  • Apt 19 18 Eastbourne Rd
  • Unit 5 9583 George St
  • Flat 298 5 Liverpool Road
  • Suite 17 99 Walker Avenue

Although seemingly quite different from one another in content, they all share the same underlying "token" pattern: U \^ \^  ? T

In the table below we see the output produced by a "word investigation" using the US Address Rule-Set on the address column within a small data file.

Word Investigation on US Address Fields
FrequencyFreq %PatternData Example
5925.322%\^?T101 COLONY DR
3615.451%\^D>T427 EAST 55TH STREET
3314.163%\^?6339 WOODSTOCK
146.009%\^D?T524 NW BRIARCLIFF RD
93.863%\^D?5137 S CASTLEWOOD
83.433%\^H\^3323 IOWA \#133
73.004%BB\^PO BOX 275
62.575%\^D>4412 W 50TH
52.146%\^H2329 KENTUCKY
41.717%\^?\^10308 MITCALF \#345
41.717%R\^B\^RR 4 BOX 152
31.288%\^?T\^4200 DRURY LANE \#218
31.288%\^HT807 ARIZONA CT
31.288%\^LT5340 COLLEGE BLVD
31.288%\^TT5710 VALLEY RD
31.288%IIB\^P O BOX 412916
20.858%\^?TT11312 HIGH RIDGE DR
20.858%\^D>>F4400 WEST 109TH 3RD FLOOR
20.858%\^DD?5617 N.E. WHITESIDE
20.858%\^DT312 N UNION
20.858%\^L1116 COLLEGE
20.858%I\^R 17
10.429%?\^B\^HRC 69 BOX 445
10.429%\^?H2024 WILSON FARM
10.429%\^?HT2024 WILSON FARM COURT
10.429%\^?TU\^11905 WILLOW LN APT 416
10.429%\^?TUI1980 ARLINGTON BLVD APT B
10.429%\^D?\^1325 SW FILLMORE \#16
10.429%\^D?T\^\^8701 SOUTH KOLB RD \#5-300
10.429%\^D>TT1710 E 24TH ST TERR
10.429%\^DD>T5225 S W 25TH ST
10.429%\^DDHT8545 N E INDIAN CREEK
10.429%\^DH3923 N KANSAS
10.429%\^DL512 W CENTER
10.429%\^DTTTU>5020 S LAKE SHORE DR APT 2903N
10.429%\^H?U\^7600 STATE LINE STE 106
10.429%\^H>3333 IOWA \#13W
10.429%\^HTU\^534 KANSAS AVENUE SUITE 1035
10.429%\^LTT11131 VILLAGE BEND LN
10.429%\^TI5943 HWY V
10.429%R\^ROUTE 3
10.429%R\^B\^\^RT 6 BOX 265-20
10.429%R\^B\^IRR \#3 BOX 135 D

In this table notice how 25% of all records share the same pattern: \^?T (a number followed by an "unknown word" and then a street type token). Also note that the top ten patterns account for approximately 78% of all records. At the bottom of the table there are wide variety of "straggler" patterns that account for only one or two actual data records. This table is a fairly typical result of what you find within many complex columns: a relatively small set of patterns account for a large percentage of the data and a large number of patterns describe a small percentage of the data.

A quick examination of this table reveals what may be a data quality issue: a number of address records are missing a street type. For instance, the third most frequent pattern (\^?) indicates many records have a house number and street name, but no accompanying street type. Several other patterns also appear to contain street names but are missing a street type. If the information from this source file is going to be loaded into a data warehouse and subsequently utilized to generate mailing lists, the missing street types could yield an unacceptable level of "undeliverable addresses." Thus, this result should be reported to the source system in attempt to discover why so many address records are missing a street type.

Detecting Data Quality Problems with Token Patterns

An examination of a complete or partial frequency table displaying the "token patterns" found in a complex field can give some insight into potential data quality problems. As was mentioned, in Table 3 we could see from the various patterns a large number of address records appeared to be missing a street type (e.g., "Rd," "Avenue"). In other cases the pattern might reveal obvious problems with the contents of a given column. For instance, if an address field contained just a single number (i.e. a pattern of "\^"), a box type token ("B" alone) or an "unknown" word (a pattern of "+") then the address is obviously incomplete and defective. A set of "defective patterns" is not immediately obvious in a frequency table. Detecting such patterns is a manual process best undertaken by individuals familiar with the data and possessing some level of expertise with subject matter found within the column(s) (e.g. knowledge regarding US or Australian addresses). In many cases data quality issues are more easily detected when lexical analysis is used for parsing and then "bucketing" key tokens into separate fields, which is discussed next.

Specificity of "Rule-Sets When you use a rule-set to analyze any data you have to make certain assumptions at the outset. These assumptions can either be based upon available metadata, information received from someone very familiar with the source data or upon a visual inspection of the data. These assumptions need to be made because rule-sets are specific to the data potentially within the column or columns (e.g. name data, address data, area data or a mix of all three) and the country of origin of the data (e.g. Australian, American, German, French). If your assumptions are incorrect the data investigation results could be very misleading.

The reason rule-sets are very specific to a given set of data relates to the "token classification tables" found within each rule-set. The "tokens" within a rule-set designed to handle personal or organizational names are very different than the "tokens" associated with an address rule-set. The former consists of "tokens" associated with name titles, common first and middle names, generational titles and so on. The latter contains "tokens" for various address-specific components such as street types, directional indicators, unit and floor designations and so on. Thus, if one ran a name rule-set on a field containing address data the resulting "patterns" would be virtually impossible to interpret in any meaningful way.

As a specific example, take the following very typical address: Floor 17 Suite 150 99 Walker Street

If you used an address-specific rule-set the resulting pattern might be the following: F \^ U \^ \^ + T Verbally, you would interpret this pattern to represent "a floor type followed by the floor number, a unit type followed by the unit number, and then the house number, street name and street type."

If you mistakenly used a name-specific rule-set it would not contain tokens for data entities such as floor, unit and street types and the resulting pattern might be as follows: + \^ + \^ \^ ++ This pattern is difficult to interpret and moreover it presents problems for "bucketing" information into discrete subcomponents such as unit type, unit number, street name and type and so on.

Similarly the classification tables often contain quite different contents depending upon the national origins of the data. As anyone with some international travel experience can attest, the format and content of addresses differ from one country to another. A list of "Street Types" applicable to Australia, for example, would be inappropriate for Spanish addresses. Table X below displays the street types beginning with the letter "A" for Australia and then the street types for Spain beginning with "A." As you can see, there\’s almost no overlap and attempting to analyze Australian addresses with the Spanish rule-set and vice versa would yield very misleading results.

Australian Address Classification Table
Data VersionStandardized VersionToken Tag
ACCESSACCST
ACCSACCST
ALLEYALLYT
ALLYALLYT
ALLEYWAYALWYT
ALWYALWYT
AMBLAMBLT
AMBLEAMBLT
ANCGANCGT
ANCHORAGEANCGT
APPAPPT
APPROACHAPPT
ARCARCT
ARCADEARCT
ARTARTT
ARTERYARTT
AVEAVET
AVENUEAVET
Spanish Address Classification Table
Data VersionStandardized VersionToken Tag
ALAMEDAALAMEDAT
AMPLEAMPLET
AUTOBIDEAUTOBIDET
AUTOBIDEZEBALAUTOBIDEZEBALT
AUTOPISTAAUTOPISTAT
AVAVT
AVDAAVT
AVENIDAAVT
AVINGUDAAVINGUDAT


Because of these cross-national differences, a precursor to conducting any type of name and address data investigation is to split the file up based upon the national origin of the data. If a field within the data file, such as the ISO country code of the record, is available then this should be used to separate records (e.g., by putting all "AU" records into the "Australian file" and having all "non-AU" records go into an "Overseas records file"). If no such field exists, and the file is known to contain a mix of records from different countries, then you might have to use a special rule-set to help identify the country of origin. QualityStage provides an "out-of-the-box" rule-set, the COUNTRY rule-set, which attempts to derive the ISO country code by examining data found in the address lines. Although the country code assignments from the COUNTRY rule-set are not perfect, particularly if the address data is incomplete or of poor quality, they are often quite accurate and at least allow you to divide up the file before using country-specific rule-sets.

Lexical Analysis, Pattern-Matching and Populating Data Dictionary Fields

Complex, freeform columns often contain data that can be further subdivided into several explicit or implicit "sub-columns." A column containing personal names, for instance, contains a number of conceptual "sub-columns": name title, first given name, other given name(s), surname and generation. An address is similarly composed of separate and distinct sub-elements: house number, street name, street name, pre or post direction, street type, building or estate name, box type, box value, and so on. With product data a freeform column might contain a product code (e.g. "MPNTS"), a product number (e.g. "505-3634"), and a product description (e.g. "Men’s Levi’s 505 jeans"). Parsing a complex, freeform field into its constituent parts might be necessary in two circumstances. One condition is when the target database contains fields at this "sub-column" level whereas the source system field is a complex column. For instance, if the target contains first name, middle name, surname and generation whereas the source system contains the complete name, then parsing this name becomes necessary. The second condition is when it\’s necessary to match records either within a source system or across different source systems. In this situations effectively parsing and standardizing complex fields generates superior matching performance. Rather than have to compare one complex, freeform string to another, parsing enables individual elements to be compared with similar elements (e.g., first name to first name). Standardization also allows the process to "clean up" distinct elements to allow matching \’apples to apples.\’ In name matching, for instance, nicknames such as Bill and Chuck can be converted to their more formal counterparts prior to matching. That way, you are matching "William" on one record to "William" on another rather than a "Bill" to a "William." In general, this approach (1) attempts to match to a very specific "token pattern" and (2) assigns the data value represented by the "token symbol" (or the standardized version of the data value) to a specific output field. Within the QualityStage tool this is accomplished by use of a proprietary coding system called "Pattern Action Language." The details behind "Pattern Action Language" are beyond the scope of this paper, but the following presents some examples of how this process would generate parsed and standardized output for some input names and addresses:

Illustrative Naming Token Patterns

Name Example #1

Input
Mister William Joseph Collins Junior
Resulting "token" pattern
PFF+G
Example 1 Output
TokenDictionary (Output) FieldContent of Dictionary Field
PName PrefixMR
FFist NameWILLIAM
FMiddle NameJOSEPH
SurnameCOLLINS
GGenerationJR
Name TypeI


Name Example #2

Input
Mrs. Ronnie G Coleman-Stahl
Resulting "token" pattern
PFI+-+
Example 2 Output
TokenDictionary (Output) FieldContent of Dictionary Field
PName PrefixMRS
FFist NameVERONICA
IMiddle NameG
SurnameCOLEMAN SurnameSTAHL
Name TypeI


Name Example #3

Input
Ziggy K Charles
Resulting "token" pattern
+IF
Example 3 Output
TokenDictionary (Output) FieldContent of Dictionary Field
First NameZIGGY
IMiddle NameK
FSurnameCHARLES
Name TypeI


Name Example #4

Input
Jenny Craig Proprietary Limited
Resulting "token" pattern
F+OO
Example 4 Output
TokenDictionary (Output) FieldContent of Dictionary Field
FKey Word \#1JENNY
Key Word \#2CRAIG
OOName SuffixPTY LTD
Name TypeO

Illustrative Address Token Patterns

Address Example #1

Input
10 Waverly Avenue
Resulting "token" pattern
\^+T
Example 1 Output
TokenDictionary (Output) FieldContent of Dictionary Field
\^House \#10
Street NameWAVERLY
TStreet TypeAVE
Address TypeS

Address Example \#2

Input
Unit 15, 754 Queen Victoria Avenue
Resulting "token" pattern
U\^,\^++T
Example 2 Output
TokenDictionary (Output) FieldContent of Dictionary Field
UApartment TypeAPT
\^Apartment Value15
\^House \#754
Street NameQUEEN VICTORIA
TStreet TypeAVE
Address TypeS


Address Example #3

Input
PO BOX 87C
Resulting "token" pattern
BB>
Example 3 Output
TokenDictionary (Output) FieldContent of Dictionary Field
BBBox TypeBOX
>Box Value87C
Address TypeB

In the examples above the final field is a "business intelligence" field designed to provide feedback regarding the nature of the output. The "Name Type" field tells you whether the rule-set classified the input name as either an individual ("I") or an organization ("O"). The "Address Type" field indicates what key address fields the rule-set was able to populate (e.g. "S" when street name is populated, "B" when just a box type and value are populated, "U" when just an apartment/unit type and value are populated). Although usually not involved in the final load process the "business intelligence" fields are very valuable when analyzing output and can also play a role in match processing.

When any given rule-set parses and subsequently "buckets" input data into output dictionary fields, the output could be either (1) completely handled correctly, (2) completely handled but the output is not perfectly correct, (3) partially unhandled, or (4) totally unhandled.

Output is completely handled when all the key tokens identified in the "token pattern" are placed into dictionary fields. All the examples above illustrate cases where the input is completely handled and the output appears to be correct. The output\’s "correct" in that each input element has been placed into the proper output dictionary field (e.g. an individual\’s first name has ended up in the "First Name" field, the middle name or initial in the "Middle Name" field and so on). When verifying whether the input has been handled correctly or not it\’s more often than not impossible to so by visually inspecting all of the output. Usually a level of comfort is reached that the process is working as designed by inspecting a sample of output records.

Output could also be completely handled, but not necessarily with 100% accuracy. Consider the following example:

Using Token Patterns to Build Business Metadata (a Data Dictionary)

Input Name
Mr. Leo Frederick Junior
Resulting "token" pattern
PFFG
Example 1 Output
TokenDictionary (Output) FieldContent of Dictionary Field
PName PrefixMR
FFist NameLEO
ISurnameFREDERICK
GenerationJR
Name TypeI

In this specific case let’s say the "Junior" is actually the person’s surname and not a generational indicator. Although the output generated above might be correct for 99% of the cases where the input pattern is "PFFG," in this case it has made a mistake. In such instances a very customized fix is often required to handle such cases.

Output data can also be a mix of handled data and "unhandled data." This occurs when the input token pattern partially matches a pattern within a given rule-set but the input data also contains additional information the rule-set can\’t interpret. Consider the following example of a partially handled pattern:

Input
10 William Street (Near Double Bay Liquor Store)
Resulting "token" pattern
\^+T (+++++)
Example 2 Output
TokenDictionary (Output) FieldContent of Dictionary Field
\^House \#10
Street NameWILLIAM
\^Street TypeST
(+++++)Unhandled Pattern
Unhandled Data(NEAR DOUBLE BAY LIQUOR STORE)
Address TypeS

In this case the address rule-set was able to identify an address ("10 WILLIAM STREET") and managed to put the constituent address elements into the correct output fields. Some "extra" address information, in this example a parenthetical comment regarding the location of this address, was not handled (i.e. assigned to one or more dictionary output fields).

The best approach to deal with potential cases of "unhandled patterns" and "unhandled data" is to do the following:

  • Sort the pattern frequency report to find those records contained an unhandled pattern.
  • Further sort the file such that the most frequently occurring unhandled patterns appear at the top of the report.
  • Inspect the content of the "unhandled data" field for the various records containing a given "unhandled pattern."
  • A business analyst then needs to decide whether the "unhandled data" is worthless information that should be ignored or if it’s valuable data that should not be dropped.
  • If the latter decision is made then the rule-set needs to be modified. This modification could incorporate the entire unhandled pattern or just a portion of it.
Wiki Contributors
Collapse Expand Close