From MIKE2 Methodology
Data Investigation Prioritization Techniques are used to determine which data should be investigated first on a project. For many projects, the data to be profiled is obvious – it is defined by a specific project scope. The prioritisation principles listed below are for when this scope is vague or when conducting a large-scale profiling exercise and a starting point is required.
The techniques outlined in the article support the content in the Overall Implementation Guide when specific detail is required for prioritizing
the Data Investigation process.
Key Objectives of Preparation and Sourcing
- To help prioritise systems for Data Profiling
- To prepare for execution phase of Data Profiling
- Key Steps for Data Sourcing (these can applied for either Data Profiling or Re-Engineering)
Key Deliverables of Preparation and Sourcing
- Action Plan for Data Profiling
General Principles for Prioritization
The prioritization exercise should identify the “high priority” systems for investigation. High priority may be linked to usability of data, the urgency of the operational/reporting need that utilises that data or the business importance of that data. The criteria to determine this priority may vary from organisation to organisation, as well as from business units within a single organisation.
Based on the type of project, a number of guidelines can be followed with regards to prioritisation. These are listed as follows:
Business Requirements must drive Prioritization
- An existing data warehouse or data mart may have quality issues and users have stopped using the data. The data in the source systems feeding the warehouse needs to be assessed to identify the source of quality issues. In the case of a new warehouse or mart, only the source data that will be used to feed it needs to be examined.
- In the case of an existing warehouse or mart, two levels of analysis may be warranted. First, examining the data from the source systems. Second, examining the data transformation and integration that occurs as part of the load process. This second case will identify data quality issues that are introduced as part of the software that transforms and integrates source data into target form. If the project includes both source system data analysis and examination of transform/integration issues, sufficient time should be scoped for both activities.
- In the case of CRM or ERP systems, as well as ODS systems that supply data to other operational systems, there is a high rate of quality issues being identified once the system has been initially deployed. For example, marketing engines may generate calls to customers that are no longer actual customers or offer products to customers that they already have. Systems are generally shut down once these problems are identified. Whether data is migrated to these new systems for inclusion in their database or these systems retrieve data on demand from external sources, the source data needs to be analysed.
- It is ideal to pull extracts directly from the source databases. However, this may not be possible in some cases. For example, the source may be in mainframe format only and cannot be directly loaded into the profiling and analysis tools. In this case, the data may need to be pulled from the staging and/or target tables. If this is the only option, then that will have to do; however, the examination process needs to be cognizant of any transformation and integration that occurs in populating these tables.
- For any complex business, data sources will include a combination of both internal and external systems. Data for both these systems should be examined in the same manner.
- In some cases, a client may purchase enrichment source files, such as demographic or logistical information, that is combined with existing client data to provide more robust data. It is generally expected that this is quality information, however, this should not be assumed. Enrichment data should be examined along with internally generated data to provide a complete view into information quality.
- Source files do not always correlate with each other in terms of timeframes of the extract. One of the key requirements for effective data analysis is for the full data set to apply over a logically consistent timeframe. Examining data that was produced for a full month is usually sufficient for analysis. However, in some environments, the data sets are not all accessible in the same timeframe. Transaction extracts may only exist for a few days, whereas customer account data will be available for the full month. A useful test for customer identification when suspect duplicates exist is to examine the transaction logs to determine if the customer transactions apply to common account or service information. If this data does not correspond to a common timeframe, then it compromises the usefulness of this test.
- Extracts from source systems should be a full extract of data as opposed to just incremental changes used to feed a target source.
- One of the greatest challenges that we face in conducting data quality analysis is actually receiving the source data. Due to the challenges listed above, it often takes some degree of investigation, including trial and error, to get the extracts right. It is recommended that as much advance time as possible be given to getting extracts so that the data analysis can start when planned.
- Key parameters that will drive the amount of effort required to perform a TBA include number of source systems, record set size, and key data elements. This information should be gathered during the prioritisation effort to refine the effort estimates to complete the TBA. If the TBA is fixed fee/fixed duration this is particularly important. Need to include our recommendations regarding number of source systems and number of records to be examined.
The relationship of the business requirements to the source systems is shown in the model the right. A model like this is most essential for conducting prioritisation for a large-scale project; for smaller projects that priorities tend to be obvious.
Factors Impacting Prioritization
Factors Impacting Prioritisation
As can be seen above, there are many variables that can impact prioritisation around planning. Whereas business priorities should provide the key driver, prioritisation of systems is often a balancing act against a number of factors:
The complexity of these factors provides some of the rationale behind the Action Plan for profiling. It is critical that project sponsors understand how systems have been prioritised and the impact this will have on long-term programmes.
This initial data source investigation is not intended to be exhaustive. It is intended only to provide initial understanding of the source system. Techniques to assist in completing this in a short timeframe include:
- Clear project sponsorship supporting the initiative
- Early confirmation of appropriate source system contact people
- Early communication of “showstopper” issues to project manager; this may speed up decision to remove a candidate from the list
Whereas the IM QuickScan survey applies to information management maturity across the enterprise, the Tools-based Assessment should be focused around a data subject area (e.g. customer data) or business process area (e.g. customer credit management) in order to gain maximize traction in any area. This same process can be repeated for other data subject areas or business process area. The results of the QuickScan survey maybe used to help prioritise the areas to focus on in the initial stages.
Steps that can be followed for Priortization
This section provides some of the key steps in the prioritisation process. Whereas the MIKE2 Overall Implementation Guide provides the overall approach, these steps provide some additional detail around the key steps required for prioritisation and data sourcing.
Ensuring the availability of an appropriate environment is also an important part of the preparation process for profiling. The MIKE2 Overall Implementation Guide should be the point of reference for steps in this process.
These steps do not need to be applied serially and some steps may not be required. The objective should be prioritise the data required, get a base understanding of the systems and key contacts and identify the key data extracts. If this can be done whilst skipping some of the steps for prioritisation (which is often the case), that is fine.
Collect information for source system analysis
- Objective
This step is to customise the standard questionnaire to suit the client/project specific objectives and to distribute it to the source system contact people. These questions should aim to provide:
- A general understanding of what data is captured in the system in the context of the data required to fulfil the high level business requirements
- Identification of issues which may impact the investigation or loading of data from the source system, e.g. other projects, resource availability, complexity of data structures, etc.
This questionnaire is used out prior to the focused business and technical interviews to help prepare the team for the more detailed interview process
- Process
- The source system questionnaire should be supplemented with relevant questions specific to the client/project. For example, if the data investigation is focused on customer data, it may be appropriate to ask questions about the types of customers held in the system, customer-account structures, etc. These questions will need to cover data that will fulfil the business requirements.
- The questionnaire should then be distributed along with a deadline for response and a request for the initial interview be communicated along with the distribution of the questionnaire
- Review responses in preparation for follow-on meetings
- Output
- Source system contact person identified and notified
- At the completion of this step of the Analysis Methodology, the team will have collated and analysed information pertaining to the current and future state of the source systems data, processes and business rules that should be housed in a central metadata repository/library
Initial Briefing on Tools Based Profiling
- Objective
This step is to work with the source system owners to prepare for data profiling ensure an effective start to the exercise. It is recommended this occur at least one month prior to the initiation of the tools based assessment. This is done in parallel with the interviewing based assessment. The objective of this session is to provide an overview of the project to the source system owners and to communicate what will be required from them during the project
This session may either be in a group or in a series on one-on-one interviews. It can also be conducted during the interviewing-based assessment.
- Input
- Process
Key Steps in the process include:
- Prior to the interview, prepare a project overview to distribute to the source system owners
- Walkthrough overview of the project and expectations
- Walkthrough questionnaire and request additional information noted in the questionnaire, eg data dictionary, data model, etc.
- Determine candidate entities in the source system that contain the data to fulfil the high level business requirements.
- Output
Initial information gathered about source system to use in the prioritisation process
Prioritise Data Investigation Requirements
- Objective
This step is to analyse the information gathered to determine the priority of source systems to be investigated further in the tools based assessment.
- Input
- Initial Interviews
- High Information Level Requirements
- Process
Key Steps in the process include:
- Collate results of the questionnaires
- Map source system candidate entities to the high level requirements, requesting additional information from source system owners where required
- Make initial assessment of source system complexity based on responses
- Based on complexity and usefulness, recommend and negotiate the priority of systems to investigate with the project team and project sponsors.
- Output
- Identification of source systems to proceed with in the tools based assessment. Note that depending on the number of systems to investigate this may include several phases.
- Initial assessment of data available to fulfil the high level business requirements
- Initial indication of factors that may impact the effort/timeline for the tools based assessment and data load
Source System Functional Review
- Objective
The next step is to gather information about the current and future state of the source systems in terms of:
- Available Data
- Supported Business Processes
- Known Business rules
This will involve interviewing the Source System Subject Matter Expert (SME), gathering relevant documentation, and analysing the information provided. The analyst will need to drive expertise out from the SME through formalized interviewing techniques conducted in an efficient fashion.
- Input
- Source System engagement has been completed
- Process
The following activities should be conducted during this step:
- Detailed discussion/presentation on source system business by Source System Business SME.
- Site visit with the business users who input data into the system
- Bring set of interviewing questions to be used in meeting
- SMEs present a list of known data quality issues
Interviews may involve multiple iterations.
- Output
- Understanding of the business capabilities of the system.
Source System Technical Review
- Objective
This is a follow-up step to gain a better understanding of the technical environment for the source system. This interview is focused on the following:
- Gathering sufficient information to understand the technical aspects of accessing the data for profiling, e.g. confirmation of data formats, access to production data, data volumes, etc.
- Confirm the key technical point of contacts for this source system
- Understanding what metadata is available for this source system
- Understanding what other initiatives are in progress/planned which may impact this source system
- Gather information about key data gaps and perceived data quality issues
- Understanding the key data concepts in the system in the context of the data to be investigated, eg a data profiling exercise focusing on customer data may include questions about the structure of customer hierarchies/customer-account structures
- Input
- List of candidate source systems to investigate
- Contact person for each candidate system identified and notified by project sponsor or project manager
- Prioritised high-level business requirements for how the data will be utilised. Note that it is possible to perform the prioritisation of these in parallel with the data source investigation. It is also recommended that these requirements be clearly related to expected business benefits/business drivers
- Process
Key Steps in the process include:
- Perform interview with source system contact
- Analyse results, document issues
- Initiate preparation for profiling for confirmed source systems
- Output
At the completion of this step, the team will have agreed source system involvement during the analysis phase and will have communicated expectations to the source system owner for the commitments during the course of the project.
Outcomes include:
- Agreed list of candidate tables to be provided for analysis with agreed format, logistics and timing for provision of these
- Confirmation of resource commitments from the source system team and identification of other resources/project teams which will need to be contacted during the data profiling exercise.
- Schedule regular business and technical SME meetings for the duration of the tools based assessment
Obtain and prepare source system data
- Objective
Based on the discussions with the source system owner in the initiate step access to one of the following data sources will be made available:
- The source system database
- A copy of the database
To ensure the completeness and accuracy of the data quality assessment this data source must include full production volumes and include all columns in the candidate tables. Finally, connectivity is established and tested from the data source to profiling tools in order to profile the data. This may involve lodging source system permission forms.
- Input
- Profiling Environment in place
- System scope is identified
- Process
- Determine whether a copy of the database exists with the source system
- Determine how the project implementation team will be able to access data from source systems (e.g. will it be a copy, direct access, etc.)
- If the project team cannot access a copy of the source system database, determine whether an existing extract exists and whether it contains the data required for profiling
- Establish connection from profiling tool to data source
- Output
The results of this step will be the availability of source system data for further analysis.
Create Action Plan for Tools Based Analysis
- Objective
Based on the priorities identified in the IM QuickScan and Initial Data Source Investigation, an action plan is developed for the next phases of work.
- Input
- Prioritised Data Requirements
- Evaluation of required capabilities vs. vendor profiling tools
- Process
Key Steps in the process include:
- Tailor the standard project plan for tool-based data profiling as necessary
- Confirm task estimates for tool-based data profiling based on the information gathered to date
- Gain confirmation to proceed with next steps
- Output
Action Plan for Tools Based Analysis