From MIKE2 Methodology
This section lists a number of High Level Solution Architecture options for Data Investigation and Re-Engineering. The architecture options are part of the overall SAFE Architecture framework and can be used as Supporting Assets for the Activity covering definition of the Future State Vision for Information Management
Data Profiling as part of the SDLC process
Data Profiling can be done as an early step in the SDLC process to gain a quantitative assessment of data that will flow between producers and consumers. Data Profiling helps to remove the uncertainty and assumptions related to Producer Systems. It will also involve an ongoing monitoring process that is put in place once the solution has been implemented. A model for the Data Profiling process is as follows:
The Design-Time Data Profiling Process
- Flat File Source System Data Extract. May be Iteratively Done
- Metadata, Including Transformation Rules, for Use in Analysis
- Includes the Tests and Reference Data for Analysis
- Review Metadata for Completeness, Unique Ids, Descriptions, etc
- If Appropriate, External Data can be Used for Matching / Enrichment
- Profile for Timeliness, Duplication, Accessibility, Completeness, Integrity, and Validity
- Results of Metadata Profiling
- Formal Quantification of Data Quality Profile Results
- Formal Quantification of Metadata Quality Profile Results
- Based on Profiling Issues and Gaps Identified, Recommend Areas for Improvement
This approach to Data Profiling is fairly fundamental and can be incorporated into the other High Level Solution Architecture options listed below.
A Framework for Ongoing Data Quality
An architecture framework for ongoing data quality management provides an approach where data quality is quantitatively understood during the design process, metadata artifacts are used to drive ETL design and data quality issues are addressed in batch and in an ongoing fashion. It is an approach that can re-used for varing data sets and is therefore particularly important for significant efforts such as IT Transformation. The key aspects of this approach are highlighted below.
High Level Solution Architecture for Ongoing Data Quality Management
Data Profiling and Definition of Metadata Model
The data is extracted from the source and staged on a profiling platform. Using a profiling tool set the data is then examined and assessed. Its current state is documented. Relevant metadata is assembled and the information is rationalized with an emerging enterprise attribute standard and a current understanding of the key business rules. The current state of data quality (DQ) is documented and a DQ plan is created for the data. Some attributes must be fixed before movement to the target environment while others may be fixed after the move. These judgments are made by looking at the the profiling results. The goal is to assess the capability of the current data values and granularity to support the functions proposed for the target environment. Transformations are designed within an ETL tool to migrate the attributes into the Metadata Standards for movement to the Data Quality platform. Many of these same transformation will be reused by the Legacy Interfaces to transform information to the Enterprise Standard for use in ongoing data synchronization.
Producer to Consumer Mapping using a Metadata-Driven Approach
At this point the attributes have been mapped to a common Enterprise standard. An ETL like tool is used to migrate the data to enterprise standard implemented on the Data Quality platform. Those aspects of the Data Quality program that are to be implemented before migration to the production environments are executed. All data quality functions (i.e., validity, missing values, de-duping etc.) are performed after the data has be transformed to the attribute standard. Transformations and migration capabilities are constructed to migrate the data to the production data structures. All the business rules and data rules are captured as metadata because this knowledge and capabilities are needed to maintain data synchronization and quality on an ongoing basis in the production environment (i.e., many conversion requirements are ongoing). Also at this point any stored procedures in the source data base are targeted for new database procedures in the target system or functions in the target application or a common service available to all.
Testing the Conversion Capabilities
Step 3 focuses on the messaging, object and interface standards. The data is migrated to a staging area which has the same structure as the target production environments. The interfaces, messages and objects are all validated for correctness, performance and compliance with the DQ plan. The knowledge of the business rules, objects and XML definitions are positioned for re-use in the production environments. Much of the metadata repository(s) are populated during these activities. The formulation of an ongoing production data mediation platform is a key outcome of this step.
A Re-Usable Platform for Data Qualiy Management
Nearly all of the knowledge and functionality acquired in the conversion process becomes reusable in the production environment on an ongoing basis. In this environment rules and standards may be invoked on record at a time basis (compared to batch migration) as a data service for use by any application.
Targets of Opportunity for New Services
The data is migrated to the actual target environments. This step may include a set of activities for ‘user acceptance testing’ as well - with subsequent migration to production. Some of the DQ and data mediation will become ongoing processes and not just a ‘one time’ move. As aspects of new and old systems are used concurrently there is an ongoing requirement for data synchronization which is met by the ongoing data mediation platform. Concurrent use for aspects of the new and old systems is addressed in the transformation plan. This provides an iterative capability to perform lower risk transformation projects.
New targets of opporuntiy then emerge as candidates for common services enabled by the Data Mediation platform which was formulated during the conversion process. This becomes a ‘Forever Re-Useable’ platform.