Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Archive for the ‘Information Governance’ Category

by: Robert.hillard
07  Jul  2010

Will the growth ever stop?

Some time ago I posted “To Pluto and Back” drawing on Richard Wray’s piece in The Guardian about the constant growth in the generation and storage of data.  Anyone who has ever dealt with any system knows that the amount of data coming into the enterprise is simply staggering.  Many of my clients are worried that any information management strategy is doomed for obsolescence based on infinite growth in the coming years.

In fact, while the amount of raw data is amazing, and would have been unimaginable just two decades ago, it is neither infinite nor really surprising.  When the price of storage dropped to about US$1 per megabyte that triggered a change in attitude from keep the minimum necessary to keep as much as possible.  It has taken time for approaches to business to catch-up with what is possible, and the rate that organisation’s have learnt to use this data has generally lagged the technology by a small amount.  However, we are not seeing a trend towards endless growth in data, rather a change in business that is as significant as the Industrial Revolution.

In the case of the Industrial Revolution it took a number of generations for business practices to mature, including labour practices and government planning.  As we move through the Information Revolution there are similar changes that we are going through, also including our practices that impact people and how governments manage economies which are global.  At some point we will have a handle on these new practices and at that time there will be a new norm for data – which will stabilise as some as yet unreached volume.

In my opinion, that point is not too far away (in the timeframes of history).  Assuming full motion video associated with every business activity (requiring maybe 1GB for everything that previously required just 1KB), then we could see that new business and data norm at some time between 2020 and 2030.  That’s well within my working life and today I am encouraging my clients to prepare data and information management strategies that are geared for this future rather than infinite growth.

Category: Information Governance
2 Comments »

by: Bsomich
16  Jun  2010

The Benefits of Information Governance.

Information Governance is defined by Gartner as “the effective and efficient use of information in enabling an organization to achieve its goals.” 

Organizations that practice a solid information governance strategy are likely to experience the following benefits:

  • Greater protection of sensitive data
  • Access to critical business intelligence in a timely manner
  • Better information sharing and decision making
  • Increased value of information throughout its lifecycle
  • Decreased information management costs and risks

Many IM professionals are aware that its success is largely dependant on how often information policies are evaluated and adapted as business priorities and market conditions evolve. 

Still, many organizations do not have a formal information governance strategy in place.   An EIU research survey of senior executives from leading companies around the world found that nearly two-thirds (62%) of companies have no formal information governance program in place, a concerning trend that can leave many corporations underperforming and open to preventable risks to sensitive information.

In your experience, why is this?   Is it lack of time and resources?  Or unacknowledgment of the benefits of information governance and the risks if avoided?

Category: Information Governance
5 Comments »

by: Sean.mcclowry
19  Oct  2009

Collaborative Governance

Over the past few years I’ve talked quite a bit about collaborative governance (or governance 2.0). I’m starting to see this more often in organizations as ways to improve communications, measurement, solution techniques and accountability. The conceptual architecture of a system to support this approach is shown below:

collab_governance

Illustrative User Story: Data management change control
The user in this case are reporting analysts responsible for developing and running crucial analysis for the capital markets group.
In the current environment, users of downstream analytical and reporting systems often face issues due to changes in upstream systems of record In the new environment, better communications and use of the metadata repository will have help prevent data change control issues.

1. Once an issue is found, data stewards communicate with the business and technical teams to resolve it. Communications are available in a discussion system for others to find.

2. Clear business reporting makes it easy for data stewards and business leaders to see when issues have occurred and their impacts

3. The Data Quality system flags the error, which triggers a workflow request for a Data Stewardship process

4. Responsible data stewards receive the notification and address the issue

5., Issues are logged in a defect management system and is tracked by data stewards

Category: Enterprise2.0, Information Governance
No Comments »

by: Robert.hillard
22  Sep  2009

The power of the crowd can improve your data quality

Well thought through online strategies can do so much more than deliver high quality web sites for internal and external users. They can dramatically improve some of your business fundamentals. There are few things more fundamental than the quality of your data.

When people think of data quality they often focus first on customer data. One of the best ways to ensure that customer data is right is to provide a way for your own customers to update their details online. On its own, this is an important capability, but to be really effective it needs to be linked to something that the customer regularly does on the web, such as reviewing their accounts, orders or other interactions with your organisation. Truly effective businesses make updating customer details part of every interaction and available to all stakeholders in the customer, effectively building a Facebook-like facility for their customers identifying relationships (friends), preferences and activities.

Apart from enhanced customer service, it is worth remembering that it is much harder to maintain a fraudulent identify when you are connected through multiple relationships and you have to maintain an exponential number of fronts.

Business data includes much more than just customer details. Online collaboration both inside and outside the enterprise can enhance almost all data in some way. One of the most common problems businesses face is maintaining an accurate understanding of the definition of complex business terminology. Every organisation develops their own language and expects staff, customers and business partners to understand it. Worse, few maintain a dictionary of this language.

Consider creating such a dictionary, with components that are visible internally, other parts to business partners and a relevant subset to the world in general. To really leverage the power of the web, make this dictionary readily updatable (even using a wiki). While open to misuse, it is unlikely that internal staff or business partners who are easily traced will deliberately abuse the privilege. Online communities have shown that complex topics attract genuinely interested contributors who can often provide a better explanation to their peers that you could hope to publish either from an insight or simple labour perspective.

Finally having learnt to use the web to better maintain customer data and your data dictionary, it rapidly becomes obvious that many datasets would be candidates to be open to a wider community for monitoring, comment or even enhancement. Consider lists of branches, community contacts and products. In the last case, suppliers sometimes make changes which flow through your supply chain without being updated in online catalogues.

If there is one thing we’ve learnt, the fear that we feel about opening our content up for collaboration is often disproportionate to the real risk of misuse. If you succumb to this fear without carefully considering what you are worried about, then you’ll miss out on the power that the crowd can bring to our business.

Readers interested in these concepts should read further about the intersection of Enterprise 2.0 and Information Management in MIKE2.0, in particular the MIKE2.0 Enterprise 2.0 Solution Offering.

Category: Data Quality, Enterprise Data Management, Enterprise2.0, Information Development, Information Governance, Information Management, Information Strategy, MIKE2.0
2 Comments »

by: Larry.dubov
14  Aug  2009

Quantifying Data Quality with Information Theory

Information Theory Approach to Data Quality for MDM

Introduction

Over the past decade data quality has been a major focus for data management professionals, data government organizations, and other data quality stakeholders across the enterprise. Still the quality of data remains low for many organizations. To a considerable extent this is caused by a lack of scientifically or at least consistently defined data quality metrics. Data professionals are still lacking a common methodology that would enable them to measure data quality objectively in terms of scientifically defined metrics and compare data sets in terms of their quality across systems, departments and corporations.  

 

Even though many data profiling metrics exist, their usage is not scientifically justified. Consequently enterprises and their departments apply their own standards or apply no standards at all.

 

As a result, regulatory agencies, executive management and data governance organizations are lacking a standard, objective and scientifically defined way to articulate data quality requirements and measure data quality improvement progress. An elusiveness of data quality results in that job performance of the enterprise roles responsible for data quality lacks consistently defined criteria, which ultimately causes limited progress in data quality improvements.

 

A quantitative approach to data quality, if developed and adopted by data management community, would enable data professionals to better prioritize data quality issues and take corrective actions proactively and efficiently.

 

In this article we will discuss a scientific approach to data quality for MDM based on Information Theory. This approach seems to be a good candidate to address the aforementioned problem.

 

Approaches to Data Quality

At a high level there are two well-known and broadly used approaches to data quality. Typically both of them are used to a certain degree by every enterprise.

 

The first approach is mostly application driven and oftentimes referred to as a “fit-for-purpose” approach. Oftentimes business users determine that certain application queries or reports do not return the right data. For instance if a query that is supposed to fetch top 10 Q2 customers does not return some of the customers the business expects to see, in depth data analysis follows. The data analysis may determine that some customer records are duplicated and some transaction records have incorrect or missing transaction dates. This type of finding can trigger some activities aimed at understanding of the data issues and corrective actions.

 

An advantage of this approach to data quality is that it is aligned with tactical needs of business functions, groups and departments. A disadvantage of this approach is that it addresses data quality issues re-actively based upon business request or even complaint. Some data quality issues may not be easy to discover and business users cannot decide which report is right and which one is wrong. The organization may eventually draw a conclusion that their data is bad but would not be able to indicate what exactly needs to be fixed in the data, which limits the IT’s abilities to fix the issues. When multiple LOB’s and functions across the enterprise struggle with their specific data quality issues separately, it is difficult to quantify the overall state of data quality and define priorities with which data quality problems are to be addressed by the enterprise.

 

The second approach is based on data profiling. Data profiling tools are intended to make a data quality improvement process more pro-active and measurable. A number of data profiling metrics is typically introduced to screen for missing and invalid attributes, duplicate records, duplicate attribute values that are supposed to be unique, frequency of attributes, cardinality of attributes and their allowed values, standardization and validation of certain data formats for simple and complex attribute types, violations of referential integrity, etc. A limitation of the data profiling techniques is in that an additional analysis is required to understand which of the metrics are most important for the business and why. It may not be easy to come up with a definitive answer and translate it into a data quality improvement action plan. The variety of data profiling metrics is not based on science but rather driven by the variety of ways relational database technology can report on data quality issues.

 

Each of the two approaches above has its niche and significance. When the quality of master data is in question an alternative and more strategic approach can be considered by data governance organizations. This approach avoids detailed analysis of business applications while providing a solid scientific foundation for its metrics.

Information Theory Approach to Data Quality for MDM  

Master data are those data which are foundational to business processes, are usually widely distributed, which, when well managed, are directly contributing
to the success of an organization, and when not well managed pose the most risk. Customer, Patient, Citizen, Member, Client, Member, Broker, Product, Financial Instrument, Drug are the entities oftentimes referred to as master data entities while company specific selection of master entities is driven by the enterprise business and focus.

 

Master Data Service (MDS) defines its primary function as the creation of the “golden view” of the master entities. We will assume that MDS has successfully created and maintains the “golden view” of entity F in the data hub. This “golden record” can be dynamic or persistent. There exist a number of data sources across the enterprise with the data corresponding to domain F. This includes the source systems that feed the data hub and other data sources that may be not integrated with the data hub. We will define an external dataset f which data quality is to be quantified with respect to F. For the purpose of this discussion f can represent any data set such as a single data source or multiple sources.

 

Our goal is to compare the source data set f with the entity data set F. The data quality of the data set f will be characterized by how well it represents the benchmark entity F defined as the “golden view” for the data in domain F. We are making an assumption here that the “golden view” was created algorithmically and then validated by the data stewards.

 

In Information Theory the information quantity associated with the entity F is expressed in terms of the entropy:

                                              

                    H(F) = – ∑ Pk log Pk,                                                                                            (1)   

                                              

where Pk are the probabilities of the attribute (token) values in the “golden” data set F. Index “K” runs over all records in F and all attributes. The base in the log function is 2.

 

H(F) represents the quantity of information in the “golden” representation of entity F.

 

Similarly for the comparison data set f

 

                    H(f) = – ∑ pi log pi,                                                                                            (2)   

 

We will use small “p” for the probabilities associated with f while capital letter “P” is used for the probabilities characterizing the “golden” entity record.

 

Mutual entropy J(f,F) characterizes how well f represents F.

                   

J(f,F) = H(f) + H(F) – H(f,F)                                                                        (3)   

 

In (3) H(f,F) is the joint entropy of f and F. It is expressed in terms of probabilities of combined events, e.g. the probability that the name = “Smith” in “the golden record” F and name = “Schmidt” in the source record linked to the same entity. The behavior of J qualifies this function as a good candidate quantifying the data quality of f with respect to F. When the data quality is low, the correlation between f and F is low. In an extreme case of a very low data quality f doesn’t correlate with F and these variables are independent. Then

 

                    H(f,F) = H(f) + H(F)                                                                                      (4)   

 

and

 

                    J(f,F) = 0                                                                                                       (5)   

 

If f represents F extremely well, e.g. f = F, then H(f) = H(F) = H(f,F) and

 

                    J(f,F) = H(F)                                                                                                  (6)   

 

We define Data Quality of f with respect to F by the following equation:

 

                    DQ(f,F) = J(f,F)/H(F)                                                                                      (7)   

 

With this definition of data quality DQ changes from 0 to 1, where 0 indicates the data quality of f is minimal; f does not represent F.  When DQ = 1 f perfectly represents F and the data quality of f with respect to F is 100%, and therefore f represents F perfectly well.

 

The approach can also be used to determine partial attribute/token level data quality. This will provide additional insights into what causes most significant data quality issues.

 

The data quality improvement should be done iteratively. Changes in the data source data may impact the “golden record”. Then equations (1) and (7) are applied again to recalculate the data quantity and data quality characteristics.

 

Conclusion

The article offers an Information Theory based method for quantifying Information Assets and the Data Quality of the Assets through equations (1) and (7). The proposed method leverages the notion of a “golden record” created and maintained in the data hub. The “golden record” is used as the benchmark against which the data quality of other sources is measured.

 

Organizations can leverage this approach to augment its data governance offerings for MDM and make our data governance approach truly unique. The quantitative approach to data quality will ultimately help data governance organizations develop policies based on scientifically defined data quality and quantity metrics.

 

By applying this approach consistently on a number of engagements, over time we will accumulate valuable insights into how metrics (1) and (7) apply to real world data characteristics and scenarios. We will develop good practices defining acceptable data quality thresholds, e.g. it might be a future industry policy for P&C insurance business to keep the quality of Customer data above the 92% mark, which sets clearly articulated data governance policy based on scientifically sound approach to data quality metrics.

 

The developed approach can be incorporated in the future products to enable data governance and provide data governance organizations with new tooling. Data governance will be able to select information sources and assets to be measured, quantify them according to (1) and (7), set the target metrics for data stewards, measure the progress on an on-going basis and report on the data quality improvement progress.

 

Even though we are mainly focusing on data quality, the quantity of data in equation (1) characterizes the overall significance of a corporate data set from the Information Theory perspective. For M&A the method can be used to measure an additional amount of information that the joint enterprise will have compared to the information owned by the companies separately. The approach developed above will measure both the information acquired due to the difference in the customer bases and the information quantity increment due to better and more precise and useful information about the existing customers.

 

  Simple Illustrative Examples

In this Appendix we will apply the theory developed above to two simple illustrative cases. We will define the “golden” data set F as follows:

 

EID

Name

State

1

Larry

NJ

2

Jim

GA

3

Scott

CA

4

Marty

CA

 

The probabilities of attributes values in F are:

 

Value

Probability (P)

log P

p log p

Larry

0.25

-2

-0.5

Jim

0.25

-2

-0.5

Scott

0.25

-2

-0.5

Marty

0.25

-2

-0.5

NJ

0.25

-2

-0.5

GA

0.25

-2

-0.5

CA

0.5

-1

-0.5

Scenario 1

Dataset f is the same as the “golden” data set. Then

 

                                                f = F, H(f) = H(F) = 3.5.

 

The probability matrix for combined values:

 

Value

Probability (P)

log P

p log p

Larry, Larry

0.25

-2

-0.5

Jim, Jim

0.25

-2

-0.5

Scott, Scott

0.25

-2

-0.5

Marty, Marty

0.25

-2

-0.5

NJ,NJ

0.25

-2

-0.5

GA,GA

0.25

-2

-0.5

CA, CA

0.5

-1

-0.5

 

 

 

 

H(f,F) = -∑Pk logPk =

 

3.5

 

and

 

                                                         H(F) = H(f) = H(f,F) = 3.5

 

 

                                             J(f,F) = H(F) + H(f) – H(f,F) = H(F) = 3.5

 

Equation (7) yields

 

                                                           DQ = J(f,F)/H(F) = 1

 

As expected the data quality of f when f = F yields 1 or 100%

 

Scenario 2

Dataset for the “golden record” F remains the same as in scenario 1.

 

                                                              H(F) = 3.5

 

We will change dataset f by adding a new record: “Larry, CA”. We will assume that the new record for “Larry” represent the same individual as “Larry, NJ”. Therefore records “Larry, NJ” and “Larry, CA” will have the same EID = 1. Data stewards determined that “NJ” is the right value for the attribute State.  Dataset f is as follows:

 

                                               

EID

Name

State

1

Larry

NJ

2

Jim

GA

3

Scott

CA

4

Marty

CA

1

Larry

CA

 

 The probability matrix for f is:

 

 

 

 

 

 

Value

Probability (P)

log P

p log p

Larry

0.4

-1.32193

-0.528771238

Jim

0.2

-2.32193

-0.464385619

Scott

0.2

-2.32193

-0.464385619

Marty

0.2

-2.32193

-0.464385619

NJ

0.2

-2.32193

-0.464385619

GA

0.2

-2.32193

-0.464385619

CA

0.6

-0.73697

-0.442179356

 

 

 

 

H(f) =

 

 

3.292878689

 

 

 

The probability matrix for combined values:

Value

Probability (P)

log P

p log p

Larry, Larry

0.4

-1.32193

-0.528771238

Jim, Jim

0.2

-2.32193

-0.464385619

Scott, Scott

0.2

-2.32193

-0.464385619

Marty, Marty

0.2

-2.32193

-0.464385619

NJ,NJ

0.2

-2.32193

-0.464385619

GA,GA

0.2

-2.32193

-0.464385619

CA, CA

0.4

-1.32193

-0.528771238

NJ, CA

0.2

-2.32193

-0.464385619

 

 

 

 

H(f,E) =

 

 

3.84385619

 

Substituting the values for H(F), H(f) and H(f,F) into 7 we will obtain:

 

                                 J(f,F) = 3.5 + 3.292878689 – 3.84385619 = 2.949022499

 

                                  DQ = J(f,F)/H(F) = 2.949022499/3.5 = 0.842577857 or ~ 84%

Category: Data Quality, Enterprise Data Management, Information Development, Information Governance, Information Management, Information Value, Master Data Management
5 Comments »

by: Robert.hillard
05  Jul  2009

To Pluto and back

Richard Wray, writing recently in The Guardian, pointed out that the volume of data held is now estimated at 487 billion GB.  To put this in perspective he explained that in printed form this would form a pile that would stretch to Pluto 10 times over.  The really staggering statistic, however, was that if this data were printed then the stack would grow faster than NASA’s fastest rocket.  I haven’t checked the stats, but a quick back of the envelope calculation suggests he’s in the right order of magnitude.

What does this mean?  Apart from the staggering numbers, it tells us that the problem for organisations isn’t holding large amounts of information – they already do that.  Nor is the problem necessarily how to index that information – increasingly they have defined information standards to do that.  The real problem is its continual growth – very few taxonomies or models properly account for the rapid rate of growth.

MIKE2.0 hosts a new generation of Information Management techniques which are designed to deal less with the data you have now and more with the data that you are likely to gain in the future.   A great place to start is with the SAFE architecture.

Category: Information Governance, Information Management, Information Strategy, MIKE2.0
3 Comments »

by: Sean.mcclowry
10  Nov  2008

Managing Your Business Data

I had the pleasure of meeting Maria Villar at the IBM’s Information on Demand Conference where she presented on her new book – Managing Your Business Data. Its a great book for someone looking to make the “case for change” for treating information as an asset across the enterprise, building a culture of better information management and establishing roles and responsibilities for better data management across the organization.  There’s a wealth of information in there – and its something a business leader can understand just as well as a technologist. I particularly liked blending the conceptual (Maslow’s needs hierarchy for data management) with the pragmatic (lots of cases studies).

Maria tells me the book will be on Amazon soon but in the meantime you can order direct from the publisher at the link above.

Tags: , ,
Category: Information Governance, Information Strategy
No Comments »

by: Sean.mcclowry
02  Oct  2008

How do you govern $700B?

In the current economic environment I expect we are going to hear a lot about the need for greater openness, transparency and better information.  Too much regulation has been shown to hamper innovation, but free markets have their issues too – we’ve had 6 global collapses in the past 15 years after all.

I’m not sure what the right balance is, but Governance 2.0, which I mentioned in an earlier post and is one of the solutions in MIKE2.0, might indeed become a hot topic.

Tags: , , ,
Category: Information Governance
No Comments »

by: Sean.mcclowry
25  Jan  2008

Executable Data Standards

Organizations typically suffer from a lack of standards around information management. They develop standards on their own although they may use external reference materials. The issue is that most of the standards are definitional, but not validation-based. That is, the standards may say how a data warehouse model should be developed or provide a policy about how reference data should be synchronized.

What is missing is the validation step against these standards. What would be valuable are validation tools that test areas such as complexity while the solution is being developed. When we have simple tools like those used for W3C Markup Validation Service it will be a big help in the industry.

Tags: ,
Category: Information Governance, Information Strategy
No Comments »

by: Sean.mcclowry
14  Dec  2007

Information Governance 2.0 (Continued)

I have been writing quite a bit lately about the topic of Governance 2.0/Networked Information Governance, including an earlier post in this blog. Networked Information Governance = Information Governance + Enterprise 2.0. The idea for the name came from an excellent article published by Paul Strassman in 2001. At the time of its authoring in 2001, networked business models were continuing to grow in popularity, from the military to the most agile Fortune 2000 organizations. What it pre-dated was the radical advances in collaborative technologies would occur over the next few years to bring together “informal networks” that are so relevant in the application of governance. When it comes to bringing the informal network together with a formal approach, technologies and techniques from Enterprise 2.0 are a great fit: collaboration, search, tagging and aggregation are the keys to bridging the gap.

I wrote a more detailed post on this subject on FastForward and recently posted on presentation on slideshare.

Tags: ,
Category: Enterprise2.0, Information Governance, Information Strategy
3 Comments »

Calendar
Collapse Expand Close
TODAY: Thu, September 2, 2010
September2010
SMTWTFS
2930311234
567891011
12131415161718
19202122232425
262728293012
Archives
Collapse Expand Close
Recent Comments
Collapse Expand Close