Small Worlds Data Transformation Measure
From MIKE2.0 Methodology
-> You are here: Small Worlds Data Transformation Measure
CIO Technique – “Small Worlds” Data Transformation Measure'
The massive growth in raw data volume through the last two decades has created a new problem for the Chief Information Officer (CIO): data as an obstacle for transformation.
Any team implementing a new business function starts by considering the strategy, process, organization, stakeholders and technology. Traditionally, the team also consider the implications of infrastructure (technology, logistics or facilities) in implementing the new capability. In the past, however, implementation teams have not had to worry as much about the data aspects of a new business function.
The absence of a strategic approach to data in considering a new business function is not due to its lack of importance. Historically, computer systems stored only data principally required to perform a transaction. The core data associated with an event is usually intuitively understood by any practitioner and hasn’t required special analysis. For example, a retail sale might involve a product code, a date, and a sale amount; a bank withdrawal might include an account number, date, and amount; and a payroll change might involve an employee number, old salary, new salary and effective date.
As the price of storage dropped during the 1990s, new systems began also storing ancillary data about the parties involved in each transaction and substantially more context for the event. Context could, for example, include the whole sales relationship tracking leading up to a transaction, or the staff contract changes that led to a salary change. Adding further to the complexity, some of this context information is in the form of documents or other unstructured content.
With the context permanently on record, there are operational, regulatory and strategic reasons requiring that any new or transforming business function do nothing to corrupt the existing detail.
Managing the complexity
The accepted technique for representing data is the Entity Relational Data Model (or ER Diagram). Since its invention in 1970 by Ted Codd , there have been hundreds if not thousands of books written to assist system designers to build well-structured databases.
Despite the abundance of technique advice, it is very difficult to take a strategic view of an organization’s system portfolio and determine how easy it is to manage, extend or transform based on the growing content associated with, or shared across, business functions.
Usually, the first sign of trouble is an increasing bill for technology services related to data migration, data quality or business intelligence. These costs generally mask an underlying issue of accessing an entire dataset in the context in which it was created and then applying to a new business purpose.
In 1967, a network theory of “Small Worlds“ was developed and has since evolved. It shows that any network (be it technical, biological or social) is only stable if there is a logarithmic relationship between the number of nodes and the number of steps needed to navigate between any two points.
A useful example to consider is the telephone network. Two neighbours calling each other might require two steps to complete a call (the caller connects to the nearest exchange which then, in turn, connects the call to the neighbour). By comparison, a call made between Sydney and New York may require only three or four steps to complete (the caller connects to a local exchange, then to an international exchange, then to an exchange local to the receiving party, and finally to the target of the call).
The two transactions in this example demonstrate the extremes of telephony complexity, the first is the simplest that can be performed on the network while the second is the most complex. Despite this, there is little difference in the number of network steps required.
This model holds true for most technology environments. Most programming tools are designed to make it easy to navigate between code units (through the use of objects or subroutines). Physical storage technologies are designed to make it easy to request the retrieval of data regardless of whether it is adjacent or distributed over a substantial distance. The Internet is the ultimate example of a distributed system with a logarithmic between distance and complexity.
The model also holds true for successful business models. For example, sales teams rely on internal communications to mirror the large accounts against which they are applied. Good organization hierarchies support communication from any obscure part of an enterprise to any other with only a few managers required to complete the contact.
The one example that consistently breaks this principle is the network that is used to link all of the context information described earlier. Typical data models within a single function database require dozens of steps to join together even closely related concepts and hundreds or even thousands of steps to link across the enterprise in new ways.
Measuring the Problem and Solution
The value of data is as much in its relationships as in its content. Described another way, the value is in the network and while data exists on the network it is not always appropriately networked.
Senior executives can direct technology staff to use appropriate data management techniques to improve the data network across the enterprise however it is difficult to promote good behaviour without a mechanism to measure its adoption. While technical staff know that they are being measured by their productivity as measured by solving individual tasks, they know that it is unlikely executives will ever examine the way they store data in data models.
Executives need a broad brush set of measures to ensure that new content is loaded onto the corporate network in a way that simplifies its application to new business functions rather than hindering new development.
Such a technique needs to simplify the data model to its constituent parts and require very little technical skill to apply.
The data model as a “graph”
In mathematics, the formal name for any networks is a graph. Each node in the graph is called a vertex. The connections between vertices are called edges.
Given that a data model is nothing more than a set of connected nodes then the use of graphs is a very logical way to abstract the model and allow for the development of some general metrics.
A graph is described by its “order” (the number of vertices), “size” (the number of edges), vertex “degree” (the number of edges intersecting a given vertex) and “geodesic distance” is the length of the shortest path between a pair of vertices.
Using these new terms (order, size, degree and geodesic distance), executives should consider three key metrics: average degree, average geodesic distance and maximum geodesic distance.
In the preceding data model example:
The average geodesic distance is 1.8 (consider each combination of vertex pairs and the number of edges that separate them).
The maximum geodesic distance is 3 (the maximum number of edges required to by stepped between the two most distance vertices).
The average degree is 1.6 (the degree of each vertex 1 + 2 + 3 + 1 + 1 divided by the order, 5).
Finally, a useful consideration to understand the nature of the information being held is the ratio of size (4) to order (5). If the ratio is <1 then (generally) more of the information is held in content, however if the ratio is >1 then the majority of the information is about the relationships.
Interpreting the results
Like any benchmark, the key is to seek constant improvement. As a start, each critical database in the enterprise should be assessed and any future developments (either modifications to existing systems or the addition of new databases) should result in all three measures being lowered.
For any database requiring human access (such as through a query or reporting tool), it is important to remember that a single query that requires more than 4 steps is beyond an average user. Anything requiring 10 or more steps is beyond anyone but a trained programmer prepared to invest substantial time in testing. That means the average geodesic distance should approach 4 and organizations should aim for a maximum geodesic distance of approximately 10.
Average degree reflects the options a user faces to navigate a database. Realistically, 3 or 4 direction options on average is manageable but as the number approaches 10 nothing other than well tested code can possibly manage the complexity.
These measures encourage good data management practice. Even systems that are not designed for direct human access on a daily basis should be measured in this way. Too often, core operational systems become an obstacle to data extraction and further business transformation.
Winning organizations in the 21st Century will streamline their data so that they can maximize its application and ongoing value to the business.
An open source tool exists to assign “small world” measures to data models. An open source tool exists on SourceForge which automates the process. It supporting any RDBMS for which you have a JDBC driver or ERWin file (via CSV output).
Wiki asset search