Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Differences between centralised and distributed Data Warehouses

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

Two levels of data redundancy that enterprises should think about when considering their data warehouse options are:

  • Central Data Warehouses
  • Distributed Data Warehouses

The rational behind the selection will vary as per the requirements, cost considerations, technical limitations and dynamics of the organization. Due to the broad reaching implications and complexity of most large organisations, further investigations, analysis and inquiries will be required before an effective recommendation of the most appropriate Data Warehouse house option will be necessary.

To complement the over overview of the Data Warehouse options, an initial listing of benefits and disadvantageous of each Data Warehouse Approach is provided further in this section presented.

Contents

Centralised Data Warehouse

Central Data Warehouses are what most people think of when they first introduced to the concept of data warehouse. The central data warehouse is a single physical database that contains all of the data for a specific functional area, department, division, or enterprise. Central Data Warehouses are often selected where there is a common need for informational data and there are large numbers of end-users already connected to a central computer or network. A Central Data Warehouse may contain data for any specific period of time. Usually, Central Data Warehouses contain data from multiple operational systems.

Central Data Warehouses are real. The data stored in the data warehouse is accessible from one place and must be loaded and maintained on a regular basis. Normally, data warehouses are built around advanced RDBMs or some form of multi-dimensional informational database server.

Most organizations build and maintain a single centralized data warehouse environment. There are many reasons why a single centralized data warehouse environment makes sense.

  • The data in the warehouse is integrated across the corporation, and allows a an integrated and cross functional view of the data to be made available.
  • The volume of data in the data warehouse is such that a single centralized repository of data makes sense.
  • Even if data could be integrated, if it were dispersed across multiple local sites, it would be cumbersome to access.

The politics, the economics, and the technology may greatly favor a single centralized data warehouse. But there may also be a need for a distributed data warehouse in a few special cases.

Traditional Centralised Data Warehouse Architecture

Traditional Data Warehouse Architecture

The key advantage of a centralized data warehouse is it can be built and tailored more easily to the business requirements. For the purpose of this proposal, it is assumed that the architecture of a End-to-End centralized data warehouse solution will be as displayed in the figure below:

In this option, an evaluation of supporting tools such as the ETL and BI should be performed to confirm appropriate toolsets.

Centralised Organisational Processing Models

Centralised Operational Processing

This figure to the right shows that an organisation has a headquarters and that all processing is done at the headquarters. If there is any processing done whatsoever at the local level, it is very basic. There might be a series of dump terminals at the local level. But no significant processing is done locally. In this type of topology it is very unlikely that there will be a need for a distributed data warehouse.

One step up the ladder in terms of local processing sophistication is the case where basic capture activity occurs at the local level. This means there is some small amount of processing activity that occurs at the local level. Once the activity is captured, it is shipped to a central location for processing. Under this simple topology it is very unlikely that there will be a need for a distributed data warehouse.

Distributed Data Warehouse

Distributed Data Warehouse Model

A Distributed Data Warehouse (DW) is just what their name implies. They are autonomous in that selective elements of each are distributed to a Global DW as required by an organization. This implies that not all of the local DW elements are required in the Global DW. The reverse is also true that only select elements (i.e. common reference codes, etc) are required to be distributed by the Global DW to the local DW

’ s as depicted in Figure 3.

There may be local DW that houses data unique to and of interest to the local operating site. There may also be a global all encompassing data warehouse, which is an integration or consolidation of all relevant aspects of the corporation

’ s data (i.e. for Group Risk Management reporting). While the structure and content of the global data warehouse are decided centrally. The mapping of the data into the global data warehouse is decided locally.

Increasingly, large organizations are pushing decision-making down to lower and lower levels of the organization and in turn pushing the data needed for decision making down (or out) to the LAN or local computer serving the local decision-maker.

While this on the surface may appear to be a series of independent data marts feeding into a Global DW, it is not. The data feeds to both the local DW and Global DW are from the same source as compared to being sourced from the local DW. In addition it does not imply data it is being feed from the Global DW in mass to the Local DW (i.e. such as with dependant data marts). It is only on a select and conditional basis.

Distributed DW topology

To understand when a distributed data warehouse makes sense, consider some basic topologies of processing. The local DW contains data that is of interest only to the local level. Figure 3 shows a simple example of a series of Local DW

’ s.

There is a local DW for different geographical regions or for different technical communities. The local DW servers the same function that any other DW serves except the scope of the DW is local. In other words, the local DW contains data that is historical in nature and is integrated within the local site. There is no coordination of data or structure of data from one local DW to another.

The global DW has as its scope the corporation or the enterprise. Each of the local DWs within it has as its scope the local site that it serves. The scope of the global DW is the corporation. The global DW contains historical data, as do the local DWs. The source of the data for the local DW is fed by its own operational systems. Each local DW has its own unique structure and content of data. Any intersection or commonality of data from one local DW to another is purely coincidental. There is no coordination whatsoever of data or processing between the local DW.

But it is reasonable to assume that a corporation will have at least some natural intersections of data from one local site to another. If there is such an intersection, it is best contained in a global DW. The global DW is fed from existing local operational systems.

The global DW contains data that is common across the corporation and that is integrated. Central to the success of the distributed DW environment is the mapping of data from the local operational systems to the data structure of the global DW as shown in the diagram. There is a common structure of data for the global DW. The common data structure encompasses all common data across the corporation. But there is a mapping of data from each local site into the global data warehouse. In other words, the global DW is defined and designed centrally, but the mapping of the data from existing local operational systems is a choice made by the local designer and developer.

A variation of the DW is that of allowing global DW "staging" areas to be kept at the local level. The distributed DW in figure 3 shows that each local area stages global warehouse data before passing the data to the central location. In many circumstances, this approach may be technically mandatory. There is an important issue associated with this approach – should locally staged global DW be emptied after it is transferred to the global level? If the data is not deleted locally, then there will be redundant data. However, under certain conditions, some amount of redundancy may be called for. This issues must be decided and policies and procedures out into places.

There are many subject areas that may be candidates for the first global DW development effort. The area that many corporations begin with is corporate finance. Finance is a good stating point because:

  • It is relatively stable
  • It enjoys high visibility

Anomalies and complexities

When building the global DW one must recognize that there are some anomalies. The global data warehouse does not fit the classical structure of a DW as far as the levels of data are concerned. One of the anomoloies is that the detailed data resides at the local level, while the lightly summarized data may reside at the centralised global level. For example, suppose that the headquaters of a company is in Beijing and there are outlying regional offices in Site A, Site B, Site C and Site D. The details of sales and finance are managed independently and at a detailed level in Site A, B, C and D. The data model is passed to the outlying regions and the needed corporate data is translated into the form that is necessary to achieve integration across the corporation. Upon having made the translation at the local level, the data is transmitted to Beijing. The raw, untranslated detail still resides at the local level. Only the transformed, lightly summarised data is passed to headquaters. This is a variation on the theme of the classical data warehouse structure.

The coordination and administration of the distributed data warehouse environment is much more complex than that of the single-site data warehouse. Distributed Data Warehouses usually involve the most redundant data and, as a consequence, most complex loading and updating processes with benefits of economy of scale not being available.

Redundancy

In conjunction with the issue of redundancy of data arises the issue of redundancy between Local DW and global DW. The distributed DW in figure 3 shows that, as a policy, there is no redundant data between the local levels and the global levels of data (and in this regard, it matters not whether global data is stored locally in a staging area or centrally). The moment there exists redundant data between the Local DW and global DW, it is an indication that the scope of the different warehouse has not been defined properly. When there is a difference of opinion between the local and global scopes, it is only a matter of time before spider web systems start to appear. For this reason, it should be a matter of policy that global data and local data be mutually exclusive.

Data Access Policy

In line with policies required to manage the different DW, there is the issues of access of data. At first, this issue seems to be almost trivial. But there are some important ramifications. The diagram shows that some local sites are accessing global data. Depending upon what is being asked for, this may or may not be an appropriate usage of DW data. If the global data is being used informationally and on a onetime-only basis, then its access at the local level is probably alright. As a principle, local data should be used locally and global data should be used globally. The question must be raised, why is global analysis being done locally? As a rule, there is no good reason for it.

Distributed Data Warehouse – Pros and Cons

In the following section, the pros and cons of the distributed data warehouse is being discussed.

Advantages of a Distributed Data Warehouse

  • It

’ s quick to accomplish. Each local group has control over its design and resources.

  • Benefit of the data warehouse can be proven throughout the corporation on a real-time basis.
  • The entry cost is much less than with a centerlized solution. The hardware and software for a DW when initially loaded onto distributed technoogy is much less than if the data warehouse were initialy loaded onto classical large, centralised hardware.
  • There is no therotical limit to how much data can be placed in each Local or the Global DW. If the volume of the data inside the warehouse begins to exceed the limit of a distributed processer, the another processer can ba added to the network.
  • Data transfer and multiple table queries will not create as many major technological problems.

Disadvantages of a Distributed Data Warehouse

  • In the distributed environment, the issue of scope, coordination, metadata, responsibilities, transfer of data, local mapping, and so on, make the environment complex.
  • Managing multiple development efforts on local sites would be reasonably difficult for a data warehouse architect.
  • The different parts of the detailed level of the data warehouse are scattered across different technological platforms or different type of database vendors.
  • Interconnectivity between the different locals of the distributed data warehouse could be an issue. If there is very heavy traffic being created by processing that occurs on either level of the DW, then the interace between the two environmnets can become a bottleneck.
  • Excessive network traffic starts to appear when the warehouse is spread over multiple servers.
  • Building a corporate data warehouse model in a distributed environment can be contrasted with the corporate financial DW of the completely unrelated locals.
  • Data transfer and multiple table queries may present special technological problems.
  • In a distributed environment, the roles and responsibilities are not straight foreward.
  • Coordinating development across distributed locations won

’ t be very effective. The local development groups never colectively move at the same pace.

Centralised Data Warehouse – Pros and Cons

In the following section, the pros and cons of the centralized data warehouse is being discussed.

Advantages of a Centralised Data Warehouse

  • Provides a single interface between users and the information they need, making it easier to get to the information and/or to build a departmental data mart based on the foundation established in the centralized data warehouse
  • A centralized data warehouse is best for the integrated business where organizations across the enterprise work together and require a cross funtional and integrate view of th data.
  • Developing new business applications becomes simpler. Attempting to execute customer relationship management solutions, which requires knowledge of lifetime customer value, is difficult with distributed DW.
  • A centralized data warehouse can be more cost effective in the longer term due to economies of scale. Your data mart implmentation may be on multiple hardware platforms. A centralized data warehouse only requires one. But consolidating the data is not just about saving disk space, it is about true savings in both labor and costs resulting from mitigated data management.
  • Consolidates data in one foundation, providing a "single version of the truth" that all users can access .
  • Alleviates redundant data, maintaining all data in a central store.
  • Even if data could be integrated, if it were dispersed across multiple local sites, it would be cumbersome to access if and integrated view is needed for each local DW.
  • In the centralised environment, the issue of scope, coordination, metadata, responsibilities, transfer of data, local mapping, and so on would be reasonably straight forword.
  • Managing development effort would be much easier for a data warehouse architect.
  • The hardware and database vendor would be homogenious to the environment.
  • In a simple centralised environment, roles and responsibilities are more straight foreward.
  • The availability of the DW is entairly depends on the single hardware, software and location opposed to multiple distributed sites.

Disdvantages of a Centralised Data Warehouse

  • If there is any commonality in the structure (not the content) of the data across the organization, this approach does nothing to recognize or rationalize that commonality.
  • The entry cost is very high. You need a classical larger hardware and associated software.
  • Benefit of the data warehouse can

’ t be easily proven throughout the corporation.

  • The limit is depends on the extensibility of the hardware configuration.
  • Network traffic is not a major issue.
Wiki Contributors
Collapse Expand Close