The Open Source Standard for Information Management

Personal tools
Refresh Collapse Expand Close
  • 38.107.191.118
  • Talk for this IP
Members
Refresh Collapse Expand Close

To join, please contact us.

Improve MIKE2.0
Refresh Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.
Add Portlet Add Portlet

ETL High Level Solution Architecture Options

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search

This section describes the high level architecture options for data movement and consolidation covering the components of an architecture but not the usability and functionality of the software. These are the options to be considered as you select the architecture components for a more thorough evaluation.

This section covers related technologies that have similar acronyms:

  1. ETL (Extract, Transform and Load)
  2. ELT (Extract, Load and Transform)
  3. ELTL (Extract, Load, Transform, Load)

But not other data integration methods such as replication, EAI, enterprise search or federation.

Contents

Software Budget

The software budget earmarked for the project may limit the choice of ETL software putting it into one of these categories:

  • Open Source ETL
  • Enterprise ETL upgrade
  • Premium ETL product

Open Source ETL

Open source ETL products have matured and are a viable option for small to medium data integration implementations. These products are not as mainstream as high profile and successful open source applications like Linux and MySQL so they do not have as many developers or as many coding hours put into them. Open source ETL trail the premium ETL tools much like Wordpad trails MS Word.

Enterprise ETL Upgrade

Some products offer an upgrade that provides an ETL tool. For example an upgrade of an Oracle data warehouse license to an enterprise version comes with licenses for Oracle Warehouse Builder. An upgrade of SQL Server 2005 opens up additional data quality and volume functions for Microsoft SSIS.

Premium ETL

Most ETL products work on CPU licensing of the server. This ensures fair pricing based on how much data the product is moving and how many jobs it is being utilized for. CPU based licensing is most valuable if the server is utilized heavily and if the software is not competing for CPU resources with other applications .

Upfront versus Ongoing Budget

Premium ETL and ELT tools have a higher up front software fee but save money during the life of the project with superior usability and functionality. These benefits can be lost if the project is staffed by developers new to the project.

Timeliness

A key architecture decision for ETL is the timeliness of the data integration.

  • Real Time: data is moved as soon as it becomes available. The ETL processes that move the data are "always on". ETL services that operate in real time can be known as On Demand or Web Services or Java Beans or Java Services. Active Warehousing utilizes real time data loads to provide near real time data for reporting.
  • Micro Batch: also could be called mini batch or delta data loads. This architecture loads data throughout the day and waits for data to be collected in a micro batch for load. This can remove some of the complexity of a real time architecture in terms of utilizing efficient volume
  • Batch Load: the most common type of ETL work is the batch load where a data source is processed at a regular interval such as hourly, daily, weekly etc.

Hardware

This section covers the decisions on what hardware is required in the architecture.

Transformation Services

A key decision is where the processing load of ETL will occur:

  • An ELT or ELTL approach calls for transformation of data to occur on the hardware of existing database engines. This calls for sizing, optimization and ongoing support of the database to handle the load of this transformation. It also requires expert coding resources who are adept at using the ELT tool and optimizing the SQL generated by that tool.
  • An ETL approach calls for a transformation server to handle the load of transforming data. This server needs excellent connectivity to data sources and targets, it requires fast and efficient access to disk reads and disk writes for temporary or staged data.
  • SaaS Software as a Service for ETL data integration is emerging as a way of moving data between internal systems and hosted applications such as Salesforce.com. ETL and ELT vendors are providing hosted software for data integration.

Scalability

How very high volumes will be handled:

  • An ETL cluster is a set of servers that have the ETL software installed where ETL jobs are configured to run across parts of the cluster for parallel processing.
  • An ETL grid is a cluster that has smart software added to automatically run ETL jobs across available servers on the grid with load balancing and fault tolerance.
  • A MPP massive parallel processor is a server capable of high utilization on a single machine reducing the need to partition and repartition data across a cluster or grid for a potential bottleneck job.
  • A standard multiple CPU server can run multiple jobs at once but is prone to bottleneck on a single high volume ETL job. Most servers will run a single process on a single CPU leaving the other CPUs idle if this is the only process running.

Culture

The ETL architecture must fit in with the culture, resourcing and support capabilities of the organization.

  1. Can the architecture be supported by existing support teams.
  2. Does the organization have development resources who can use the product.
  3. Do the vendors have similar products already in the organization.
  4. Has the solution been used in the organization before.

Development Resources

The solution architecture needs to take into account the programming and development of ETL jobs and the resources available to do the coding. Most ETL tools provide a simple Designer driven GUI interface that allows most beginners to start coding jobs. What the Solution Architecture needs to take into account is extending basic ETL into advanced coding:

  1. What coding language is used to extend the transformation capabilities. ETL and ELT tools typically favor a particular coding language (C, Visual Basic, VbScript, Java) or a database development language (SQL, Pro C, stored procedures).
  2. What coding and development is required to control the ETL jobs, what compatibility does it have with existing enterprise scheduling tools and support notification tools.
  3. What compatibility does the architecture have with other parts of the software development life cycle such as data modeling, data mapping, requirements, testing, version control, production configuration and deployment. A typical failing of an ETL deployment is that the ETL development is fast but the steps before and after are slow due to lack of synergy between team members and the SDLC processes.

The other development resourcing element of ETL implementation is the standards and techniques of using the tool that can lead to robust and maintainable ETL jobs. This requires expertise on the project that can be supplied by acquired staff, vendor professional services or system integrators. The solution architecture needs to use software that has expert development resource support.

Product Compatibility

Most ETL and ELT products now come as part of a data integration suite. The solution architecture needs to define if any of the following functions need to be catered for in terms of sharing metadata and transferring processed data:

The architecture also needs to take into account compatibility with existing products around development and deployment:

Security and Compliance

The architecture needs to meet existing organization standards for security and legal compliance.

  1. Encryption: encryption of in flow data, data sources and target files, data fields such as visa card numbers.
  2. Data Quality: SOX compliance calls for robust movement of data and an acceptable quality of financial reporting data.
  3. Authorization: the authentication and authorization of user and login access to all parts of the ETL process from source files, staging files, in flow data and target files and tables.
  4. Connectivity: the security of data moving on or off the architecture by methods such as secure FTP or data encryption.
Add a portlet to your desktop
Close