Posts Tagged ‘ETL’
A few years ago, I did a little consulting for an organization involved in a legal dispute with its software vendor and integrator. For obvious reasons, I can’t divulge specifics here. To make a long story short, the company (let’s call it ABC) contracted the vendor (let’s call it XYZ Software) to customize and deploy a new supply chain and financial system. A six-month project quickly ballooned into 18 months, with still no promise that the system would work as promised.
And then the lawyers got involved.
As I listened to executives from ABC tell me what happened, a few things became clear to me. For starters, ABC recognized that it was partially at fault for the current state of affairs. While not accepting all of the blame, ABC realized that it made a critical mistake at the onset of the project: the company believed that it could successfully migrate its data from its legacy system to its new application. After all, how hard could ETL be, right?
ABC would soon find out.
It’s Never That Easy
ABC’s legacy system did not store data in a transactional and relational way. That is, its legacy system updated accounts, inventory, and the like to reflect current amounts (see second set of data below). Unfortunately for ABC, its new application needed transactional data to function (see first set of data below).
Because ABC employees had never worked with transactional data before, they struggled with the ETL process, no doubt contributing to the project’s delays. Sure, other issues were involved with XYZ’s (lack) of project management and personnel, but in reality this project never had a shot at hitting its six-month goal because of myriad data issues.
Interesting ending to the story: ABC kicked out XYZ and implemented the software of a new vendor relatively easily. The reason is obvious: its data had already been converted into a more ERP-friendly format.
Data storage today is a commodity but that didn’t used to be the case. (See Kryder’s Law.) Systems built in the 1980s tried to minimize then-expensive data storage. Remember, transactional tables can get very large.
When moving from an old system to a new one, don’t underestimate the amount of time and effort needed to convert data. I’ve seen far too many examples like ABC over the years.
What say you?
Yogi Berra famously once said, “When you come to a fork in the road, take it.” In this post, I’ll discuss a few different paths related to data quality during data migrations.
But let’s take a step back first. Extract-Transform-Load (ETL) represents a key process in any information management framework. At some point, at least some of an organization’s data will need to be taken from one system, data warehouse, or area, transformed or converted, and loaded into another data area.
The entire MIKE2.0 framework in large part hinges on data quality. DQ represents one of the methodology’s key offerings, if not its most important one. To this end, it’s hardly unique as an information management framework. But, as Gordon Hamilton and my friend Jim Harris pointed out recently on an episode of OCDQ Radio, not everyone is on the same page when it comes to when we should actually clean our data. Hamilton talks about EQTL (Extract-Quality-Transform-Load), a process in which data quality is improved in conjunction with an application’s business rules. (Note that there are those who believe that data should only be cleaned after it has been loaded into its ultimate destination–i.e., that ETL should give way to ELT.)
Why does this matter? Many reasons, but I’d argue that implicit in any decent information management framework is the notion of data quality. While many frameworks, models, and methodologies vary, it’s hard to imagine any one worth its salt that pooh-poohs accurate information. (For more on a different framework, see Robert Hillard’s recent post.)
Data Quality Questions
Different frameworks aside, consider the following questions:
- As data management professionals, should we be concerned about the quality of the data as we are transforming and loading it?
- Or can our concerns be suspended during this critical process?
- And the big question: Is DQ always important?
I would argue yes, although others may disagree with me. So, for the organization migrating data, when is the best time for conduct the cleansing process? Here are the usual suspects:
- During Migration
- Post Migration
Organizations can often save an enormous amount of time and money if they cleansed their data before moving it from point A to point B. To be sure, different applications call for different business rules, fields, and values. Regardless, it’s a safe bet that a new system will require different information than its predecessor. Why not get it right before mucking around in the new application?
Some contend that this is generally the ideal point to cleanse data. Retiring a legacy system typically means that organizations have decided to move on. (The term legacy system implies that it is no longer able to meet a business’ needs.) As such, why spend the time, money, and resources “fixing” the old data. In fact, I’ve seen many organizations specifically refuse to cleanse legacy data because their end users felt more secure with the knowledge that they could retrieve the old data if need be. So much for cutting the cord.
If your organization takes this approach, then it is left with two options: cleanse the data during or post-migration. If given the choice, I’d almost always opt for the former. It’s just easier to manipulate data in Excel, Access, ACL, or another tool than it is in a new system loaded with business rules.
The ease of manipulating data outside of a system is the very reason that many organizations prefer to clean their in a system of record. Excel doesn’t provide a sufficient audit trail for many folks concerned about lost, compromised, or duplicated data. As such, it makes more sense to clean data in the new system. Many applications support this by allowing for text-based fields that let users indicate the source of the change. For instance, you might see a conversion program with a field populated with “data change – conversion.”
Is this more time-consuming? Yes, but it provides greater audit capaability.
What say you?
ETL tools can be extremely involved, especially with complex data sets. At one time or another, many data management professionals have built tools that have done the following:
- Taken data from multiple places.
- Transformed into (often significantly) into formats that other systems can accept.
- Loaded said data into new systems.
In this post, I discuss how to add some basic checkpoints into tools to prevent things from breaking bad.
The Case for Checkpoints
Often, consultants like me are brought into organizations in need of solving urgent data-related problems. Rather than gather requirements and figure everything out, the client usually wants to just start building. Rarely will people listen when consultants advocate the need to take a step back before beginning our development efforts in earnest. While this is a bit of a generalization, few non-technical folks understand:
- the building blocks required to create effective ETL tools
- the need to know what you need to do–before you actually have to do it
- the amount of rework required should developers have an incomplete or inaccurate understanding of what the client needs done
Clients with a need to get something done immediately don’t want to wade through requirements; they want action–and now. The consultant who doth protests too much runs the risk of irritating his/her clients, not to mention being replaced. While you’ll never hear me argue against understanding as much as possible before creating an ETL tool, I have ways to placate demanding clients while concurrently minimizing rework.
Enter the Checkpoint
Checkpoints are simply invaluable tools for preventing things from getting particularly messy. Even simple SQL SELECT statements identifying potentially errant records can be enormously useful. For example, on my current assignment, I need to manipulate a large number of financial transactions from disparate systems. Ultimately, these transactions need to precisely balance against each other. Should one transaction be missing or inaccurate, things can go awry. I might need to review the thirty or so queries that transform the data, looking for an error on my end. This can be time-consuming and even futile.
Enter the checkpoint. Before the client or I even run the balancing routine, my ETL tool spits out a number of audits that identify major issues before anything else happens. These include:
- Missing currencies
- Missing customer accounts
- Null values
- Duplicate records
While the absence of results on these audits guarantees nothing, both the client and I know not to proceed if we’re not ready. Consider starting a round of golf only two realize on the third whole that you forgot extra balls, your pitching wedge, and drinking water. You’re probably not going to have a great round.
Sure, agile methods are valuable. However, one of the chief limitations of iterative development is that you may well be building something incorrectly or sub-optimally. While checkpoints offer no guarantee, at least they can stop the bleeding before wasting a great deal of time analyzing problems that don’t exist. Use them liberally; should the produce no errors, you can always ignore them, armed with increased confidence that you’re on the right track.
What say you?
Here’s a dirty little about data management: It’s about art as well as science. In this post, I discuss how many people often mistakenly focus on the “science” of things while minimizing the art piece.
Any good developer knows that there are many ways to skin a cat. For even something requiring high a degree of precision, there are often many options. Of course, some are better than others. When I develop data management tools, I often have a number of different alternatives with respect to moving and manipulating data, including the use of:
- Temp tables
- Batch processes
- Export/import routines
While the MIKE2.0 framework provides for extensive best practices at a general level, there is a blizzard of individual decisions to make. Major development questions often include:
- Should a data transfer process occur automatically or is there a need for someone to approve an action mid-stream?
- What’s the right application for end users to enter and maintain data?
- Are there any audit or regulatory concerns?
- How technical are those being asked to maintain the data?
- What type of safeguards exist so my clients won’t have to call me with minor questions?
- How can I lock down the data–and the magic behind the scenes–so people can’t break things, unintentionally or otherwise?
The answers to these questions drive my development efforts and basic philosophy for data management. For example, if I build an ETL tool for the IT department, it’s reasonable to assume that employees’ expertise will allow them to make some changes, especially if I document things well. I can probably automate many things and let SQL dance. However, if the same tool is built for generally less technical folks, there exists the very real danger that someone might break something. I typically err on the side of simplicity and more manual data management but it’s appropriate for that audience.
In the classic movie This is Spinal Tap, there’s an amazing line: There’s a fine line between clever and stupid. That quote need not be confined to the music industry. The same aphorism holds true for data management. Different folks in organizations have different levels of understanding and proficiency of all things data. I have seen repeatedly throughout my career the perils of overreaching; sometimes, really neat methods of data management are lost on end users, resulting in confusion, frustration, and dysfunction.
What say you?
Agile software development is a funny thing. Most of what I know about formal Agile methodologies comes from my friend Roland Cuellar. He wrote the chapter on Agile in my second book. While I’m relatively new to formal Agile processes, in fact I have been using them for the vast majority of my career. In this post, I’ll be discussing some of the limitations of Agile within the context of data management.
On my current assignment, my client is a large financial institution in the midst of some M&A activity. I am working closely with a director named Tommy to build an ETL tool. (This is not his real name; I just happen to be listening to Styx right now). Tommy is getting pulled in a bunch of different directions and, using more tact that I have, is trying to successfully navigate his organization’s political waters. As soon as we make the changes to file formats for his internal clients (read: very senior folks), he almost just as quickly comes back with changes. Some are minor; many are not.
The Limitations of Agile
The benefits of Agile–and there are many–are not unequivocal. While my little ETL tool is nowhere near as complex as many enterprise systems, building anything in an iterative process introduces risks such as rework and related time and expense overruns.
Now, I wouldn’t consider myself a hard core software developer. I can conceptualize as well as the next guy but sometimes I just have to see the aftereffects of a change. No one can think of everything conceptually. While versioning can partially control rework, I’d be lying if I claimed that all of my work is linear. At least my client understands this.
Implementing an ostensibly simple change request in a production environment can have one or more of the following ramifications:
- Break something else down the line.
- Corrupt or change data.
- Introduce inefficiencies–i.e., batch processes involve superfluous tables and views that, had the end result been known, could have easily been averted.
- Lead to excessive dependence on a resource with intimate knowledge of the data and development process. As a result, it might be difficult for an organization to rid itself of an expensive external resource.
What’s more, all of these changes may not be immediately evident. Dealing with tens of thousands of records (or more) means that it’s easy to miss a few. For example, because of some data manipulation, I had to derive a number of fields in a translate table. It would have been easy for me to forget that those fields needed to be updated in several related tables. Fortunately, I’m pretty diligent. Because of the complexity of my task, however, it would have been understandable for me to have missed a few of these interdependencies.
The Great Divide
Agile challenges are exacerbated by the IT-business disconnect, something which most experienced data management professionals have seen over their careers. “The line” wants what they want when they want it, irrespective of the development and data ramifications of their requests. Most business folks could care less about data models, referential integrity, and normalized tables. Changes often come from the top down and IT just needs to make it happen. This breeds frustration in IT folks, put in the unfortunate position of “just doing their job” if they successfully make the changes and “screwing things up” if they don’t.
Bottom line: Agile isn’t a panacea to The Great Divide between IT and the business. It may be better for many projects than a Waterfall methodology and, to be sure, not understanding fundamental business requirements will pose major challenges for any software development process. Ensure that IT is at the table during major development efforts, especially if business end users are unfamiliar with software, technical, and data issues.
What say you?
TODAY: Fri, March 24, 2017March2017