The Evolution of ETL

Why We Move Data

In the history of data processing, every application — and sometimes every program within an application — had its own copy of the data. We counted on being able to change that data and then pass the copy on to the next program in the application sequence.

Databases made this “change-and-forward” routine easier, as all programs within an application, and different applications, could share the same data. In time, the “Enterprise View” of the data emerged, to support business analytics and decision making. The “business data warehouse” was born as a stable repository of business data that persists across long horizons of time.

With the need for a historical, unified, and stable data warehouse came the need to move — and improve — data. Moving data means more than making a copy of some files. It means extracting the fundamental data of the business out of the volatile application space and holding that data in a value-preserving state over time.

Data In Motion

Moving data is a two-headed problem:

  • What does data mean when it is removed from its original context? – a business problem
  • How does data get from one place to another? – a technical problem

The business problem is the much more important problem to solve, and the much more difficult. It requires business knowledge, not merely technical competence. A wide variety of data management practices and processes have grown up around this problem: master data management, business analytics, predictive and retrospective analysis, data governance, and meta data management.

All of these processes and practices, however, presume that the technical problem — the problem of Data in Motion — has been, or can be, solved. And it has been solved — over and over again.

Generation 1: Customized Code and Code Generators

In the earliest days, when the data warehouse was just being born, data was moved through custom-written application code. Since much of the data was resident on mainframe systems, the typical data movement application was a COBOL program, retrieving data from application data stores and moving the data to the data warehouse. Because application data was not always, or even often, stored in a  relational database supported by standard SQL, customized code — often built manually — was the only option.

Those methods seem laughable today, but at the time (the early 1990’s) we used the tools that were at hand to accomplish the task. Then we improved the tools, by creating code generators. As late as 1993, Prism Solutions was announcing its release of  the Prism Warehouse Manager 2.0 — the press report bragged that the new product

…generates Cobol programs to extract and transform data from IMS, Vsam and sequential file structures, then produces JCL and DLL statements with scripts to load the data into a DBMS.

These early ETL tools also resulted in customized code to put Data In Motion — the customization was just being done by a generator rather than by a skilled programmer. We didn’t change the paradigm — we just got better at it.

I refer to this period as ETL Generation 1. It continued for a relatively short period, before the tool vendors realized that this was a market opportunity. That market vacuum needed to be filled. ETL Generation 2 was ushered in as one tool after another pushed its way into the spotlight.

Generation 2: ETL Engines and the ETL Ecosystem

This second generation of ETL tools seeks to provide full, “cradle-to-grave” coverage of the ETL process. The result is a marketplace of fully-integrated tools which can read from any data source, perform any transformation, and populate any target system. These are the ETL Engines. Rather than generate COBOL code, the ETL Engines operate as self-contained execution systems. They use parameterized job definitions to connect to the source and target systems, identify the data, and move the data applying the changes along the way. Beyond that minimal capability, the ETL 2.0 tools incorporate their own, often proprietary, implementation of industry best practices: a metadata repository, an audit and balance facility, a version control system, an archiving system, a data quality and validation facility, a scheduling system, a resource-management system, and even a security system.

What is interesting to me about these products — and they are products, much more than mere solutions — is that they are seamless, tightly integrated, and highly cohesive, even though they are made up of many disparate parts with each part performing a single, separable function. That seamlessness is a source of pride for the tool vendor; it is probably a defined design objective.

We can review any of the leading ETL Engines today — IBM InfoSphere (“DataStage”), Informatica, Ab Initio — and quickly learn that the tool supports all of the processing facets of a full ETL ecosystem. We learn also that each of their parts works well  — seamlessly — with each of their other parts.

What is less obvious is whether the tool is a closed system. The parts work well with each other because they subscribe to interfaces that are defined to each other. But those interfaces are not open to other tools. While InfoSphere is integrated with itself (and, to some extent, its brethren products in the WebSphere “family”), it is not integrated with Informatica, Ab Initio or Talend. We don’t expect them to be integrated because they are, after all, competing products.

This leaves us with a choice of tools from this 2nd Generation of ETL tools — but it is a Hobson’s choice. We can use InfoSphere, or Informatica, or Talend, or the next tool that comes along — but we cannot use the best of each and mold them into a seamless solution of our own. The ETL tool marketplace is populated with whole-solution products, but not with solution components.

There is no problem with this whole-solution product approach. It offers a welcome degree of simplicity for the customer who — looking at all of the myriad aspects of an ETL solution — can’t decide where to begin.

But the whole-solution product does bring with it a subtle constraint. The customer is obligated to take the weak part of a product in order to enjoy the benefit of its strengths in another area.

Examining the choices

As more enterprises step into the data warehouse arena, and as we gain more experience with data warehousing in general and with ETL in particular, we need to reaffirm this whole-solution product direction. It is time to examine the choices being offered by the ETL industry and determine if they are suited to solving the problems the customer is facing on a daily basis.

That examination will determine if this is the time to advance to a Third Generation of ETL — to advance to ETL 3.0. In the next installment, we will begin the examination.