ETL 3.0 vs. The Need For Speed

The previous posts have taken us along a journey on the business side of ETL. Unlike most ETL discussions, the ETL 3.0 story is rooted first and firmly in the business impact of a technology decision. I showed how a technology choice, made simply on technical grounds, can handcuff a system — and constrain the business.

ETL 3.0 is built on the premise that change happens, and that change happens unavoidably. With that in mind, it is imperative that the ETL 3.0 architecture is built to accommodate change. This is the first of the flexible, extensible, autonomous architecture principles, which sets the stage for all subsequent architectural choices.

I have found, to my genuine surprise, that being flexible and receptive to change is not a commonly-held view among architects and designers. In fact, flexibility is often dismissed as an inhibitor to the preferred architecture, design and implementation. So, I wondered, what is the dominating principle?


As Fast As Possible

Whenever I ask a client what they are trying to accomplish with their ETL solution, the first answer seems to be: “to go as fast as possible”. There may be other drivers, but the key is always based on speed.

This emphasis on speed as the first principle leads to some very significant choices:

  • Networks are inhibitors to speed; therefore, eliminate or shorten the network path
  • Disk storage is an inhibitor to speed; therefore, eliminate as much disk I/O as possible
  • Data size is an inhibitor to speed; therefore, use binary (machine-dependent) formats rather than human-readable formats
  • Generic interfaces are inhibitors to speed; therefore, use specific interfaces that are tuned to the specific release level of the specific tool for the specific system
  • Novice programmers are inhibitors to speed; therefore, use highly-skilled experts who can apply fine-grained “tweaks” to the system for maximum speed

Now, not every “speed freak” will take all of these positions, and some of them seem a little extreme. But in our experiences, we have heard every one of these cited as a speed-inhibitor, and we have heard, seen and recovered from every one of these solutions.

As Fast As Necessary

What I propose – and what ETL 3.0 embraces – as an alternative is to architect a system that is as fast as necessary. This gives us a goal that is more meaningful, more achievable and more affordable.

The purpose of ETL is to ensure that data is available when the user submits a query for that data. That might mean that the data must be available one minute after midnight, or it might mean that the data must be available not later than 6:00 a.m. In some cases, the data must be available within one hour or twelve hours. In real-time cases, the data must be available in seconds or less.

There is a speed dimension to each of these requirements. An ETL process that takes 13 hours won’t fit into a 6-hour overnight processing window. But the speed that is necessary is not the same as the speed that is as fast as possible.

It is necessary to draw this distinction between as fast as possible and as fast as necessary here only because, in an ETL 3.0 architecture, it is inappropriate to build an ETL process that is as fast as possible.

In an ETL 3.0 architecture, the emphasis is on being flexible. There have been protests that this flexibility slows the process down. These protests are not without merit – history has shown us how an emphasis on flexibility can degrade speed and add “bulk” (consider XML’s flexibility vs the raw speed of fixed-format EDI).

However, this objection should be taken as a challenge to introduce flexibility while preserving speed. That combination is what ETL 3.0 achieves – the ability to construct a flexible solution that performs at the speed of a fixed, static solution.