ETL 3.0 vs. The Need For Speed

The previous posts have taken us along a journey on the business side of ETL. Unlike most ETL discussions, the ETL 3.0 story is rooted first and firmly in the business impact of a technology decision. I showed how a technology choice, made simply on technical grounds, can handcuff a system — and constrain the business.

ETL 3.0 is built on the premise that change happens, and that change happens unavoidably. With that in mind, it is imperative that the ETL 3.0 architecture is built to accommodate change. This is the first of the flexible, extensible, autonomous architecture principles, which sets the stage for all subsequent architectural choices.

I have found, to my genuine surprise, that being flexible and receptive to change is not a commonly-held view among architects and designers. In fact, flexibility is often dismissed as an inhibitor to the preferred architecture, design and implementation. So, I wondered, what is the dominating principle?

Speed.

As Fast As Possible

Whenever I ask a client what they are trying to accomplish with their ETL solution, the first answer seems to be: “to go as fast as possible”. There may be other drivers, but the key is always based on speed.

This emphasis on speed as the first principle leads to some very significant choices:

  • Networks are inhibitors to speed; therefore, eliminate or shorten the network path
  • Disk storage is an inhibitor to speed; therefore, eliminate as much disk I/O as possible
  • Data size is an inhibitor to speed; therefore, use binary (machine-dependent) formats rather than human-readable formats
  • Generic interfaces are inhibitors to speed; therefore, use specific interfaces that are tuned to the specific release level of the specific tool for the specific system
  • Novice programmers are inhibitors to speed; therefore, use highly-skilled experts who can apply fine-grained “tweaks” to the system for maximum speed

Now, not every “speed freak” will take all of these positions, and some of them seem a little extreme. But in our experiences, we have heard every one of these cited as a speed-inhibitor, and we have heard, seen and recovered from every one of these solutions.

As Fast As Necessary

What I propose – and what ETL 3.0 embraces – as an alternative is to architect a system that is as fast as necessary. This gives us a goal that is more meaningful, more achievable and more affordable.

The purpose of ETL is to ensure that data is available when the user submits a query for that data. That might mean that the data must be available one minute after midnight, or it might mean that the data must be available not later than 6:00 a.m. In some cases, the data must be available within one hour or twelve hours. In real-time cases, the data must be available in seconds or less.

There is a speed dimension to each of these requirements. An ETL process that takes 13 hours won’t fit into a 6-hour overnight processing window. But the speed that is necessary is not the same as the speed that is as fast as possible.

It is necessary to draw this distinction between as fast as possible and as fast as necessary here only because, in an ETL 3.0 architecture, it is inappropriate to build an ETL process that is as fast as possible.

In an ETL 3.0 architecture, the emphasis is on being flexible. There have been protests that this flexibility slows the process down. These protests are not without merit – history has shown us how an emphasis on flexibility can degrade speed and add “bulk” (consider XML’s flexibility vs the raw speed of fixed-format EDI).

However, this objection should be taken as a challenge to introduce flexibility while preserving speed. That combination is what ETL 3.0 achieves – the ability to construct a flexible solution that performs at the speed of a fixed, static solution.

ETL 3.0: Getting Down To Business (2)

In part 1 of “ETL 3.0: Getting Down To Business“, I described the core principles of an ETL architecture – flexibility, extensibility and autonomy – and the business benefits that flow from such an architecture. In particular, the business benefits because flexible, extensible, autonomous architectural principles empower IT to respond rapidly to changes in the business environment. An empowered IT serves the business by making better Product Selections and making better Staff Selections.

Beyond that, there are more ways to benefit the business through an architecture premised on flexibility, extensibility and autonomy.

Invention

Invention is the natural outcome of curiosity, need and opportunity. Invention begins when someone is faced with a problem (the need) that is systemic or chronic — a problem that goes on and on and demands attention. A problem that costs the business money. Someone wonders “why can’t we change something so that this problem goes away?” (the curiosity). And that person takes the time (the opportunity) to seek out a solution to the problem. That solution, when it is found, is the invention.

But an invention can shrivel on the drawing board unless there is a way to economically and effectively apply the change. That means there must be sufficient flexibility in the system to permit the invention to be tried, and – if successful – to be adopted.

An ETL application which is built around a single, all-encompassing and tightly-integrated product leaves little room for examining alternative solutions. Often, the time, effort and cost of testing the invention are too much to bear. Unless the invention is likely to produce significant value with minimal disruption, it can remain only an idea that never finds its way to implementation.

An ETL 3.0 environment welcomes invention. The hallmarks of the architecture – flexibility, extensibility and autonomy – encourages IT to introduce inventive solutions by minimizing the disruptive impact of change. This is an architectural view that accepts change, whether as remedies or as enhancements, as a natural way to grow and improve a system.

Bottom Line: The ETL 3.0 model empowers IT to better serve the business, by inviting and welcoming invention and minimizing the disruption of exploration, discovery and growth.

‘One-off Solutions’

A variation of the invention is the one-off solution – a single-purpose solution, somewhat out of the ordinary and not fitting into the pre-defined standard that has been established for the system. Every IT organization has worked to create an ETL framework that allows for somewhat rapid development of an ETL job. But what happens when a solution doesn’t fit into the framework? Or when the framework is just too rigid for this solution?

We have all found ourselves creating a quick-and-dirty solution. And we have all promised to throw that solution out once we’ve addressed the immediate problem. And we have all found ourselves hiding it away, to be used again later.

There really isn’t a need to discard a solution that is quick-and-dirty. More than likely, you will need it again, either as a reference or as a basis for another solution.

What is actually needed is a framework that allows you to build a quick-and-clean solution, without sacrificing all of the governance and business rules that apply to normal solutions. A solution that is a one-off is still manipulating data within your domain, and needs to be compliant with the same rules.

An ETL 3.0 environment accepts the one-off solution as a normal (though infrequent) method for addressing problems. In the ETL 3.0 environment, a one-off solution is formed by a unique re-combination of existing parts, rather than by “hacking together” work that is truly quick, but also truly dirty.

Bottom Line: The ETL 3.0 model empowers IT to better serve the business, by offering ways to create a quick-and-clean solution, leveraging both the parts and the framework that already exists in normal production.


ETL 3.0: Getting Down to Business

ETL 3.0 is driven by the business cost of putting data in motion. ETL is “plumbing” — it offers no direct value, but it is a critical part of the support system for the data warehouses and data marts which do provide direct value. In this light, ETL is a necessary cost. ETL 3.0 is shaped to manage and control that cost.

The driving factor of the modern IT shop is to operate at the speed of business.More than anything else, this means being able to respond rapidly to changes in the business climate. These changes come from the business units within the enterprise, from trading partners outside the enterprise and – for IT – from the continuous advancements of technology itself.

Whenever the business makes a new demand, IT must be prepared to satisfy that demand. It must do so rapidly. It must do so without major disruption. And it must be able to move forward when it needs to move forward.

The approach to technology that underlies a system contributes to the responsiveness of IT. And the core principles underlying the system’s architecture drive that approach.

Core Principles for an ETL Architecture

It is important to give appropriate weight to the principles that must drive the architectural foundation of the Extract, Transform and Load (ETL) system.

  • Flexibility
  • Extensibility
  • Autonomy

These three principles – Flexibility, Extensibility and Autonomy – have the greatest impact on IT’s ability to respond to changing business demand.

Flexibility is the key principle to guide all design, adoption and implementation choices. Flexibility means being able to adapt to forces of change, easily, swiftly and with minimum risk. Technology, products, the marketplace, or – especially – the business may impose necessary and beneficial change. Flexibility is essential to avoid “tear-up” when the inevitable changes occur.

Extensibility ranks right behind Flexibility. It is especially important when Flexibility must be compromised because of a limitation in a product or design choice. Extensibility means being able to take a product beyond its intended capabilities. This is the enabler of discovery and invention, two key elements of a vibrant IT organization. Extensibility allows you to overcome limitations, not with “workarounds”, but with solutions that are well-designed and architecturally sound.

Autonomy is the IT organization’s capacity for moving forward at its own pace. Autonomy is enabled by Flexibility, by Extensibility and by a skilled workforce. If an architected solution supports Autonomy, then the IT organization can take an active role in creating what is needed, when it is needed, to respond to a specific business demand. The IT organization is not dependent on, or encumbered by, the ability or desire or timetable of vendors or markets.

The Inevitability of Change

The business community will change – organizations recombine and refocus; partners and vendors join and separate; people come and go; priorities rise and fall.

And certainly technology changes – in fact, “technology churn” has been with us for decades. Often, the churning just produces cosmetic changes to old favorites. But, on occasion, the churning signals a significant adjustment – the replacement of some tried-and-true technology with a “new kid on the block” that will make a real difference.

When architecting a solution, we must account for the fact that today’s innovations are tomorrow’s legacy systems and legacy products. Anticipate the change that will inevitably come, by adopting Flexibility and Extensibility as core architectural principles.

ETL 3.0 Begins With Flexibility

The core principle of ETL 3.0 is rooted in the recognition of change as a constant and driving force. ETL 3.0 strives to embrace change — the same external pressures that prevented IT from servicing customer demands are anticipated by ETL 3.0, enabling IT to rapidly respond to the customer.

The benefits of an ETL 3.0 architecture address the costs of technical changes, but they go well beyond technology. An ETL 3.0 architecture enables direct business benefits as well.

Product Selection

In every IT shop, there has come the moment when the market unveils a new product, one which is better, faster and — most importantly — cheaper than what you have in place. Yet, you cannot take advantage of this cost savings because this product doesn’t work just like your legacy products. The parts don’t fit together, so the cost savings are offset by the costs and risks of “rip-up-and-replace”.

ETL 3.0 attacks this problem by dictating that the ETL solution must be made up of discrete parts, with standardized, open, and loosely-coupled interfaces between parts. When the process transitions from source system to Extract, from Extract to Transform, from Transform to Load, and from Load to target system conform to the ETL 3.0 model, than any individual part of the process can be enhanced or replaced without disrupting the remaining parts.

The business value lies in empowering IT to explore new product opportunities, by minimizing the disruptive cost of the exploration. A new transform component can be added in, co-existing with the existing legacy transform component, for a direct side-by-side comparison of features, functions and fit.

When a good fit is found, the ETL 3.0 model permits that same replacement model to ease the transition from the legacy tool to the updated tool. Because the interfaces are open, standardized and loosely-coupled, the old and new tools can run side-by-side. This allows the transition to be paced, driven not by the demands of the technology but by the capacity and capabilities and needs of the business.

Bottom Line: The ETL 3.0 model empowers IT to better serve the business, by investigating and adopting new products with minimal disruption and side-effects, leveraging the historic technology trend of better, faster, cheaper products.

Staff Selection

As a data warehouse grows to meet the needs of the business, it is likely that the number of systems and tools making up that warehouse — the “plumbing” — expand. Today’s market is dominated by a few very large vendors and a smattering of smaller vendors. The technologies and tools offered by both large and small vendors is continuously in flux, and every tool and platform requires a different set of skills, experience and expertise. As this mix grows within a single IT shop, the ability to staff that shop becomes more difficult  — and more expensive.

ETL 3.0 attacks this problem by dictating that that the ETL solution must be made up of discrete parts, with standardized, open, and loosely-coupled interfaces between parts. Because the parts are discrete and the interfaces are loosely-coupled, the ETL development staff can likewise be comprised of discrete and loosely-coupled staff members.

Consider: if an ETL environment consists of moving data from an IBM mainframe to an Oracle database using Informatica, the ETL staff is typically made up of people who know Informatica and Oracle, or Informatica and mainframe. Candidates who are expert in Informatica but don’t know Oracle, or who are strong in Oracle but don’t know Informatica, are not as highly regards as someone who is less expert in both.

Now consider what happens when change occurs: imagine that the ETL environment is expanded to include a Teradata system, co-existing with Oracle. And some of the ETL will be processed using Talend. What skills does the new ETL staff candidate need to have? The search for someone experienced with mainframe and Informatica and Talend and Oracle and Teradata will probably reduce to zero candidates.

In an ETL 3.0 environment, IT is freed from making such a choice. Each tool or platform can be staffed by the most affordable skilled candidate for that position. The staff members work as loosely-coupled teams to obtain the maximum value from each tool or from each platform.

Bottom Line: The ETL 3.0 model empowers IT to better serve the business, by drawing the best, most skilled, most affordable ETL staff members from a large pool of candidates.

What are other ways that the ETL 3.0 model empowers IT? In the next dispatch, I’ll be talking about the two essentials of advancing business value — invention and ‘one-off solutions‘.


ETL 3.0TM is a trademark of BVWatson LLC, copyright 2010, all rights reserved.

In ETL, We Mean Business

In the last blog entry, “Heraclitus, Change and ETL”, I described three scenarios that, as IT professionals, we never want to encounter twice. All of these scenarios were rooted in an inability to adapt to change.

In the first scenario, the business loses the opportunity to reduce its cost because the legacy ETL applications are wedded to your existing technology. Years ago, the decision was made to tightly couple the applications to the technology to gain the best possible performance. It was a good decision then — performance has been stellar, even at one time extraordinary. Today, however, that decision is a constraint, and costs the company a significant cost-savings opportunity.

What do you say to the business? You could defend the tight-coupling decision, but that doesn’t move you forward to the next opportunity. Unless you take action, you may find yourself making the same defense in another 12 months. That won’t be a meeting you will enjoy.

In the second scenario, the business is not offered the choice, only an impending and unavoidable disruption. When you chose this software tool several years ago, you paid due diligence to the vendor’s financial stability and outlook. But the unthinkable has happened — the economy sank, the bubble burst, credit became more difficult to get and the vendor is now floundering. There is no problem with the tool — it works as well as ever. But it no longer has a future and your investment in that tool is now at risk. This is the problem of the whole-solution product. You have your whole application, indeed the life-blood of your data warehouse, tied to this soon-to-be-obsolete product.

There is no easy way out of this predicament, but that’s not the question. The question is, as you bite the bullet and move your application out from under this deprecated product, what can you do to avoid landing right back into the same trap? You could search for the vendor that you’re certain will never fail — but in these times there are no sure bets. The product you need most may be the product of a new and financially unproven vendor. Should you bet your job on the new vendor? Or bet your job on the proven track record? How about if you don’t bet your job at all?

The third scenario goes to the heart of the cost-saving pressures that we face every minute. Staffing an ETL project presents its own set of challenges; you need expertise in a variety of tools and platforms — just one skill won’t do. Every additional source or system type added to your ETL mix has a multiplier effect on the staffing needs. As the total warehouse becomes more complex, the number of outsourcing options diminish geometrically, until — as in one case — the number of staffing options is reduced to zero.

What choices do you have? Of course, one option is to take the single-platform route — move all of your data warehouse platforms to a single family, and get by with one set of skills and tools. But this seems oddly reminiscent of other constraining problems.

What if you could use individuals who each add one expert skill to the team? And what if the very unique skills could be hired as temporary help, to fill the gap? And what if each expert could build their part of the solution knowing only how that part fit with other parts — but not knowing how the other parts worked? This has worked for years in other industries — why not in your multi-platform, multi-tool ETL shop?

I described three scenarios that are at the heart of what frustrates the business about IT.  ETL 3.0 will help you avoid these scenarios in the future, but it’s important, first, to understand that these situations are not an inevitable part of delivering high-performing or quick-to-market IT solutions.

To meet the pace of change in technology and business that is part of the current marketplace, IT must remove the threat of vendor breakdowns, and exploit the advantages of lower-cost solutions while still meeting — and exceeding — the demand for new capabilities.  As IT Professionals, our choices cannot simply be about leveraging the proprietary [and cool] features of a particular technology or having one technical part operating as fast as possible or focusing on the initial speed to implement or cost to deploy.  Our focus must be on how we remove business barriers, how we empower the business to respond to competitive threats and how we contribute to business value.

This focus on the business is at the heart of ETL 3.0. I’ll describe that in the next dispatch.

Heraclitus, Change and ETL

ποταμοῖσι τοῖσιν αὐτοῖσιν ἐμϐαίνουσιν, ἕτερα καὶ ἕτερα ὕδατα ἐπιρρεῖ.1

Heraclitus, the ancient Greek philosopher, said “You cannot step twice into the same river”. We have been living this truth in IT for decades now. Change is with us, undeniable and inexorable. Sometimes we welcome the change; sometimes we resist the change; sometimes we fear the change. But change wins out. What we assume to be obviously true today will be obviously false one day.

IT Needs to Change

Change as a constant, driving force is well-recognized in the IT world. As we look back over the last 20 years, we see the arrival and departure of new technologies, new vendors, and new tools. We see new development processes as Agile and Scrum press hard against Waterfall and SDLC. We see new alliances, as one large company acquires — and sometimes dismantles — another, as familiar tools and languages become obsolete and new ones take their moment in the spotlight. We see Open Source gain a foothold against the major tool sets, and the demise of “cradle-to-grave” CASE tools against the speed and agility of IDEs. And we see the emergence of the internet, email, smart phones and WiFi creating a demand for IT services that is always present, always connected, always on. Change is both constant and immediate.

Business Needs to Change

The change in IT is reflective of the change in the business that IT serves. Business is under increasing pressure to satisfy a marketplace that demands better, cheaper, faster goods and services. Cheaper and faster comes from mass production, where competing companies offer the same product and try to distinguish themselves on price and marketing cost — on “mass customization” and a “360-degree view of the customer”. To meet this need, the business needs to know more about each customer. To know more, the business needs more data. And — because of the changes in technology — there is more data available all the time about everything everybody does. Business can keep up with this explosion in data because advances in technology make it both possible and cost-effective.

Which brings us to the need for an ETL architecture that allows the data warehouse to move at the speed of the business. We have seen that technology changes frequently, and usually for the better. When technology changes, can IT embrace that change, fold it into its arsenal and exploit the improvement it offers? Or is IT shackled to a technology that is tried and true but no longer measures up to the demands of modern business?

What Would You Do?

As you look at your ETL architecture, how would you respond to these change scenarios?

  • A new database server promises to provide the same capabilities as your legacy servers, at a dramatic reduction in cost. But it uses standard SQL and has its own proprietary utilities, much different from your legacy systems (which use a proprietary SQL). You decide it will cost too much to change your legacy ETL applications to use a new set of tools and commands, so you skip this cost-reduction opportunity. How do you explain that to the business?

  • The vendor of your core ETL software product is at risk of going bankrupt. You’re afraid that you will lose any support or future enhancements to the product. You need to find an alternative product, but you have 15,000 ETL jobs and 20,000 business rules embedded in the product. It will take 12 months and several million dollars to convert to a new tool, just to do what you’re able to do today. How do you avoid getting into this situation again?

  • You can reduce staffing costs by outsourcing the development work. You are using a leading ETL software product to extract data from 4 types of source systems and to load 3 different types of database servers. You’re looking for ETL developers who know all 8 products, but it’s apparent that there is no one with that combination of skills. You end up having to split the work among 3 different outsource firms to get developers with the skills to implement your ETL jobs, and managing the distribution of work is a nightmare. You’re afraid you’ll have to give up the outsourcing savings and bring everything back in-house. Do you have any other options?

None of these are comfortable situations to be in – and a well-crafted architecture can go a long ways toward avoiding these situations. This series continues with a look at the architectural principles behind ETL 3.0 — principles that drive such a well-crafted architecture.


1Translation: “Ever-newer waters flow on those who step into the same rivers”

The Evolution of ETL

Why We Move Data

In the history of data processing, every application — and sometimes every program within an application — had its own copy of the data. We counted on being able to change that data and then pass the copy on to the next program in the application sequence.

Databases made this “change-and-forward” routine easier, as all programs within an application, and different applications, could share the same data. In time, the “Enterprise View” of the data emerged, to support business analytics and decision making. The “business data warehouse” was born as a stable repository of business data that persists across long horizons of time.

With the need for a historical, unified, and stable data warehouse came the need to move — and improve — data. Moving data means more than making a copy of some files. It means extracting the fundamental data of the business out of the volatile application space and holding that data in a value-preserving state over time.

Data In Motion

Moving data is a two-headed problem:

  • What does data mean when it is removed from its original context? – a business problem
  • How does data get from one place to another? – a technical problem

The business problem is the much more important problem to solve, and the much more difficult. It requires business knowledge, not merely technical competence. A wide variety of data management practices and processes have grown up around this problem: master data management, business analytics, predictive and retrospective analysis, data governance, and meta data management.

All of these processes and practices, however, presume that the technical problem — the problem of Data in Motion — has been, or can be, solved. And it has been solved — over and over again.

Generation 1: Customized Code and Code Generators

In the earliest days, when the data warehouse was just being born, data was moved through custom-written application code. Since much of the data was resident on mainframe systems, the typical data movement application was a COBOL program, retrieving data from application data stores and moving the data to the data warehouse. Because application data was not always, or even often, stored in a  relational database supported by standard SQL, customized code — often built manually — was the only option.

Those methods seem laughable today, but at the time (the early 1990’s) we used the tools that were at hand to accomplish the task. Then we improved the tools, by creating code generators. As late as 1993, Prism Solutions was announcing its release of  the Prism Warehouse Manager 2.0 — the press report bragged that the new product

…generates Cobol programs to extract and transform data from IMS, Vsam and sequential file structures, then produces JCL and DLL statements with scripts to load the data into a DBMS.

These early ETL tools also resulted in customized code to put Data In Motion — the customization was just being done by a generator rather than by a skilled programmer. We didn’t change the paradigm — we just got better at it.

I refer to this period as ETL Generation 1. It continued for a relatively short period, before the tool vendors realized that this was a market opportunity. That market vacuum needed to be filled. ETL Generation 2 was ushered in as one tool after another pushed its way into the spotlight.

Generation 2: ETL Engines and the ETL Ecosystem

This second generation of ETL tools seeks to provide full, “cradle-to-grave” coverage of the ETL process. The result is a marketplace of fully-integrated tools which can read from any data source, perform any transformation, and populate any target system. These are the ETL Engines. Rather than generate COBOL code, the ETL Engines operate as self-contained execution systems. They use parameterized job definitions to connect to the source and target systems, identify the data, and move the data applying the changes along the way. Beyond that minimal capability, the ETL 2.0 tools incorporate their own, often proprietary, implementation of industry best practices: a metadata repository, an audit and balance facility, a version control system, an archiving system, a data quality and validation facility, a scheduling system, a resource-management system, and even a security system.

What is interesting to me about these products — and they are products, much more than mere solutions — is that they are seamless, tightly integrated, and highly cohesive, even though they are made up of many disparate parts with each part performing a single, separable function. That seamlessness is a source of pride for the tool vendor; it is probably a defined design objective.

We can review any of the leading ETL Engines today — IBM InfoSphere (“DataStage”), Informatica, Ab Initio — and quickly learn that the tool supports all of the processing facets of a full ETL ecosystem. We learn also that each of their parts works well  — seamlessly — with each of their other parts.

What is less obvious is whether the tool is a closed system. The parts work well with each other because they subscribe to interfaces that are defined to each other. But those interfaces are not open to other tools. While InfoSphere is integrated with itself (and, to some extent, its brethren products in the WebSphere “family”), it is not integrated with Informatica, Ab Initio or Talend. We don’t expect them to be integrated because they are, after all, competing products.

This leaves us with a choice of tools from this 2nd Generation of ETL tools — but it is a Hobson’s choice. We can use InfoSphere, or Informatica, or Talend, or the next tool that comes along — but we cannot use the best of each and mold them into a seamless solution of our own. The ETL tool marketplace is populated with whole-solution products, but not with solution components.

There is no problem with this whole-solution product approach. It offers a welcome degree of simplicity for the customer who — looking at all of the myriad aspects of an ETL solution — can’t decide where to begin.

But the whole-solution product does bring with it a subtle constraint. The customer is obligated to take the weak part of a product in order to enjoy the benefit of its strengths in another area.

Examining the choices

As more enterprises step into the data warehouse arena, and as we gain more experience with data warehousing in general and with ETL in particular, we need to reaffirm this whole-solution product direction. It is time to examine the choices being offered by the ETL industry and determine if they are suited to solving the problems the customer is facing on a daily basis.

That examination will determine if this is the time to advance to a Third Generation of ETL — to advance to ETL 3.0. In the next installment, we will begin the examination.

Announcing ETL 3.0

ETL 3.0TM is a trademark of BVWatson LLC, copyright 2010, all rights reserved.

It is almost appropriate that ETL 3.0 should be first pronounced on July 4th. This is the date celebrated as “Independence Day” in the United States, and ETL 3.0 offers independence as its primary objective.

In these pages, over the next few months, I will describe the evolutionary history of ETL through its first two generations, before describing the need for ETL to move on to a third generation — one rooted in independence, flexibility, extensibility and autonomy.

For now, this pronouncement on this date will mark the beginning of the next step in ETL’s growth. The rest will come in parts over the next few months.

Please stay tuned.