First, Philosophy, then Data Science

One of the most telling articles on “Data Science” appeared in the NYTimes in April[1]. We are facing a massive shortage of data scientists, it read. “There will be almost half a million jobs in five years, and a shortage of up to 190,000 qualified data scientists.”
Trouble is, the same article says, “Because data science is so new, universities are scrambling to define it and develop curriculums.”
So — we don’t know what they are, but we need 500,000 of them.

Continue reading

What’s the best answer?

In a recent dialog on a web forum, someone asked how to find the “best” way to construct an SQL query. They had posed a problem and received many solutions, each different in syntax, in approach, and in detail. Reasonably, he asked “how do I find the ‘best’ solution?”
That’s a question we should always be asking when we craft a solution to any problem. In IT, and in SQL particularly, I use these criteria to evaluate an answer:
(1) it must return the correct answer on the target platform
(2) it should use standard syntax and construct
(3) it should be easy to understand by someone who didn’t write it
(4) it should be as fast as necessary… it needn’t be as fast as possible
(5) it should play well with others in a multi-user environment
(6) it must be ready-to-run on time and under budget
The “best” answer is the one which blends all of these. Notice that only two of these are “must”, the others are “should”.
Let’s take these one by one: Continue reading

Why “Metaphysics”?

A very short blog entry this time : Read this article:
Data Management Is Based on Philosophy, Not Science

My field of study at the university was philosophy (after a couple years of political theory and constitutional democracy). I still keep the works of Aristotle on my shelf. I’ve always thought philosophy was the perfect way to prepare for a career in information systems. I make that recommendation to parents asking what their computer-savvy child should major in. They always look at me like I’ve grown an extra eye!

But philosophy is the study of … well, of everything, and metaphysics is the precursor to philosophy.

Hence — software metaphysics — the fundamental thinking that prepares the ground for all software.

Worth a read.

Software for Idiots

Tonight, I tried to install an HP Photosmart C4700 printer at home.

Now, I’m a fan of HP. They’ve brought me a lot of business over the last 8 years. I have HP PCs and laptops. I’m one of the few people in the world, apparently, with an HP TV.

So, I want this HP printer to replace the worn-out Lexmark. It’s a nice printer, an all-in-one copy/scan/print machine. But the key (besides the low price) is that it’s wireless. I need a network printer, and this little baby fits the bill.

Until, that is, I tried to install it. Everything works — except the wireless. Hmmmm. I click on the "configure wireless connection" in the HP Solution Center, and my PC spins for a bit and then — nothing.

Nothing. No light flashing. No message box. Nothing. It just starts up like it’s going to do an install and then it quits.

Eventually, I figure out that, although I’ve installed all the software already, I need to have the original CD in the drive to configure the wireless. Huh? Try again.

This is better, but not by much. It starts the program and begins to scan looking for my Wi-Fi.

Okay, here’s where I tell you that I am paranoid about my network. I take lots of Wi-Fi security measures. But the first Wi-Fi security measure is — don’t broadcast your Wi-Fi. So I don’t.

HP seems to think that you would — and should. So they search, and search, then show me some networks from the neighbors’ houses.

I’ll stop there, because here’s my point — every software package I get seems to have adopted the most obnoxious behaviors of AOL, Microsoft and others. And that is to assume that you (the consumer) are too ignorant to know whether you need help or not.

So software will scan every hard drive attached to your machine — no matter how many terabytes — looking for some software or configuration file. And it will make you wait until it’s done. Sure, they could have just asked: What program do you want to use? or What is the name of your network? But that would assume an intelligent user. Software doesn’t assume that. On the contrary, they assume you are ignorant, too ignorant to even allow you the option of taking a shortcut.

When you write software like this — and more and more vendors do — you are making an architectural choice. You are choosing to treat your customer like he/she is an idiot.

The significance of that choice is profound. It seeps into all of your design decisions, all of your marketing decisions, all of your support decisions. And it certainly seeps into the friendliness or hostility of your product.

It is a simple thing to ask, when installing or configuring software — do you know how to install this? do you know where your programs/files are? do you know where your network is? Maybe 90% of the customers will say "No, please help me". But some of your customers will say "Yes, thank you, my time is valuable. I’m installing this software because I’m trying to get something done. My goal is to get back to real work, so let me speed this process up. Please."

Part of Software Metaphysics’ mission is to examine some principles that form the basis of how we think about software — principles that we don’t even realize we hold.

How we envision our users defines how we build our software.

Do yourself a favor — think about users as pretty smart folks.

ETL Architecture – Core Principles

IT at the Speed of Business

The driving factor of the modern IT shop is to operate at the speed of business.

More than anything else, this means being able to respond rapidly to changes in the business climate. These changes come from the business units within the enterprise, from trading partners outside the enterprise and – for IT – from the continuous advancements of technology itself.

Whenever the business makes a new demand, IT must be prepared to satisfy that demand. It must do so rapidly. It must do so without major disruption. And it must be free to move forward when it needs to move forward.

The approach to technology that underlies a system contributes to the responsiveness of IT. And the core principles underlying the system’s architecture drive that approach.


Core Principles for ETL Architecture

It is important to give appropriate weight to the principles that drive the architectural foundation of the Extract, Transform and Load (ETL) system. These principles – listing the most important first – are:

·   Accuracy

·   Reliability

·   Flexibility

·   Extensibility

·   Autonomy

·   Cost of Ownership

·   Scalability

·   Speed

Three of these principles – Flexibility, Extensibility and Autonomy – have the greatest impact on IT’s ability to respond to changing business demand.

Flexibility is the key principle to guide all design, adoption and implementation choices. Flexibility means being able to adapt to forces of change, easily, swiftly and with minimum risk. Technology, products, the marketplace, or – especially –the business may impose necessary and beneficial change. Flexibility is essential to avoid “tear-up” when the inevitable changes occur.

Extensibility ranks right behind Flexibility. It is especially important when Flexibility must be compromised because of a limitation in a product or design choice. Extensibility means being able to take a product beyond its intended capabilities. This is the enabler of discovery and invention, two key elements of a vibrant IT organization. Extensibility allows you to overcome limitations, not with “workarounds”, but with solutions that are well-designed and architecturally sound.

Autonomy is the IT organization’s capacity for moving forward at its own pace. Autonomy is enabled by Flexibility, Extensibility and a skilled workforce. If an architected solution supports Autonomy, the IT organization can take an active role in creating what is needed, when it is needed, to support the specific business demand. The IT organization is not dependent on, or encumbered by, the ability or desire or timetable of vendors or markets.


The ETL Model

A model for this ETL architecture is as simple and complete as that shown in Figure 1

 

ETL Model (small)

Figure 1 ETL Model

 

EXTRACT is platform-specific. Its role is to optimally collect the data that needs to be shipped to the Transformer, including both the core data and its context. The Extractor may be consuming resources on a highly-active, highly-volatile system. It must be able to take advantage of platform-specific features, and deal with platform-specific limitations, in order to minimize disruption to the platform.

TRANSFORM is platform-agnostic. Its role is to mediate between two, sometimes conflicting players, resolving the differences between the two, preserving (or adding to) the value of the data from the source extractor, normalizing it for general usage and delivering it to the target loader.

LOAD is platform-specific. Its role is to optimally organize and store the data that has been pulled from the Application platform and mediated by the Transformer. Like the Extractor, the Loader must be able to leverage the platform-specific features and avoid the platform-specific limitations of its host system.


Essential Characteristics

The essential characteristics of the system depicted in the ETL model are:

Encapsulation – Each component o
f the ETL processing is functionally encapsulated. This provides the degree of isolation that is required to allow each component to incorporate whatever optimizations are most appropriate to achieve its objective.

This encapsulation limits the component’s scope of awareness – it “knows” only about its own environment and (literally) knows nothing about its partners. For example, Load knows the details of its physical database design and implementation, but knows nothing about the source system, the transform engine or the business rules that moved the data from one form to another. Likewise, Extract knows the details of the source data and perhaps knows the details of the application that produced the data, but knows nothing about the target system (or systems).

Because its scope of awareness is constrained, each component is also unaffected by change to either of the other components. This constraint offers more opportunities for adaptation and flexibility as new sources, new targets and new business rules emerge.

Loose Coupling, Standard Interfaces – The components are loosely coupled – that is, they communicate only through standard, open interfaces. The rules of encapsulation require that each component knows its partner only through the coupling interface. Using loose coupling respects that encapsulation.

Loose coupling promotes extensibility. A standard interface permits the insertion of additional components which can add functionality to the standard model. For example, a “fan-out” requirement – in which a single transform feeds multiple load targets simultaneously – can be implemented by inserting a “one-to-many” distribution component between the transform and the loads. Each load remains encapsulated, unaware of its sibling loads. The transform remains encapsulated, unaware of the “fan-out”.

Likewise, using a standard interface promotes autonomy. Since the interface is non-proprietary, the IT organization can add functionality without waiting for the product vendor to incorporate that functionality into the product. This capability is essential in allowing IT to respond rapidly at its own pace to changing business demands.

Platform Awareness – The ETL model allows for platform-awareness. Because the platform-related components – Extract and Load – are encapsulated, they can freely take advantage of those features specific to their respective platforms. This allows the use of special utilities, known only to the platform, to be used to their best advantage, for performance or other purposes.

The Transform component is not platform-related – its domain is the data itself. The rules of encapsulation and loose coupling dictate that the Transform component is unaware of the specific nature of the (physical) source or target. In this respect, the Transform is platform-agnostic. It must work with the data, regardless of its physical origin, applying the transformation rules imposed by the business requirements. The Transform component is therefore free to receive from any source and deliver to any target, trusting that its Extract and Load partners know what is to be done with the data.

It is important to encapsulate platform-specific characteristics, capabilities and requirements within the platform-specific processes. This allows those processes to flex or expand as they must to leverage the platform.

 

 

ETL Model

 

ETLModel.png

 Figure 1. ETL Model

This simple model of an ETL application – "3 circles, 2 arrows" – is all that is needed to highlight the key architectural principle of Flexibility. Each component is separated from the other processing components by a "wall". They communicate with one another through a loose coupling based on an exchange of messages. This is the long-understood requester-server model, that is not often enough adhered to.

return to ETL Architecture – Core Principles

Architecture’s Business Stakeholder

Whenever an IT architecture discussion begins, all the participants line up on different sides.

There are, of course, those who just look around puzzled, wondering "What architecture?" — or, more frequently, "Why architecture?"

But the those who call themselves "architects" quickly take a side:

  • Architecture is how all the machines are deployed, and what kind of network links the machines together
  • Architecture is separating machines by function: database server, web server, security server, application server, middle tier distributor
  • Architecture is how the vendor’s product is distributed, scaled, and sped up

I go, first, to the more fundamental question — the "technology-free" question — of "What are we trying to accomplish, and why?"

It’s the "Why?" that creates the greatest silence in a roomful of techies.

  • Why do we have a web site?
  • Why are we building a data warehouse?
  • Why are we creating a real-time EAI system?
  • Why is the data secured? Why isn’t it?

Unless the answer to "Why?" is "because we’re a technology research firm and this is what we study", then I would expect the people who write the check — the Business people — to offer up the only relevant answers. And they are rarely technical.

What would the business stakeholder say about the choice between EAI and real-time data warehouse?

How does the business stakeholder feel about Open Source?

As it turns out, even these geek debates have a business implication. It’s a business problem specifically because, at some point, one solution will cost more than the other. That’s dollars. And dollars are a business problem.

  • What is the cost of Open Source? How does that compare against a proprietary software suite, like Oracle, Microsoft or IBM?
  • What is the cost of a vendor-specific software solution? What is the "lost opportunity" cost of being unable to change from one vendor to another without a multi-million dollar rewrite of a working system?

There is — almost — always a business interest. All technical questions reduce, ultimately, to a business question, because they always come down to costs — and costs are measured in business dollars.