First, Philosophy, then Data Science

One of the most telling articles on “Data Science” appeared in the NYTimes in April[1]. We are facing a massive shortage of data scientists, it read. “There will be almost half a million jobs in five years, and a shortage of up to 190,000 qualified data scientists.”
Trouble is, the same article says, “Because data science is so new, universities are scrambling to define it and develop curriculums.”
So — we don’t know what they are, but we need 500,000 of them.

Continue reading

Immovable Objects

There are many things that we treat everyday as immovable objects. Immovable? More correctly, they are things that we believe to exist in a particular context, and to always exist in that context. We see Mount Rushmore and quickly — unconsciously — put it in the context of South Dakota, the United States, or perhaps in the movie ‘North By Northwest’. We see hula dancers, and unconsciously think of Hawaii. We see steamboats going along a river, and unconsciously think of the Mississippi, Missouri or perhaps the Ohio rivers, maybe we think of Mark Twain or ‘Showboat’.
We make these cognitive leaps because we have become accustomed to certain signals in the data — in this case, visual data — that trigger assumptions about other related things we know.
When we are dealing with data, software and even business processes, we often make these same unconscious leaps. They are unconscious because we don’t recognize that we are making the connection. And because we don’t recognize that, we don’t challenge it or demand proof of its validity, we don’t question it. We make the leap and move on.  Continue reading

The Key is “Just Be Natural”

A questioner asked how to select a natural key in a business data model. In this case, the business is modeling client organizations across international boundaries. He wrote:

In a business data model I was wondering about something like National Registration Number (varies by country), Country, and National Registration Number Qualifier (if there could be multiple registering organizations – which raises other considerations).

A natural key (as opposed to, say, a surrogate key) has the two-fold goal of being (1) unique, and (2) “naturally” memorable. A memorable attribute of a person or an organization is their name, but — alas — that is almost never guaranteed to be unique. In a business setting, where the business has control of the organization’s name, the name is a much better candidate. However, it’s not clear from the question whether the business has control of the name.

It is important to know that the “National Registration Number” is a surrogate, and not a natural key. At least this is so in the United States, where this is the “Federal Employer Identification Number” (EIN) or the “Taxpayer Identification Number” (TIN) — a number assigned by the Federal government but not indicative of the business itself. Like any surrogate key, the number is arbitrarily assigned and usually forgotten by the business itself. (Technically, the TIN is not guaranteed to be unique either, though in practicality it is.)

My point, I guess, is that the purpose of having a natural key is to simplify look-up — and typically you are simplifying the look-up of a surrogate key. Within the data model, all relationships are by the surrogate key, a system-defined identifier that is guaranteed to be unique (though not memorable, meaningful or derivable).

So a suitable natural key could be the name of the organization or an organizational identifier, coupled with some geographic qualifier (in the US, the country is not specific enough; many business names are only unique within a state).

But it’s important, I think, to keep in mind the reason why a natural key is desired: it is memorable (naturally) and it is unique. That only matters where the user enters into the data model — it doesn’t matter, and shouldn’t be used, to navigate relationships within the model.

Big Data in the age of punched cards

In 1973, when I entered the computer field, the standard format for data was an 80-column punched card and a 132-character line printer. Data was forced to fit into those boundaries, and many creative solutions were crafted to make that happen.
If we needed to store the data, we would put it on paper tape or magnetic tape — 800 bpi, then high density (1600 bpi, 6250 bpi) tape. One day, we installed a disk drive — 5 MB — and could update single records without copying whole files.
Today, we talk about “Big Data” and propose massive reconstruction of our technical mindset to deal with this novel problem. “Big Data” — in quotes and capitalized — is the raging problem of the day.
Does this mean that “Big Data” is more a statement of what limits our technology (and our skills) have, than a statement of what is actually going on in the real world?

Continue reading

“In today’s information-driven economy”

My e-mail graced me with an invitation — well, yes, a sales-pitch — to a webinar entitled:

I read on, despite the obvious sponsorship by CA Technologies advertising their ERwin product. And the opening sentence was:

In today’s information-driven economy, data drives your business.

Now, I have held the belief for the last 30 years that “data drives your business”. That, after all, is why I have been engaged in IT and focused on databases and data architecture, rather than on networking, server management, user interfaces, programming, security or any of the dozens of other disciplines within IT. At one time, before an audience in a small auditorium, I explained my interest in databases by saying simply “everyone has data”.

So, what is it about “today’s information-driven economy” that makes it different from the 1970’s, 1980’s or any previous decade?

Continue reading

Identity Exposure is an Architecture Failure

Today’s software story is on the front page of the day’s news: 

Monday, May 22, 2006; Posted: 5:46 p.m. EDT (21:46 GMT)

WASHINGTON (CNN) — Personal information on 26.5 million veterans was stolen from the home of a data analyst in what appears to have been a random burglary, Veterans Affairs Secretary Jim Nicholson said Monday.

The computer records include names, Social Security numbers and dates of birth, Nicholson said. The Department of Veterans Affairs disclosed the theft Monday and said it has seen no indication that the information has been misused.

The analyst took the data home without authorization, Nicholson said. Department spokesman Matt Burns said the employee has been put on administrative leave while the investigation is conducted.

What makes this a story about software? Exactly this: Why did the software architecture permit this personal data to be available to anyone in the VA?

Continue reading