RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Modeling for Data Quality: Don't Try This at Home

来源: 作者:互联网作品 时间:2007-05-05 点击:

Do you work on your own car? I used to, a long time ago. One of the tasks is changing spark plugs. It's like changing lightbulbs: anybody can do it. 数据挖掘研究院

At some point, I was watching a real mechanic do it. He pulled off the first plug wire, then (strangely) brought over an air hose. "What's that for?" I asked. "To blast out the sand that accumulates around the base of the plug," he replied. "Don't want that falling into the engine and scratching up the cylinders."

数据挖掘研究院

Oh. I didn't know that. I actually had noticed that my car was slowly losing a little bit of its old power, and I actually did learn (later) that it was because it was losing compression. Could I have gotten sand in the cylinders? The evidence was sure there... 数据挖掘实验室

The moral is, just because something is easy doesn't mean I'm qualified to do it. I thought I was saving myself money, when in fact I was creating expensive problems. Worse, if I had not seen the mechanic do that, I might never have figured it out. ("Poor compression? Well, changing sparkplugs couldn't have caused that.") 数据挖掘研究院

These days, if it involves something important (like my car), I pay a specialist to do it, even if it is more expensive up front. They might make a mistake, too, but chances are good they've already made most of the mistakes and now know how to avoid them (saving me money in the long run). 数据挖掘研究院

As it turns out, data modeling is easy, too. Furthermore, it involves something even more important than your car - it involves your business. So today, I'm going to be your data mechanic friend, and show you how fun and easy it is to get into expensive trouble!

Got a Match?

数据挖掘研究院

Here's a data model you don't see every day! This is an explosives "tagging" scheme, for keeping track of explosives (after all, nothing good can come from fifty pounds of stolen dynamite). 数据挖掘实验室

One thing that makes tracking explosives difficult is that they're commodities - one lump of plastic explosive looks pretty much like the next one. So if police seize some big cache of stolen explosives, how are you going to prove it's yours? Well, that's where "taggants" come in! 数据挖掘研究院

Taggants are chemicals or tiny engineered particles designed to be mixed into explosives as a sort of "fingerprint." They can be detected and positively identified (even after the explosives are detonated). This lets forensic experts trace them back to their source. 数据挖掘研究院

Of course, for this to work, each taggant must be guaranteed by its manufacturer to be absolutely unique: once assigned by the manufacturer to a given company, the same type of taggant must never again be provided to anyone else - ever. 数据挖掘研究院

In this little example (wildly abstracted from the original problem that inspired this article), our explosives manufacturer has decided to use taggants. They want to be able to claim recovered explosives, as well as to know what plant manufactured them. Here is the plan: 数据挖掘研究院

  • PLANTs will be assigned many TAGGANTs. (Some types of taggants are incompatible with certain explosive types, so each plant needs a variety of different taggant types to fully identify all of the explosives they manufacture.)
  • Each TAGGANT belongs to only one PLANT, forever.
  • PLANTs can manufacture many EXPLOSIVE TYPEs.
  • Each EXPLOSIVE TYPE may be manufactured at multiple PLANTs.
  • Each PLANT identifies each EXPLOSIVE TYPE they make with one of their assigned TAGGANTs (which never changes, for the sake of simplicity).
  • A given TAGGANT may be used by a given PLANT to identify many different EXPLOSIVE TYPEs manufactured at that PLANT.

Interesting problem, no? Check out the model presented above. Looks very reasonable, doesn't it? It certainly seems to capture the rules. It's elegant, too - simple and compact. (And it was so easy!)

Let's check it for normalization, just to be sure: yes, it meets third normal form (3NF). Overall, it looks pretty good. But in truth, it has a very serious problem.

数据挖掘研究院

Model Go "BOOM!"

Do you see it? If not, don't feel bad; after all, neither do most of us mere mortals. But before I reveal the problem, remember that this article is not intended to make you a better mechanic! I'm trying to show that there's a lot more to this modeling business than meets the eye, and that we should call a professional to do it for us. But, I know you're curious, so let's continue (keep your hands away from the engine, please):

The main problem with this model is that we can get two different answers to the question "What plant does taggant "Z357" trace back to?" Considering that the whole point of this system was to be able to always get the one correct answer to that precise question, this flaw would fall into the category data architects call "a whopper."

Let's look at some sample data to illustrate:

数据挖掘研究院


(By the way, please forgive the meaningful keys and other lesser sins; I was trying to make this readable and simple): 数据挖掘研究院

In PLANT EXPLOSIVE TYPE, we see that "Nitro" is being manufactured in plant 2, and embedded with taggant "Z357." But the TAGGANT table says that taggant "Z357" is assigned to plant 1, not plant 2. The original model lets us store conflicting data.

Any time we can get two different answers to the same exact question, we've got a normalization violation - but we already said this model meets 3NF - and it does! (Check it again if you want!) So what's going on here? 数据挖掘实验室

Well, as it turns out, this model violates Boyce-Codd normal form (BCNF), not 3NF. BCNF is one of those higher-order normal forms that most people just wish would go away. But, as we can clearly see, it creates problems that are just as serious as others.

数据挖掘研究院

"So," the do-it-yourselfer asked, "can you fix it?" (Answer: It's going to cost you.) 数据挖掘研究院

The Fix

One approach is to patch up the holes with code. We could use triggers that fire on each change to PLANT EXPLOSIVE TYPE or TAGGANT to make sure the data in each table remains in agreement with the data in the other table. (We'd also have to clean up any contradictory data that was already in there, and probably run periodic audits just to stay sure.)

数据挖掘研究院

That approach works, but we might suffer some locking and performance issues if we're really hammering the tables (especially with certain RDBMS software and parallel updates). 数据挖掘研究院

Another approach is to just fix the model, but that would mean changing a lot of code. Obviously we would have been a lot better off if we'd simply found a better data structure in the first place. 数据挖掘研究院

So how could we find a better structure? The short answer would be we could hire a data architect, even though the job looks really simple! So while I demonstrate a design that defuses this bomb, remember that most of us won't even recognize the next bomb when we see it. 数据挖掘研究院

Okay, here's a model that fixes this particular problem (meaning it's fully normalized, through BCNF and fourth, fifth and sixth - and yes, there's a sixth): 数据挖掘研究院

数据挖掘研究院

In this arrangement, foreign keys (FK), alternate unique keys (AK1) and the joys of the IDEF1X unification principle ("Plant ID" from the two foreign-keys coming together into a single column in PLANT EXPLOSIVE TAGGANT) enforce the business rules correctly. (Try to put in the bad example data if you're not convinced; it won't go in.)

Notice that this corrected model is, however, more complex than the first one. It requires six tables, not four. It's not as elegant. It's even got a couple of "1" cardinalities, which is just weird - and which can only be enforced with extra unique keys. It's not exactly intuitive. In fact, it's really quite strange. Which is, of course, the whole point.

数据挖掘研究院

Have a Seat in the Waiting Room, and Help Yourself to Some Coffee.

Spotting these types of problems and their myriad cousins, and designing things properly, is really quite challenging - and critical. The data model is the hardest thing to change once we start coding, and the typical span between the start of modeling and finding out that developers are coding is typically about zero. It's a tricky job done in a limited timeframe with everyone waiting on you, where every mistake causes problems forever. 数据挖掘研究院

Making things even harder, the exact same pattern of tables used in this article might be absolutely fine or have a different problem entirely. (I came across examples of both while fixing this one; email me if you're curious.) So we can't even learn to spot this pattern and know whether we have problems, or what kind. It just doesn't work that way. 数据挖掘研究院

In light of the short working timeframe, the high cost of mistakes, and the deceptively difficult nature of the work, we really do ourselves a favor when we hire this work out to an experienced data architect - even for a few "simple" tables like these.

(Special thanks to Mark Horan and Lou Lendi, both of Experian, for reviewing and challenging the ideas in this article and bettering them!) 数据挖掘研究院

...............................................................................
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?