December 23, 2003

XML and Relational Thinking

In this space, I've emphasized the advantages of XML as a human-readable data-oriented representation. XML is not the first technology to try to fill this role, and there is a lot to be learned from the classical relational database work started in the 1970s.

One of the dilemmas of XML that was tackled very thoroughly by the relational camp is "how do you normalize your data?" There is a tradeoff between completely normalized application neutrality and human-readable, but application-specific hierarchy. In my opinion, this continues to be an unsolved problem, and it is interesting.

In an ongoing email thread, my colleague Daniel Weinreb (of Symbolics and Object Design fame) has done an excellent job of distilling the problem. That thread is republished here. If you skim the messages, be sure to pause and think about the three "equivalent" XML examples in his second message.

From: Daniel Weinreb
Date: Thursday, December 18 2003
Subject: Representation vs. Encapsulation


Hi, this is Dan Weinreb, at home late at night reading your blog. I think we met briefly at eWorld.

It looks like I'm too late for the blog conversation, so I thought I'd just send you email directly.

I basically agree with what you're saying here. However, I'd like to point out that it seems to be a tacit assumption that "representation" means XML. As much as it pains me to say it, the theorists behind the "relation" concept and the theoretical concept of a "relational database" and the concept of "normal form" actually do have some validity. (The reason for the pain is that I was an object-oriented database system guy for so long and had to fight against wrongheaded criticisms from relational bigots!) The real motivation behind the whole "relational" concept was to try to represent data with as little as possible built-in about how it will be used. (I don't want to go on too much about this since perhaps you know as much or more than I do about it!)

Now, the original 1973 relational model falls short in many ways; E. F. Codd did a later paper in 1980 about a data model called "RM/T" with several good improvements, but that never got any traction in the real world. And SQL is a sort of weird hybrid of relational algebra and relational calculus. And then there are the serious problems: SQL is not nearly standard enough, and the products being sold under the name "relational database system" are very far away from the mathematical concept that inspired them; I could go on and on.

But XML "as she is spoke" (as most people use it) can have the kind of problem that relational theory was meant to head off. Highly simplified example:

<config> <cluster name="clust1"> <server name="serv1" fast="true"/> <server name="serv2"/> </cluster> <cluster name="clust2"> <server name="serv3"/> <server name="serv4" fast="true"/> </cluster> <server name="serv5" fast="true"/> <server name="serv6"/> </config>

Problem: find all the fast servers. It's not as easy to express as it might be, because you have to know that servers can be under "config", or two layers down. (Yes, I do know about "//" but I think it's a kludge.)

I guess this is not the world's most compelling example; better ones can be found in books on relational databases. It's hard to provide a small example; the problems emerge more seriously in what I would call relatively large data structures with interesting patterns of sharing. I saw this in some of the big ebXML schemas (whose names I can't remember any more). You can improve the sharing situation a lot with "id" stuff and possibly people need to learn more about where that's appropriate.

Of course, XML is a lot more than a data model for databases, and therein lies much of its value and strength. I'm certainly, by history and temperment, an XML fan and not a relational database fan! But intellectual honesty takes me by the throat and forces me to
concede that the relational theorists are not entirely without merit when it comes to the question of storing your data in an application-neutral form.

My old company had a product that was explicitly designed on the theory that corporate data should be in relational databases but that applications should see it as object-oriented data, cached persistently on the middle tier. I wasn't too involved with the product so I don't know the details, but it ran inside J2EE app servers (I'm not sure which ones but I'm pretty sure WLS was one). Unfortunately, the direction of the EJB spec is oriented strongly towards relational data in a way that made our product not fit in well. (It was called "Javlin" (I think that's the spelling), and the company was "eXcelon, Corp", formerly "Object Design, Inc" back when we started it.) So in a way it was very much in line with what you're saying, only not oriented so much around XML. (A separate part of the company was making an XML database system; that product lives on, evidently, as Sonic XML Server from Sonic Software, a division/subsidiary/it's-hard-to-tell of Progress Software, which bought eXcelon.)

Wow, am I rambling or what? Anyway, I enjoyed your article.

- Dan

From: David Bau
Date: Friday, December 19 2003
Subject: Representation vs. Encapsulation


You write:

But intellectual honesty takes me by the throat and forces me to concede that the relational theorists are not entirely without merit when it comes to the question of storing your data in an application-neutral form.

That's an excellent comparison, and I wonder if the relational theorists would have solved the world's problems if in addition to providing a logical model they had emphasized a need for a standardized serialization of their application-neutral data.

I'm curious - I've been emphasizing the benefits of XML since it is a representation, but you're right that relational databases in many ways are a better and more neutral representation.

You mentioned that you're more of a fan of XML than relational databases. I'm curious what the reasons for that are, and if you think there is a way to get both the benefits of the fully-normalized neutrality of relational data and the benefits of XML?

Thanks for the interesting note,


From: Daniel Weinreb
Date: Saturday, December 20 2003
Subject: Representation vs. Encapsulation

Having had some more time to think about what you asked, consider these three ways to represent the same thing. Example 1 is makes it sort of hard to ask questions about "all servers". Example 2 is intended to fix that; and example 3 has the advantage that if we ever wanted to allow one server to be in many clusters, we would not need to change the hierarchy.

Example 1:

<config> <cluster name="website"> <server name="dumbo" fast="true"/> <server name="rudolph"/> </cluster> <cluster name="billing"> <server name="prancer"/> <server name="dancer" fast="true"/> </cluster> <server name="donner" fast="true"/> <server name="blitzen"/> </config>

Example 2:

<config> <servers> <server id="s1" name="dumbo" fast="true"/> <server id="s2" name="rudolph"/> <server id="s3" name="prancer"/> <server id="s4" name="dancer" fast="true"/> <server id="s5" name="donner" fast="true"/> <server id="s6" name="blitzen"/> <servers/> <clusters> <cluster id="c1" name="website"> <member ref="s1"/> <member ref="s2"/> </cluster> <cluster id="c2" name="billing"> <member ref="s3"/> <member ref="s4"/> </cluster> </clusters> <config>

Example 3:

<config> <servers> <server id="s1" name="dumbo" fast="true"/> <server id="s2" name="rudolph"/> <server id="s3" name="prancer"/> <server id="s4" name="dancer" fast="true"/> <server id="s5" name="donner" fast="true"/> <server id="s6" name="blitzen"/> <servers/> <clusters> <cluster id="c1" name="website"/> <cluster id="c2" name="billing"/> </cluster> </clusters> <cluster-membership> <member clus="c1" serv="s1"/> <member clus="c1" serv="s2"/> <member clus="c2" serv="s3"/> <member clus="c2" serv="s4"/> </cluster-membership> <config>

However, the successive examples start to look less and less like what most of us are accustomed to when we think of (data-oriented) XML. And a serious problem with relational databases is that the whole concept of "flattened-out" normalized representations is hard for ordinary people to get a grasp on -- at least that has been my experience. Example 3 somehow, subjectively speaking, seems to be throwing the baby out with the bathwater.

Indeed, as you say, what if there had been a well-defined textual serialization for relational databases? Adam Bosworth likes to say that XML is fundamentally different because it's "self-describing". Now, if you look at real-world relational DBMS's, you generally find what they call "system tables", which are metadata saying "there are these tables" and "this table has these columns" and "this column is of this type" and so on. Imagine if the names and formats of the system tables had been a formal part of the relational model and had been standardized across vendors. Then would we feel that XML was more "self-describing" than these imaginary augmented relational databases?

Actually "self-describing" is less easy to pin down than all that. The phrase "self-describing" is vague as to whether we mean "a program can understand it" or "a person can understand it". Often, fragments of well-formed XML can be clear in meaning to a person, mainly because the element and attribute names are in a natural language that the person understands, and it's certainly nice that if you are debugging, or peeking in at conversations between programs, you had a far greater chance of figuring out what's going on with XML data than ASN.1 data, but as far as two communicating programs is concerned, it's not clear how important this is. "Self-describing" means somewhat more when a real XML schema is present, and we know whether a "123" is a number or a text string and so on. But would this be so different from a relational database with a standard set of system tables and well-chosen table names and column names?

Summarizing, we could say that some key differences between relational and XML representation are:

  1. XML has a standard serizalization, whereas relational could have but doesn't.
  2. XML has a standard schema representation, whereas relatinon could have but doesn't.
  3. XML "encourages" the use of more nesting, although you can eliminate nesting and be more "normal form" if you want to, a la Example 3, whereas relational databases "encourage" the use of more "normal form" representations, and gives you less ability to do "nesting", which has good and bad points.

I'm not sure how much patience Adam would have for my attempts to distinguish between "could have, but doesn't" versus "could not have" - in real life, they both mean "doesn't have". And I don't think we have to worry about the relational vendors trying to "compete" with XML by remedying these problems to create some kind of competing standard. But to me it seems important, just from the point of view of understanding, to tease out which differences are very heavily implicit or constrained by the underlying fundamental concepts, as opposed to differences that are just contingent and could have been different without upsetting the underlying concepts.

By the way, the best book I've ever read on the deep issues of how to represent the real world as "data" is "Data and Reality" by William Kent. I was just about to say "unfortunately it's been out of print for a long time", but I just checked and evidently it has been reissued.

- Dan

After receiving this last email, I asked Dan for permission to republish his email thread here. He then was careful to point out that his observation on the lack of standardization of relational metadata did not give credit to one effort to do so in SQL-92 - according to Dan:

SQL-92 actually does define a standard for metadata, called INFORMATION_SCHEMA. But it's not widely-supported; in particular, Oracle doesn't support it. So it's a "de non facto standard" or "de facto non standard" or something.

Dan also followed up with the following thoughts in another mail:

From: Daniel Weinreb
Saturday, December 23 2003
Subject: Representation vs. Encapsulation

There was a discussion about a year ago on the mailing list of Symbolics alumni, in which someone asked something along the lines of: what's so great about XML? Isn't it just like what Lisp expressions were?

(There is a certain tendency among adherents of a beloved and lost technology to claim that their technology "already did that". It's sort of like the running joke in the original Star Trek where Chekov would claim that everything was a Russian invention. In most of these claims, the older technology "already did" the same core concept, but often the modern technology has many improvements that, while they may not change the core concept, are nevertheless significant. To look at a modern-day spreadsheet and say "VisiCalc already did that" is not realistic.)

Several people on the list replied that the main difference was that XML is ubiquitous, but Bob Kerns sent a long piece of mail explaining that it's much more than that. I'll paraphrase his comments and bring them up to date.

The main difference is the XML Schema, which not only provides for better error-checking but also can drive all kinds of useful tools, such as helpful XML editors, and helps a person induce the semantics better than he or she could by simply looking at the raw XML.

His second point I'll quote verbatim:

So it's possible to take an existing schema, add some attributes and an additional external schema and XML file, and annotate it with additional information not anticipated by the originators of the original XML. The original may be unchanged, or minimally changed with the addition of ID's, and still validate and operate against all uses. This allows clean and robust extension.

I think I agree with this but it must be qualified: just because you can add a new attribute, and programs written to the old schema will still work without getting any low-level parsing errors, does not necessarily mean that the old programs will do what they are intended to do! (I'm going to read your writings about compatibility as soon as I get a chance.)

His third point was that XML Namespaces let you join together pieces that come from different conceptual domains into one document; there isn't anything like this in plain old Lisp expressions.

You put all this (and much more) together, and you have something quite different from a simple recursive-descent syntax for denotation of tree structure. What you have is a platform for integration - a common ground where diverse players can put together things that would be much harder without all this extra stuff. I can take some information from a web store, a credit card processing company, my inventory department, a KB (from a KB editor), a rule engine, the latest sales tax/zip-code correspondences, the USPS matching database, and make them all play together.

Also, high-quality software for parsing, manipulating, etc. XML is available, much of it free. XML has gotten into the positive-feedback loop of being a widely-used standard: since it's widely-used, people create these tools, and the existence of the tools causes XML to be used more widely, and so on.

So what does this have to do with the point about relational data representation?

Suppose someone comes to you and says (as I sort of did), well, when you talk about "representation", wouldn't it be better to use relational databases as the representation, because of all the benefits of normalization and so on. The answer that I think I like best is that XML has a lot of important advantages: a completely standard schema language, the serialized representation, the namespaces, and so on. It is possible to represent data in normal form and still have it be in XML, a la the "example 3", so if you really believe in relational data modelling, that's how you can have what is, to you, the best of both worlds.

-- Dan

Posted by David at December 23, 2003 03:52 AM
Post a comment

Remember personal info?