davidbau.com Wildcards and XML Versioning

Wildcards and XML Versioning

In a comment on this weblog, George Datuashvili asked about usage of wildcards in Schema to allow for versioning, observing correctly that it is not easy to do right. My own recommendation? Don't do it. It doesn't work.

Perhaps my assessment is too pessimistic. My friend David Orchard has written a great article on the details how to use wildcards to allow for certain kinds of schema evolution. I highly recommend reading it. But even though he presents what I believe to be the best possible approach for using Schema wildcards for extensibility, there remain significant problems with the approach when applied to versioning.

The "projection" model discussed in my Theory of Compatibility articles provides a viable alternative approach that can work well with W3C XML Schema. But XML Schema wildcards are proposed often enough that here I discuss why XML Schema wildcards don't work very well for compatible versioning.

What is a Wildcard?

"Wildcard" is the XML-Schema insider's term for an "xs:any" or an "xs:anyAttribute" declaration inside a schema. The idea is, wherever a wildcard appears in an XML Schema content model, "any" element or "any" attribute can be allowed to appear in an actual XML instance.

Here is an example:

<xs:complexType name="extensible"> <xs:sequence> <xs:element name="first" type="xs:string"/> <xs:element name="last" type="xs:string"/> <xs:any namespace="##targetNamespace" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="ex" type="extensible"/>

Valid instances include:

<ex> <first>Edgar</first> <last>Codd</last> </ex> <ex> <first>Edgar</first> <last>Codd</last> <title>Dr.</title> </ex>

Notice that the "any" declaration permits zero or more elements with "any" name inside the "targetNamespace" to appear at the end of the content model. Although the <title> element was never explicitly declared in the schema, it is permitted because it matches the wildcard.

Using Wildcards for Extensibility

The idea behind wildcards in schema is to decouple the development of parallel related schema defintions. Wildcards do this by providing a way for a schema designer to specify "extensibility points".

For example, suppose we are defining a data management system for a zoo. There may be two different kinds of data experts designing the system:

Logistics experts may be knowledgable about how to run a zoo facility as a corporation; how to allocate and manage the resources, the staff, the customers, and the animals.
Veterinary experts may be knowledgable about the animal medicine. They understand the data behind feeding, keeping, and caring for the animals.

Each of these two kinds of experts may define their own schema for their own data, and for the most part, they can work in their own worlds. But occasionally, they may need to define schemas for documents that cross between their worlds.

For example, imagine you are in logistics and you are designing the document describing a new animal arrival at the zoo. Although the main information in the document is about the date, place, contact information, and so, the document might also have to include medical information about the new animal - probably a full veterinary file. How can this be done? Must the logistics data modeller become an expert in the data model for animal medicine? That is obviously a ridiculous idea, and it becomes even more untenable when you realize that the medical data model may also be evolving. In the real world, it may be impossible to for logistics engineers to learn about the veterinary model, because the animal medicine people may not understand their own model yet.

And the reality is that the logistics expert does not need to know the medical data to do the job. A logistics expert may think of a new animal record as follows:

<newAnimalArrival> <animalID>tiger34</animalID> <dept>Carnivores</dept> <date>2004-01-04</date> <caretaker>Jennifer Stegler</caretaker> <contact>888-555-1234</contact> <medical> blah blah blah blah blah blah blah </medical> </newAnimalArrival>

Logistical engineers do not need to know the medical data for most of their work, except that they may need to preserve it to be passed on to the veterinary systems that do need the medical data.

Wildcards at the Zoo

Here is how a wildcard might be used to model the schema type above:

<xs:element name="newAnimalArrival" type="naa-type"/> <xs:complexType name="naa-type"> <xs:sequence> <xs:element name="animalID" type="xs:NCName"/> <xs:element name="dept" type="xs:token"/> <xs:element name="date" type="xs:date"/> <xs:element name="caretaker" type="xs:token"/> <xs:element name="contact" type="tns:phone"/> <xs:element name="medical" type="openMedicalData-type"/> </xs:sequence> </xs:complexType> <xs:complexType name="openMedicalData-type" <xs:sequence> <xs:any namespace="http://veterinary.org/schema" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

Notice that in the type for the <medical> element our schema has basically thrown up its hands and said "zero or more elements in the relevant veterinary.org namespace are allowed here". Exactly what kind of elements are defined in that namespace are left to another schema, which may not even be written yet. The contents of the <medical> element are what we would call an extensibility point.

Our example has the characteristics that:

We have a good idea where extensibility is needed.
We know that the extension is going to be done by somebody else.

Wildcards fit this case perfectly. They let us delineate and control exactly where somebody else can define data within our schema. The fact that a wildcard is explicit prevents extensibility from getting out of control; yet wildcards also let us provide the other schema designer with complete flexibility at that point. The veterinary data people can define their schema in any way they want; logistics provides no constraints whatsoever on the kind or number or purpose of elements or types they define. And despite all the flexibility, the veterinary data won't pollute the logistics schema except at the very specifically delineated wildcard within the openMedicalData-type.

Contemplating Wildcards for Versioning

Wildcards work well for extensibility, so some also advocate using wildcards for compatible versioning as well. Let us return to the first example in this article:

By defining the wildcard in our own "targetNamespace", we are implying (although strictly speaking, we are not requiring) that we ourselves will be the ones to define elements to fill in the wildcard. The extensibility elements will come in our own namespace, and out of politeness, other people should probably not define elements in our namespace - so the extensibility point is for ourselves.

Why would we ever define an extensibility point for ourselves? Why not just use the explicit element that we mean? One reason is that we may be saying "a future element can go here, and we haven't defined that future element yet." In effect, we can think of one use of extensibility as a way of coordinating with a future version of ourselves.

For example, in a future version of our specification, we may decide that a <middle> element should be added to the <first> and <last> elements which were in the original specification. Then we might want to evolve the definition of "extensible" to work as follows (warning: the schema below is invalid per the UPA rule, as we will discuss later):

<xs:complexType name="extensible"> <xs:sequence> <xs:element name="first" type="xs:string"/> <xs:element name="last" type="xs:string"/> <xs:element name="middle" type="xs:string" minOccurs="0"/> <xs:any namespace="##targetNamespace" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

A new instance message might look like this:

<ex> <first>Edgar</first> <last>Codd</last> <middle>F.</middle> </ex>

The idea is that while the old version of the schema only explicitly allows <first> and <last> elements, it is compatible with a new version of the schema that in addition permits an optional <middle> element because the old version of the schema also permits <middle> elements by virtue of permitting any element at the wildcard.

Unfortunately, this pleasant idea does not work.

The Four Problems

Wildcards provide an excellent extensibility mechanism for cooperating with other data models, and the general idea of the example above suggests that they might be good for cooperation between future and past versions of the same schema. But there are several reasons wildcards are not a good way to achieve compatibility between versions.

A targetNamespace wildcard permits too much freedom for current-version message producers to insert garbage that will be incompatible with future-version message consumers.
A wildcard in a specific place does not provide enough flexibility for natural data model evolution in the second version of the schema.
Technical limitations of wildcards - the element declarations consistent and unique particle attribution rules - prevent their use in "natural" ways and require awkward "wrapping" techniques.
And most seriously, wildcards put the onus on the designer of the original version of a schema to anticipate where and how evolution of the schema will occur. Experience shows that few people understand the future well enough to actually anticipate it.

In short, although it is possible to use wildcards to help with versioning, my advice is that it is the wrong approach, and that misuse of wildcards for versioning compatibility can actually be harmful. Wildcards are excellent for extensibility, especially where several schemas are being developed simultaneously and need to be stitched together in a flexible way. But wildcards are not very good for versioning, because compatibility with the future version of your own spec is different from compatibility with a different spec. For versioning, the "compatibility by projection" strategy discussed in my other articles should be used instead.

The remainder of this note critiques the use of wildcards from the point of view of their use for achieving forward version compatibility, and I will expand on each of the points 1-4 in detail. Please do not misinterpret the note as a critique of wildcards for extensibility; they are excellent for that purpose.

Wildcards Provide Too Much Freedom

Consider the "original version" of the example schema again.

While a wildcard may seem like a "good idea" because we may be anticipating that a future version of a schema may permit, say, a <middle> element that we want to allow in this version of the schema, there is actually a very serious problem with putting the wildcard in the original schema like this.

The problem is that it explicitly tells people who are writing messages that they may insert any data they like - so long as it is an element in the targetNamespace - at the wildcard location. For example, our schema defines an "ex" element, so the wildcard, even if it is put in "strict" mode, quite explicitly permits "ex" to appear at the given location:

<ex> <first>Edgar</first> <last>Codd</last> <ex> <first>Charlie</first> <last>Kaufman</last> <ex> <first>John</first> <last>Malkovich</last> <ex> <first>Malkovich</first> <last>Malkovich</last> </ex> </ex> </ex> <middle>F.</middle> </ex>

As you can see, targetNamespace wildcards trivially allow us to achieve a kind of madness of recusive nesting! Wildcards in the targetNamespace open our data model more broadly than we want. They typically make the actual model completely recursive even if it wasn't previously recursive. Innocent-looking wildcards permit fully valid, strictly correct instances that resemble a house of mirrors.

My example may appear harmless. It simply shows extra unneeded degrees of freedom at the point at which the wildcard appears. But then take a second look at the instance above: there is a "middle" element in the instance. Perhaps there is a sliver of hope here, because the "middle" element denotes Edgar Codd's middle initial "F.", something that a future schema may be interested in. But the problem is that there is no hope that we can guarantee that this "middle" element actually conforms to the future version of the schema.

Indeed, in the future "schema" we proposed (I won't write the schema again here because as I mentioned before it wasn't actually a valid schema), the "middle" element, if present, was required to occur immediately after the "last" element. In the instance we have provided above, the "middle" element appears not directly after "last", but after "ex" instead.

The problem is that the old version of the schema permits instance authors to write instances which become invalid in a new version of the schema.

If you can find a message that is valid in an old schema, but not valid in a new schema, then your two schemas are not working together to provide backward compatibility. Your old schema is permitting old message producers to produce messages that new message consumers, under the new schema, are allowed to reject. Because of this, wildcards cannot be specialized in new versions of schemas, because that would break backward compatibility.

What is needed, perhaps, is a "special" form of wildcard that says "any element is allowed here if you are consuming messages, but you're not allowed to put anything here if you are producing messages". And indeed, that comes very close to the real reccommendation that I would make. However, as I discuss in the remainder of this note, there are several other reasons that the wildcard mechanism isn't the right technique for the effect we are trying to achieve. What is needed is a way to generally distinguish between message producers and consumers, not just for wildcards, but at every point within a content model.

Wildcards Provide Too Little Flexibility

While wildcards provide too much freedom for an old version of a schema to permit producers to insert garbage, wildcards also provide too little flexibility for a new version of a schema to expand the data model in a natural way.

As a trivial example, observe that when we put the wildcard at the end of the "extensible" data type in our example, we constrained the data model to grow only at the end:

For example, we were forced to add the "middle" element after the "last" element, while it would seem very natural to put the "middle" element between "first" and "last" instead.

While this might seem like an unimportant detail, perhaps just the complaint that a whiner might make, akin to decrying the choice of the spelling of class names in a library, there are actually three good reasons that constraining the element placement of new elements in future versions can be overly restrictive.

When XML data straddles the boundary between text and data, for example, if the "extensible" type were designed to be printable with the XML markup removed, then it can be important to order the elements in a text-meaningful order, for example, title first; then first name; then middle name; then last name; then suffix.
When there are repeated elements, for example, if the first element were defined in a way to permit "one or more first names", it can be important when expanding a data model to allow new elements to sit adjacent to a repeated element with which the new element is related, rather than at the end where it is difficult to correlate.
And finally, and this is the most important reason, XML is designed to be human-readable. If it is conventional for people to expect data elements to come in a specific order, it is important for the XML data to reflect that order. If the technicalities of your Schema prevent the natural order from being used, you will be inviting more bugs - i.e., more engineers creating invalid instances - and higher development costs - i.e., fewer engineers being able to recognize problems by eyeballing the data. The whole reason for XML is to lower development costs. And that means we should be putting "middle" between "first" and "last".

Wildcards Can Lead to Technical Problems

The problems cited so far are enough for me to suggest that wildcards should not be used for version compatibility. However, the issue that seems to turn most people off to the strategy of using wildcards for versioning is the technical trouble they run into when they try to use wildcards in a nontrivial content model with lots of optional or repeated data.

The problem is that XML Schema prohibits content models that permit any ambiguity as to which particle declaration matches with a specific element in the instance. This rule is called the "unique particle attribution rule", and it reads as follows:

Schema Component Constraint: Unique Particle Attribution
A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

Please be aware that the XML Schema is full of such legalistic language. It is hard to understand this language when you first see it, but reading the language is a skill that can be learned, and to those who have learned the skill, the schema spec is actually quite clear.

What the heck is the "Unique Particle Attribution" (UPA) rule? It tells us that certain schemas are illegal to write, because they leave too much ambiguity when validating an instance. Here is a very simple example of a simple schema that violates the UPA rule:

Certainly, you can informally understand when reading the type definition above that the content model permits zero, one, or two <ambigugous> elements. If there are two elements in the actual instance, it is clear how the actual elements in the instance line up with element declarations in the schema: the first lines up with the first, and the second with the second. Similarly, if there are zero elements, there is no ambiguity.

However, if there is exactly one element, the schema does not say which particle matches the <ambiguous> element. Is it the first? Is it the second? The schema spec could have defined that it is one of them, e.g., the spec could have designed a "greedy match" rule that says that the first declaration is the one that is matched. But it does not.

In practice, if you have a schema with this kind of ambiguity, it is can be a sign of other problems. So instead of permitting ambiguous schemas and defining their semantics, the schema specification simply says that this kind of ambiguity is illegal. If it is possible to have a nonunique particle attribution, then your schema is defined to be wrong, and as a schema author, you must change it.

Why is the UPA rule relevant to the use of wildcards for versioning? Because extensibility wildcards are always optional, and if they are adjacent to any optional element data in a matching namespace, the resulting content model will always violate the UPA rule.

Consider the following schema:

The schema is invalid, because a simple instance such as the following is ambiguous:

<ex> <first>Edgar</first> <last>Codd</last> <middle>F.</middle> </ex>

The ambiguity comes because the <middle> element could match either the <xs:element> declaration or the <xs:any> declaration. Rather than just saying that the element matches the first declaration, the W3C schema specification says that the schema is illegal.

What to do? David Orchard has an excellent discussion as to how to avoid violation of the UPA rule when using wildcards. My oversimplified summary of his analysis is that wildcards should always be quarantined within their own elements, so that your instance should look like this:

<ex> <first>Edgar</first> <last>Codd</last> <extensions> <middle>F.</middle> </extensions> </ex>

This approach, of course, is awkward. Cordoning off a new "middle" element inside a special "extensions" element is hardly friendly for human-readability. And for other reasons discussed above, the wildcard still suffers from the other problems when you try to version the schema.

Should the UPA rule be eliminated? Maybe, or maybe not. However, I believe that the solution for version compatibility does not lie in changing the UPA rule. It lies in the realization that wildcards should not be used for version compatibility at all.

Wildcards Require Predicting the Future

The most serious problem with using wildcards for forward version compatibility is that they effectively require that engineers predict the future.

Where should we put wildcards? We have two choices:

We can put optional wildcards everywhere. But then as discussed before, our schemas will permit extra garbage everywhere; they will permit massively unreadable instances; and they will also suffer problems with the UPA rule.
Or we can make reasonable guesses about the future and put wildcards in the anticipated locations where they will be needed. The wildcards will still suffer from the problems above, but at least the problems will be localized. We will feel more in control of our destiny.

The second proposition may sound mild and sensible. However, making the assertion that "we can make reasonable guesses about the future" is an overstretched requirement. It requires us to predict the future. It requires us to bless some possible future designs as "accounted for" while indicating other possible future designs as impossible or unreasonable.

For example, it might seem eminently reasonable to propose, as we have done, that all sorts of extensions for our example data type be expressed as "extra elements" at the end of the data model.

However our simple model of future designs as "adding extra new elements in our targetNamespace at the end" leaves out several possibilities such as putting new elements in other places or incorporating elements in other namespaces, and these omissions can come back to burn us. Given an extensible schema of sufficient size and complexity, running into a problems of omission is not just a possibility. It is an inevitability.

For example, a common desire is to internationalize data. So in a future version we may like to alter our schema to permit xml:lang attributes on the first and last elements.

<ex> <first xml:lang="en-gb">Edgar</first> <last>Codd</last> </ex>

Do our original schema's wildcards permit this? No! We purposely omitted - or more likely we just didn't think of or bother to go to the trouble to add - "xs:anyAttribute" declarations within the data types for "first" and "last". And we haven't done any thinking about namespaces outside targetNamespace.

As soon as the future is upon us, we realize that we have not prepared for it.

The roadside is littered with the discarded ruins of countless clever technologies based on failed guesses of the future. Basing a forward compatibility strategy on the idea of "making a reasonable guess about the future" is an act of conceit and hubris. Guessing the future is best confined to weblogs and flamewars; this kind of speculation should not provide the foundation for our mission-critical XML Schema compatibility strategy.

The proper way to account for the future is to admit that "new data can come anywhere; it can be called anything; and it can be for any purpose." But wildcards are not well-suited for expressing this idea.

What the Future Brings

The future promises new data anywhere, called anything, for any purpose. Can we know anything about the future? Just about the only things we can know about future compatibility are that:

Future data looks different from today's data.
The future wants to be backward-compatible with us.

These two ideas are not exploited well by wildcards. Instead, they are best matched to the "must ignore" and "cannot be assumed to be recognizable" rules that can be found in protocols such as HTTP. The mathematical foundation for these guidelines, and their application to XML and XML Schema, is what I am developing in the Theory of Compatibility articles.

Posted by David at December 22, 2003 09:34 PM