January 15, 2004Theory of Compatibility (Part 3)In the first two articles of this series, we introduced a theoretical approach to versioning compatibility:
The discussion so far has been theoretical, but our careful theorems and definitions have given us a solid foundation on which we can build permanent full compatibility between versions. This third article in the series is a practical article. We apply these theoretical ideas to versioning XML and XML Schema. The Idea of XML Projection In discussions with my colleague David Orchard, we have been calling the Versioned Contract Language approach to XML Schema "Validation By Projection". What does this mean? A "projection" of an XML document is a partial copy of the document that selects out a subset of the elements and attributes, but doesn't otherwise modify the order or the content of the data. For example, we can project an XML document by removing all elements other than the specific set of element names we want to pay attention to. For example, suppose we were interested only in the "first" and "last" and "age" elements within "customer" elements. Then our projection could begin with the following XML document: And then our projection would end up producing the following stripped down document: In practice, you may not want to select the same element names throughout the entire document. For example, while we might be interested in <age> within <customer>, we might not be interested in <age> if it apears within <wine>. So we can certainly define projections that select elements and attributes based on their context as well as their name. That's all there is to the idea. Projection is simple enough. Schema-Aware XML Projection Why is XML projection interesting? Because it gives us a nice clean way to apply the "must-ignore" rule to XML. (In Part 2, we examined the theory behind the must-ignore rule.) In other words, projection is a way to ignore the "unrecognized" parts of an XML document. Any schema defines a set of elements and attributes, and in particular, it defines the specific schema types within which those elements and attributes are recognized. So there is an interesting projection for every schema that simply removes every unrecognized element and attribute. For example, here is a simple schema that defines the <first> and <last> and <age> elements within a customer-type From this schema, we can make a catalog of recognized elements:
A projection that selects out these specific elements and removes others works as we would expect - it projects exactly like the first projection example above. Projection and Validation Notice that, since projection removes "unrecognized" elements, it helps make a document "more valid" in a way. For example, the original document (reproduced below) is invalid according to our schema, because it contains several illegal elements such as <id> and <middle> and <since>: But the projected document has become valid according to our schema: However, a projection isn't meant to guarantee validity by any means. For example, if our original document was the following: Then the projected document would look like this: This document is invalid according to our schema on two counts:
So although projection can help make some instances valid that were previously not valid, it clearly does not make all instances valid. And it is not supposed to: the purpose of projection is to ignore the parts of the document that you do not recognize. If a part of a document is recognizable, it is certainly fair to apply strict validation rules to the recognized part. The idea that projection only eliminates unrecognized parts of a document is is similar to the way HTTP requires you to ignore completely unrecognized HTTP headers, but still requires that if you use a recognized header, you must use it correctly. Pinning down Recognizability To pin down projection to "recognized elements and attributes", it is necessary to pin down exactly what it means for an element or attribute to be "recognized". I propose that the most reasonable thing to do is to follow the lead of the XML Schema Specifications "element declarations consistent" rule and say that:
There may be other reasonable ways to define recognizability. These seem like the most reasonable rules to me, but the same idea of validation by projection will still work even with a different definition of recognizability. Using Validation By Projection We will call a document "Valid By Projection" according to a specific schema if the document is valid after projecting to the recognized elements and attributes in the schema. When would you want to use validation by projection as opposed to ordinary, plain old validation? Validation by projection is more lax than plain validation, and it is a way of apply a must-ignore rule. Both experience and theory tell us that:
Using the notation of the Theorem of Compatible Extensions, this is the same as saying we should define, for a specific schema S, c(S) = { documents valid by projection according to S } The key is that the set acepted by consumers is larger than the set allowed for producers. This allows new versions of schema to "wedge into" the gap between p(S) and c(S) while maintaining full compatibility. Part 1 in this series described this "wedging between" idea in theoretical detail. But what does it mean in practice? A Simple Example of Full Compatibility Here is an example. We need to take a look at two versions of a schema. Version one has a first and last name, and an age: Version two introduces some additional new, optional information - a middle name and a "since" date: Now, to test compatibility, we have two different interesting questions to ask:
Backward Compatibility +----------------------+ +----------------------+ | Old Message Producer | --(old message)--> | New Message Consumer | +----------------------+ +----------------------+ Backward compatibility is achieved because our V2 schema is carefully designed so that every XML document that was valid according to the V1 schema is still valid in V2. This is a kind of design that is not hard to do within ordinary schema mechanisms. In our case, the main thing we have done is to make sure that both new elements <middle> and <since> are defined to be optional. Since a producer using V1 is required to follow strict validity according to the V1 schema, it produces documents that look like this: Documents like this are still valid according to the V2 schema - and when they are projected into the V2 schema, they are not changed at all - so our V2 consumers have no trouble accepting them. Forward Compatibility +----------------------+ +----------------------+ | New Message Producer | --(new message)--> | Old Message Consumer | +----------------------+ +----------------------+ Forward compatibility works because of projection. A V2 producer is required to conform to strict validation according to the V2 schema, so it is not allowed to do bad things such as put a non-number such as "n/a" within the "age" element. But the V2 schema permits the producer to include new elements that are not present in the V1 schema such as <middle> and <since>. Here is an example valid V2 message: When the V1 consumer receives this message, it applies validation by projection. In other words, it first strips out unrecognized elements as follows: Then it applies validation on the result. Validation by projection allows V2 to include new information that would not have been recognized by V1. Here, we can also see that it is important that the V2 schema has been carefully designed so that all elements that used to be recognized in V1 were constrained so that they are still only allowed to be used in ways that would have been valid in V1. This rule is what I call the "cannot be assumed to be recognizable rule", and it is formally set down in a previous article. When stripped of all the new elements, an instance must still be valid according to the old schema. Next Steps This article has described the practical key to the ideas behind the theory of compatibility. By requiring consumers to ignore unrecognized parts of messages, you open the door to easy versioning and full compatibility. The key is to define a "projection" function for the schema language, and to require consumers to use "validation by projection" while still requiring producers to use traditional strict validation. In our theoretical discussions, we have also established that there are a few other pieces of the puzzle. There is a "cannot be assumed to be recognizable" rule as well as a "specialization" relationship with consistency requirements that are needed to achieve rock-solid full compatibility. In future articles, we will dissect these issues in more detail. Posted by David at January 15, 2004 08:37 AMComments
We are currently using DTDs and JAXB to process XML, but it means that if even a single attribute is added to the XML, it breaks our code. I have been investigating whether XMLBeans and XML schemas could help solve this problem, so it was very surprising to see the designer of XMLBeans writing an article which directly addresses it! Thank you very much for a clear explanation of the principles. However, due to my inexperience with XMLBeans and XML Schemas, I'm not sure whether I can do the "Validation By Projection" using XMLBeans. Does the XMLBeans software ignore elements and attributes not in the schema? Or do I need a judicious sprinkling of "any" elements and "anyAttribute" attributes in the XML Schema? Thanks for any help. Posted by: Roger Beardsworth at April 13, 2004 11:57 AMI've since read your "Wildcards and XML Versioning" which rules out the second option I'd considered. I've also tried out XMLBeans, revealing that they work exactly as I'd like - I can add/remove elements/attributes and it doesn't complain unless they're used by the code. So it looks like it does projection, in a slightly roundabout way... Thanks again for the interesting articles. Posted by: Roger Beardsworth at April 14, 2004 06:49 AMPost a comment
| ||||||||||
Copyright 2004 © David Bau. All Rights Reserved. |