November 19, 2003
The Design of XMLBeans (Part 2)
This article continues my series on the architecture of XMLBeans, an open-source Java/XML binding tool.
In the first article, we introduced two basic architectural principles: type correspondence and node correspondence, and we looked at some of the ramifications of type correspondence. In this article, we begin to examine node correspondence.
Understanding Node Correspondence
The principle of node correspondence says that there is a one-to-one correspondence between Java instance objects and XML Infoset document, element, and attribute nodes.
For example, consider the following XML instance document:
No matter the details of the XML Schema, the node correspondence rule guarantees that this instance corresponds to an object instance containment hierarchy that appears just like the XML structure, as follows:
The fact that the Java object containment tree corresponds exactly and directly to the layout of the XML infoset tree means that it is possible for a programmer to work with an XML instance document after just seeing an example of the XML, rather than requiring detailed knowledge of the schema. It also means that the binding is very robust to schema development and evolution, as long as compatibility is maintained for the XML instance data itself. (The W3C TAG finding on versioning points out very correctly that language evolution and versioning means maintaining compatibility between specific agents and a corpus of instance messages rather than non-concrete metadata such as schema models http://www.w3.org/2001/tag/doc/versioning.html.)
One way to understand the node correspondence principle is to understand what it is not. For example, the JAXB 1.0 model group binding style does not adhere to the node correspondence principle. To see this, consider the following two schemas, both of which accept the document above.
The first example schema is the one you would write if you knew every "buy" is followed by a "sell". Using regular-expression-like notation, the content model described is:
Here is the schema:
If we were to impose the model group on the instance above, the document would be organized as follows:
The JAXB 1.0 "model group" binding style (which does not conform to the node correspondence principle) provides Java objects for the group constructs that appear when imposing the content model:
When using this model, the programmer must be aware that, to get to a "buy" transaction, they must first navigate through a "buyAndSell" object, even though "buyAndSell" does not correspond to any node in the XML infoset instance data. This is a little bit awkward to program with.
The "buyAndSell" object is problematic for several reasons besides clumsiness. For example, since it does not correspond to a DOM node, if DOM were used to manipulate the tree, "buyAndSell" objects would have to somehow appear and disappear at the "right" times. Also, the "buyAndSell" object is not robust to schema evolution, because it can change or go away if the schema is evolved in a backward-compatible way.
For example, suppose that after working with the schema above we realize that the schema was too restrictive for the actual business process at hand: every buy does not need to be followed by a sell, and not all account histories end with a "close".
The following is a rewrite of the "history" schema type to address both issues. In regular-expression-like notation, the content model here is (open (buy | sell)* close?)
The simplified binding style still produces the same Java object containment tree regardless of the schema. However, the JAXB 1.0 model group binding style provides quite a different tree for the same data:
Note that this binding describes exactly the same instance, but the logic and structure of how to navigate the same data is quite different. By relaxing the schema slightly and not changing the instance data at all, the topology of the tree has changed.
By tying containment to the shape of instance data rather than the shape of the schema description, the principle of node correspondence guarantees that even in the face of schema evolution, the binding results in object trees which are the same shape for the same data.
Element Order and Node Correspondence
There is a tension between Java and XML data models with respect to element order. In Java, named fields or methods do not have an inherent order in the instance data. However, in XML, named elements do have a specific order that is a significant aspect of the instance data.
Although a schema can certainly constrain the order of elements, the order in which elements actually appear is a property of the XML instance document, not of the schema which describes the instance. Therefore, for many XML applications, it can be important to access and manipulate the element order.
One solution to this problem is to bind every set of children to an ordered collection or array in Java, fully preserving ordering information in the Java object.
This binding model (known as the "generic content" binding model in JAXB 1.0) certainly maintains complete node correspondence, including ordering information. On the other hand, it is obviously inconvenient to use. The binding provides little extra value over an unbound API such as the w3c DOM's Node.getChildNodes() method.
Why is the above approach obviously missing something? Because in many situations in Java applications, it is the tag name, not the order, which is significant! In the example schema above, we can expect that a typical Java application developer would want to access the "<open>" transaction and the "<close>" transaction by name, without traversing through a list. In other words, programmers want to call getOpen() and getClose().
Sometimes it is possible to tell that element order is not significant for an application just by looking at the schema. Within model groups that constrain element order completely, applications cannot possibly extract any additional information from an instance by examining element order. For example:
The example constrains any <first-name> element to precede any <middle-name> element and in turn any <last-name> element. So valid instance data contains no interesting information about order at all. Because there are no degrees of freedom in element ordering, applications can be expected not to care that <first-name> comes before
[As an aside, it is interesting and important to recognize that the more a schema does to constrain element order, the less information a particular instance's element order provides for an application, and the less interesting it is for an application to be aware of the order. More ordering information in a schema can mean less ordering significance in the application. However, less ordering information does not necessarily mean more ordering significance in the application, as the following example shows.]
Other times, the schema does not constrain the order but the order is still not important to the application; in these situations the schema typically provides enough information to produce a "perfectly fine" order without forcing the Java developer to think about order all the time, as long as the application truly is insensitive to order. For example:
In the example above, XML element order is unconstrained, so the instance data does contain ordering information that can vary from instance to instance. But a typical application (e.g., one that is going to manipulate a specific user's configuration) may not care about the order at all, and the challenge for a binding model is to free the Java developer from having to worry about order when it does not matter.
However, the fact that the order is not signifiant to the particular application is not inherent to this kind of schema. Certainly you could write another application against messages conforming to the same "simple-config-set" schema where the order was very important. For example, you could suppose that the order of configuration elements defined of precedence order for applying configuration to, say, the action of a user opening a file with a program. In that case, the fact that a certain <user-config> might override a certain <file-config> by preceding it in the order could be essential to the application.
So tension between order-significance and order-insiginficance is not purely a tension between different kinds of schemas: it is a tension between different programs that can be written against the same schema.
And as we saw in the buy-and-sell example in the previous section, the tension is also between different schemas that can be used to describe the same corpus of messages.
The solution to this solution is not to somehow figure out how to select either order significance or order insignificance, but to provide both ordered and by-name access all the time, so the programmer can choose between the two techniques when writing the program. In other words:
In concept, the bound interfaces always has methods that provide both forms of access to the same data model. It is left up to the implementation to ensure that the data is maintained in an efficient and consistent way.
Providing convenient setters while not interfering with element order is an interesting topic. The elegant solution provided by the simple binding model will discussed in the "Order and Setters" section of one of the future articles in this series.
Achieve Convenience while Applying the Two Principles
Before going on to discuss the specific applications of these techniques to substitution, wildcards, and other idiosyncrasies of XML Schema, we should recap what we have covered so far.
When applying the two basic principles to the simple binding style, it is also important to make sure that the bound APIs are as convenient as possible while still being formally correct.
As we have seen, achieving convenience has two consequences:
So both metadata and instance data can be seen in two ways:
We have seen that the two basic principles of XMLBeans binding architecture leads to a design which has both convenience methods and types, as well as formal methods and types.
In the next article in the series, we will discuss how XML Schema type and element substitution are handled with this "dual model" technique.Posted by David at November 19, 2003 02:52 PM
|Copyright 2003 © David Bau. All Rights Reserved.|