davidbau.com The Design of XMLBeans (Part 2)

The Design of XMLBeans (Part 2)

This article continues my series on the architecture of XMLBeans, an open-source Java/XML binding tool.

In the first article, we introduced two basic architectural principles: type correspondence and node correspondence, and we looked at some of the ramifications of type correspondence. In this article, we begin to examine node correspondence.

Understanding Node Correspondence

The principle of node correspondence says that there is a one-to-one correspondence between Java instance objects and XML Infoset document, element, and attribute nodes.

For example, consider the following XML instance document:

<account-history> <open>2003-01-01</open> <buy>2003-01-01</buy> <sell>2003-02-05</sell> <buy>2003-02-06</buy> <sell note="all assets" auth="43JK">2003-03-12</sell> <close>2003-03-12</close> </account-history>

No matter the details of the XML Schema, the node correspondence rule guarantees that this instance corresponds to an object instance containment hierarchy that appears just like the XML structure, as follows:

* document | * account-history +----+----+----+----+-------------+ * * * * * * open buy sell buy sell--* note close * auth

The fact that the Java object containment tree corresponds exactly and directly to the layout of the XML infoset tree means that it is possible for a programmer to work with an XML instance document after just seeing an example of the XML, rather than requiring detailed knowledge of the schema. It also means that the binding is very robust to schema development and evolution, as long as compatibility is maintained for the XML instance data itself. (The W3C TAG finding on versioning points out very correctly that language evolution and versioning means maintaining compatibility between specific agents and a corpus of instance messages rather than non-concrete metadata such as schema models http://www.w3.org/2001/tag/doc/versioning.html.)

One way to understand the node correspondence principle is to understand what it is not. For example, the JAXB 1.0 model group binding style does not adhere to the node correspondence principle. To see this, consider the following two schemas, both of which accept the document above.

The first example schema is the one you would write if you knew every "buy" is followed by a "sell". Using regular-expression-like notation, the content model described is:

(open (buy sell)* close)

Here is the schema:

<xs:element name="account-history" type="history"/> <xs:complexType name="transaction"> <xs:simpleContent> <xs:extension base="xs:date"> <xs:attribute name="note" type="xs:token"/> <xs:attribute name="auth" type="xs:token"/> </xs:extension> </xs:simpleContent> </xs:complexType> <xs:complexType name="history"> <xs:sequence> <xs:element name="open" type="transaction"/> <xs:sequence minOccurs="0" maxOccurs="unbounded"/> <xs:element name="buy" type="transaction"/> <xs:element name="sell" type="transaction"/> </xs:sequence> <xs:element name="close" type="transaction"/> </xs:sequence> </xs:complexType>

If we were to impose the model group on the instance above, the document would be organized as follows:

<account-history> (<open>2003-01-01</open> (<buy>2003-01-01</buy> <sell>2003-02-05</sell>) (<buy>2003-02-06</buy> <sell note="all assets" auth="43JK">2003-03-12</sell>) <close>2003-03-12</close>) </account-history>

The JAXB 1.0 "model group" binding style (which does not conform to the node correspondence principle) provides Java objects for the group constructs that appear when imposing the content model:

account-history * +---------+----+------+-----------+ * * * * open buyAndSell buyAndSell close | | | | buy sell buy sell

When using this model, the programmer must be aware that, to get to a "buy" transaction, they must first navigate through a "buyAndSell" object, even though "buyAndSell" does not correspond to any node in the XML infoset instance data. This is a little bit awkward to program with.

The "buyAndSell" object is problematic for several reasons besides clumsiness. For example, since it does not correspond to a DOM node, if DOM were used to manipulate the tree, "buyAndSell" objects would have to somehow appear and disappear at the "right" times. Also, the "buyAndSell" object is not robust to schema evolution, because it can change or go away if the schema is evolved in a backward-compatible way.

For example, suppose that after working with the schema above we realize that the schema was too restrictive for the actual business process at hand: every buy does not need to be followed by a sell, and not all account histories end with a "close".

The following is a rewrite of the "history" schema type to address both issues. In regular-expression-like notation, the content model here is (open (buy | sell)* close?)

<xs:complexType name="history"> <xs:sequence> <xs:element name="open" type="transaction"/> <xs:choice minOccurs="0" maxOccurs="unbounded"/> <xs:element name="buy" type="transaction"/> <xs:element name="sell" type="transaction"/> </xs:sequence> <xs:element name="close" type="transaction" minOccurs="0"/> </xs:choice> </xs:complexType>

The simplified binding style still produces the same Java object containment tree regardless of the schema. However, the JAXB 1.0 model group binding style provides quite a different tree for the same data:

account-history * +------+---------+---------+---------+-------+ * * * * * * open buyOrSell buyOrSell buyOrSell buyOrSell close | | | | buy sell buy sell

Note that this binding describes exactly the same instance, but the logic and structure of how to navigate the same data is quite different. By relaxing the schema slightly and not changing the instance data at all, the topology of the tree has changed.

By tying containment to the shape of instance data rather than the shape of the schema description, the principle of node correspondence guarantees that even in the face of schema evolution, the binding results in object trees which are the same shape for the same data.

Element Order and Node Correspondence

There is a tension between Java and XML data models with respect to element order. In Java, named fields or methods do not have an inherent order in the instance data. However, in XML, named elements do have a specific order that is a significant aspect of the instance data.

Although a schema can certainly constrain the order of elements, the order in which elements actually appear is a property of the XML instance document, not of the schema which describes the instance. Therefore, for many XML applications, it can be important to access and manipulate the element order.

One solution to this problem is to bind every set of children to an ordered collection or array in Java, fully preserving ordering information in the Java object.

class AccountHistory { // An ordered list of all transactions including // "open", "buy", "sell", and "close" Collection getElementChildren(); }

This binding model (known as the "generic content" binding model in JAXB 1.0) certainly maintains complete node correspondence, including ordering information. On the other hand, it is obviously inconvenient to use. The binding provides little extra value over an unbound API such as the w3c DOM's Node.getChildNodes() method.

Why is the above approach obviously missing something? Because in many situations in Java applications, it is the tag name, not the order, which is significant! In the example schema above, we can expect that a typical Java application developer would want to access the "<open>" transaction and the "<close>" transaction by name, without traversing through a list. In other words, programmers want to call getOpen() and getClose().

Sometimes it is possible to tell that element order is not significant for an application just by looking at the schema. Within model groups that constrain element order completely, applications cannot possibly extract any additional information from an instance by examining element order. For example:

<xs:complexType name="simple-sequence"> <xs:sequence> <xs:element name="first-name"/> <xs:element name="middle-name" minOccurs="0"/> <xs:element name="last-name"/> </xs:sequence> </xs:complexType>

The example constrains any <first-name> element to precede any <middle-name> element and in turn any <last-name> element. So valid instance data contains no interesting information about order at all. Because there are no degrees of freedom in element ordering, applications can be expected not to care that <first-name> comes before in a particular instance document. Those elements must always come in that order.

[As an aside, it is interesting and important to recognize that the more a schema does to constrain element order, the less information a particular instance's element order provides for an application, and the less interesting it is for an application to be aware of the order. More ordering information in a schema can mean less ordering significance in the application. However, less ordering information does not necessarily mean more ordering significance in the application, as the following example shows.]

Other times, the schema does not constrain the order but the order is still not important to the application; in these situations the schema typically provides enough information to produce a "perfectly fine" order without forcing the Java developer to think about order all the time, as long as the application truly is insensitive to order. For example:

<xs:complexType name="simple-config-set"> <xs:choice minOccurs="0" maxOccurs="unbounded"> <xs:element name="file-config"/> <xs:element name="program-config"/> <xs:element name="user-config"/> </xs:choice> </xs:complexType>

In the example above, XML element order is unconstrained, so the instance data does contain ordering information that can vary from instance to instance. But a typical application (e.g., one that is going to manipulate a specific user's configuration) may not care about the order at all, and the challenge for a binding model is to free the Java developer from having to worry about order when it does not matter.

However, the fact that the order is not signifiant to the particular application is not inherent to this kind of schema. Certainly you could write another application against messages conforming to the same "simple-config-set" schema where the order was very important. For example, you could suppose that the order of configuration elements defined of precedence order for applying configuration to, say, the action of a user opening a file with a program. In that case, the fact that a certain <user-config> might override a certain <file-config> by preceding it in the order could be essential to the application.

So tension between order-significance and order-insiginficance is not purely a tension between different kinds of schemas: it is a tension between different programs that can be written against the same schema.

And as we saw in the buy-and-sell example in the previous section, the tension is also between different schemas that can be used to describe the same corpus of messages.

The solution to this solution is not to somehow figure out how to select either order significance or order insignificance, but to provide both ordered and by-name access all the time, so the programmer can choose between the two techniques when writing the program. In other words:

The primary data model is of an ordered list of elements.
The bound API provides convenient mainpulation of elements by name.

In concept, the bound interfaces always has methods that provide both forms of access to the same data model. It is left up to the implementation to ensure that the data is maintained in an efficient and consistent way.

class AccountHistory { // select "*" for all element children in order, // or select "buy|sell" for all buy and sell children // in their interleaved order XmlObject[] selectPath(String childSpecifier); // strongly-typed bound getters are provided for // all declared element names. Transaction getOpen(); Transaction[] getBuyArray(); Transaction[] getSellArray(); Transaction getClose(); }

Providing convenient setters while not interfering with element order is an interesting topic. The elegant solution provided by the simple binding model will discussed in the "Order and Setters" section of one of the future articles in this series.

Achieve Convenience while Applying the Two Principles

Before going on to discuss the specific applications of these techniques to substitution, wildcards, and other idiosyncrasies of XML Schema, we should recap what we have covered so far.

When applying the two basic principles to the simple binding style, it is also important to make sure that the bound APIs are as convenient as possible while still being formally correct.

As we have seen, achieving convenience has two consequences:

When preserving type correspondence, it is also important to be able to provide "convenience" types for certain simple types, so, for example, schema strings can be easily seen as Java Strings. So in addition to the "formal" Java class corresponding to each schema type, simple types also have a "convenience" Java type.
When preserving node correspondence, it is important to provide "convenience" access to named element children when element order does not matter as well as "formal" preservation of XML element order. So in addition to a "formal" accessor API that provides children as an ordered list of objects, there are also "convenience" methods that allow elements to be manipulated by name.

So both metadata and instance data can be seen in two ways:

	Formally	Conveniently
Schema type	"formal class"	"convenience type"
Child nodes	"child list"	"named property"

Next

We have seen that the two basic principles of XMLBeans binding architecture leads to a design which has both convenience methods and types, as well as formal methods and types.

In the next article in the series, we will discuss how XML Schema type and element substitution are handled with this "dual model" technique.

Posted by David at November 19, 2003 02:52 PM