davidbau.com The Design of XMLBeans (Part 3)

The Design of XMLBeans (Part 3)

We have a number of ongoing series of articles on this weblog. This week, I'm returning to write the third article in the series on the architecture of XMLBeans.

In the first article, we discussed the principle of type correspondence.
In the second article, we discussed the pinciple of node correspondence.

In this third article, will examine the way binding works for XML schema features of type and element substitution. What does it mean for the Java code you need to write?

Type Substitution

What is type substitution? It is simple - both Java and XML Schema have it. This is what it looks like in Java:

Product p = doc.getItem(); System.out.println("Desc: " + p.getDescription()); if (p instanceof ProductOnSale) { ProductOnSale s = (ProductOnSale)p; System.out.println("Price: " + s.getPrice()); } else System.out.println("Product is not on sale.");

The idea is that, even though a variable "p" is declared with type "Product", it may (or may not) actually hold an instance whose actual type is more specific such as "ProductOnSale".

Java type substitution allows us to write generic code such as p.getDescription() in cases where "all Products behave the same" without rewriting the same line of code for all different kinds of Products. On the other hand, when we need explicitly distinguish between substituted types, Java also provides an "instanceof" operator which can be used to detect instances of specific subclassses.

In XML Schema, type substitution allows an XML element to contain a more specific type than its declared type. The xsi:type attribute gives the name of the more specific type contained in the element. For example, the following two documents are valid according to the schema below. The second document substitutes a "product-on-sale" instance for the default declared "product" type.

<item> <description>Red Balloon</description> </item> <item xsi:type="product-on-sale" xmlns:xsi="..."> <description>Blue Balloon</description> <price>0.75</price> </item>

For type substitution to be permitted, the "product-on-sale" type needs to be explicitly derived from the "product" type, as in the schema below. The rules of XML Schema inheritance require that derived types to share enough common structure with a base type that that a substituted type can be "treated as" its base type.

<xs:element name="item" type="product"/> <xs:complexType name="product"> <xs:sequence> <xs:element name="description" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:complexType name="product-on-sale"> <xs:complexContent> <xs:extension base="product"/> <xs:sequence> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

The simple binding style aligns XML Schema type substitution with Java type substitution, so that Java programmers can use their ordinary Java type substitution techniques when manipulating XML.

Polymorphism, instanceof, and Reflection

Java provides two main techniques for using type substitution:

1. Polymorphism. A method on a Java base class such as "getDescription()" is guranteed to also be provided by dervied classes. So, when working with base class methods on an instance that might have a substituted class, the programmer does not need to explicitly treat the derived class instances differently: the derived classes can be assumed to provide the same services as the base class.

2. Instanceof and casting. When it is necessary to explicitly detect and handle a more derived substituted class (or rule out an instance of a specific derived class as in the "else" clause above), the "instanceof" operator can detect a subclass, and a class cast operator can provide access to methods that are provided by that derived class.

In the Java toolbox for type substitution, polymorphism is the artful surgeon's scalpel and "instanceof" is a prosaic kitchen knife. There is a third technique in Java for working with type substitution, used less often, that is powerful but crude; it could be compared to a hacksaw:

3. Reflection and Object.getClass(). The final and least elegant technique for dealing with Java type substitution is to explicitly reflect on the class metadata for an instance object.

For type substitution to work as Java programmers expect, these techniques need to work when working with XML in Java. In particular:

1. Polymorhpism. Any methods provided on a base class of course must be present on the derived class. Moreover, the base methods should provide the same service and have the same behavior on derived classes, so that programmers using polymorphism do not need to use "instanceof" when calling methods on the base class. This raises some questions about XML Schema inheritance by restriction, which we will discuss in this article.

2. Instanceof and casting. The "instanceof" operator must be able to be used not only to detect the presence of a needed subtype, but also to rule out the presence of an unwanted subtype. In other words, both "instanceof" and "!instanceof" must work. This has some implications for element substitution, which we will discuss.

3. Reflection. Schema type metadata differs from Java class metadata, so getClass() cannot be expected to return all the relevant reflective information for a schema type. However, there should be a runtime method that does return the schematype metadata. In the simple binding style, this is the XmlObject.schemaType() method. The SchemaType "XML reflection" API is a big topic which we will leave for a future article.

Instanceof and the XmlObject base class

In the XMLBeans simplified binding style, the XML Schema anyType is bound to a special "univeral base class" interface called XmlObject. It is often asked, why not bind anyType to java.lang.Object instead? Perhaps, for example, a java.lang.Object binding would allow you to return a org.w3c.dom.Node when you don't have a more specific type, or a different class when you do have an xsi:type that gives you more type information.

The problem with this approach is that it doesn't preserve the following invariant:

If the XML type derived inherits from type base, then in Java, "x instanceof Derived" must imply "x instanceof Base".

Why do we care? Because it is very hard to write correct code when the invariant isn't true, and that leads to terrible bugs in practice.

For example, suppose that instances of xs:anyType could be returned as implementations of org.w3c.dom.Node, and properties were typed as java.lang.Object so that more specific implementations could also be returned if xsi:type information were available. Then a schema type that had the following element would bind to a java.lang.Object property signature as follows:

<xs:complexType name="docs"> <xs:sequence> <xs:element name="howto" type="xs:anyType"/> </xs:sequence> </xs:complexType> <xs:element name="doc" type="docs"/> interface Docs { java.lang.Object getHowto(); // May return a DOM node etc }

Then you might write code as follows:

Object howto = obj.getHowto(); String flatText = "unknown"; if (howto instanceof org.w3c.dom.Node) { org.w3c.dom.Node n = (org.w3c.dom.Node)howto; flatText = flatten(n); // some DOM code } System.out.println(flattext);

Perhaps the "flatten" function eliminates all markup from the given DOM tree and returns the result as flat text. For example, providing the following input to the program above would output "how to avoid trouble".

<doc> How to avoid trouble. </doc>

Great!

But the problem with this approach comes from the fact that we are using "Node" to stand in for the "anyType", and yet when we have a specific type that extends anyType, the corresponding Java instance may not implement "Node".

For example, suppose we had a bound type "htmlContent" tied to a Java class "HtmlContent" and we used it in our XML:

<doc xsi:type="h:htmlContent" xmlns:xsi="..." xmlns:h="..."> How to avoid trouble. </doc>

In this case our code would break. Our "howto" variable would hold an instance of "HtmlContent" which would not implement org.w3c.dom.Node, and we would get the result "unknown".

The key is that the design above has broken the invariant:

If the XML type derived inherits from type base, then in Java, "x instanceof Derived" must imply "x instanceof Base".

Here, "x instanceof HtmlContent" is true while "x instanceof Node" is false, which it the source of our bug.

Using a specific XmlObject class solves this problem. The simplified binding style provides the following binding:

interface Docs { XmlObject getHowto(); // Always returns an XmlObject }

Then the code is written as follows:

XmlObject howto = obj.getHowto(); org.w3c.dom.Node n = howto.domNode(); flatText = flatten(n); System.out.println(flattext);

Notice that this code has the following advantages over the previous design:

It is simpler and shorter. (There is no "if".)
It is fully typechecked and type-correct. (There is no downcast needed.)
Most importantly, it works on all data.

Since every type, including a bound type like HtmlContent extends XmlObject, this code works all the time.

So as you can see, the requirement that "instanceof" work correctly when Derived derives from Base leads us to an XmlObject universal base type. For code to work correctly, you also need "!instanceof" to work correctly when NotDerived does not derive from Base, and that leads us to other constraints.

Instanceof Implies Direct Instance Classes

The simplified binding style binds to Java classes that are interfaces (not specific class implementations), so you might imagine that the binding style allows you to implement mulitple bound interfaces with a single class (perhaps to reduce code size).

However, it turns out that the correct behavior of "instanceof" under type substitution does require that implementations supply distinct implementation classes for distinct types such as "Product" and "ProductOnSale".

Why can't an implementation "save" on code size and simply implement "Product" and "ProductOnSale" using a single class that can play both roles? What would go wrong? Suppose we had just a single shared implementation class for all Product and ProductOnSale instances:

class SharedImplClass implements Product, ProductOnSale {...}

If all Product or ProductOnSale instances were implemented by the same SharedImplClass, then the test "(obj instanceof ProductOnSale)" would always return true! The following lines of code would not work, because it would appear that all products were ProductOnSale:

if (product instanceof ProductOnSale) System.out.println("Product is on sale."); else System.out.println("Product is not on sale.");

So the correct behavior of "instanceof" requires that there actually be a concrete Java class for each type that can be instantiated and tested via "instanceof". Ensuring the correct bheavior of "instanceof" leads to the requirement that not only must there be one (abstract or interface) Java class for each schema type, but there also be at least one concrete Java class for each nonabstract schema type as well.

Here are two axioms for type substitution:

If the XML type derived inherits from type base, then in Java, "x instanceof Derived" must imply "x instanceof Base".
If the XML type notderived does not inherit from base, then in Java, "x instanceof NotDerived" must imply "!(x instanceof Base)".

It is especially important to be keep this second axiom in mind in the next section, when we analyze how element substitution should work.

Element Substitution as Distinct from Type Correspondence

Type correspondence works very well for type substitution, so it is tempting to apply the same technique for element substitution, assigning a Java class for each declared element and aligning element inheritance with class inheritance.

However, the element-class strategy does not work: if Schema element declarations are translated into Java classes, the number of classes that are required at runtime is the product of all schema types and substitutable element declarations. In other words, using Java classes for schema elements would result in a huge (nonlinear) number of classes.

Here is an example. Continuing our example from the previous section which defines a type "product-on-sale" that derives from "product", consider a substitution group of elements that can substitute for the <item> element, whose type is product:

<xs:element name="item" type="product"/> <xs:element substitutionGroup="item" name="hot-item" type="product"/> <xs:element substitutionGroup="item" name="cool-item" type="product"/>

Any bound Java class that contains a reference to the "<item>" element declaration will have a getItem() method that is declared to return an object of type Product.

Here is how we would use Java classes to simulate element substitution, if we were to do so. First, each declared element such as "item" would corespond to a Java class "Item" that inherited from its declared type, in this case "product".

interface Item extends Product {}

We might even declare getItem() to return Items rather than merely Products.

Then since <hot-item> and <cool-item> can also substitute for <item> (and can be returned from getItem()), they would have to extend Item:

interface HotItem extends Item {} interface CoolItem extends Item {}

We then must think about the classes of actual instance objects rather than just the declared classes of method signatures. When holding an instance of an <item> element, instanceof must correctly report that we have an Item and not a HotItem.

if (obj instanceof Item && !(obj instanceof HotItem)) System.out.println("It is an item but not a hot-item.");

So, just as we saw in the last section, for the correct instanceof behavior, any implementation must be able to supply at least one concrete instance class for each declared element:

class ItemImpl implements Item {...} class HotItemImpl extends ItemImpl implements HotItem {...} class CoolItemImpl extends ItemImpl implements CoolItem {...}

So far, so good. But next we run into a problem. This scheme explodes when we superimpose it with the same requirement for distinct instance classes that appears for types. This is because, in addition to substituting <hot-item> for <item>, XML schema also permits subsitution of xsi:type="product-on-sale" for the declared type "product". In order for "instanceof" to be meaningful, we would now need six concrete classes:

	Product only	ProductOnSale
instanceof Item only	ItemProduct	ItemProductOnSale
instanceof HotItem	HotItemProduct	HotItemProductOnSale
instanceof CoolItem	CoolItemProduct	CoolItemProductOnSale

All six classes are needed, because "instanceof" code in different combinations such as the following must be able to return six possible answers when testing for Item or Product
substitutions:

if ((obj instanceof HotItem) && !(obj instanceof ProductOnSale)) System.out.println("We have a waiting list for this item");

As you can see, the number of classes needed is at least the product of (size of substitution group) x (number of types that can be substituted). The first number is large whenever substitution groups are used extensively, and second number can be very large, especially if the declared type of the base element is the default "anyType" (in which case it is "all globally declared types").

It is acceptable for a binding solution to produce a linear number of generated classes (i.e., for a schema with twice as many components, generate twice as many classes). However, it is unacceptable for a binding solution to be required to generate a quadratic number of classes.

It is possible to defer the type explosion from compiletime to runtime in Java through through use of dynamic proxies. However that technique would also impose a layer of required inefficiency on the design; it just shifts the burden from compiletime to runtime.

Element Names as Data Rather than Types

The discussion so far has explained why it is not feasible to arrange Java class inheritance in a way that permits "instanceof" to be used to detect substitution of elements. What is the correct approach?

The simple binding style solution is straightfoward: since substituted element names cannot be treated as Java type metadata, they must be treated as Java instance data.

1. In keeping with the intent of substitution groups as a way of "substituting" elements, getters corresponding to elements that are the head of a substitution group should return all elements in the substitution group. For example, getItemArray() will return an array that represents all the <item>, <hot-item>, and <cool-itme> elements.

2. In light of the discussion in the last section, the instances that are returned should all implement Java classes that correspond to the schema types of the instance data. To avoid a class explosion, they are not required to implement additional classes that correspond to the elements. The schema spec guarantees that when an element is substituted, the type is also guaranteed to be substitutable.

3. Then, to make the element substitution detectable and accessible to the programmer, the element names used in the xml data are made available by a method on the associated instance in Java.

For example:

Product item = container.getItem(); if (item.nodeQName().equals(ItemDocument.QNAME_ITEM)) System.out.println("An ordinary item"); else if (item.nodeQName().equals(ItemDocument.QNAME_HOT_ITEM)) System.out.println("A hot item"); else if (item.nodeQName().equals(ItemDocument.QNAME_COOL_ITEM)) System.out.println("A cool item");

For this to be convenient, constants (such as QNAME_ITEM) should be generated for the relevant QNames.

4. Similarly, setter methods must be available that are keyed off of specific element names, to permit construction of instances that use substitution groups.

For example:

container.add(ItemDocument.QNAME_HOT_ITEM, hotProduct); container.add(ItemDocument.QNAME_ITEM, ordinaryProduct); // the second line above is perfectly equivalent to: // container.addItem(ordinaryProduct);

Although XMLBeans v1 fully supports substitution groups, does not provide methods as easy as the ones illustrated above. For example, it does not provide "add" methods such as the ones above or the "nodeQName" methods - you must use XmlCursor to access that functionality. But XMLBeans v2 will probably add these kinds of methods.

The idea of using element names as instance data to parameterize write-access to data is also relevant when discussing wildcards. We will discuss wildcard in the next article in the series.

Inheritance by Restriction

We finish this article with a short discussion of inheritance by restriction. In Java, the only kind of inheritance is inheritance by extension, and it always works by adding or overriding methods on a class. However, in schema, there are two forms of complex type inheritance:

Inheritance by extension, where additional data is added at the end of a type's content model.
Inheritance by restriction, where a new content model is defined that is guaranteed to be a subset of the base content model.

Here is an example of three types that are related to each other via inheritance by restriction:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:tns="http://rest/" targetNamespace="http://rest/" elementFormDefault="qualified"> <xs:complexType name="base"> <xs:sequence> <xs:any namespace="##targetNamespace" minOccurs="0" maxOccurs="3"/> </xs:sequence> </xs:complexType> <xs:complexType name="derived1"> <xs:complexContent> <xs:restriction base="tns:base"> <xs:sequence> <xs:element name="first"/> <xs:element name="middle" minOccurs="0"/> <xs:element name="last"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType> <xs:complexType name="derived2"> <xs:complexContent> <xs:restriction base="tns:derived1"> <xs:sequence> <xs:element name="first"/> <xs:element name="last"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType> </xs:schema>

In words, the example above defines a type "derived2" that
derives from "derived1", which in turn derives from "base".

Type	Definition
base	Permits ANY zero to 3 elements in the target NS
derived1	Requires <first> (<middle> optional) <last>
derived2	Requires <first> <last>, with no <middle> allowed

As you can see, derivation by restriction is a form of derivation by subsetting. The set of instances permitted "derived1" is a subset of the set of instances permitted by "base", and the set of instances permitted by "derived2" is a subset of the set of instances permitted by "derived1".

How should the two dervied types be bound in Java? Clearly for type correspondence to work, they must both be bound to classes that inherit from the correspondeing classes for the base types. But in particular, there are two natural questions:

How should the <middle> element be bound, since it seems to disappear from derived2?
How should the wildcard (the <any>) in "base" be bound, since it seems to disappear from derived1?

Let us take a look at the <middle> element question first.

It is a common misconception that inheritance by restriction allows derived types to "remove" elements from a base type's content model. That is not a correct description of derivation by restriction. A derived restriction can only impose further restrictions on degrees of freedom that were already present in the base type. So, for example, the reason the <middle> tag is allowed to be "removed" when derived2 restricts derived1 is that it is already optional in derived1. In contrast, an element such as the <first> tag cannot be "removed" because it is required in the base type derived1. Notice that <middle> is allowed to be "added" when derived1 restricts <base>, because it is specializing a wildcard which already permits <middle>.

What should the bound type Derived2 look like? The answer is, the signature should look exactly like Derived1, but implementations happen to always return "null" on the getMiddle
() call whenever the data is valid.

An ordinary Java programmer working with a variable of type "Derived1" would want to write polymorphic code like the following:

Derived1 derived = computeDerivedData(); String first = derived.getFirst(); String middle = derived.getMiddle(); String last = derived.getLast();

In particular, polymorphism should guarantee that the "getMiddle()" method should work correctly and return the right value regardless of whether the instance were actually a
"Dervied1" as declared, or a substituted subclass such as "Derived2". The only special thing about the "missing" element is that it is always missing for valid instances of "derived2", so the method can always be expected to return null.

The second question when examining inheritance by restriction is, how should wildcards such as the <xs:any> found in the "base" type be bound?

The answer provided by the simple binding model is "wildcards are not bound to any generated method at all" - because open element and attribute content is always accessible by generic accessors. An additional method is unnecessary and not provided by the simple binding style.

The omission is not an oversight, but a result of careful design! But that is a topic for another day. The detailed reasons for this conscious omission will be discussed in the next article in this series.

The Principle of Type Correspondence

In this article, we've discussed a few of the implications of the prinicple of type correspondence, and how it impacts the binding of a few XML Schema features to Java. For Java programmers to be able to write correct code in the presence of type and element substitution, they need the following two invariants to hold true:

1. If the XML type derived inherits from type base, then in Java, "x instanceof Derived" must imply "x instanceof Base".
2. If the XML type notderived does not inherit from base, then in Java, "x instanceof NotDerived" must imply "!(x instanceof Base)".

In particular, we have seen that these invariants have the following ramifications:

The universal base type in schema (xs:anyType) needs to correspond to a universal base type for bound XML types in Java, which we call XmlObject.
Binding implementations must provide a concerete implementation class for each schema type, even though the binding style doesn't explicitly specify what those classes are.
Element substitution cannot be modelled using Element interfaces without a quadratic class explosion, so element names must be modelled as data rather than metadata.
Inheritance by restriction does not mean "removing properties" but simply means "constraining properties to have specific values", including possibly constraing them to be always empty in a specific subtype.

In the next article in the series, we will move on to discuss the problem of open content, wildcards, and raw XML infoset access.

Posted by David at December 18, 2003 02:42 PM

Thanks for another well-written article. By the way, it looks like some angle brackets need to be escaped:

"getItemArray() will return an array that represents all the , , and element"

Posted by: Brian Slesinsky at December 19, 2003 12:29 AM

It looks like some escapes are still missing. Here's one example but I think there were others too:

In contrast, an element such as the tag cannot be "removed" because it is required in the base type derived1. Notice that is allowed to be "added" when derived1 restricts , because it is specializing a wildcard which already permits .

Posted by: J. David Beutel at February 10, 2004 09:17 PM

have read a lot about xmlbeans and we ahve been using it in one of our major projects. I am wondering is tehre any one out there that is using this technology.

we are facing some performance bottlenecks and on running the profiler we see it all in com.bea pacjkage

ashish

Posted by: ashish at March 18, 2004 04:46 PM

One of the best Xml binding design I have ever read!
Simply, elegant even through formal definitions and logics.
Keep up the good work, I have involved in a project management which last two years and framework like XmlBeans (superior to JAXB) lead me clarity.

About performance issue however i'm not sure it will be IN EVERY CASES a substitute to old Sax techniques :). Thanks a lot for the article anyway.

Posted by: at June 4, 2008 11:57 AM

Posted by: valerio.pace at June 4, 2008 11:58 AM