December 18, 2003
The Design of XMLBeans (Part 3)
We have a number of ongoing series of articles on this weblog. This week, I'm returning to write the third article in the series on the architecture of XMLBeans.
What is type substitution? It is simple - both Java and XML Schema have it. This is what it looks like in Java:
The idea is that, even though a variable "p" is declared with type "Product", it may (or may not) actually hold an instance whose actual type is more specific such as "ProductOnSale".
Java type substitution allows us to write generic code such as p.getDescription() in cases where "all Products behave the same" without rewriting the same line of code for all different kinds of Products. On the other hand, when we need explicitly distinguish between substituted types, Java also provides an "instanceof" operator which can be used to detect instances of specific subclassses.
In XML Schema, type substitution allows an XML element to contain a more specific type than its declared type. The xsi:type attribute gives the name of the more specific type contained in the element. For example, the following two documents are valid according to the schema below. The second document substitutes a "product-on-sale" instance for the default declared "product" type.
For type substitution to be permitted, the "product-on-sale" type needs to be explicitly derived from the "product" type, as in the schema below. The rules of XML Schema inheritance require that derived types to share enough common structure with a base type that that a substituted type can be "treated as" its base type.
The simple binding style aligns XML Schema type substitution with Java type substitution, so that Java programmers can use their ordinary Java type substitution techniques when manipulating XML.
Polymorphism, instanceof, and Reflection
Java provides two main techniques for using type substitution:
1. Polymorphism. A method on a Java base class such as "getDescription()" is guranteed to also be provided by dervied classes. So, when working with base class methods on an instance that might have a substituted class, the programmer does not need to explicitly treat the derived class instances differently: the derived classes can be assumed to provide the same services as the base class.
2. Instanceof and casting. When it is necessary to explicitly detect and handle a more derived substituted class (or rule out an instance of a specific derived class as in the "else" clause above), the "instanceof" operator can detect a subclass, and a class cast operator can provide access to methods that are provided by that derived class.
In the Java toolbox for type substitution, polymorphism is the artful surgeon's scalpel and "instanceof" is a prosaic kitchen knife. There is a third technique in Java for working with type substitution, used less often, that is powerful but crude; it could be compared to a hacksaw:
3. Reflection and Object.getClass(). The final and least elegant technique for dealing with Java type substitution is to explicitly reflect on the class metadata for an instance object.
For type substitution to work as Java programmers expect, these techniques need to work when working with XML in Java. In particular:
1. Polymorhpism. Any methods provided on a base class of course must be present on the derived class. Moreover, the base methods should provide the same service and have the same behavior on derived classes, so that programmers using polymorphism do not need to use "instanceof" when calling methods on the base class. This raises some questions about XML Schema inheritance by restriction, which we will discuss in this article.
2. Instanceof and casting. The "instanceof" operator must be able to be used not only to detect the presence of a needed subtype, but also to rule out the presence of an unwanted subtype. In other words, both "instanceof" and "!instanceof" must work. This has some implications for element substitution, which we will discuss.
3. Reflection. Schema type metadata differs from Java class metadata, so getClass() cannot be expected to return all the relevant reflective information for a schema type. However, there should be a runtime method that does return the schematype metadata. In the simple binding style, this is the XmlObject.schemaType() method. The SchemaType "XML reflection" API is a big topic which we will leave for a future article.
Instanceof and the XmlObject base class
In the XMLBeans simplified binding style, the XML Schema anyType is bound to a special "univeral base class" interface called XmlObject. It is often asked, why not bind anyType to java.lang.Object instead? Perhaps, for example, a java.lang.Object binding would allow you to return a org.w3c.dom.Node when you don't have a more specific type, or a different class when you do have an xsi:type that gives you more type information.
The problem with this approach is that it doesn't preserve the following invariant:
Why do we care? Because it is very hard to write correct code when the invariant isn't true, and that leads to terrible bugs in practice.
For example, suppose that instances of xs:anyType could be returned as implementations of org.w3c.dom.Node, and properties were typed as java.lang.Object so that more specific implementations could also be returned if xsi:type information were available. Then a schema type that had the following element would bind to a java.lang.Object property signature as follows:
Then you might write code as follows:
Perhaps the "flatten" function eliminates all markup from the given DOM tree and returns the result as flat text. For example, providing the following input to the program above would output "how to avoid trouble".
But the problem with this approach comes from the fact that we are using "Node" to stand in for the "anyType", and yet when we have a specific type that extends anyType, the corresponding Java instance may not implement "Node".
For example, suppose we had a bound type "htmlContent" tied to a Java class "HtmlContent" and we used it in our XML:
In this case our code would break. Our "howto" variable would hold an instance of "HtmlContent" which would not implement org.w3c.dom.Node, and we would get the result "unknown".
The key is that the design above has broken the invariant:
Here, "x instanceof HtmlContent" is true while "x instanceof Node" is false, which it the source of our bug.
Using a specific XmlObject class solves this problem. The simplified binding style provides the following binding:
Then the code is written as follows:
Notice that this code has the following advantages over the previous design:
Since every type, including a bound type like HtmlContent extends XmlObject, this code works all the time.
So as you can see, the requirement that "instanceof" work correctly when Derived derives from Base leads us to an XmlObject universal base type. For code to work correctly, you also need "!instanceof" to work correctly when NotDerived does not derive from Base, and that leads us to other constraints.
Instanceof Implies Direct Instance Classes
The simplified binding style binds to Java classes that are interfaces (not specific class implementations), so you might imagine that the binding style allows you to implement mulitple bound interfaces with a single class (perhaps to reduce code size).
However, it turns out that the correct behavior of "instanceof" under type substitution does require that implementations supply distinct implementation classes for distinct types such as "Product" and "ProductOnSale".
Why can't an implementation "save" on code size and simply implement "Product" and "ProductOnSale" using a single class that can play both roles? What would go wrong? Suppose we had just a single shared implementation class for all Product and ProductOnSale instances:
If all Product or ProductOnSale instances were implemented by the same SharedImplClass, then the test "(obj instanceof ProductOnSale)" would always return true! The following lines of code would not work, because it would appear that all products were ProductOnSale:
So the correct behavior of "instanceof" requires that there actually be a concrete Java class for each type that can be instantiated and tested via "instanceof". Ensuring the correct bheavior of "instanceof" leads to the requirement that not only must there be one (abstract or interface) Java class for each schema type, but there also be at least one concrete Java class for each nonabstract schema type as well.
Here are two axioms for type substitution:
It is especially important to be keep this second axiom in mind in the next section, when we analyze how element substitution should work.
Element Substitution as Distinct from Type Correspondence
Type correspondence works very well for type substitution, so it is tempting to apply the same technique for element substitution, assigning a Java class for each declared element and aligning element inheritance with class inheritance.
However, the element-class strategy does not work: if Schema element declarations are translated into Java classes, the number of classes that are required at runtime is the product of all schema types and substitutable element declarations. In other words, using Java classes for schema elements would result in a huge (nonlinear) number of classes.
Here is an example. Continuing our example from the previous section which defines a type "product-on-sale" that derives from "product", consider a substitution group of elements that can substitute for the <item> element, whose type is product:
Any bound Java class that contains a reference to the "<item>" element declaration will have a getItem() method that is declared to return an object of type Product.
Here is how we would use Java classes to simulate element substitution, if we were to do so. First, each declared element such as "item" would corespond to a Java class "Item" that inherited from its declared type, in this case "product".
We might even declare getItem() to return Items rather than merely Products.
Then since <hot-item> and <cool-item> can also substitute for <item> (and can be returned from getItem()), they would have to extend Item:
We then must think about the classes of actual instance objects rather than just the declared classes of method signatures. When holding an instance of an <item> element, instanceof must correctly report that we have an Item and not a HotItem.
So, just as we saw in the last section, for the correct instanceof behavior, any implementation must be able to supply at least one concrete instance class for each declared element:
So far, so good. But next we run into a problem. This scheme explodes when we superimpose it with the same requirement for distinct instance classes that appears for types. This is because, in addition to substituting <hot-item> for <item>, XML schema also permits subsitution of xsi:type="product-on-sale" for the declared type "product". In order for "instanceof" to be meaningful, we would now need six concrete classes:
All six classes are needed, because "instanceof" code in different combinations such as the following must be able to return six possible answers when testing for Item or Product
As you can see, the number of classes needed is at least the product of (size of substitution group) x (number of types that can be substituted). The first number is large whenever substitution groups are used extensively, and second number can be very large, especially if the declared type of the base element is the default "anyType" (in which case it is "all globally declared types").
It is acceptable for a binding solution to produce a linear number of generated classes (i.e., for a schema with twice as many components, generate twice as many classes). However, it is unacceptable for a binding solution to be required to generate a quadratic number of classes.
It is possible to defer the type explosion from compiletime to runtime in Java through through use of dynamic proxies. However that technique would also impose a layer of required inefficiency on the design; it just shifts the burden from compiletime to runtime.
Element Names as Data Rather than Types
The discussion so far has explained why it is not feasible to arrange Java class inheritance in a way that permits "instanceof" to be used to detect substitution of elements. What is the correct approach?
The simple binding style solution is straightfoward: since substituted element names cannot be treated as Java type metadata, they must be treated as Java instance data.
1. In keeping with the intent of substitution groups as a way of "substituting" elements, getters corresponding to elements that are the head of a substitution group should return all elements in the substitution group. For example, getItemArray() will return an array that represents all the <item>, <hot-item>, and <cool-itme> elements.
2. In light of the discussion in the last section, the instances that are returned should all implement Java classes that correspond to the schema types of the instance data. To avoid a class explosion, they are not required to implement additional classes that correspond to the elements. The schema spec guarantees that when an element is substituted, the type is also guaranteed to be substitutable.
3. Then, to make the element substitution detectable and accessible to the programmer, the element names used in the xml data are made available by a method on the associated instance in Java.
For this to be convenient, constants (such as QNAME_ITEM) should be generated for the relevant QNames.
4. Similarly, setter methods must be available that are keyed off of specific element names, to permit construction of instances that use substitution groups.
Although XMLBeans v1 fully supports substitution groups, does not provide methods as easy as the ones illustrated above. For example, it does not provide "add" methods such as the ones above or the "nodeQName" methods - you must use XmlCursor to access that functionality. But XMLBeans v2 will probably add these kinds of methods.
The idea of using element names as instance data to parameterize write-access to data is also relevant when discussing wildcards. We will discuss wildcard in the next article in the series.
Inheritance by Restriction
We finish this article with a short discussion of inheritance by restriction. In Java, the only kind of inheritance is inheritance by extension, and it always works by adding or overriding methods on a class. However, in schema, there are two forms of complex type inheritance:
Here is an example of three types that are related to each other via inheritance by restriction:
In words, the example above defines a type "derived2" that
As you can see, derivation by restriction is a form of derivation by subsetting. The set of instances permitted "derived1" is a subset of the set of instances permitted by "base", and the set of instances permitted by "derived2" is a subset of the set of instances permitted by "derived1".
How should the two dervied types be bound in Java? Clearly for type correspondence to work, they must both be bound to classes that inherit from the correspondeing classes for the base types. But in particular, there are two natural questions:
Let us take a look at the <middle> element question first.
It is a common misconception that inheritance by restriction allows derived types to "remove" elements from a base type's content model. That is not a correct description of derivation by restriction. A derived restriction can only impose further restrictions on degrees of freedom that were already present in the base type. So, for example, the reason the <middle> tag is allowed to be "removed" when derived2 restricts derived1 is that it is already optional in derived1. In contrast, an element such as the <first> tag cannot be "removed" because it is required in the base type derived1. Notice that <middle> is allowed to be "added" when derived1 restricts <base>, because it is specializing a wildcard which already permits <middle>.
What should the bound type Derived2 look like? The answer is, the signature should look exactly like Derived1, but implementations happen to always return "null" on the getMiddle
An ordinary Java programmer working with a variable of type "Derived1" would want to write polymorphic code like the following:
In particular, polymorphism should guarantee that the "getMiddle()" method should work correctly and return the right value regardless of whether the instance were actually a
The second question when examining inheritance by restriction is, how should wildcards such as the <xs:any> found in the "base" type be bound?
The answer provided by the simple binding model is "wildcards are not bound to any generated method at all" - because open element and attribute content is always accessible by generic accessors. An additional method is unnecessary and not provided by the simple binding style.
The omission is not an oversight, but a result of careful design! But that is a topic for another day. The detailed reasons for this conscious omission will be discussed in the next article in this series.
The Principle of Type Correspondence
In this article, we've discussed a few of the implications of the prinicple of type correspondence, and how it impacts the binding of a few XML Schema features to Java. For Java programmers to be able to write correct code in the presence of type and element substitution, they need the following two invariants to hold true:
1. If the XML type derived inherits from type base, then in Java, "x instanceof Derived" must imply "x instanceof Base".
In particular, we have seen that these invariants have the following ramifications:
In the next article in the series, we will move on to discuss the problem of open content, wildcards, and raw XML infoset access.Posted by David at December 18, 2003 02:42 PM
|Copyright 2003 © David Bau. All Rights Reserved.|