November 14, 2003

The Design of XMLBeans (Part 1)

I am a contributor to a new Apache-project-in-incubation called XMLBeans, a powerful new Java/XML binding tool.

Although folks are starting to use it, and there is some reasonable documentaion, nobody has explained the architecture of the tool, or explained the "why" behind the "how."

This is the first in a series of entries that will discuss the problem of Java/XML binding and the explain the thinking behind the XMLBeans approach to solving it.

The Simplified XML Binding Style

The XML Schema/Java binding technology used by Apache XMLBeans is a carefully designed "simplified binding style" that has several desirable properties.

  1. The binding style is capable of supporting all of XML schema (e.g., all types of content models, substitution, inheritance, and 100% of schema validation).
  2. The binding style is capable of supporting a model that can round-trip all of XML, as well as permitting consistent access to the same XML infoset data using other APIs (XPath, DOM, etc).
  3. The binding style is simple and so permits very fast implementations.
  4. The binding style is robust to versioning of XML schemas as well as binding to invalid content.
  5. The binding style produces Java signatures that are easy to understand and convenient to use.

This note describes the underlying principles and the specifics of the simplified XML schema binding style.

Basic Principles

The simplified binding style is built on two architectural principles:

  1. Principle of Type Correspondence. There is a one-to-one correspondence betweeen Java classes and schema types, and the inheritance trees in Java and schema correspond to each other.
  2. Principle of Node Correspondence. There is a one-to-one correspondence betwen Java instance objects and nodes for elements, attributes, and documents in the XML infoset, and the containment relationships in Java reflect the child and sibling relationships in the XML infoset.

These two principles provide a bedrock of invariants that guarantee that some basic programming mechanisms work. For example:

  1. Type correspondence guarantees that Java "instanceof" can be used to detect schema types even in the presence of substitution and inheritance, and that Java type substitution can be used wherever schema type substitution can be found.
  2. Node correspondence guarantees that XML information is preserved in the Java instance data, and that object identity can be maintained while accessing or manipulating the bound XML infoset using a variety of different idioms, such as DOM or XPath or XQuery.

The two basic principles also have the advantage that they provide an easy and intuitive model for programmers to apply and understand. Yet preserving both principles while providing a useful binding model presents a couple challenges.

Understanding Type Correspondence

The principle of type correspondence provides a Java class for every schema type. In particular, all the built-in Schema types must have corresponding Java classes.

At first blush, one might assume, for example, that the Java class formally corresponding to the schema type xs:string should be java.lang.String. However, since java.lang.String is a final class, that choice would not allow xs:token (or any other schema type which inherits from xs:string) to have a Java class that has the proper inheritance relationship, since no Java class can extend java.lang.String.

XML Schema type Corresponding Java type?
xs:string java.lang.String? Maybe, but...
xs:token java.lang.String?
No, because it is not distinct from the type bound to xs:string, so instanceof cannot distinguish them.
xs:token a custom XmlToken class?
No, because it is not an instanceof java.lang.String, so the inheritance trees do not line up.

On the other hand, any Java programmer would be right to demand the convenience of a java.lang.String for each xs:string, as well as a java "int" for an xs:int and so on, even though the "instanceof" operator has no hope of working correctly. Faithful type correspondence, while very important for complex types, seems to be different from what you want in practice for simple types. And yet, since schema allows complex types to inherit from simple types (these are called complex types with simple content), if we do not establish type correspondence for simple types, we will not be able to establish full type correspondence for complex types.

The solution provided by the simpified style is to provide not one, but two Java classes for each simple type. There is a "formal" Java class which establishes full type correspondence, and there is a "convenience" Java type that does not need to play in the type correspondence world. The "convenience" Java type does not need to uniquely map to or from a schema type or have any particular inheritance relationship with other Java types, and it will be provided where convenience is important. But the "formal" type will always be available and will represent the "true" data model.

A table of all the built-in schema types together with their "formal" and "convenience" Java types is listed below.

Schema type Formal class Convenience
xs:string XmlString String
xs:boolean XmlBoolean boolean
xs:decimal XmlDecimal BigDecimal
xs:float XmlFloat float
xs:double XmlDouble double
xs:duration XmlDuration GDuration*
xs:dateTime XmlDateTime Calendar
xs:time XmlTime Calendar
xs:date XmlDate Calendar
xs:gYearMonth XmlGYearMonth Calendar
xs:gYear XmlGYear Calendar
xs:gMonthDay XmlGMonthDAy Calendar
xs:gDay XmlGDay Calendar
xs:gMonth XmlGMonth Calendar
xs:hexBinary XmlHexBinary byte[]
xs:base64Binary XmlBase64Binary byte[]
xs:anyURI XmlAnyURI String
xs:QName XmlQName QName
xs:normalizedString XmlNormalizedString String
xs:token XmlToken String
xs:language XmlLanguage String
xs:Name XmlName String
xs:NCName XmlNCName String
xs:ID XmlID String
xs:IDREF XmlIDREF String
xs:integer XmlInteger BigInteger
xs:negativeInteger XmlNegativeInteger BigInteger
xs:long XmlLong long
xs:int XmlInt int
xs:short XmlShort short
xs:byte XmlByte byte
xs:unsignedLong XmlUnsignedLong BigInteger
xs:unsignedInt XmlUnsignedInt long
xs:unsignedShort XmlUnsignedShort int
xs:unsignedByte XmlUnsignedByte short
xs:positiveInteger XmlPositiveInteger BigInteger
xs:anyType XmlObject XmlObject**
xs:anySimpleType XmlAnySimpleType String
* all convenience types are built-in to the JDK except for GDuration. The JDK does not have a built-in class that corresponds to XML Schema's Gregorian duration type.
** sometimes - for the non-simple types - the "convenience" type is just the same as the "formal" type.

The formal classes have the same inheritance relationships that the corresponding schema types do, for example, the XmlInt Java class has the following base types:

Java inheritance XML Schema inheritance
XmlInt extends xs:int restricts
XmlLong extends xs:long restricts
XmlInteger extends xs:integer restricts
XmlDecimal extends xs:decimal restricts
XmlAnySimpleType extends xs:anySimpleType restricts
XmlObject xs:anyType

The fact that the inheritance in Java follows the inheritance in schema has some utility. For example, if XmlDecimal has a method called "getBigDecimalValue()", then you can also call "getBigDecimalValue()" on any XmlInteger, XmlLong, or XmlInt. Even if somebody has substituted a restricted subclass in the XML instance such as an xs:int for an xs:decimal, the programmer can be assured that it is always possible to extract a BigDecimal value in the same way.

Another consequence of the type correspondence is that every Java class that corresponds to a schema type inherits from the class that represents xs:anyType. Here we have called this universal base class "XmlObject".

Of course, the principle of type correspondence extends beyond the builtin types above to all user-defined types. Note some user-defined types in XML Schema are anonymous. In the simplified binding model nested anonymous schema types also have a corresponding nested Java class.

Next Column

I've talked a bit about what "type correspondence" means when doing XML/Java binding. In the next article in this series I will discuss some of the details of "node correspondence" - what it is, and what it is not. Posted by David at November 14, 2003 04:37 PM

Post a comment

Remember personal info?