Practical XML: Output

How tough is it to generate XML output? After all, it's just a text format with angle-brackets as delimiters. As long as you don't have to fiddle with DTD or XSD, it should be no problem, right?

As it turns out, it's not that easy at all. Unless you're very familiar with the XML specification, chances are good that a simple text-based implementation will have flaws. In the best of cases, they will produce XML that is not well-formed and will be rejected by a parser. In the worst, they'll produce apparently well-formed and parseable XML that delivers incorrect data.

Naive String-based XML Generation

For some obsessive-compulsive reason, you've decided that you need a list of your music in XML format. Something that looks like this:

<music>
    <artist name="Anderson, Laurie">
        <album>Big Science</album>
        <album>Mister Heartbreak</album>
        <album>Strange Angels</album>
    </artist>
</music>

Since all of your files are in a single directory tree, organized by artist and album, with names and titles courtesy of CDDB, this sounds like a job for Java. So you write a program that reads the directory structure and calls the following method for each artist:

private static void appendArtist(StringBuffer buf, String artist, String[] albums)
{
    buf.append("  <artist name=\"").append(artist).append("\">\n");
    for (String album : albums)
    {
        buf.append("    <album>").append(album).append("</album>\n");
    }
    buf.append("  </artist>\n");
}

You run your program, look at the beautiful XML output, and are happy. Then a few weeks later you send this list to a friend, who writes an XSLT stylesheet to extract the data for her music database. And she calls you to complain about parsing errors: artist names like “Becker & Fagan” and album names like “The Raw & The Cooked”. What happened?

Well, ampersands aren't allowed in XML, except in very restricted places. You debate, for just a minute, changing your code to emit CDATA blocks, then realize that doesn't help with the artist name, which is stored as an attribute — and some day there will be an album name containing “]]>” as some sort of inside joke.

If you're familiar with Jakarta Commons you might think to use StringEscapeUtils.escapeXml(), but the whole experience has made you wonder: what else am I missing? And the language of the XML spec doesn't leave you with confidence that the answer is “nothing.”

Of course, in the real world you're not building a list of albums, you're building code to send an order message to a fulfillment system. And if that message can't be parsed, you're going to expose your company to legal action. Simple text isn't looking so simple anymore, is it?

Building and Writing a DOM Document

This is probably the point where you say “wait a minute, the JDK gives me everything I need to build XML documents!” And you're right, it does, and after a quick Google you discover there's even a tutorial — although you might wonder why XML processing is buried in a J2EE tutorial. But you read it, and look at the code samples that Sun provides, and you're ready to begin:

    DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
    Document doc = db.newDocument();
    Element root = doc.createElement("music");
    doc.appendChild(root);
That's, well, a little baroque. But it does show off plenty of design patterns: the first line has both Abstract Factory and Factory Method, and the Document interface is filled with factory methods! You press onward. By this point, you've probably been reading the org.w3c.dom package docs pretty heavily, so you've realized that an Element holds a reference to its Document, so you don't need to pass the latter around. Your appendArtist() method now looks like this:
public static void appendArtist(Element root, String artist, String[] albums)
{
    Document doc = root.getOwnerDocument();
    Element eArtist = doc.createElement("artist");
    root.appendChild(eArtist);
    eArtist.setAttribute("name", artist);
    for (String album : albums)
    {
        Element eAlbum = doc.createElement("album");
        eArtist.appendChild(eAlbum);
        Text tAlbum = doc.createTextNode(album);
        eAlbum.appendChild(tAlbum);
    }
}

Wow, that's a lot of work. And a lot of repetitive code. And it's not at all clear exactly what you're doing. But at least you've got an XML document. Well, no: actually, you've got an XML Document. Try to print that and all you get is:

[#document: null]

A little more reading, and you learn that you need to use the javax.xml.transform package to generate output. And if you think about that for a minute, you'll realize that what you're actually doing is applying the default XSLT stylesheet to your DOM:

public static String generateXML(Document doc)
throws Exception
{
    StringWriter out = new StringWriter();
    Transformer xform = TransformerFactory.newInstance().newTransformer();
    xform.setOutputProperty(OutputKeys.INDENT, "yes");
    xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    xform.transform(new DOMSource(doc), new StreamResult(out));
    return out.toString();
}

Well, that didn't seem too bad, did it? Your ampersands are escaped, and it didn't take that many additional lines of code. Of course, there are a few things that aren't immediately apparent, such as those properties that we're setting. And why, if we're telling it to indent, is it not indented?!? (the answer is here).

XmlBuilder

At this point, you're probably thinking “there's gotta be a better way!”

There are several. One popular approach is JDOM, which provides a simplified interface for building an in-memory representations of an XML document, and a configurable output generator. Another alternative is to write a set of utility methods to simplify access to the DOM that's built into the JDK.

I've used both, but ultimately I was driven toward an even simpler solution to support unit tests. One of the hallmarks of a good test is that it be easy to read. When you build XML programmatically, even if you have a simplified API like JDOM, it can be difficult to discern the structure in the data. So, with readability as a goal, I ended up with XmlBuilder.

ElementNode root =
    element("music",
        element("artist",
            attribute("name", "Anderson, Laurie"),
            element("album", text("Big Science")),
            element("album", text("Mister Heartbreak")),
            element("album", text("Strange Angels"))),
        element("artist",
            attribute("name", "Becker & Fagan"),
            element("album", text("The Collection"))),
        element("artist",
            attribute("name", "Fine Young Cannibals"),
            element("album", text("The Raw & The Cooked"))));

Your first thought might be that that's enough parentheses to keep a LISP programmer happy. It isn't, although it does highlight a former coworker's comment that XML is just a more verbose way to represent S expressions.

What makes this code possible are two features that arrived in JDK 1.5: static imports and varargs. The various builder methods, such as element(), are static methods on XmlBuilder. With JDK 1.5, we can use these methods without prefixing them with the classname, provided that we specify their use in a static import statement. Varargs come into play with the element() method, which can take any number of child nodes. In JDK 1.4 and before, you would have to explicitly create the array of child nodes. Together, these language features lead to a very clean way to build XML.

In addition to the declarative approach of static methods, the ElementNode.addChild() method provides a simple imperative mechanism:

public static void appendArtist(ElementNode root, String artist, String[] albums)
{
    ElementNode entry = element("artist", attribute("name", artist));
    for (String album : albums)
        entry.addChild(element("album", text(album)));
    root.addChild(entry);
}

One of the additional benefits of XmlBuilder is that it's a very compact representation: a few pointers wrapped around the actual data.

Pay Attention to Encodings

There's still one piece of the picture that needs to be examined: the XML prologue, and most importantly, the encoding specification therein. An XML document may use most (but not all!) of the characters defined by Unicode. However, Unicode characters take at least two bytes to represent, while disk files and network sockets deal with individual bytes.

XML parsers are required to accept the UTF-8 encoding as a default (also UTF-16, but we'll ignore that as it isn't a byte-oriented encoding). They may support other encodings, which can be specified via an optional prologue as the first line of an XML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>

This prologue specifies that the file should be processed using the ISO-8859-1 (Latin-1) character encoding. This encoding uses one byte per character, and consists of the ASCII characters and a selection of accented characters and symbols. However, it doesn't have mathematical symbols, or Greek characters, or Chinese characters … or any of the other million-plus characters that can be represented by Unicode. You can still use such characters in your content, but they must be escaped as entities: a Euro symbol, for example, must appear as "&#8364;". But you can't use that solution for element names, just text and attributes.

If you write your output to a StreamResult that wraps an OutputStream, you'll have no problem. Where you will have problems is when you write your output using a StringWriter, then separately write that string to an OutputStream.

This is because Java strings use Unicode characters, but Java streams use bytes. And unfortunately, it's very easy to wrap a stream with an object that accepts strings, without having any idea what encoding is being used: if you don't specify an encoding, Java provides a default. More unfortunately, this default encoding is meant to provide output to the user's console, so it differs by platform, and even by how the user has configured his or her environment. My home Linux system, for example, uses UTF-8, the Linux system I use at work goes with ISO-8859-1, and my Windows XP system uses “windows-1252” — an 8-bit encoding that's almost, but not quite the same as ISO-8859-1.

There are several solutions to this problem. The first, of course, is to never use anything but ASCII in your XML files. This works, but is not very useful in a global context: sooner or later you will need to use data that is outside of the 96 printable characters provided by ASCII.

The second solution is to always ensure that, if you're writing a string containing XML, you write it using UTF-8 encoding. Seems simple enough, but you don't always have control over the stream (particularly if it's handed to you as a Writer). And all it takes is one mistake, one forgotten or incorrect prologue, to send bad data.

The third, and in my opinion best solution is to never store XML data in a Java String other than for logging or debugging. Instead, keep it in a DOM or other internal representation, and use the standard output code to translate into a stream-appropriate encoding. This, of course, brings us back to XmlBuilder, which supports writing directly to a stream with whatever encoding you choose:

    ElementNode root = element( // ...
    OutputStream out = // ...
    // ...
    root.toStream(out, "ISO-8859-1");
    out.close();

One More Thing

What happens if your XML content happens to contain a backspace character (“\b”)? Well, every transformer that I've used will happily encode that character as an entity: &#8;. And the Xerces parser used by the Sun JDK will reject that entity as an illegal character!

That's right, backspaces aren't legal in XML version 1.0. In fact, most of the ASCII control characters aren't legal: the only legal characters are tab (“\t”), newline (“\n”), and carriage return (“\r”). All other characters in the range \u0000 to \u001f are illegal. What makes this particularly annoying is that the Latin-1 control codes (characters \u0080 to \u009f) are legal!

This is a rather gaping hole in the spec, and it was resolved with XML 1.1, so one solution is to specify that version in a prologue and hope that the recipients of your document use a parser that pays attention to the prologue (the Xerces parser does). Or, recognizing that such characters don't often appear in normal text, you could encode them using a non-XML mechanism such as Base64.

Conclusion

Contrary to appearances, this article was not intended to scare you into downloading my XMLBuilder — although that certainly provides solutions to many of the problems of XML output. Instead, it was driven by some of the questions that I've seen on Java XML forums, and the mistakes that I've seen in production code. The take-away should be that XML isn't as easy as it looks, but that the proper tools make it much easier.

For More Information

Compilable examples for this article:

The last example requires the Practical XML library. In addition to the XML Builder, this library provides a set of utility classes designed to simplify life with the JDK's XML API. I've developed it over the past dozen or so years, based on actual need in professional projects (actually, I developed several predecessor libraries that were owned by the companies I worked for, then decided that I didn't want to develop the same code ever again).

The XML Specification is, to say the least, dense. Tim Bray, one of the original editors, produced the Annotated XML Specification, which provides internal hyperlinks and commentary. It is my first source for questions about legal XML.

Copyright © Keith D Gregory, all rights reserved