Practical XML: Output
Originally published: 2008-08-03
Last updated: 2015-04-27
How tough is it to generate XML output? After all, it's just a text format with angle-brackets as delimiters. As long as you don't have to fiddle with DTD or XSD, it should be no problem, right?
As it turns out, it's not that easy at all. Unless you're very familiar with the XML specification, chances are good that a simple text-based implementation will have flaws. In the best of cases, they will produce XML that is not well-formed and will be rejected by a parser. In the worst, they'll produce apparently well-formed and parseable XML that delivers incorrect data.
Naive String-based XML Generation
For some obsessive-compulsive reason, you've decided that you need a list of your music in XML format. Something that looks like this:
<music> <artist name="Anderson, Laurie"> <album>Big Science</album> <album>Mister Heartbreak</album> <album>Strange Angels</album> </artist> </music>
Since all of your files are in a single directory tree, organized by artist and album, with names and titles courtesy of CDDB, this sounds like a job for Java. So you write a program that reads the directory structure and calls the following method for each artist:
private static void appendArtist(StringBuffer buf, String artist, String[] albums) { buf.append(" <artist name=\"").append(artist).append("\">\n"); for (String album : albums) { buf.append(" <album>").append(album).append("</album>\n"); } buf.append(" </artist>\n"); }
You run your program, look at the beautiful XML output, and are happy. Then a few weeks later you send this list to a friend, who writes an XSLT stylesheet to extract the data for her music database. And she calls you to complain about parsing errors: artist names like “Becker & Fagan” and album names like “The Raw & The Cooked”. What happened?
Well, ampersands
aren't allowed in XML, except in very restricted places. You debate,
for just a minute, changing your code to emit CDATA
blocks, then realize that doesn't help with the artist name, which is stored
as an attribute — and some day there will be an album name containing
“]]>
” as some sort of inside joke.
If you're familiar with
Jakarta Commons you might think to use
StringEscapeUtils.escapeXml()
,
but the whole experience has made you wonder: what else am I missing?
And the language of the XML spec doesn't leave you with confidence
that the answer is “nothing.”
Of course, in the real world you're not building a list of albums, you're building code to send an order message to a fulfillment system. And if that message can't be parsed, you're going to expose your company to legal action. Simple text isn't looking so simple anymore, is it?
Building and Writing a DOM Document
This is probably the point where you say “wait a minute, the JDK gives me everything I need to build XML documents!” And you're right, it does, and after a quick Google you discover there's even a tutorial — although you might wonder why XML processing is buried in a J2EE tutorial. But you read it, and look at the code samples that Sun provides, and you're ready to begin:
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document doc = db.newDocument(); Element root = doc.createElement("music"); doc.appendChild(root);That's, well, a little baroque. But it does show off plenty of design patterns: the first line has both Abstract Factory and Factory Method, and the
Document
interface is filled with factory methods!
You press onward. By this point, you've probably been reading the
org.w3c.dom
package docs pretty heavily, so
you've realized that an Element
holds a reference to its
Document
, so you don't need to pass the latter around. Your
appendArtist()
method now looks like this:
public static void appendArtist(Element root, String artist, String[] albums) { Document doc = root.getOwnerDocument(); Element eArtist = doc.createElement("artist"); root.appendChild(eArtist); eArtist.setAttribute("name", artist); for (String album : albums) { Element eAlbum = doc.createElement("album"); eArtist.appendChild(eAlbum); Text tAlbum = doc.createTextNode(album); eAlbum.appendChild(tAlbum); } }
Wow, that's a lot of work. And a lot of repetitive code. And it's not at all
clear exactly what you're doing. But at least you've got an XML document.
Well, no: actually, you've got an XML Document
. Try to
print that and all you get is:
[#document: null]
A little more reading, and you learn that you need to use the
javax.xml.transform
package to generate output.
And if you think about that for a minute, you'll realize that what you're
actually doing is applying the default XSLT stylesheet to your DOM:
public static String generateXML(Document doc) throws Exception { StringWriter out = new StringWriter(); Transformer xform = TransformerFactory.newInstance().newTransformer(); xform.setOutputProperty(OutputKeys.INDENT, "yes"); xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); xform.transform(new DOMSource(doc), new StreamResult(out)); return out.toString(); }
Well, that didn't seem too bad, did it? Your ampersands are escaped, and it didn't take that many additional lines of code. Of course, there are a few things that aren't immediately apparent, such as those properties that we're setting. And why, if we're telling it to indent, is it not indented?!? (the answer is here).
XmlBuilder
At this point, you're probably thinking “there's gotta be a better way!”
There are several. One popular approach is JDOM, which provides a simplified interface for building an in-memory representations of an XML document, and a configurable output generator. Another alternative is to write a set of utility methods to simplify access to the DOM that's built into the JDK.
I've used both, but ultimately I was driven toward an even simpler solution
to support unit tests. One of the hallmarks of a good test is that it be easy
to read. When you build XML programmatically, even if you have a simplified
API like JDOM, it can be difficult to discern the structure in the data. So,
with readability as a goal, I ended up with
XmlBuilder
.
ElementNode root = element("music", element("artist", attribute("name", "Anderson, Laurie"), element("album", text("Big Science")), element("album", text("Mister Heartbreak")), element("album", text("Strange Angels"))), element("artist", attribute("name", "Becker & Fagan"), element("album", text("The Collection"))), element("artist", attribute("name", "Fine Young Cannibals"), element("album", text("The Raw & The Cooked"))));
Your first thought might be that that's enough parentheses to keep a LISP programmer happy. It isn't, although it does highlight a former coworker's comment that XML is just a more verbose way to represent S expressions.
What makes this code possible are two features that arrived in JDK 1.5:
static imports and varargs. The various builder methods, such as
element()
, are static methods on XmlBuilder
.
With JDK 1.5, we can use these methods without prefixing them with the
classname, provided that we specify their use in a static import
statement. Varargs come into play with the element()
method, which can take any number of child nodes. In JDK 1.4 and before,
you would have to explicitly create the array of child nodes. Together,
these language features lead to a very clean way to build XML.
In addition to the declarative approach of static methods, the
ElementNode.addChild()
method provides a simple imperative
mechanism:
public static void appendArtist(ElementNode root, String artist, String[] albums) { ElementNode entry = element("artist", attribute("name", artist)); for (String album : albums) entry.addChild(element("album", text(album))); root.addChild(entry); }
One of the additional benefits of XmlBuilder
is that it's a
very compact representation: a few pointers wrapped around the actual data.
Pay Attention to Encodings
There's still one piece of the picture that needs to be examined: the XML prologue, and most importantly, the encoding specification therein. An XML document may use most (but not all!) of the characters defined by Unicode. However, Unicode characters take at least two bytes to represent, while disk files and network sockets deal with individual bytes.
XML parsers are required to accept the UTF-8 encoding as a default (also UTF-16, but we'll ignore that as it isn't a byte-oriented encoding). They may support other encodings, which can be specified via an optional prologue as the first line of an XML document:
<?xml version="1.0" encoding="ISO-8859-1" ?>
This prologue specifies that the file should be processed using the
ISO-8859-1 (Latin-1) character encoding. This encoding uses one byte per
character, and consists of the ASCII characters and a selection of
accented characters and symbols. However, it doesn't have mathematical
symbols, or Greek characters, or Chinese characters … or any of
the other million-plus characters that can be represented by Unicode.
You can still use such characters in your content, but they must be
escaped as entities: a Euro symbol, for example, must appear as
"€
". But you can't use that solution for element
names, just text and attributes.
If you write your output to a StreamResult
that wraps an
OutputStream
, you'll have no problem. Where you will have
problems is when you write your output using a StringWriter
,
then separately write that string to an OutputStream
.
This is because Java strings use Unicode characters, but Java streams use bytes. And unfortunately, it's very easy to wrap a stream with an object that accepts strings, without having any idea what encoding is being used: if you don't specify an encoding, Java provides a default. More unfortunately, this default encoding is meant to provide output to the user's console, so it differs by platform, and even by how the user has configured his or her environment. My home Linux system, for example, uses UTF-8, the Linux system I use at work goes with ISO-8859-1, and my Windows XP system uses “windows-1252” — an 8-bit encoding that's almost, but not quite the same as ISO-8859-1.
There are several solutions to this problem. The first, of course, is to never use anything but ASCII in your XML files. This works, but is not very useful in a global context: sooner or later you will need to use data that is outside of the 96 printable characters provided by ASCII.
The second solution is to always ensure that, if you're writing a string
containing XML, you write it using UTF-8 encoding. Seems simple enough, but
you don't always have control over the stream (particularly if it's handed
to you as a Writer
). And all it takes is one mistake,
one forgotten or incorrect prologue, to send bad data.
The third, and in my opinion best solution is to never store XML data in
a Java String
other than for logging or debugging. Instead,
keep it in a DOM or other internal representation, and use the standard
output code to translate into a stream-appropriate encoding. This, of
course, brings us back to XmlBuilder
, which supports
writing directly to a stream with whatever encoding you choose:
ElementNode root = element( // ... OutputStream out = // ... // ... root.toStream(out, "ISO-8859-1"); out.close();
One More Thing
What happens if your XML content happens to contain a backspace character
(“\b
”)? Well, every transformer that I've used will
happily encode that character as an entity: 
. And the Xerces
parser used by the Sun JDK will reject that entity as an illegal character!
That's right, backspaces aren't legal in XML version 1.0. In fact, most of
the ASCII control characters aren't legal: the only legal characters are tab
(“\t
”), newline (“\n
”),
and carriage return (“\r
”). All other characters
in the range \u0000
to \u001f
are illegal. What makes this particularly annoying is that the Latin-1 control
codes (characters \u0080
to \u009f
)
are legal!
This is a rather gaping hole in the spec, and it was resolved with XML 1.1, so one solution is to specify that version in a prologue and hope that the recipients of your document use a parser that pays attention to the prologue (the Xerces parser does). Or, recognizing that such characters don't often appear in normal text, you could encode them using a non-XML mechanism such as Base64.
Conclusion
Contrary to appearances, this article was not intended to scare you into
downloading my XMLBuilder
— although that certainly
provides solutions to many of the problems of XML output. Instead, it was
driven by some of the questions that I've seen on Java XML forums, and the
mistakes that I've seen in production code. The take-away should be that
XML isn't as easy as it looks, but that the proper tools make it much
easier.
For More Information
Compilable examples for this article:
The last example requires the Practical XML library. In addition to the XML Builder, this library provides a set of utility classes designed to simplify life with the JDK's XML API. I've developed it over the past dozen or so years, based on actual need in professional projects (actually, I developed several predecessor libraries that were owned by the companies I worked for, then decided that I didn't want to develop the same code ever again).
The XML Specification is, to say the least, dense. Tim Bray, one of the original editors, produced the Annotated XML Specification, which provides internal hyperlinks and commentary. It is my first source for questions about legal XML.
Copyright © Keith D Gregory, all rights reserved