Java Serialization
Originally published: 2014-02-17
Last updated: 2015-12-10
Java's built-in serialization mechanism has a bad reputation; everybody
seems to have a story of serialization gone wrong. But I don't think it
deserves that reputation. Yes, it's verbose, but that only matters in
high-volume applications. Yes, it's finicky about versioning, but if you take
care you can add or remove fields without pain. And yes, there are quirks in
ObjectOutputStream
that can cause memory leaks and
incorrect data, but that's a matter of knowing how the stream works.
Serialization Basics
For objects built from primitives and other serializable objects (which includes
most of the “data” objects from the JDK), you enable serialization
simply by implementing the Serializable
marker interface.
public class BasicSerializableClass implements Serializable { private static final long serialVersionUID = 1L; private int ival; private String sval; public BasicSerializableClass(int i, String s) { ival = i; sval = s; } public int getIval() { return ival; } public String getSval() { return sval; } }
The second piece to serialization are the streams:
ObjectOutputStream
to write your objects,
and ObjectInputStream
to read them. Object
streams are decorators for an underlying input or output stream. In this example
I used a file, because I wanted to preserve the serialized data for the next
section.
public static void main(String[] argv) throws Exception { File tmpFile = File.createTempFile("example", ".ser"); tmpFile.deleteOnExit(); BasicSerializableClass orig = new BasicSerializableClass(123, "Hello, World"); ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(tmpFile)); oos.writeObject(orig); oos.close(); ObjectInputStream ois = new ObjectInputStream(new FileInputStream(tmpFile)); BasicSerializableClass rslt = (BasicSerializableClass)ois.readObject(); ois.close(); System.out.println("result.ival = " + rslt.getIval()); System.out.println("result.sval = " + rslt.getSval()); }
I think it's instructive to look at what is actually written to the stream. For details, see the protocol spec However, you can get a sense from the following dump: after a prologue comes the classname of the serialized object, followed by the name, type, and value for each of its fields.
00000000 AC ED 00 05 73 72 00 3A 63 6F 6D 2E 6B 64 67 72 sr :com.kdgr 00000010 65 67 6F 72 79 2E 65 78 61 6D 70 6C 65 2E 73 65 egory.example.se 00000020 72 69 61 6C 69 7A 61 74 69 6F 6E 2E 42 61 73 69 rialization.Basi 00000030 63 53 65 72 69 61 6C 69 7A 61 62 6C 65 43 6C 61 cSerializableCla 00000040 73 73 00 00 00 00 00 00 00 01 02 00 02 49 00 04 ss I 00000050 69 76 61 6C 4C 00 04 73 76 61 6C 74 00 12 4C 6A ivalL svalt Lj 00000060 61 76 61 2F 6C 61 6E 67 2F 53 74 72 69 6E 67 3B ava/lang/String; 00000070 78 70 00 00 00 7B 74 00 0C 48 65 6C 6C 6F 2C 20 xp {t Hello, 00000080 57 6F 72 6C 64 World
As I said at the start of this article, the serializaton format is verbose: our
sample object contained 17 bytes of actual data (4 for the int
,
13 for the UTF-8 encoded string) yet the serialized version takes 133 bytes.
Evolving a Serializable Object
Incompatible changes are one of the first things that people stumble over with regards to serialization: they write an object, then make some seemingly minor change to the object's class, and find they can't reload the serialized data. Which is unfortunate, because the serialization protocol is actually quite resilient to changes. Provided that you follow a few rules.
The first of those rules is that you must always define
serialVersionUID
. If you don't, then the serialization
mechanism creates its own value by hashing class metadata, including not just
member variables, but all of the method names and their access modifiers. If the
value written to the stream differs from the value of the destination class, you
won't be able to deserialize the object.
But if you do pay attention to versioning, you can make an extraordinary number of changes to your serializable classes, and they'll still be readable. For example, here's a new version of the class that started this article.
public class BasicSerializableClass implements Serializable { private static final long serialVersionUID = 1L; private int intval; private String sval; private BigDecimal newVal; public BasicSerializableClass(int i, String s, BigDecimal bd) { intval = i; sval = s; newVal = bd; } public int getIval() { return intval; } public String getString() { return sval; } public BigDecimal getNewVal() { return newVal; } }
So that you don't have to page back and forth, here are the changes to this class:
- The instance variable
ival
was renamed tointval
. - The accessor method
getSval()
was renamed togetString()
. - An entirely new instance variable,
newVal
, was added along with its accessor method.
While these changes seem extensive, they are compatible:
-
ival
is still in the serialized data, but since there's no instance variable with that name in the new object definition, it's ignored. The serialized value is not recoverable. -
intval
is a new instance variable, so gets a default value (0). -
newVal
is also new, so also gets a default value (null
). -
getSVal()
andgetString()
are methods, so don't affect the serialized data one way or another. You can make as many method name changes as you like.
So what constitutes an incompatible change? In practice, the only changes
that you have to worry about are changes to types: either of the object itself or
of any of its fields. For example, if you changed ival
from int
to long
.
That said, your program might also consider some changes incompatible, even if
serialization doesn't. For example, if you were expecting getIval()
to return something other than zero, or getNewval()
to return
something other than null
(a more likely situation). However,
if you're the person writing the program, then you have control over how it handles
incompatible data.
I'm going to finish this section with a comment on serialVersionUID
values: you'll note that used 1L
, to indicate the first version of
the class. If I make an incompatible change, I'll increment it to 2L
,
and increment again for future changes. I believe that this is the easiest way to keep
track of class evolution. There are tools that will give you a hashed value, but those
values will require you, the programmer, to pore through source control to see the
various changes.
On the other hand, if you already wrote serialized data without an explicit value, then
you need to use the same value to deserialize the data. Before modifying the class, use
a tool such as serialver
to generate the hashed value. But for
subsequent incompatible changes, I still recommend using simple incremented numbers.
Reading and Writing Non-Serializable Components
As I said at the start of this article, many of the classes in the JDK are
serializable. So your objects can reference a BigInteger
,
or a Calendar
, or even a Class
, and
have no problem. But what if your class holds a JarFile
?
public class UnserializableObject implements Serializable { private static final long serialVersionUID = 1L; private String id; private JarFile jar;
Yes, this is a contrived example: there aren't a lot of real-world cases
where you'd need to do this, but it was hard to find a simple class in the
JDK that was not serializable. You're more likely to need custom
serialization when using classes from a third-party library. That said, I
have an application that uses a HashMap<String,JarFile>
as a lookup table for the dependencies in a Maven project; it's expensive to
build, and therefore I might want to serialize it as a performance optimization
for multiple runs with the same project.
As written, UnserializableObject
claims to be serialiable.
The compiler believes that claim; it can't verify that all potentially referenced
objects are in fact serializable. But when you try to write an instance of this
object to a stream, you'll get a NotSerializableException
.
Object streams allow objects to define their own serialization methods, so if
we can find a way to preserve and reconstruct the non-serializable field, we
can prevent this exception. In the case of JarFile
, this
is easy: you can retrieve the name of the file, and construct a new instance
with the same name. Of course, if the file doesn't exist when you're ready to
deserialize, then you can't read the object back; this might happen if you try
to load the serialized representation to a different machine, or by a different
user.
To make an unserializable object serializable, add the writeObject()
and readObject()
methods. Pay careful attention to the method
signatures: if you don't follow the exact signature (including making the
methods private), the streams won't call them.
private void writeObject(ObjectOutputStream out) throws IOException { out.writeObject(id); if (jar == null) out.writeObject(null); else out.writeObject(jar.getName()); } private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException { id = (String)in.readObject(); String jarName = (String)in.readObject(); if (jarName != null) { jar = new JarFile(jarName); } }
As I said, JarFile
provides a getName()
function, which is enough information to reconstruct the object. Note that I had to
account for the possibility that jar
was null. Also note that,
if the file doesn't exist when deserializing, the JarFile
constructor throws FileNotFoundException
, which is a subclass
of IOException
and therefore meets the signature requirements.
You will need to trap any other checked exceptions and convert them as appropriate.
I'll wrap up this section by noting that there's another way to handle custom
serialization: implement the Externalizable
interface. This gives you complete control over the process; it could be a way to
avoid the overheads introduced by the normal serialization protocol. But, frankly,
it's a pain to implement for anything other than simple data holders; you have to
handle your entire superclass hierarchy. If overhead is your concern, I suggest
switching to an alternative serialization mechanism, such as
Avro or
Protocol Buffers.
When to use Transient Fields
While readObject()
and writeObject()
are useful for handling data objects that weren't designed with serialization
in mind, there are some classes that make no sense to serialize. A
MappedByteBuffer
, for example, represents a segment of
memory within the current process' address space. That mapping won't exist in
another process; you'll need to create it anew, given the raw materials of filename,
offset, and length.
You could write custom serialization and deserialization code that automatically
creates the buffer on the destination. But unlike JarFile
,
there's no way to retrieve the necessary information from the buffer itself; you
need to explicitly store the name/offset/size as instance variables. Given that,
it makes more sense to mark the buffer transient
, let the
stream handle serialization, and lazily recreate the buffer on use:
public class TransientExample implements Serializable { private static final long serialVersionUID = 1L; private File mappedFile; private transient MappedByteBuffer buffer; public TransientExample(File file) { this.mappedFile = file; } public MappedByteBuffer getBuffer() throws IOException { if (buffer == null) { // map buffer; note that opening file may throw } return buffer; }
The same principle holds for any derived object: if you already store enough information to reconstruct the object as needed, do so rather than writing custom serialization code. It's more maintainable to let the stream ensure that all the basic fields arrive at the destination, and for you to focus on the things the stream can't.
ObjectOutputStream Object Sharing
Once you get past version errors and unserializable objects, the next biggest source of problems with Java serialization is that the object streams retain a reference to every object written. If the same object is written to the stream multiple times, the second and subsequent writes use a unique ID, rather than the actual object data. When reading, the input stream recognizes these IDs and uses the first instance.
In most use cases, this is a great feature. It reduces the amount of data sent over the stream, and it ensures that the “shape” of the data will be preserved: if your application depends on the fact that the same object exists at two places, it won't break because the serialization code reconsituted a second instance. Plus — and to me, more important — it prevents infinite recursion:
public class GraphNode { private List<GraphNode> incoming = new ArrayList<GraphNode>(); private List<GraphNode> outgoing = new ArrayList<GraphNode>();
This object represents a node in a directed graph; it can have zero or more incoming connections from other nodes, and zero or more outgoing connections. If one of those connections happened to form a loop, and the output stream didn't keep track of objects already written, it would keep following those links until it ran out of stack. Clearly not optimal.
On the other hand, if you're using serialization as a simple way to implement a message protocol, this behavior can lead to bugs. The first happens when you have mutable objects.
public class SharedMutableObjectExample { private static class MyMutableObject implements Serializable { private static final long serialVersionUID = 1L; private int value; public void setValue(int value) { this.value = value; } public int getValue() { return value; } } public static void main(String[] argv) throws Exception { ByteArrayOutputStream bos = new ByteArrayOutputStream(); ObjectOutputStream oos = new ObjectOutputStream(bos); MyMutableObject obj = new MyMutableObject(); for (int ii = 0 ; ii < 5 ; ii++) { obj.setValue(ii); oos.writeObject(obj); } oos.close(); ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray()); ObjectInputStream ois = new ObjectInputStream(bis); for (int ii = 0 ; ii < 5 ; ii++) { MyMutableObject ret = (MyMutableObject)ois.readObject(); System.out.println("read #" + ii + " value = " + ret.getValue()); } } }
If you run this, you'll see the same value, 0, on every line of the output. This is because the stream saw that you were writing the same object over and over, so it only wrote the object's identifier, not its actual value.
One solution to the problem is to replace the call to writeObject()
with a call to writeUnshared()
. This method instructs the stream to
fully serialize objects, without attempting to replace “known” objects by
heir references. However, it has one critical limitation: it only writes an unshared
copy of the passed object; any referenced objects are written as
shared. So, if your base object happens to have a byte[]
as one
of its instance variables, that array is only serialized once, and you;ll see the same
data over and over again.
I'm sure that the JDK developers had a reason for this behavior, but to me it's just a bug waiting to happen. Rather than use mutable objects, I much prefer to create a new message for each write.
However, that highlights another issue with object sharing. Since the stream keeps a hard reference to all objects written, the garbage collector never tries to reclaim their memory. And we all know where that leads:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Whether or not you actually see this error depends on how many messages you send over
the stream, and how much heap you've allocated. But even if you never see the exception,
the leak is still there: jconsole
will show an ever-growing heap.
And a heap dump should have
your message object(s) at the top of the instance count.
Rather than rely on writeUnshared()
, I recommend calling
reset()
after writing each message:
oos.writeObject(myUnsharableObject); oos.reset();
Calling reset()
does two things: it clears the output stream's
table of object references, and it writes a single-byte control code onto the stream.
On the other end, the input stream sees that control code and clears its own table of
saved object references. Yes, this will add a small amount of overhead to each object
sent over the stream, but in my opinion that's a small price to pay for bug-free
communication.
Closing Thoughts
I originally titled this section “When Not To Use Serialization,” but decided that would give the wrong impression. The decision of when to use or not use serialization depends on your need for exchanging objects with non-Java systems, as well as the longevity of your objects and expected level of structural change.
It may surprise you that I don't think of performance as one of the criteria for (not) choosing Java serialization. While the stream protocol is verbose when compared to a raw binary protocol like Protocol Buffers, it just doesn't matter for most purposes: network and CPU are cheap. I've used serialization to support message rates on the order of 10,000 msg/sec/node without a problem. Plus, alternative formats require you to describe your data in an external file, violating the DRY (“don't repeat yourself”) principle.
Interoperability is a far more important reason to forego Java serialization. If you have to exchange messages between Java and non-Java applications, Java serialization will just get in your way. Yes, the stream format is published, but you'll have to write or find a tool to parse it. If you have to share data objects between Java and non-Java programs, I think the best approach is to create a project that just consists of those data objects, and using a tool like Protocol Buffers to manage serialization.
Related to compatibility: don't store serialized data as a BLOB
in
a relational database. The whole reason for using a relational database is the
ability to relate data in different tables. If you store opaque data in the
database, you lose this ability; the database becomes little more than a filesytem
(albeit one with transactions). If you're thinking of doing this, you'll probably find
that an alternative storage mechanism (maybe a key-value store) is a better choice.
And if you do decide to store serialized data in a database, even in a key-value store, you might discover (too late) that your classes have evolved in incompatible ways and that data is no longer usable.
In my opinion, longevity is the biggest reason not to use Java serialization.
Data does change over time, and not all of those changes will be compatible. If your data
model is evolving, you should take the time to evolve your data with the model. Using
Java serialization prevents you from doing that: there's no easy way to load an instance
of MyClass
where a particular field contains an int
and write it back out with that field marked as a long
(the difficult
way involves multiple classloaders and some glue code). If you're thinking of long-term data
storage, use a real database.
With all that said, for simple messaging and preservation of data between executions of the same program, I think serialization is hard to beat.
Security
Since this article was first written, Java object serialization has been exploited as an attack vector by hackers. If you'd like details on how this works, read my blog post and the linked slide deck (which describes similar attacks using other languages). Here I will limit myself to a short description of the problem, and some steps that you can take to prevent becoming a victim.
The root causes of the attack are simple:
- The program deserializes untrusted data.
- Somewhere on the classpath is a class that allows execution of code specified as data.
Of these, you can only reasonably control the first, but it's important to understand the second to avoid uninformed decisions.
To deserialize an object, your program must be able to load that class' bytecode from somewhere on the classpath. This requirement may give you a false sense of security: after all, you're not going to write a class that sends sensitive data to a hacker. However, most applications today don't consist solely of classes that you've written, they include dozens — and, after transitive dependencies, perhaps hundreds — of external libraries. So the real problem becomes do any of those libraries have exploitable classes?
As it turns out, many libraries do. In the case that I examined,
Apache Commons Collections provided two classes that could be used for an exploit: LazyMap
,
which is a Map
that uses factories to retrieve objects, and
InvokerTransformer
, which is a factory that uses reflection
to create objects. This allowed arbitrary code to be executed just by calling
get()
on the map.
It's important to understand that commons-collections is not a bad library because it has these classes. The classes are very useful, as is the library as a whole. Which is why it's present on the classpath of many applications. And there are many other good libraries that have similar exploitable classes, just waiting for a hacker with incentive to find them.
The real problem — and fortunately, the one you can prevent — is deserializing untrusted data. The definition of untrusted data is simple: any data that you didn't create and control throughout its lifetime. Here are some examples:
- If you want to preserve your application state by serializing it, writing to a file on your server, and later deserializing, you're OK. Unless, of course, someone untrusted has gained access to your server, in which case deserialization is the least of your worries.
- If you're writing base64-encoded serialized data into a form in a web page, or a cookie, and then deserializing it when the form is submitted, you're vulnerable. Fortunately, you can close this vulnerability by using base64-encoded, encrypted, serialized data; as long as encryption and decryption happen on the server, with a secret key, there's no way to modify the payload.
- If you expose an RMI server to the open Internet, you need to shut it down, right now, RMI is entirely built on serialization, and there's no way to prevent untrusted clients from submitting whatever they want. Fortunately, not many people use RMI for client-server communication. Unfortunately, tools such as JMX do. There's value in JMX, but you need to control access to that server.
In my blog post, I also talked about how the vulnerability was exploited using
a class that defined a member variable as a Map
, rather
than using a concrete class. To the extent that you can, I recommend sticking
to concrete members in serializable classes. However, doing so does not guarantee
that you'll be safe, because you have no control over the variables that class or
its dependencies define.
Bottom line: don't deserialize untrusted data.
For More Information
If you want more information about the mechanics of serialization, I recommend reading the Object Serialization Stream Protocol specification. It's a relatively simple protocol, and is useful for answering questions of the form “what should I expect if…”
The examples from this article are available as compilable programs. Note that you might need external libraries (eg, Apache Commons IO).
- BasicExample: writes an object to a file and reads it back in.
- BasicSerializableClass: the object written by that example.
- BasicSerializableClass-Revised:
a variant of
BasicSerializableClass
with multiple modifications. Note that the filename does not correspond to the classname; you'll need to rename to use. RunBasicExample
with the original example class, then recompile with the changed class. - CustomSerializationExample:
opens your installation's
rt.jar
and attempts to serialize a reference to it. - UnserializableObject: used by the
previous example to hold a reference to a
JarFile
. As provided, it causes an exception when written to the output stream. Uncomment the custom serialization code ot make it work. - SharedMutableObjectExample: demonstrates how, by default, only one copy of each object instance is written to the stream.
- SharedObjectMemoryLeakExample: demonstrates an out-of-memory condition caused by the object streams' retention of instances. To compile and run, you'll need Apache Commons-IO in your classpath.
Copyright © Keith D Gregory, all rights reserved