table of contents

Java Reference Objects

or
How I Learned to Stop Worrying and Love OutOfMemoryError

Introduction

I started programming with Java in 2000, after fifteen years with C and C++. I thought myself fairly competent at C-style memory management, using coding practices such as pointer handoffs, and tools such as Purify. I couldn't remember the last time I had a memory leak. So it was some measure of disdain that I approached Java's automatic memory management … and quickly fell in love. I hadn't realized just how much mental effort was expended in memory management, until I didn't have to do it any more.

And then I met my first OutOfMemoryError. Just sitting there on the console, with no accompanying stack trace … because stack traces require memory! Debugging that error was tough, because the usual tools just weren't available, not even a malloc logger. And the state of Java debuggers in 2000 was, to say the least, primitive.

I can't remember what caused that first error, and I certainly didn't resolve it using reference objects. They didn't enter my toolbox until about a year later, when I was writing a server-side database cache and tried using soft references to limit the cache size. Turned out they weren't too useful there, either, for reasons that I'll discuss below. But once reference objects were in my toolbox, I found plenty of other uses for them, and gained a better understanding of the JVM as well.

The Java Heap and Object Life Cycle

For a C++ programmer new to Java, the relationship between stack and heap can be hard to grasp. In C++, objects may be created on the heap using the new operator, or on the stack using "automatic" allocation. The following is legal C++: it creates a new Integer object on the stack. A Java compiler, however, will reject it as a syntax error.

Integer foo = Integer(1);

Java, unlike C++, stores all objects on the heap, and requires the new operator to create the object. Local variables are still stored on the stack, but they hold a pointer to the object, not the object itself (and of course, to confuse C++ programmers more, these pointers are called “references”). Consider the following Java method, which has an Integer variable that references a value parsed from a String:

public static void foo(String bar)
{
    Integer baz = new Integer(bar);
}

The diagram below shows the relationship between the heap and stack for this method. The stack is divided into “frames,” which contain the parameters and local variables for each method in the call tree. Those variables that point to objects — in this case, the parameter bar and the local variable baz — point at objects living in the heap. diagram of relationship between stack and heap

Now look more closely at the first line of foo(), which allocates a new Integer object. Behind the scenes, the JVM first attempts to find enough heap space for this object — approximately 12 bytes. If able to allocate the space, it then calls the constructor, which parses the passed string and initializes the newly-allocated object. Finally, the JVM stores a pointer to that object in the variable baz.

That's the “happy path.” There are several not-so-happy paths, and the one that we care about is when the new operator can't find those 12 bytes for the object. In that case, before giving up and throwing an OutOfMemoryError, it invokes the garbage collector in an attempt to make room.

Garbage Collection

While Java gives you a new operator to allocate objects on the heap, it doesn't give you a corresponding delete operator to remove them. When method foo() returns, the variable baz goes out of scope but the object it pointed to still exists on the heap. If this were the end of the story, all programs would quickly run out of memory. Java, however, provides a garbage collector to clean up these objects once they're no longer referenced.

The garbage collector goes to work when the program tries to create a new object and there isn't enough space for it in the heap. The requesting thread is suspended while the collector looks through the heap, trying to find objects that are no longer actively used by the program, and reclaiming their space. If the collector is unable to free up enough space, and the JVM is unable to expand the heap, the new operator fails with an OutOfMemoryError. This is normally followed by your application shutting down.

Mark-Sweep

One of the enduring myths of Java revolves around the garbage collector. Many people believe that the JVM keeps a reference count for each object, and the collector only looks at objects whose reference count is zero. In reality, the JVM uses a technique known as “mark-sweep.” The idea behind mark-sweep garbage collection is simple: every object that can't be reached by the program is garbage, and is eligible for collection.

Mark-sweep collection has the following phases:

Phase 1: Mark

The garbage collector starts from “root” references, and walks through the object graph marking all objects that it reaches.

diagram of heap with live objects marked
Phase 2: Sweep

Anything that hasn't been marked in the first phase is unreachable, and therefore, garbage. If a garbage object has a finalizer defined, it's added to the finalization queue (more about that later). If not, its space is made available for re-allocation (exactly what that means depends on the specific GC implementation, and there are many implementations).

diagram of heap with dead objects removed
Phase 3: Compact (optional)

Some collectors have a third step: compaction. In this step, the GC moves objects to coalesce free space left behind by the collected garbage. This prevents the heap from becoming fragmented, which can cause large contiguous allocations to fail.

The Hotspot JVM, for example, uses a compacting collector for its “young” generation, but a non-compacting collector (at least in the 1.6 and 1.7 “server” JVMs) for its ”tenured” generations. For more information, see the references at the end of this article.

diagram of heap after compaction

So what are the "roots"? In a simple Java application, they're method arguments and local variables (stored on the stack), the operands of the currently executing expression (also stored on the stack), and static class member variables.

In programs that use their own classloaders, such as app-servers, the picture gets muddy: only classes loaded by the system classloader (the loader used by the JVM when it starts) contain root references. Any classloaders that the application creates are themselves subject to collection, once there are no more references to them. This is what allows app-servers to hot-deploy: they create a separate classloader for each deployed application, and let go of the classloader reference when the application is undeployed or redeployed.

It's important to understand root references, because they define what a "strong" reference is: if you can follow a chain of references from a root to a particular object, then that object is "strongly" referenced. It will not be collected.

So, returning to method foo(), the parameter bar and local variable baz are strong references only while the method is executing. Once it finishes, they both go out of scope, and the objects they referenced are eligible for collection. In the real world, foo() would probably return a reference to the object, meaning that it would remain strongly referenced by foo()'s caller.

Now consider the following:

LinkedList foo = new LinkedList();
foo.add(new Integer(123));

Variable foo is a root reference, which points to the LinkedList object. Inside the linked list are zero or more list elements, each of which points to its successor. When we call add(), we add a new list element, and that list element points to an Integer instance with the value 123. This is a chain of strong references, meaning that the Integer is not eligible for collection. As soon as foo goes out of scope, however, the LinkedList and everything in it are eligible for collection — provided, of course, that there are no other strong references to it.

You may be wondering what happens if you have a circular reference: object A contains a reference to object B, which contains a reference back to A. The answer is that a mark-sweep collector isn't fooled: if neither A nor B can be reached by a chain of strong references, then they're eligible for collection.

Finalizers

C++ allows objects to define a destructor method: when the object goes out of scope or is explicitly deleted, its destructor is called to clean up the resources it used. For most objects, this means explicitly releasing the memory that the object allocated with new or malloc. In Java, the garbage collector handles memory cleanup for you, so there's no need for an explicit destructor to do this.

However, memory isn't the only resource that might need to be cleaned up. Consider FileOutputStream: when you create an instance of this object, it allocates a file handle from the operating system. If you let all references to the stream go out of scope before closing it, what happens to that file handle? The answer is that the stream has a finalizer method: a method that's called by the JVM just before the garbage collector reclaims the object. In the case of FileOutputStream, the finalizer closes the stream, which releases the file handle back to the operating system — and also flushes any buffers, ensuring that all data is properly written to disk.

Any object can have a finalizer; all you have to do is declare the finalize() method:

protected void finalize() throws Throwable
{
    // cleanup your object here
}

While finalizers seem like an easy way to clean up after yourself, they do have some serious limitations. First, you should never rely on them for anything important, since an object's finalizer may never be called — the application might exit before the object is eligible for garbage collection. There are some other, more subtle problems with finalizers, but I'll hold off on these until we get to phantom references.

Object Life Cycle (without Reference Objects)

Putting it all together, an object's life can be summed up by the simple picture below: it's created, it's used, it becomes eligible for collection, and eventually it's collected. The shaded area represents the time during which the object is "strongly reachable," a term that becomes important by comparison with the reachability provided by reference objects. object life-cycle, without reference objects

Enter Reference Objects

JDK 1.2 introduced the java.lang.ref package, and three new stages in the object life cycle: softly-reachable, weakly-reachable, and phantom-reachable. These states only apply to objects eligible for collection — in other words, those with no strong references — and the object in question must be the referent of a reference object:

softly reachable
The object is the referent of a SoftReference, and there are no strong references to it. The garbage collector will attempt to preserve the object as long as possible, but will collect it before throwing an OutOfMemoryError.
weakly reachable
The object is the referent of a WeakReference, and there are no strong or soft references to it. The garbage collector is free to collect the object at any time, with no attempt to preserve it. In practice, the object will be collected during a major collection, but may survive a minor collection.
phantom reachable
The object is the referent of a PhantomReference, and it has already been selected for collection and its finalizer (if any) has run. The term “reachable” is really a misnomer in this case, as there's no way for you to access the actual object, but it's the terminology that the API docs use.

As you might guess, adding three new optional states to the object life-cycle diagram makes for a mess. Although the documentation indicates a logical progression from strongly reachable through soft, weak, and phantom, to reclaimed, the actual progression depends on what reference objects your program creates. If you create a WeakReference but don't create a SoftReference, then an object progresses directly from strongly-reachable to weakly-reachable to finalized to collected. object life-cycle, with reference objects

It's also important to understand that not all objects are attached to reference objects — in fact, very few of them should be. A reference object is a layer of indirection: you go through the reference object to reach the referent, and clearly you don't want that layer of indirection throughout your code. Most programs, in fact, will use reference objects to access a relatively small number of the objects that the program creates.

References and Referents

A reference object is a layer of indirection between your program code and some other object, called a referent. Each reference object is constructed around its referent, and the referent cannot be changed. relationships between application code, soft/weak reference, and referent

The reference object provides the get() method to retrieve a strong reference to its referent. The garbage collector may reclaim the referent at any point; once this happens, get() returns null. The following code shows this in action:

SoftReference<List<Foo>> ref = new SoftReference<List<Foo>>(new LinkedList<Foo>());

// create some Foos, probably in a loop
List<Foo> list = ref.get();
if (list == null)
    throw new RuntimeException("ran out of memory");
list.add(foo);

There are a few important things to note about this code:

  1. You must always check to see if the referent is null
    The garbage collector can clear the reference at any time, and if you blithely use the reference, sooner or later you'll get a NullPointerException.
  2. You must hold a strong reference to the referent to use it
    Again, the garbage collector can clear the reference at any time, even in the middle of a single expression. If you call get() once to check for null, and then call get() again to use the reference, it might be cleared between those calls.
  3. You must hold a strong reference to the reference object
    If you create a reference object, but allow it to go out of scope, then the reference object itself will be garbage-collected. Seems obvious, but it's easy to forget, particularly when you're using reference queues to track when the reference objects get cleared.

Also remember that soft, weak, and phantom references only come into play when there are no more strong references to the referent. They exist to let you hold onto objects past the point where they'd normally become food for the garbage collector. This may seem like a strange thing — if you no longer hold a strong reference, why would you care about the object? The reason depends on the specific reference type.

Soft References

We'll start to answer that question with soft references. If there are no strong references to an object but it is the referent of a SoftReference, then the garbage collector is free to reclaim the object but will try not to. You can tune the garbage collector to be more or less aggressive at reclaiming softly-referenced objects.

The JDK documentation says that this is appropriate for a memory-sensitive cache: each of the cached objects is accessed through a SoftReference, and if the JVM decides that it needs space, then it will clear some or all of the references and reclaim their referents. If it doesn't need space, then the referents remain in the heap and can be accessed be program code. In this scenario, the referents are strongly referenced when they're being actively used, softly referenced otherwise. If a soft reference gets cleared, you need to refresh the cache.

To be useful in this role, however, the cached objects need to be pretty large — on the order of several kilobytes each. Useful, perhaps, if you're implementing a fileserver that expects the same files to be retrieved on a regular basis, or have large object graphs that need to be cached. But if your objects are small, then you'll have to clear a lot of them to make a difference, and the reference objects will add overhead to the whole process.

Soft Reference as Circuit Breaker

A better use of soft references is to provide a "circuit breaker" for memory allocation: put a soft reference between your code and the memory it allocates, and you avoid the dreaded OutOfMemoryError. This technique works because memory allocation tends to be localized within the application: you're reading rows from a database, or processing data from a file.

For example, if you write a lot of JDBC code, you might have a method like the following to process query results in a generic way and ensure that the ResultSet is properly closed. It only has one small flaw: what happens if the query returns a million rows?

public static List<List<Object>> processResults(ResultSet rslt)
throws SQLException
{
    try
    {
        List<List<Object>> results = new LinkedList<List<Object>>();
        ResultSetMetaData meta = rslt.getMetaData();
        int colCount = meta.getColumnCount();

        while (rslt.next())
        {
            List<Object> row = new ArrayList<Object>(colCount);
            for (int ii = 1 ; ii <= colCount ; ii++)
                row.add(rslt.getObject(ii));

            results.add(row);
        }

        return results;
    }
    finally
    {
        closeQuietly(rslt);
    }
}

The answer, of course, is an OutOfMemoryError, unless you have a gigantic heap or tiny rows. It's the perfect place for a circuit breaker: if the JVM runs out of memory while processing the query, release all the memory that it's already used, and throw an application-specific exception.

At this point, you may wonder: who cares? The query is going to abort in either case, why not just let the out-of-memory error do the job? The answer is that your application may not be the only thing affected. If you're running on an application server, your memory usage could take down other applications. Even in an unshared environment, a circuit-breaker improves the robustness of your application, because it confines the problem and gives you a chance to recover and continue.

To create the circuit breaker, the first thing you need to do is wrap the results list in a SoftReference (you've seen this code before):

    SoftReference<List<List<Object>>> ref
        = new SoftReference<List<List<Object>>>(new LinkedList<List<Object>>());

And then, as you iterate through the results, create a strong reference to the list only when you need to update it:

while (rslt.next())
{
    rowCount++;
    List<Object> row = new ArrayList<Object>(colCount);
    for (int ii = 1 ; ii <= colCount ; ii++)
        row.add(rslt.getObject(ii));

    List<List<Object>> results = ref.get();
    if (results == null)
        throw new TooManyResultsException(rowCount);
    else
        results.add(row);
    results = null;
}

This works because almost all of the method's memory allocation happens in two places: the call to next(), and the loop that calls getObject(). In the first case, there's a lot that happens when you call next(): the ResultSet typically retrieves a large block of binary data, containing multiple rows. Then, when you call getObject(), it extracts a piece of that data and wraps it in a Java object.

While those expensive operations happen, the only reference to the list is via the SoftReference. If you run out of memory the reference will be cleared, and the list will become garbage. It means that the method throws, but the effect of that throw can be confined. And perhaps the calling code can recreate the query with a retrieval limit.

Once the expensive operations complete, you can hold a strong reference to the list with relative impunity. Note that I use a LinkedList: I know that linked lists grow in increments of a few dozen bytes, which is unlikely to trigger OutOfMemoryError. By comparison, if an ArrayList needs to increase its capacity, it must create a new array to do so. In a large list, this could mean megabytes.

Also note that I set the results variable to null after adding the new element; this is one of the few cases where doing so is justified. Although the variable goes out of scope at the end of the loop, the garbage collector does not know that (because there's no reason for the JVM to clear the variable's slot in the call stack). So, if I didn't clear the variable, it would be an unintended strong reference during the subsequent pass through the loop.

Soft References Aren't A Silver Bullet

While soft references can prevent many out-of-memory conditions, they can't prevent all of them. The problem is this: in order to actually use a soft reference, you have to create a strong reference to the referent: to add a row to the results, we need to have a reference to the actual list. During the time we hold that string reference, we are at risk for an out-of-memory condition. In this example we store the pointer to the list in a local variable, but even if we just used the value directly in an expression, it would be a strong reference for the duration of that expression.

The goal with a circuit breaker is to minimize the window during which it's useless: the time that you hold a strong reference to the object, and perhaps more important, the amount of allocation that happens during this time. In our case, we confine the strong reference to adding a row to the results, and we use a LinkedList rather than an ArrayList because the former grows in much smaller increments.

Also note that in the example, we hold the strong reference in a variable that quickly goes out of scope. However, the language spec says nothing about the JVM being required to clear variables that go out of scope, and in fact the Sun JVM does not do so. If we didn't explicitly clear the results variable, it would remain a strong reference throughout the loop, acting like a penny in a fuse box, and preventing the soft reference from doing its job.

There are some cases where you just can't make the window small enough. For example, let's say that you wanted to process a ResultSet into a DOM Document. You would have to dereference the document after every call to getObject(), and you might just find that the memory usage to create a new Element is large enough to push you into an OutOfMemoryError (although there are techniques, such as pre-allocating a sacrificial buffer, which may help).

Finally, think carefully about the strong references that you hold. For example, a DOM is typically processed recursively, and you might think of a recursive solution to adding rows, passing in the parent node. However, method arguments are strong references. And in a DOM, a reference to any node is the start of a chain of references to every other node — so if you pass a node into a method, you have just created a long-lasting strong reference to the entire DOM tree.

Weak References

A weak reference is, as its name suggests, a reference that doesn't even try to put up a fight to prevent its referent from being collected. If there are no strong or soft references to the referent, it's pretty much guaranteed to be collected.

So what's the use? There are two main uses: associating objects that have no inherent relationship, and reducing duplication via a canonicalizing map. The first case is best illustrated with a counter-example: ObjectOutputStream.

The Problem With ObjectOutputStream

When you write objects to an ObjectOutputStream, it maintains a strong reference to the object, associated with a unique ID, and writes that ID to the stream along with the object's data. This has two benefits if you later write the same object to the same stream: you save bandwidth, because the output stream only needs to send the ID, and you preserve object identity on the other end.

Unfortunately, it's also a form of memory leak, since the stream holds onto the source object forever — or at least until you close the stream or call reset() on it. If you're using object streams simply as a means to move objects, and aren't concerned about preserving identity or reducing bandwidth, then you quickly learn to call reset() on a regular basis.

If the ObjectOutputStream instead held the source object via a WeakReference, the problem wouldn't happen: when the object went out of scope in the program code, the collector could reclaim the object. Since there would be no way that it could ever be written to the stream again, there's no reason for the stream to hold onto it. Better, the ObjectOutputStream could notify the ObjectInputStream that the object is no longer valid, eliminating memory leaks on the receiving side.

Unfortunately, although the object stream protocol was updated with the 1.2 JDK, and weak references were added at the same time, the JDK developers didn't think to combine them.

Using WeakHashMap to Associate Objects

To be honest, I don't believe there are many cases where you should associate two objects that don't have an inherent relationship. Either the objects should have a composition relationship, and be collected together, or they should have an aggregation relationship and be collected separately.

However this rule breaks down if you have no ability to change the objects to reflect their relationship — for example, if you need to form a composition relationship between a third-party class and an application class. It also breaks down in cases like ObjectOutputStream, which the relationship is ad hoc and the objects have differing lifetimes.

Should you find the need to create such an association, the JDK provides WeakHashMap, which holds its keys via weak references. When the key is no longer referenced anywhere else within the application, the map entry is no longer accessible. In practice, the entry remains in the map until the next time the map is accessed, so you may find your related objects sitting in the heap far longer than they should.

Rather than give an example here, we'll look at WeakHashMap in the context of a canonicalizing map.

Eliminating Duplicate Data with Canonicalizing Maps

In my opinion, a far better use of weak references is for canonicalizing maps. And the best example of how a canonicalizing map works — even though it's written as a native method — is String.intern(). When you intern a string, you get a single, canonical instance of that string back. If you're processing some input source with a lot of duplicated strings, such as an XML or HTML document, interning strings can save an enormous amount of memory. However, that memory comes at a cost, at least on the Hotspot JVM: prior to JDK 8, interned strings were stored in the permanent generation, a scarce resource. You can get the benefit of interning without the drawback by implementing your own canonicalizing map.

A simple canonicalizing map works by using the same object as key and value: you probe the map with an arbitrary instance, and if there's already a value in the map, you return it. If there's no value in the map, you store the instance that was passed in (and return it). Of course, this only works for objects that can be used as map keys. Here's how we might implement String.intern():

private Map<String,String> _map = new HashMap<String,String>();

public String intern(String str)
{
    if (_map.containsKey(str))
        return _map.get(str);
    _map.put(str, str);
    return str;
}

This implementation is fine if you have a small number of strings to intern, perhaps within a single method that processes a file. However, let's say that you're writing a long-running application that has to process input from multiple sources, that contain a wide range of strings but still has a high level of duplication. For example, a server that processes uploaded files of postal address data: there will be lots of entries for New York, not so many for Temperanceville. If you can simultaneously eliminate creating 16-byte strings for each instance of the former, while not holding onto the 30 bytes needed for the latter, you will (maybe) reduce long-term memory consumption.

This is where weak references come in: they allow you to create a canonical instance only so long as some code in the program is using it. After the last strong reference disappears, the canonical string will be collected.

To improve our canonicalizer, we can replace HashMap with a WeakHashMap:

private Map<String,WeakReference<String>> _map
    = new WeakHashMap<String,WeakReference<String>>();

public synchronized String intern(String str)
{
    WeakReference<String> ref = _map.get(str);
    String s2 = (ref != null) ? ref.get() : null;
    if (s2 != null)
        return s2;

    _map.put(str, new WeakReference(str));
    return str;
}

First thing to notice is that, while the map's key is a String, its value is a WeakReference<String>. This is because WeakHashMap only uses WeakReference for its keys; the Map.Entry holds a strong reference to the value. If we did not wrap the value in its own WeakReference, that strong reference would never allow the string to be collected.

Second, note the process for returning a string: first we retrieve the weak reference. If it exists, then we retrieve the referent. But we have to check that object as well. It's possible that the reference is sitting in the map but is already cleared. Only if the referent is not null do we return it; otherwise we consider the passed-in string to be the new canonical version.

Thirdly, note that I've synchronized the intern() method. The most likely use for a canonicalizing map is in a multi-threaded environment such as an app-server, and WeakhashMap isn't synchronized internally. The synchronization in this example is actually rather naive, and the intern() method can become a point of contention. In a real-world implementation, I might use ConcurrentHashMap, but the naive approach works better for a tutorial.

One final thing to know about WeakHashMap: its documentation is somewhat misleading about when things get removed from the map. It states that “a WeakHashMap may behave as though an unknown thread is silently removing entries.” In reality there is no other thread. Instead, the map cleans up whenever it's accessed. To keep track of which entries are no longer valid, it uses a reference queue.

Reference Queues

While testing a reference for null lets you know whether its referent has been collected, doing so requires that you interrogate the reference. If you have a lot of references, it would be a waste of time to interrogate all of them to discover which have been cleared. The alternative is a reference queue: when you associate a reference with a queue, the reference will be put on the queue after it has been cleared.

You associate a reference object with a queue at the time you create the reference. Thereafter, you can poll the queue to determine when the reference has been cleared, and take appropriate action (WeakHashMap, for example, removes the map entries associated with those references). Depending on your needs, you might want to set up a background thread that periodically polls the queue, blocking until references become available.

Reference queues are most often used with phantom references, described below, but can be used with any reference type. The following code is an example with soft references: it creates a bunch of buffers, accessed via a SoftReference, and after every creation looks to see what references have been cleared. If you run this code, you'll see long runs of create messages, interspersed with an occasional run of clear messages (each run of the garbage collector will clear multiple references).

public static void main(String[] argv) throws Exception
{
    List<SoftReference<byte[]>> refs = new ArrayList<SoftReference<byte[]>>();
    ReferenceQueue<byte[]> queue = new ReferenceQueue<byte[]>();

    for (int ii = 0 ; ii < 10000 ; ii++)
    {
        SoftReference<byte[]> ref
            = new SoftReference<byte[]>(new byte[10000], queue);
        System.err.println(ii + ": created " + ref);
        refs.add(ref);

        Reference<? extends byte[]> r2;
        while ((r2 = queue.poll()) != null)
        {
            System.err.println("cleared " + r2);
        }
    }
}

As always, there are things to note about this code. First, although we're creating SoftReference instances, we get Reference instances back from the queue. This serves to remind you that, once they're enqueued, it no longer matters what type of a reference you're using: the referent has already been cleared.

Second is that we must keep track of the reference objects via strong references. The reference object knows about the queue; the queue doesn't know about the reference until it's enqueued. If we didn't maintain the strong reference to the reference object, it would itself be collected, and we'd never be the wiser. We use a List in this example, but in practice, a Set is a better choice because it's easier to remove those references once they're cleared.

Phantom References

Phantom references differ from soft and weak references in that they're not used to access their referents. Instead, their sole purpose is to tell you when their referent has already been collected. While this seems rather pointless, it actually allows you to perform resource cleanup with more flexibility than you get from finalizers.

The Trouble With Finalizers

Back in the description of object life cycle, I mentioned that finalizers have subtle problems that make them unsuitable for cleaning up non-memory resources. There are also a couple of non-subtle problems, that I'll cover here for completeness and then promptly ignore.

In my opinion, the real problem with finalizers is that they introduce a gap between the time that the object is identified for collection and the time that its memory is reclaimed. The JVM is guaranteed to perform a full collection before it returns OutOfMemoryError, but if the only objects eligible for collection have a finalizer, then the collection will have little effect. Throw in the fact that a typical JVM only has a single thread to handle finalization of all objects, and you can see where problems arise.

The following program demonstrates this: each object has a finalizer that sleeps for half a second. Not much time at all, unless you've got thousands of objects to clean up. Every object goes out of scope immediately after it's created, yet at some point you'll run out of memory (this will happen faster if you reduce the maximum heap size with -Xmx).

public class SlowFinalizer
{
    public static void main(String[] argv) throws Exception
    {
        while (true)
        {
            Object foo = new SlowFinalizer();
        }
    }

    // some member variables to take up space -- approx 200 bytes
    double a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z;

    // and the finalizer, which does nothing by take time
    protected void finalize() throws Throwable
    {
        try { Thread.sleep(500L); }
        catch (InterruptedException ignored) {}
        super.finalize();
    }
}

The Phantom Knows

Phantom references allow the application to know when an object is no longer used, so that the application can clean up the object's non-memory resources. Unlike finalizers, cleanup is controlled by the application. If the application creates objects using a factory method, that method can be written to block until some number of outstanding objects have been collected. No matter how long it takes to do cleanup, it won't affect any thread other than the one calling the factory.

Phantom references differ from soft and weak references in that your program does not access the actual object through the reference. In fact, if you call get(), it always returns null, even if the referent is still strongly referenced. Instead, you use the phantom reference to hold a second strong reference to the resources used by the referent: relationships between application code, phantom reference, and referent

While this seems strange, the purpose of the phantom reference is simply to let you know when the referent has been reclaimed. Your program still needs to be able to access the resources in order to reclaim them, so the referent can't be the sole path to those resources. The application must rely on a reference queue to report when the referent has been collected.

Implementing a Connection Pool with Phantom References

Database connections are one of the most precious resources in any application: they take time to establish, and database servers place strict limits on the number of simultaneous open connections that they'll accept. For all that, programmers are remarkably careless with them, sometimes opening a new connection for every query and either forgetting to close it or forgetting to close it in a finally block.

Phantom references provide a clean way to implement a connection pool: the PooledConnection instance handed to application code is a thin wrapper over the actual database connection, and the pool uses a phantom reference to detect when that wrapper object has been collected. The main benefit over a finalizer-based approach is separation of concerns: the pool is responsible for managing connections, rather than the connections being responsible for managing themselves.

Here's the PooledConnection object, showing a single example method from the Connection interface.

public class PooledConnection
implements Connection
{
    private ConnectionPool _pool;
    private Connection _cxt;

    // ...

    public void commit() throws SQLException
    {
        getConnection().commit();
    }

You'll note that the commit() method doesn't access the delegate _cxt directly. I use the internal method getConnection() for two reasons: first, so I can apply some checks for the validity of the connection, and second, so that I can override in a test case.

For now, that's enough said about the PooledConnection object. Now let's look at the ConnectionPool, which is a factory for new pooled connections via the getConnection() method. This is a blocking method: if there isn't a connection available in the pool, it will wait until one becomes available (signified by its reference being enqueued).

public synchronized Connection getConnection()
throws SQLException
{
    while (true)
    {
        if (_pool.size() > 0)
            return wrapConnection(_pool.remove());
        else
        {
            try
            {
                Reference<?> ref = _refQueue.remove(100);
                if (ref != null)
                    releaseConnection(ref);
            }
            catch (InterruptedException ignored)
            {
                // this could be used to shut down pool
            }
        }
    }
}

From this method, you should be able to infer that we have a queue of some sort containing our actual connections, and also a reference queue that we'll use to track when our PooledConnection objects get collected. The call to releaseConnection() should give you a hint that we keep track of the pooled connections via their references, so let's take a look at the internal data structures:

private Queue<Connection> _pool
    = new LinkedList<Connection>();

private ReferenceQueue<Object> _refQueue
    = new ReferenceQueue<Object>();

private IdentityHashMap<Object,Connection> _ref2Cxt
    = new IdentityHashMap<Object,Connection>();

private IdentityHashMap<Connection,Object> _cxt2Ref =
    new IdentityHashMap<Connection,Object>();

What's happening here is that the pool maintains two lookup tables: one from the reference object to the actual connection, and one from the actual connection to the reference object. Both tables use IdentityHashMap, because we care about the actual object, and don't want a potential override of equals() to get in our way. Note that the two lookup tables also serve as our strong references to the phantom reference instances, so that they won't get collected.

Assuming that there are connections in the pool, the wrapConnection() method handles the bookkeeping needed to track that connection. It creates a PooledConnection instance, which is handed to the caller, and a PhantomReference to refer to that instance. It then inserts these objects in the lookup tables.

private synchronized Connection wrapConnection(Connection cxt)
{
    Connection wrapped = new PooledConnection(this, cxt);
    PhantomReference<Connection> ref = new PhantomReference<Connection>(wrapped, _refQueue);
    _cxt2Ref.put(cxt, ref);
    _ref2Cxt.put(ref, cxt);
    return wrapped;
}

Its counterpart is releaseConnection(), which comes in two flavors. The first is meant to be called from within the pool, when the phantom reference is enqueued. It uses the reference to find the actual connection.

synchronized void releaseConnection(Reference<?> ref)
{
    Connection cxt = _ref2Cxt.remove(ref);
    if (cxt != null)
        releaseConnection(cxt);
}

The second version is meant to be called from the PooledConnection itself, when the application explicitly closes that connection (it's also called from the first version). It clears out the bookkeeping objects, and puts the actual connection back into the pool.

synchronized void releaseConnection(Connection cxt)
{
    Object ref = _cxt2Ref.remove(cxt);
    _ref2Cxt.remove(ref);
    _pool.offer(cxt);
    System.err.println("Released connection " + cxt);
}

To go full circle, we'll look at the PooledConnection's close() method, which not only returns the connection to the pool, but also ensures that it won't be used again. Remember: this method will only be called by application code, to explicitly close the connection. If the pool decides to close the connection, the PooledConnection instance will be long gone.

public void close() throws SQLException
{
    if (_cxt != null)
    {
        _pool.releaseConnection(_cxt);
        _cxt = null;
    }
}

The Trouble with Phantom References

Several pages back, I noted that finalizers are not guaranteed to be called. Neither are phantom references. If the collector doesn't run, it will never collect unreachable objects, and any phantom references won't be enqueued. Consider what would happen if your program used the connection pool above, and threw an uncaught exception immediately after calling getConnection().

The answer is that it would quickly exhaust the pool, and all further requests would block. If your program didn't do anything else that would cause a garbage collection, pretty soon every thread would be blocked, waiting for connections that will never return to the pool.

However, even in this situation, phantom references have an advantage over finalizers: cleanup is under your control. True, with finalizers you could call System.gc() in the hopes that will cause the collector to get to work, but there's no guarantee.

By comparison, the connection pool could run through its list of outstanding connections, and force them to close, without relying on the finalizer (to be fair, you could make the same thing happen with a finalizer, but at that point you're already more than halfway to an implementation using references).

A Final Thought: Sometimes You Just Need More Memory

While reference objects are a tremendously useful tool to manage your memory consumption, sometimes they're not sufficient and sometimes they're overkill. For example, let's say that you're building a large object graph, containing data that you read from the database. While you could use soft references as a circuit breaker for the read, and weak references to canonicalize that data, ultimately your program requires a certain amount of memory to run. If you can't actually accomplish any work, it doesn't matter how robust your program is.

Your first response to OutOfMemoryError should be to figure out why it's happening. Maybe you have a leak, maybe your memory settings are simply too low.

During development, you should specify a large heap size — 1 Gb or more if you have the physical memory - and pay careful attention to how much memory is actually used (jconsole is a useful tool here). Most applications will reach a steady state under simulated load, and that should guide your production heap settings . If your memory usage climbs over time, it's quite probable that you are holding strong references to objects after they're no longer in use. Reference objects may help here, but it's more likely that you've got a bug that should be fixed.

The bottom line is that you need to understand your applications. A canonicalizing map won't help you if you don't have duplication. Soft references won't help if you expect to execute multi-million row queries on a regular basis. But in the situations where they can be used, reference objects are often life savers.

Additional Information

You can download the sample code for this article. This JAR contains both source and executables, with “runner” classes.

The “string canonicalizer” class is available from SourceForge, licensed under Apache 2.0.

Sun has many articles on tuning their JVM's memory management. This article is an excellent introduction, and provides links to additional documentation.

Brian Goetz has a great column on the IBM developerWorks site, "Java Theory and Practice." A few years ago, he wrote columns on using both soft and weak references. These articles go into depth on some of the topics that I simply skimmed over, such as using WeakHashMap to associate objects with different lifetimes.

Copyright © Keith D Gregory, all rights reserved