|
Java Reference Objects
or How I Learned to Stop Worrying and Love OutOfMemoryError
Introduction
I started programming with Java in 2000, after fifteen years with C and
C++. I thought myself fairly competent at C-style memory management, using
coding practices such as pointer handoffs, and tools such as Purify. I
couldn't remember the last time I had a memory leak. So it was some measure
of disdain that I approached Java's automatic memory management … and
quickly fell in love. I hadn't realized just how much mental effort was
expended in memory management, until I didn't have to do it any more.
And then I met my first OutOfMemoryError. Just sitting there
on the console, with no accompanying stack trace … because stack
traces require memory! Debugging that error was tough, because the usual
tools just weren't available, not even a malloc logger. And
the state of Java debuggers in 2000 was, to say the least, primitive.
I can't remember what caused that first error, and I certainly didn't
resolve it using reference objects. They didn't enter my toolbox until
about a year later, when I was writing a server-side database cache and
tried using soft references to limit the cache size. Turned out they
weren't too useful there, either, for reasons that I'll discuss below. But
once reference objects were in my toolbox, I found plenty of other uses for
them, and gained a better understanding of the JVM as well.
The Java Heap and Object Life Cycle
For a C++ programmer new to Java, the relationship between stack and heap
can be hard to grasp. In C++, objects may be created on the heap using the
new operator, or on the stack using "automatic" allocation.
The following is legal C++: it creates a new Integer object on
the stack. A Java compiler, however, will reject it as a syntax error.
Integer foo = Integer(1);
Java, unlike C++, stores all objects on the heap, and requires the
new operator to create the object. Local variables are still
stored on the stack, but they hold a pointer to the object, not the object
itself (and of course, to confuse C++ programmers more, these pointers are
called "references"). Consider the following Java method, which allocates
an Integer, giving it a value parsed from a String:
public static void foo(String bar)
{
Integer baz = new Integer(bar);
}
The diagram below shows the relationship between the heap and stack for
this method. The stack is divided into "frames," which contain the
parameters and local variables for each method in the call tree. Those
variables that point to objects — in this case, the parameter
bar and the local variable baz — point at
objects living in the heap.
Now look more closely at the first line of foo(), which
allocates a new Integer object. Behind the scenes, the JVM
first attempts to find enough heap space for this object — approximately
12 bytes. If able to allocate the space, it then calls the specified
constructor to initialize the object and stores a pointer to the object in
variable baz. If the JVM is unable to allocate the space, it
calls the garbage collector in an attempt to make room.
Garbage Collection
While Java gives you a new operator to allocate objects on the
heap, it doesn't give you a corresponding delete operator to
remove them. When method foo() returns, the variable
baz goes out of scope but the object it pointed to still
exists on the heap. If this were the end of the story, all programs would
quickly run out of memory. Java, however, provides a garbage collector to
clean up these objects once they're no longer referenced.
The garbage collector goes to work when the program tries to create a new
object and there isn't enough space for it in the heap. The requesting
thread is suspended while the collector looks through the heap, trying to
find objects that are no longer actively used by the program, and
reclaiming their space. If the collector is unable to free up enough space,
and the JVM is unable to expand the heap, the new operator
fails with an OutOfMemoryError. This is normally followed by
your application shutting down.
There are many excellent references as to how the Java garbage collector
works, and some of them are listed at the end of this article. While they
make great reading, and will teach you how to tune your JVM appropriately
for the programs that you're running, for now all you need to know is that
Java uses a form of mark-sweep-compact garbage collection, based
on strong references.
Mark-Sweep-Compact
The idea behind mark-sweep-compact garbage collection is simple: every
object that can't be reached by the program is garbage, and can be
collected. This is a three-part process:
|
The garbage collector starts from “root” references, and
walks through the object graph marking all objects that it reaches.
|
|
|
Then, it goes through all objects on the heap, and discards those
that aren't marked.
|
|
|
Finally, it compacts the heap, moving objects around to coalesce the
free space left behind by the collected garbage.
|
 |
So what are these "roots"? In a simple Java application, they're method
parameters and local variables stored on the stack, the operands of the
currently executing expression (also stored on the stack), and static class
member variables.
In programs that use their own classloaders, such as app-servers, the
picture gets muddy: only classes loaded by the system classloader (the
loader used by the JVM when it starts) contain root references. Any
classloaders that the application creates are themselves subject to
collection, once there are no more references to them. This is what allows
app-servers to hot-deploy: they create a separate classloader for each
deployed application, and let go of the classloader reference when the
application is undeployed or redeployed.
It's important to understand root references, because they define what a
"strong" reference is: if you can follow a chain of references from a root
to a particular object, then that object is "strongly" referenced. It will
not be collected.
So, returning to method foo(), the parameter bar
and local variable baz are strong references only while the
method is executing. Once it finishes, they both go out of scope, and the
objects they referenced are eligible for collection. In the real world,
foo() would probably return the reference held in
baz, meaning that it remains strongly referenced by
foo()'s caller.
Now consider the following:
LinkedList foo = new LinkedList();
foo.add(new Integer(123));
Variable foo is a root reference, which points to the
LinkedList object. Inside the linked list are zero or more
list elements, each of which points to its successor. When we call
add(), one of these elements will point to an
Integer instance with the value 123. This is a chain of strong
references, from a root reference, meaning that the Integer is
not eligible for garbage collection. As soon as foo goes out
of scope, however, the LinkedList and everything in it are
eligible for collection — provided, of course, that there are no other
strong references to it or its contents.
You may be wondering what happens if you have a circular reference: object
A contains a reference to object B, which contains a reference back to A.
The answer is that a mark-sweep collector isn't fooled: if neither A nor B
can be reached by a chain of strong references, then they're eligible for
collection.
Finalizers
C++ allows objects to define a destructor method: when the object goes out
of scope or is explicitly deleted, its destructor is called to clean up the
resources it used. For most objects, this means explicitly releasing the
memory that the object allocated with new or
malloc. In Java, the garbage collector handles memory cleanup
for you, so there's no need for an explicit destructor to do this.
However, memory isn't the only resource that might need to be cleaned up.
Consider FileOutputStream: when you create an instance of this
object, it allocates a file handle from the operating system. If you let
all references to the stream go out of scope before closing it, what
happens to that file handle? The answer is that the stream has a
finalizer method: a method that's called by the JVM just before
the garbage collector reclaims the object. In the case of
FileOutputStream, the finalizer closes the stream, which
releases the file handle back to the operating system — and also
flushes any buffers, ensuring that all data is properly written to disk.
Any object can have a finalizer; all you have to do is declare the
finalize() method:
protected void finalize() throws Throwable
{
// cleanup your object here
}
While finalizers seem like an easy way to clean up after yourself, they do
have some serious limitations. First, you should never rely on them for
anything important, since an object's finalizer may never be called —
the application might exit before the object is eligible for garbage
collection. There are some other, more subtle problems with finalizers, but
I'll hold off on these until we get to phantom references.
Object Life Cycle without Reference Objects
Putting it all together, an object's life can be summed up by the simple
picture below: it's created, it's used, it becomes eligible for collection,
and eventually it's collected. The shaded area represents the time during
which the object is "strongly reachable," a term that becomes important by
comparison with the reachability provided by reference objects.
Enter Reference Objects
JDK 1.2 introduced the java.lang.ref package, and three new stages in the object life cycle: softly-reachable,
weakly-reachable, and phantom-reachable. These states only apply to objects
eligible for collection — in other words, those with no strong
references — and the object in question must be the referent
of a reference object:
- softly reachable
- The object is the referent of a
SoftReference. The
garbage collector will attempt to preserve the object as long as
possible, but will collect it before throwing an
OutOfMemoryError.
- weakly reachable
- The object is the referent of a
WeakReference, and there
are no strong or soft references to it. The garbage collector is free
to collect the object at any time, with no attempt to preserve it. In
practice, the object will be collected during a major collection, but
may survive a minor collection.
- phantom reachable
- The object is the referent of a
PhantomReference, and
there are no strong, soft, or weak references to it. This reference
type differs from the other two in that it isn't meant to be used to
access the object, but as a signal that the object has already been
finalized, and the garbage collector is ready to reclaim its memory.
As you might guess, adding three new optional states to the object
life-cycle diagram makes for a mess. Although the documentation indicates a
logical progression from strongly reachable through soft, weak, and
phantom, to reclaimed, the actual progression depends on what reference
objects your program creates. If you create a WeakReference
but don't create a SoftReference, then an object progresses
directly from strongly-reachable to weakly-reachable to finalized to
collected.
It's also important to remember that not all objects are attached to
reference objects — in fact, very few of them should be. A reference
object is a layer of indirection: you go through the reference object to
reach the referred object, and clearly you don't want that layer of indirection
throughout your code. Most programs, in fact, will use reference objects to
access a relatively small number of the objects that the program creates.
References and Referents
A reference object is a layer of indirection between your program code
and some other object, called a referent. Each reference object
is constructed around its referent, and the referent cannot be changed.
The reference object provides the get() method to retrieve a
strong reference to its referent. The garbage collector may reclaim the
referent at any point; once this happens, get() returns
null. The following code shows this in action:
SoftReference<List<Foo>> ref = new SoftReference<List<Foo>>(new LinkedList<Foo>());
// create some Foos, probably in a loop
List<Foo> list = ref.get();
if (list == null)
throw new RuntimeException("ran out of memory");
list.add(foo);
There are a few important things to note about this code:
- You must always check to see if the referent is
null
The garbage collector can clear the reference at any time, and
if you blithely use the reference, sooner or later you'll get a
NullPointerException.
- You must hold a strong reference to the referent to use it
Again, the garbage collector can clear the reference at any time, even
between two statements in your code. If you simply call
get() once to check for null, and then call
get() again to use the reference, it might be cleared
between those calls.
- You must hold a strong reference to the reference object
If you create a reference object, but allow it to go out of scope,
then the reference object itself will be garbage-collected. Seems
obvious, but it's easy to forget, particularly when you're using
reference queues to track when the reference objects get cleared.
Also remember that soft, weak, and phantom references only come into play
when there are no more strong references to the referent. They exist to let
you hold onto objects past the point where they'd normally become food for
the garbage collector. At first, this may seem like a strange thing —
if you no longer have anything that points to the object, why would you care
about it ever again?
Soft References
We'll start to answer that question with soft references. If there are no
strong references to an object but it is the referent of a
SoftReference, then the garbage collector is free to reclaim
the object but will try not to. You can tune the garbage collector to be
more or less aggressive at reclaiming softly-referenced objects.
The JDK documentation says that this is appropriate for a memory-sensitive
cache: each of the cached objects is accessed through a
SoftReference, and if the JVM decides that it needs space,
then it will clear some or all of the references and reclaim their
referents. If it doesn't need space, then the referents remain in the heap
and can be accessed be program code. In this scenario, the referents are
strongly referenced when they're being actively used, softly referenced
otherwise. If a soft reference gets cleared, you'll need to refresh the cache.
To be useful in this role, however, the cached objects need to be pretty
large — on the order of several kilobytes each. Useful, perhaps, if
you're implementing a fileserver that expects the same files to be retrieved
on a regular basis, or have large object graphs that need to be cached. But
if your objects are small, then you'll have to clear a lot of them to make a
difference, and the reference objects will add overhead to the whole process.
Soft Reference as Circuit Breaker
A better use of soft references is to provide a "circuit breaker" for
memory allocation: put a soft reference between your code and the memory it
allocates, and you avoid the dreaded OutOfMemoryError. This
technique works because memory allocation tends to be localized within the
application: you're reading rows from a database, or processing data from a
file.
For example, if you write a lot of JDBC code, you might have a method like
the following to process query results in a generic way and ensure that the
ResultSet is properly closed. It only has one small flaw: what
happens if the query returns a million rows?
public static List<List<Object>> processResults(ResultSet rslt)
throws SQLException
{
try
{
List<List<Object>> results = new LinkedList<List<Object>>();
ResultSetMetaData meta = rslt.getMetaData();
int colCount = meta.getColumnCount();
while (rslt.next())
{
List<Object> row = new ArrayList<Object>(colCount);
for (int ii = 1 ; ii <= colCount ; ii++)
row.add(rslt.getObject(ii));
results.add(row);
}
return results;
}
finally
{
closeQuietly(rslt);
}
}
The answer, of course, is an OutOfMemoryError, unless you have
a gigantic heap or tiny rows. It's the perfect place for a circuit breaker:
if the JVM runs out of memory while processing the query, release all the
memory that it's already used, and throw an application-specific exception.
At this point, you may wonder: who cares? The query is going to abort in
either case, why not just let the out-of-memory error do the job? The
answer is that your application may not be the only thing affected. If
you're running on an application server, your memory usage could take down
other applications. Even in an unshared environment, a circuit-breaker
improves the robustness of your application, because it confines the
problem and gives you a chance to recover and continue.
To create the circuit breaker, the first thing you need to do is wrap the
results list in a SoftReference (you've seen this code before):
SoftReference<List<List<Object>>> ref
= new SoftReference<List<List<Object>>>(new LinkedList<List<Object>>());
And then, as you iterate through the results, create a strong reference to
the list only when you need to update it:
while (rslt.next())
{
rowCount++;
List<Object> row = new ArrayList<Object>(colCount);
for (int ii = 1 ; ii <= colCount ; ii++)
row.add(rslt.getObject(ii));
List<List<Object>> results = ref.get();
if (results == null)
throw new TooManyResultsException(rowCount);
else
results.add(row);
results = null;
}
This works because almost all of the method's memory allocation happens in
two places: the call to next(), and the loop that calls
getObect(). In the first case, there's a lot that happens
when you call next(): the ResultSet typically
retrieves a large block of binary data, containing multiple rows. Then,
when you call getObject(), it extracts a piece of that data
and wraps it in a Java object.
While those expensive operations happen, the only reference to the list
is via the SoftReference. If you run out of memory the reference
will be cleared, and the list will become garbage. It means that the method
throws, but the effect of that throw can be confined. And perhaps the calling
code can recreate the query with a retrieval limit.
Once the expensive operations complete, you can hold a strong reference to the
list with relative impunity. However, note that it's a LinkedList:
I know that linked lists grow in increments of a few dozen bytes, which is
unlikely to trigger OutOfMemoryError. By comparison, if an
ArrayList needs to increase its capacity, it must create a new
array to do so. In a large list, this could mean megabytes.
Also note that I set the results variable to null
after adding the new element — this is one of the few cases where doing
so is justified. Although the variable goes out of scope at the end of the loop,
the garbage collector does not know that (because there's no reason for the JVM
to clear the variable's slot in the call stack). So, if I didn't clear the
variable, it would be an unintended strong reference during the subsequent
pass through the loop.
Soft References Aren't A Silver Bullet
While soft references can prevent many out-of-memory conditions, they can't
prevent all of them. The problem is this: in order to actually use a soft
reference, you have to create a strong reference to the referent: to add a
row to the results, we need to have a reference to the actual list. During
the time we hold that string reference, we are at risk for an out-of-memory
condition. In this example we store the pointer to the list in a local
variable, but even if we just used the value directly in an expression, it
would be a strong reference for the duration of that expression.
The goal with a circuit breaker is to minimize the window during which it's
useless: the time that you hold a strong reference to the object, and
perhaps more important, the amount of allocation that happens during this
time. In our case, we confine the strong reference to adding a row to the
results, and we use a LinkedList rather than an
ArrayList because the former grows in much smaller increments.
Also note that in the example, we hold the strong reference in a variable
that quickly goes out of scope. However, the language spec says nothing
about the JVM being required to clear variables that go out of scope, and
in fact the Sun JVM does not do so. If we didn't explicitly clear the
results variable, it would remain a strong reference
throughout the loop, acting like a penny in a fuse box, and preventing the
soft reference from doing its job.
There are some cases where you just can't make the window small enough. For
example, let's say that you wanted to process a ResultSet into
a DOM Document. You would have to dereference the document
after every call to getObject(), and you might just find that
the memory usage to create a new Element is large enough to
push you into an OutOfMemoryError (although there are
techniques, such as pre-allocating a sacrificial buffer, which may help).
Finally, think carefully about the strong references that you hold. For
example, a DOM is typically processed recursively, and you might think of a
recursive solution to adding rows, passing in the parent node. However,
method arguments are strong references. And in a DOM, a reference to any
node is the start of a chain of references to every other node — so if
you pass a node into a method, you have just created a long-lasting strong
reference to the entire DOM tree.
Weak References
A weak reference is, as its name suggests, a reference that doesn't even
try to put up a fight to prevent its referent from being collected. If
there are no strong or soft references to the referent, it's pretty much
guaranteed to be collected.
So what's the use? There are two main uses: associating objects that have
no inherent relationship, and reducing duplication via a canonicalizing
map. The first case is best illustrated with a counter-example:
ObjectOutputStream.
The Problem With ObjectOutputStream
When you write objects to an ObjectOutputStream, it maintains
a strong reference to the object, associated with a unique ID, and writes
that ID to the stream along with the object's data. This has two benefits
if you later write the same object to the same stream: you save bandwidth,
because the output stream only needs to send the ID, and you preserve
object identity on the other end.
Unfortunately, it's also a form of memory leak, since the stream holds onto
the source object forever — or at least until you close the stream or
call reset() on it. If you're using object streams simply as a
means to move objects, and aren't concerned about preserving identity or
reducing bandwidth, then you quickly learn to call reset() on
a regular basis.
If the ObjectOutputStream instead held the source object via a
WeakReference, the problem wouldn't happen: when the object
went out of scope in the program code, the collector could reclaim the
object. Since there would be no way that it could ever be written to the
stream again, there's no reason for the stream to hold onto it. Better, the
ObjectOutputStream could notify the
ObjectInputStream that the object is no longer valid,
eliminating memory leaks on the receiving side.
Unfortunately, although the object stream protocol was updated with the 1.2
JDK, and weak references were added with 1.2, the JDK developers didn't
think to combine them.
Using WeakHashMap to Associate Objects
To be honest, I don't believe there are many cases where you should
associate two objects that don't have an inherent relationship. Either the
objects should have a composition relationship, and be collected together,
or they should have an aggregation relationship and be collected separately.
However this rule breaks down if you have no ability to change the objects
to reflect their relationship — for example, if you need to form a
composition relationship between a third-party class and an application
class. It also breaks down in cases like ObjectOutputStream,
which the relationship is ad hoc and the objects have differing lifetimes.
Should you find the need to create such an association, the JDK provides
WeakHashMap, which holds its
keys via weak references. When the key is no longer referenced
anywhere else within the application, the map entry is no longer accessible.
In practice, the entry remains in the map until the next time the map
is accessed, so you may find your related objects sitting in the heap
far longer than they should.
Rather than give an example here, we'll look at WeakHashMap
in the context of a canonicalizing map.
Eliminating Duplicate Data with Canonicalizing Maps
In my opinion, a far better use of weak references is for canonicalizing
maps. And the best example of how a canonicalizing map works — even
though it's written as a native method — is String.intern().
When you intern a string, you get a single, canonical instance of that string
back. If you're processing some input source with a lot of duplicated strings,
such as an XML or HTML document, interning strings can save an enormous
amount of memory.
A simple canonicalizing map works by using the same object as key and value:
you probe the map with an arbitrary instance, and if there's already a value
in the map, you return it. If there's no value in the map, you store the
instance that was passed in (and return it). Of course, this only works for
objects that can be used as map keys. Here's how we might implement
String.intern():
private Map<String,String> _map = new HashMap<String,String>();
public synchronized String intern(String str)
{
if (_map.containsKey(str))
return _map.get(str);
_map.put(str, str);
return str;
}
This implementation is fine if you have a small number of strings to
intern. However, let's say that you're writing a long-running application
that has to process input that contains a wide range of strings that still
have a high level of duplication. For example, an HTTP server that
canonicalizes the headers in its requests. There are only about a dozen
values that you'll see in a "User-Agent" header, yet some of those values
occur more frequently than others — the Googlebot only visits once a
week.
In this case, you can reduce long-term memory consumption by holding the
canonical instance only so long as some code in the program is using
it. And this is where weak references come in: by holding the map
entries as weak references, they will become eligible for collection after
the last strong reference disappears. Once the Googlebot has finished
indexing your site its user agent string will be collected.
To improve our canonicalizer, we can replace HashMap with a
WeakHashMap:
private Map<String,WeakReference<String>> _map
= new WeakHashMap<String,WeakReference<String>>();
public synchronized String intern(String str)
{
WeakReference<String> ref = _map.get(str);
String s2 = (ref != null) ? ref.get() : null;
if (s2 != null)
return s2;
// as-of 1.5, still possible for a string to reference a much larger
// shared buffer; creating a new string will trim the buffer
str = new String(str);
_map.put(str, new WeakReference(str));
return str;
}
First thing to notice is that, while the map's key is a
String, its value is a WeakReference<String>.
This is because WeakHashMap only uses
WeakReference for its keys; the Map.Entry holds a
strong reference to the value. If we did not wrap the value in its own
WeakReference, that strong reference would never allow the
string to be collected.
Second, since we're holding the values via a reference object, we have to
ensure that we establish a strong reference before returning. We can't
simply return ref.get(), because it's possible that the
reference will be cleared between the time we verify its contents and the
time we return. So we create the strong reference s2, verify
it, and then return if it's not null.
Thirdly, note that I've synchronized the intern() method. The
most likely use for a canonicalizing map is in a multi-threaded environment
such as an app-server, and WeakhashMap isn't synchronized
internally. The synchronization in this example is actually rather naive,
and the intern() method can become a point of contention.
Realistically, you could wrap the map with
Collections.synchronizedMap(), understanding that two
concurrent calls with the same string may return different instances.
However, only one instance will go into the map, and since our goal is to
reduce duplication, that should be acceptable. The naive approach is better
for a tutorial.
One final thing to know about WeakHashMap is that its
documentation is somewhat misleading. Above, I noted that it's not
synchronized internally, but the documentation states "a
WeakHashMap may behave as though an unknown thread is silently
removing entries." While that may be how it appears, in reality there is no
other thread; instead, the map cleans up whenever it's accessed. To keep
track of which entries are no longer valid, it uses a reference queue.
Reference Queues
While testing a reference for null lets you know whether its
referent has been collected, doing so requires that you interrogate the
reference. If you have a lot of references, it would be a waste of time to
interrogate all of them to discover which have been cleared. The
alternative is a reference queue: when you associate a reference with a
queue, the reference will be put on the queue after it has been
cleared.
You associate a reference object with a queue at the time you create the
reference. Thereafter, you can poll the queue to determine when the
reference has been cleared, and take appropriate action
(WeakHashMap, for example, removes the map entries associated
with those references). Depending on your needs, you might want to set up a
background thread that periodically polls the queue, blocking until
references become available.
Reference queues are most often used with phantom references, described
below, but can be used with any reference type. The following code is an
example with soft references: it creates a bunch of buffers, accessed via a
SoftReference, and after every creation looks to see what
references have been cleared. If you run this code, you'll see long runs of
create messages, interspersed with an occasional run of clear messages
(each run of the garbage collector will clear multiple references).
public static void main(String[] argv) throws Exception
{
List<SoftReference<byte[]>> refs = new ArrayList<SoftReference<byte[]>>();
ReferenceQueue<byte[]> queue = new ReferenceQueue<byte[]>();
for (int ii = 0 ; ii < 10000 ; ii++)
{
SoftReference<byte[]> ref
= new SoftReference<byte[]>(new byte[10000], queue);
System.err.println(ii + ": created " + ref);
refs.add(ref);
Reference<? extends byte[]> r2;
while ((r2 = queue.poll()) != null)
{
System.err.println("cleared " + r2);
}
}
}
As always, there are things to note about this code. First, although we're
creating SoftReference instances, we get
Reference instances back from the queue. This serves to remind
you that, once they're enqueued, it no longer matters what type of a
reference you're using: the referent has already been cleared.
Second is that we must keep track of the reference objects via strong
references. The reference object knows about the queue; the queue doesn't
know about the reference until it's enqueued. If we didn't maintain the
strong reference to the reference object, it would itself be collected, and
we'd never be the wiser. We use a List in this example, but in practice, a
Set is a better choice because it's easier to remove those
references once they're cleared.
Phantom References
Phantom references differ from soft and weak references in that they're not
used to access their referents. Instead, their sole purpose is to tell you
when their referent has already been collected. While this seems rather
pointless, it actually allows you to perform resource cleanup with more
flexibility than you get from finalizers.
The Trouble With Finalizers
Back in the description of object life cycle, I mentioned that finalizers
have subtle problems. These problems come about because object finalization
happens after the object has been marked as garbage, but before its memory
is reclaimed. With a little ingenuity, you can use finalizers to cause
out-of-memory conditions even when there are no strongly-referenced objects
in your program.
The first of the problems with finalizers is that they may not be invoked.
If your program never runs out of available memory, the garbage collector
will never identify objects that need to be finalized, and their finalizers
will never be called. This is a particular concern when the finalizer
exists to clean up JNI-allocated resources: if the Java-side objects are
small, you could easily run out of C heap space before any of them get
collected. The only way around this problem is to manually clean up your
objects.
A second problem is that finalize() is that it's allowed to
create a strong reference to the object — to resurrect the object. If
you've ever read Stephen King, you will probably guess that the resurrected
object "isn't quite right": in particular, its finalizer will never be
executed again. While this seems scary on the surface, most people who
write finalizers aren't trying to resurrect their objects. And even if they
do, ultimately they can't change the order of the universe, and once the
object becomes unreachable again it will be eligible for collection.
The real problem with finalizers is that they introduce a discontinuity
between the time that the object is identified for collection and the time
that its memory is reclaimed. The JVM is guaranteed to perform a full
collection before it returns OutOfMemoryError, but if the only
objects eligible for collection happen to have a finalizer, then the
collection will have little effect. Throw in the fact that a JVM may only
have a single thread responsible for finalization of all objects, and you
start to see the problems.
The following program demonstrates this: each object has a finalizer that
sleeps for half a second. Not much time at all, unless you've got thousands
of objects to clean up. Every object goes out of scope immediately after
it's created, yet at some point you'll run out of memory (this will happen
faster if you reduce the maximum heap size with -Xmx).
public class SlowFinalizer
{
public static void main(String[] argv) throws Exception
{
while (true)
{
Object foo = new SlowFinalizer();
}
}
// some member variables to take up space -- approx 200 bytes
double a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z;
// and the finalizer, which does nothing by take time
protected void finalize() throws Throwable
{
try { Thread.sleep(500L); }
catch (InterruptedException ignored) {}
super.finalize();
}
}
The Phantom Knows
Phantom references allow the application to know when an object is no
longer used, so that the application can clean up the object's non-memory
resources. Unlike finalizers, cleanup is controlled by the
application. If the application creates objects using a factory
method, that method can be written to block until some number of
outstanding objects have been collected. No matter how long it takes to do
cleanup, it won't affect any thread other than the one calling the factory.
Phantom references differ from soft and weak references in that your
program does not access the actual object through the reference. In fact,
if you call get(), it always returns null,
even if the referent is still strongly referenced. Instead, you
use the phantom reference to hold a second strong reference to resources
used by the referent:
While this seems strange, the purpose of the phantom reference is simply to
let you know when the referent has been reclaimed. Your program still needs
to be able to access the resources in order to reclaim them, so the
referent can't be the sole path to those resources. The application must
rely on a reference queue to report when the referent has been collected.
Implementing a Connection Pool with Phantom References
Database connections are one of the most precious resources in any
application: they take time to establish, and database servers place strict
limits on the number of simultaneous open connections that they'll accept.
For all that, programmers are remarkably careless with them, sometimes
opening a new connection for every query and either forgetting to close it
or forgetting to close it in a finally block. While the
Connection object itself could use a finalizer to release
actual resources, that's still dependent on the whim of the garbage
collector, and it doesn't limit the number of connections that can be open
at any time.
By using phantom references, we gain control over the number of open
connections, and can block until one becomes available. In the example, we
don't go further than that, but it would be a simple matter to add in a
reaper that reclaims connections that have been open/unused too long.
The first part of the connection pool is the PooledConnection
object. This is the object that is given to the application to use —
the referent of our phantom reference. It implements the JDBC
Connection interface, and delegates all operations to the
actual connection handed out by the DriverManager.
public class PooledConnection
implements Connection
{
private ConnectionPool _pool;
private Connection _cxt;
Since our PooledConnection implements the
Connection interface, it must expose all of the methods of
that interface. When called, these methods simply delegate to the embedded
connection object. I use the internal method getConnection()
rather than directly accessing the _cxt field for two reasons:
first, so I can apply some checks for the validity of the connection, and
second, so that I can override in a test case.
public void commit() throws SQLException
{
getConnection().commit();
}
For now, that's enough said about the PooledConnection object.
Now let's look at the ConnectionPool, which is a factory for
new pooled connections via the getConnection() method. This is
a blocking method: if there isn't a connection available in the pool, it
will wait until one becomes available (signified by its reference being
enqueued).
public synchronized Connection getConnection()
throws SQLException
{
while (true)
{
if (_pool.size() > 0)
return wrapConnection(_pool.remove());
else
{
try
{
Reference<?> ref = _refQueue.remove(100);
if (ref != null)
releaseConnection(ref);
}
catch (InterruptedException ignored)
{
// this could be used to shut down pool
}
}
}
}
From this method, you should be able to infer that we have a queue of some
sort containing our actual connections, and also a reference queue that
we'll use to track when our PooledConnection objects get
collected. The call to releaseConnection() should give you a
hint that we keep track of the pooled connections via their references, so
let's take a look at the internal data structures:
private Queue<Connection> _pool
= new LinkedList<Connection>();
private ReferenceQueue<Object> _refQueue
= new ReferenceQueue<Object>();
private IdentityHashMap<Object,Connection> _ref2Cxt
= new IdentityHashMap<Object,Connection>();
private IdentityHashMap<Connection,Object> _cxt2Ref =
new IdentityHashMap<Connection,Object>();
What's happening here is that the pool maintains two lookup tables: one
from the reference object to the actual connection, and one from the actual
connection to the reference object. Both tables use
IdentityHashMap, because we care about the actual object, and
don't want a potential override of equals() to get in our way.
Note that the two lookup tables also serve as our strong references to the
phantom reference instances, so that they won't get collected.
Assuming that there are connections in the pool, the
wrapConnection() method handles the bookkeeping needed to
track that connection. It creates a PooledConnection instance,
which is handed to the caller, and a PhantomReference to refer
to that instance. It then inserts these objects in the lookup tables.
private synchronized Connection wrapConnection(Connection cxt)
{
Connection wrapped = new PooledConnection(this, cxt);
PhantomReference<Connection> ref = new PhantomReference<Connection>(wrapped, _refQueue);
_cxt2Ref.put(cxt, ref);
_ref2Cxt.put(ref, cxt);
return wrapped;
}
Its counterpart is releaseConnection(), which comes in two
flavors. The first is meant to be called from within the pool, when the
phantom reference is enqueued. It uses the reference to find the actual
connection.
synchronized void releaseConnection(Reference<?> ref)
{
Connection cxt = _ref2Cxt.remove(ref);
if (cxt != null)
releaseConnection(cxt);
}
The second version is meant to be called from the
PooledConnection itself, when the application explicitly
closes that connection (it's also called from the first version). It clears
out the bookkeeping objects, and puts the actual connection back into the
pool.
synchronized void releaseConnection(Connection cxt)
{
Object ref = _cxt2Ref.remove(cxt);
_ref2Cxt.remove(ref);
_pool.offer(cxt);
System.err.println("Released connection " + cxt);
}
To go full circle, we'll look at the PooledConnection's
close() method, which not only returns the connection to the
pool, but also ensures that it won't be used again. Remember: this method
will only be called by application code, to explicitly close the
connection. If the pool decides to close the connection, the
PooledConnection instance will be long gone.
public void close() throws SQLException
{
if (_cxt != null)
{
_pool.releaseConnection(_cxt);
_cxt = null;
}
}
The Trouble with Phantom References
Several pages back, I noted that finalizers are not guaranteed to be
called. Neither are phantom references. If the collector doesn't run, it
will never collect unreachable objects, and any phantom references won't be
enqueued. Consider what would happen if your program used the connection
pool above, and threw an uncaught exception immediately after calling
getConnection().
The answer is that it would quickly exhaust the pool, and all further
requests would block. If your program didn't do anything else that would
cause a garbage collection, pretty soon every thread would be blocked,
waiting for connections that will never return to the pool.
However, even in this situation, phantom references have an advantage over
finalizers: cleanup is under your control. True, with finalizers you could
call System.gc() in the hopes that will cause the collector to
get to work, but there's no guarantee: per the documentation, it
"suggests that the Java Virtual Machine expend effort toward
recycling unused objects" (emphasis added).
By comparison, the connection pool could run through its list of
outstanding connections, and force them to close, without relying on the
finalizer (to be fair, you could make the same thing happen with a
finalizer, but at that point you're already more than halfway to an
implementation using references).
A Final Thought: Sometimes You Just Need More Memory
While reference objects are a tremendously useful tool to manage your
memory consumption, sometimes they're not sufficient and sometimes they're
overkill. For example, let's say that you're building a large object graph,
containing data that you read from the database. While you could use soft
references as a circuit breaker for the read, and weak references to
canonicalize that data, ultimately your program requires a certain amount
of memory to run. If you can't actually accomplish any work, it doesn't
matter how robust your program is.
Your first response to OutOfMemoryError should be to figure
out why it's happening. And to do that, simply increase your heap size with
the -Xmx parameter. Personally, I find it strange that the JVM
expects you to set explicit limits on memory consumption: the rest of the
world seems to have no problems with relying on virtual memory management
to do the right thing. For whatever reasons, the JVM is different, and by
default it doesn't give you a very large default allotment: 64Mb in the Sun
1.5 JVM.
During development, you should specify a large heap size — 1 Gb or
more if you have the physical memory - and pay careful attention to how much
memory is actually used (most IDEs provide a way to monitor heap usage, or
you can turn to classes in the java.lang.management package or
even Runtime.totalMemory()). Most applications will reach a
steady state under simulated load, and that should guide your production heap
settings . If your memory usage climbs over time, it's quite probable that
you are holding strong references to objects after they're no longer in
use. Reference objects may help here, but it's more likely that you've got
a bug that should be fixed.
The bottom line is that you need to understand your applications. A
canonicalizing map won't help you if you don't have duplication. Soft
references won't help if you expect to execute multi-million row queries on
a regular basis. But in the situations where they can be used, reference
objects are often life savers.
Additional Information
You can download the
sample code for this article. This JAR contains both source and executables, with
“runner” classes.
The “string canonicalizer” class is available from
SourceForge,
licensed under
Apache 2.0.
Sun has many articles on tuning their JVM's memory management.
This article is an excellent introduction, and provides links to
additional documentation.
Brian Goetz has a great column on the IBM developerWorks site, "Java Theory
and Practice." A few years ago, he wrote columns on using both
soft and
weak references. These articles go into depth on some of the topics that I simply
skimmed over, such as using WeakHashMap to associate objects
with different lifetimes.
Copyright © Keith D Gregory, all rights reserved
|