Tuesday, August 21, 2007

Garbage Collection in Java

Java is a garbage collected language. This means that programmers need not worry themselves with memory management issues such as when to free the memory used to store objects on the heap. On the other hand, this places the burden on Java’s runtime system to determine when allocated memory can be freed for reuse, and this is called garbage collection. In garbage collection, the memory management system attempts to locate portions of memory reserved for use by the running program but which are never going to be used, and mark those portions as available for reuse.

The traditional criticism of garbage collected languages is that this activity necessarily slows down the running of the program, and besides, the programmer is likely to know better than any automated garbage collector when memory can be freed up for reuse. While true that garbage collection takes a performance hit on the running program, the axiom that the programmer knows best when memory can be freed turns out to cause substantial problems. As the mystery of computer programming has escaped the strict control of the guilds and become available to the hoi polloi, every Tom, Dick, and Harry has considered himself able to program. Having passed the tests and trials of simulated programming problems, he is granted a bachelor of science and sets to work programming massive multi-year projects running mission-critical applications. But without the long apprenticeships of yore, he does not understand nor respect the heap, and the pieces he crafts have memory leaks causing undue headaches to his successors (since such Johnny-come-lately programmers inevitably move on before they finish the job) and unnecessary headaches and expenses to his patrons. Thus, the true masters among us have seen the need to write programs and design whole programming languages that prevent these pretenders from spoiling our reputation. And so the ever-downward spiral continues, as we construct languages that make it even more easy to program, attracting an even less-qualified group of people into the craft, who introduce even more stupid errors, requiring us to further simplify programming. At this rate, it will not be long before programming is in the hands of everyone. It will be as common as writing, and we have seen what a terribly democratizing influence that has been.

However, the stupidity of modern programmers has, in fact, made it easier for the few masters left to design a garbage collector that outperforms the programmer, for it turns out that the quality of programming has been so poor that even a performance-hogging garbage collector is better than leaving such details in the hands of the programmers.

There is no canonical garbage collection algorithm defined for Java, and different implementations utilize different algorithms, all of them likely better than leaving memory management in the programmers’ hands. One I looked at [Sun 2005] used a generational approach, the theory base behind which can be illustrated by Figure 1, with the author’s explanation in the caption. The figure almost makes sense, but not quite. Lucky for you, however, I can translate. The difficulty, of course, is in the measurement of object lifetimes in bytes allocated, since we normally think of lifetimes as being measured in units of time. Although the author could not explain this explicitly, for fear that the knowledge would fall into the hands of the unwashed, he is using bytes allocated as a measure of time, where each byte allocated is a tick. Thus, if a byte is allocated on the heap and then immediately freed, its lifetime was 0 bytes, regardless of whether a millisecond or a millennium passed in the meantime. On the other hand, if before it was freed, 100,000 other bytes on the heap were allocated, then its lifetime is 100,000 bytes allocated, even if the actual time that passed was under a second. This measurement unit actually does make sense, as garbage collection is a function of memory use, rather than raw time.

Figure 1. Infant Mortality. The blue area is a typical distribution for the lifetimes of objects. The x-axis is object lifetimes measured in bytes allocated. The byte count on the y-axis is the total bytes in objects with the corresponding lifetime. The sharp peak at the left represents objects that can be reclaimed (i.e., have “died”) shortly after being allocated. Iterator objects, for example, are often alive for the duration of a single loop.

So what the author notes is that most objects last only a very short amount of time. However, ones that do make it past a certain age, tend to stick around a lot longer. So the idea of the generational approach is that once some data has survived a certain amount of time, it is unlikely to be freed up in the near future, so it can be moved to an area that is garbage collected less often. Might as well spend resources collecting bytes that are likely to be garbage than looking for ones among those that are not. Kind of like the lioness going after the weakest of the herd. Or the bully in grade school.

So memory is managed in generations, or memory pools holding objects of similar ages. Garbage collection occurs in each generation when that generation’s pool has filled up. Objects are initially allocated in the generation for younger objects, called, oddly enough, the young generation, and because of infant mortality, most objects die there. Nice metaphor. When the young generation fills up, it is garbage collected, called a minor collection. Minor collections can be optimized assuming a high infant mortality rate. The costs of such collections are, to the first order, proportional to the number of live objects being collected: a young generation full of dead objects is collected very quickly. Some surviving objects are moved to a tenured generation. When the tenured generation needs to be collected there is a major collection, which is often much slower, because it involves all live objects. In theory, you could have several generations for the objects of various ages, but in practice, only two are necessary. (Well, there is also a third, permanent generation, which contains the reflective data of the virtual machine itself, such as class and method objects.)

There are several possible configurations for a generation:
  1. A generation can be held in a single (logical) area in memory. Objects are allocated somewhere in that area, and a standard mark-and-sweep algorithm is used to mark objects for collection. If some sort of compaction is required, it is added in as an additional time sink.
  2. The generation’s memory is divided into two equally-sized halves, only one of which is active at a time. When one half fills up, the other half becomes the active space, and live objects are copied to the newly activated space, leaving the garbage behind. Objects are copied between the two halves in this way until they are old enough to be tenured, or copied to the next generation. In theory (and generally in practice, too), the amount of memory taken up by the live objects after the copy is only a fraction of the space taken up by it in the old space. In this method, compaction comes for free.
  3. The generation’s memory consists of an eden space plus two survivor spaces. Objects are initially allocated in eden. One survivor space is empty at any time, and serves as a destination of the next, copying collection of any live objects in eden and the other survivor space. Objects are copied between survivor spaces in this way until they are old enough to be tenured. As with the two-half approach, compaction comes for free.
Generally, the tenured generation is held in a single-space configuration. Although this may make garbage collection a bit more expensive, major collections are performed infrequently at best. Young generations are often configured in the two-half configuration or the three-space eden-survivor configuration. The author did not express a preference for one over the other, except that he chose the latter for his JVM.

REFERENCES

    There was an error in this gadget