Tuesday, August 21, 2007

Garbage Collection in Java

Java is a garbage collected language. This means that programmers need not worry themselves with memory management issues such as when to free the memory used to store objects on the heap. On the other hand, this places the burden on Java’s runtime system to determine when allocated memory can be freed for reuse, and this is called garbage collection. In garbage collection, the memory management system attempts to locate portions of memory reserved for use by the running program but which are never going to be used, and mark those portions as available for reuse.

The traditional criticism of garbage collected languages is that this activity necessarily slows down the running of the program, and besides, the programmer is likely to know better than any automated garbage collector when memory can be freed up for reuse. While true that garbage collection takes a performance hit on the running program, the axiom that the programmer knows best when memory can be freed turns out to cause substantial problems. As the mystery of computer programming has escaped the strict control of the guilds and become available to the hoi polloi, every Tom, Dick, and Harry has considered himself able to program. Having passed the tests and trials of simulated programming problems, he is granted a bachelor of science and sets to work programming massive multi-year projects running mission-critical applications. But without the long apprenticeships of yore, he does not understand nor respect the heap, and the pieces he crafts have memory leaks causing undue headaches to his successors (since such Johnny-come-lately programmers inevitably move on before they finish the job) and unnecessary headaches and expenses to his patrons. Thus, the true masters among us have seen the need to write programs and design whole programming languages that prevent these pretenders from spoiling our reputation. And so the ever-downward spiral continues, as we construct languages that make it even more easy to program, attracting an even less-qualified group of people into the craft, who introduce even more stupid errors, requiring us to further simplify programming. At this rate, it will not be long before programming is in the hands of everyone. It will be as common as writing, and we have seen what a terribly democratizing influence that has been.

However, the stupidity of modern programmers has, in fact, made it easier for the few masters left to design a garbage collector that outperforms the programmer, for it turns out that the quality of programming has been so poor that even a performance-hogging garbage collector is better than leaving such details in the hands of the programmers.

There is no canonical garbage collection algorithm defined for Java, and different implementations utilize different algorithms, all of them likely better than leaving memory management in the programmers’ hands. One I looked at [Sun 2005] used a generational approach, the theory base behind which can be illustrated by Figure 1, with the author’s explanation in the caption. The figure almost makes sense, but not quite. Lucky for you, however, I can translate. The difficulty, of course, is in the measurement of object lifetimes in bytes allocated, since we normally think of lifetimes as being measured in units of time. Although the author could not explain this explicitly, for fear that the knowledge would fall into the hands of the unwashed, he is using bytes allocated as a measure of time, where each byte allocated is a tick. Thus, if a byte is allocated on the heap and then immediately freed, its lifetime was 0 bytes, regardless of whether a millisecond or a millennium passed in the meantime. On the other hand, if before it was freed, 100,000 other bytes on the heap were allocated, then its lifetime is 100,000 bytes allocated, even if the actual time that passed was under a second. This measurement unit actually does make sense, as garbage collection is a function of memory use, rather than raw time.

Figure 1. Infant Mortality. The blue area is a typical distribution for the lifetimes of objects. The x-axis is object lifetimes measured in bytes allocated. The byte count on the y-axis is the total bytes in objects with the corresponding lifetime. The sharp peak at the left represents objects that can be reclaimed (i.e., have “died”) shortly after being allocated. Iterator objects, for example, are often alive for the duration of a single loop.

So what the author notes is that most objects last only a very short amount of time. However, ones that do make it past a certain age, tend to stick around a lot longer. So the idea of the generational approach is that once some data has survived a certain amount of time, it is unlikely to be freed up in the near future, so it can be moved to an area that is garbage collected less often. Might as well spend resources collecting bytes that are likely to be garbage than looking for ones among those that are not. Kind of like the lioness going after the weakest of the herd. Or the bully in grade school.

So memory is managed in generations, or memory pools holding objects of similar ages. Garbage collection occurs in each generation when that generation’s pool has filled up. Objects are initially allocated in the generation for younger objects, called, oddly enough, the young generation, and because of infant mortality, most objects die there. Nice metaphor. When the young generation fills up, it is garbage collected, called a minor collection. Minor collections can be optimized assuming a high infant mortality rate. The costs of such collections are, to the first order, proportional to the number of live objects being collected: a young generation full of dead objects is collected very quickly. Some surviving objects are moved to a tenured generation. When the tenured generation needs to be collected there is a major collection, which is often much slower, because it involves all live objects. In theory, you could have several generations for the objects of various ages, but in practice, only two are necessary. (Well, there is also a third, permanent generation, which contains the reflective data of the virtual machine itself, such as class and method objects.)

There are several possible configurations for a generation:

A generation can be held in a single (logical) area in memory. Objects are allocated somewhere in that area, and a standard mark-and-sweep algorithm is used to mark objects for collection. If some sort of compaction is required, it is added in as an additional time sink.
The generation’s memory is divided into two equally-sized halves, only one of which is active at a time. When one half fills up, the other half becomes the active space, and live objects are copied to the newly activated space, leaving the garbage behind. Objects are copied between the two halves in this way until they are old enough to be tenured, or copied to the next generation. In theory (and generally in practice, too), the amount of memory taken up by the live objects after the copy is only a fraction of the space taken up by it in the old space. In this method, compaction comes for free.
The generation’s memory consists of an eden space plus two survivor spaces. Objects are initially allocated in eden. One survivor space is empty at any time, and serves as a destination of the next, copying collection of any live objects in eden and the other survivor space. Objects are copied between survivor spaces in this way until they are old enough to be tenured. As with the two-half approach, compaction comes for free.

Generally, the tenured generation is held in a single-space configuration. Although this may make garbage collection a bit more expensive, major collections are performed infrequently at best. Young generations are often configured in the two-half configuration or the three-space eden-survivor configuration. The author did not express a preference for one over the other, except that he chose the latter for his JVM.

REFERENCES

[Sun 2005] “Tuning Garbage Collection with the 1.4.2 Java™ Virtual Machine.” Sun Developer Network. 20 Oct. 2005. http://java.sun.com/docs/hotspot/gc1.4.2/.

Monday, July 23, 2007

Javascript from the Command Line

I have an interest in Javascript, as I find it a very easy language to quickly put something together. It is very flexible, and its prototyping of objects instead of using class definitions makes it very easy to quickly create, well, prototypes. It always bothers me to click on a link purporting to be about the Javascript language only to discover it is about using Javascript in a browser. Most of my Javascript usage is outside of the browser using Windows Script Host.

Charles Lowell wrote in “Learning Javascript from the Command Line” about Javascript being a language in its own right, not attached to any browser. His directions, however, were for use on a Linux host rather than Windows. I have been using a WSF file to run Javascript from the command line for several years now, and since some of the comments on Reddit [http://programming.reddit.com/info/28gum/comments] indicate that this technique is not common knowledge, I share it with you here. Put the following file named jscript.wsf in your path (I put it in C:\WINDOWS:

<?XML version="1.0" standalone="yes" ?>
<package>
 <job id="jscript">
  <object id="FSO" progid="Scripting.FileSystemObject" events="true"/>
  <script language="JScript">
   <![CDATA[

var __prompt = "> ";
var __l;
var __e;
var __f;

while (true)
{
 WScript.StdOut.Write(__prompt);
 __l = WScript.StdIn.ReadLine();
 __l = __l.replace(/^\s+/, "");
 try 
 {
  switch (__l.split(/\s+/)[0].toLowerCase())
  {
   case "?":
   case "help":
    WScript.StdOut.WriteLine('Type JScript statement to evaluate or "HELP" or "QUIT"');
    break;
   
   case "exit":
   case "quit":
    WScript.Quit();
   
   case "include":
    __f = __l.replace(/^include\s+/i, "");
    __f = FSO.OpenTextFile(__f, 1);
    __l = __f.ReadAll();
    __f.Close();
   
   default:
    WScript.StdOut.WriteLine(eval(__l));
  }
 }
 catch (__e)
 {
  WScript.StdErr.WriteLine(__e.name + " " + __e.number + " (" + __e.description + "): " + __e.message);
 }
}

   ]]>
  </script>
 </job>
</package>

Now you can run Javascript from the command line:

C:\Documents and Settings\user\Desktop>jscript
> 1
1
> function f(x){return x+1}

> f
function f(x){return x+1}
> f(2)
3

If you get an error when you first try to run it, it is likely because your default script host is WScript instead of CScript. To fix this, issue the command:

C:\Documents and Settings\user\Desktop>cscript //H:CScript
The default script host is now set to "cscript.exe".

The script does not have much help, but if you look at the code, you can pretty much see the only undocumented feature, which is the ability to load script from a file using the include command.

Friday, July 20, 2007

Discovering Erlang

Distributed Systems

In my first post, “Toward a Universal Data Description Language,” I mentioned that I thought that “any computer system of interest is distributed.” My post today will describe a programming language I discovered only recently that is designed specifically for distributed systems. It is remarkable to me that it is not even a new language—having been around for well over a decade—since a language such as this, specifically designed to solve this common problem, should have been noticed by me before.

In distributed systems, nodes of the system communicate by sending messages between each other. There are two common patterns of the communication:

Pull. Information receiver “client” node pulls from an information source "server" node by sending a request to the server and receiving the information in its reply.
Push. Information source “publisher” node pushes information to a "subscriber" node.

If the information source and receiver use identical data formats and communication protocols, then implementation of the system is easy: information source and receiver can send messages directly between each other. However, it is quite common for different nodes to have been developed by different organization (e.g., the inventory system was developed by a company that specializes in MRP systems, the order entry system was created by an e-commerce company), so the different nodes don't speak the same language. This is where a real-time data linking application is required. A real-time data linking application (LINK) serves as a translator between the homogeneous systems, speaking each nodes native language and translating between them in real-time. These translations are not always as straightforward as they seem. In this scenario, the LINK acts as subscriber, client, and publisher in the same transaction.

Node A publishes information X to LINK.
LINK sends request for additional information Y to node B.
Node B replies to LINK with information Y.
LINK combines X & Y into Z then sends Z to node C.
Node C sends acknowledgment to LINK.
LINK sends acknowledgment to node A.

The Problem

In this scenario, the LINK withholds its acknowledgment to node A until the last step. This delay will block node A from sending more information until it is sure that all processing is complete with the original message X. However, it need not be so, and, in fact, this ordering of events could be disastrous to the system as a whole. Suppose node A publishes information at a rate of 10 messages per second, node B is well able to handle 10 requests per second, and node C can certainly receive 10 messages per second. It would seem that we should not have a problem.

And we should not, but we do: node B takes ten seconds to process each request. Because node B can handle 100 simultaneous connections, it can still keep up with the 10 requests per second pace required of it, but it requires the LINK to have multiple requests at any one time. If the LINK withholds its acknowledgment until step 6, then it will never have multiple requests to send to node B. Therefore, the LINK must send its acknowledgment around step 2. However, once it does this, it needs to be able to manage multiple instances of each piece of information and match up each Y with the correct X to form message Z. There may be no guarantee that node B will respond to requests in the order received (some information may be cached and return very quickly, while other requests might require querying some other remote data source), so a simple queue will not work either, where each Y is matched to the first X in the X-queue.

The Solution: Erlang

Data linking applications, therefore, usually require support for handling concurrent jobs. If we had the ability to spawn a new process for each message X received from node A, then each spawned process need only keep track of one X and one Y. The complexity of our program goes down significantly with this feature, since our program logic need only describe how to handle a single message X instead of having to describe how to manage all the message X’s, and Y’s, and the connections to node B, etc.

Erlang is a programming language specifically designed to handle concurrency, and my initial dabbles in it seem to indicate it is well-suited for this sort of problem. It is available for a variety of platforms, including Windows.

An Introduction to Erlang

I am admittedly no expert in Erlang, so my introduction will necessarily be “gentle”. My introduction, however, does assume that you managed to install Erlang and have tried some typical hello-world-type modules. I have little interest in repeating the excellent documentation that is available on-line. Your first stop should probably be trapexit.org.

Suppose node B requires us to assign a unique nine-digit ID to each of our requests. How do we generate such an ID without repeating ourselves? Our first thought might be to utilize the built-in now() function, which returns the number of microseconds since 00:00 GMT, January 1, 1970. The documentation for this function guarantees that subsequent calls to the function will return continuously increasing values so that it can be used to generate unique time-stamps. This is precisely what we are looking for!

Unfortunately, we are restricted to nine digits, and now() will produce well over nine digits. If we just take the least significant nine digits, then we have a cycle that repeats every three hours, so there is no guarantee that the chosen numbers will be unique. We could play the odds, but our odds of a collision are significantly increased if our system (like mine) has only millisecond resolution. We might then decide to drop the last three digits and use only the least significant nine of the milliseconds in the epoch, but even this repeats three times a year.

The solution is a sequence: start at 000000000 and increment it by one for each message. At our average rate of 10 transactions per second, the sequence will last us 31 years. One might argue that this merely puts off the repetition and we run the risk of having a Y2K-like problem in 2038, but with only nine digits to play with, this is the best we can do even in theory.

So we decide to write a module sequence to encapsulate our ID-generating logic. If we were to write this module using a more traditional language, our resulting Erlang module would use a variable to store the last ID generated, and we would increment this variable by one on each call. Translating this approach into Erlang produces a module that might look something like this:

-module(sequence).
-export([next/0]).

next() ->  Next_ID =
case get(sequence) of
undefined -> 0;
ID -> ID + 1
end,
put(sequence, Next_ID),
string:right(integer_to_list(Next_ID), 9, $0).

Testing this module in Erlang seems to indicate that it works:

(node1@host1)1> sequence:next(). 
"000000000"
(node1@host1)2> sequence:next(). 
"000000001"
(node1@host1)3> sequence:next(). 
"000000002"

But that is not a very good test. When we utilize the sequence in LINK implementation, each call to sequence:next() will be in its own process with its very own process dictionary. This means that the get() and put() calls of our implementation our initialized every time. To demonstrate, we spawn off processes to make the calls to sequence:next():

(node1@host1)4> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.63.0>
(node1@host1)5> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.65.0>
(node1@host1)6> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.67.0>

As expected, our function returned “000000000” every single time. (The numbers between angled brackets, such as “<0.63.0>”, are the return values of the spawn function, which is output by the Erlang interface.)

If we continue our traditional programming approach to this problem, we might come up with some work-arounds. For example, we might have the process that receives the message from node A generate the ID and pass it to the process it spawns to handle the message. This, however, leaks details of our implementation of the communication with node B into the code that handles node A communications. If node B changes its interface so as to require a different ID, or multiple ID's, or to no longer even require an ID, our code that handles communications with node A would have to change.

In Erlang, processes are cheap. They are cheaper even than operating system threads, so the Erlang solution is to create a process that generates the ID’s, and use that process to generate all our ID’s. This solution now looks something like this:

-module(sequence).
-export([start/0, next/0, next/1, loop/1]).

start() ->
register(sequence, spawn(sequence, loop, [0])).

loop(Next_ID) ->
receive
{next, Requester_PID, Request_ID} ->
Requester_PID ! {Request_ID, string:right(integer_to_list(Next_ID), 9, $0)}
end,
loop(Next_ID + 1).

get_next(Process) ->
Request_ID = make_ref(),
Process ! {next, self(), Request_ID},
receive
{Request_ID, ID} -> ID
end.

next() -> get_next(sequence).

next(Node) -> get_next({sequence, Node}).

Running the same test as before gives us the results we want:

(node1@host1)5> sequence:start().
true
(node1@host1)6> spawn(fun() -> io:format(sequence:next()) end).
<0.71.0>000000000
(node1@host1)7> spawn(fun() -> io:format(sequence:next()) end).
<0.73.0>000000001
(node1@host1)8> spawn(fun() -> io:format(sequence:next()) end).
<0.75.0>000000002

You may wonder what was so special about this program; after all, it is just a counter. The wonderful thing about this, however, is that we have a sequence server that can be accessed from anywhere within the network. We can start Erlang on any machine on our network and get the next sequence with one line of code! In our example, we set up our sequence-generator on a node called “node1” on a host called “host1”. If we start Erlang on another host, say “host2”, naming our instance “node2” (note that you can have multiple nodes running on the same host), then the following one line will fetch for us on host2 the next number in the sequence from host1:

(node2@host2)1> sequence:next(node1@host1).
"000000003"

Erlang appears to be an good language in which to write concurrent distributed applications, and I shall investigate this language further.

Thursday, July 12, 2007

Toward a Universal Data Description Language

I have been working in the computer industry for almost twenty-five years. While that may seem like a long time to many people, I am humbled when I meet computer scientists who were programming back before there was even software. My late uncle, for instance, got his start during World War II programming rockets or artillery or somesuch. He said they didn't think of it as “programming” back then, and that it was only later, when what they were doing evolved into computers that they could look back at what they were doing in the war as “programming.” To certain extent, I am jealous of those pioneers of the fifties and sixties, but I realize now that I myself am becoming an old fogey in our field. No, I don't remember vacuum tubes and I never programmed with punch cards (though I do remember when I was young seeing stacks of punch cards at my school, and I am always amazed by the stories of people younger than me who tell of having to use punch cards when they were in college, since I have always used a keyboard and monitor), but I did used to program in eight-bit assembly with three registers, I have wired a computer from scratch, with real wires and a soldering iron, and I used to know all about how the magnetic ones and zeroes were arranged on a disk drive to encode data in the sectors. Some of this knowledge is quite arcane nowadays, though I hesitate to say unnecessary. Indeed, I sure would like to wire a modern computer from scratch, though it would probably be quite painstaking wiring up many sixty-four-bit registers to a bus. Actually, the physical wiring is not as interesting to me as the microcode and nanocode controlling the signals on those wires, but my time is finite, and I have found a more interesting and more useful class of problems.

Almost every project I have worked on commercially has involved data transformation. If you look at it very basically, almost any project that involves a database requires some sort of data transformation to translate the input data to the database schema (and the commands to manipulate the data in the database). However, this is not really what I mean. When you have complete control over the formats of the input and output (i.e., you can design the schema to work with your data formats), this is mainly an exercise in mapping business entities to a schema.

However, almost all business software must be built to work with software made by other vendors. This is a necessity to sell the software. If you want to sell your financial software, it must integrate with the client's existing general ledger system. No matter if you also make a general ledger system, the client wants to use his existing one with your receivables application. You want to sell your order entry software? It must be able to feed data to the existing forecasting system. “No, we already have a good forecasting system and don’t want to buy yours, too. We just want your order entry system.” I have come to the conclusion that the function of nearly all business software is to take input in one format and output it in another. There is obviously a little value added in the middle there, but a significant amount of effort is involved in integrating heterogeneous systems. Indeed, I might even go so far as to say that any computer system of interest is distributed, so this is an unfortunate necessity.

An unfortunate necessity, I say, because data interfaces are neither the most glamorous nor the most interesting of problems that a software engineer wants to work on.

One of my goals has been to develop a universal data description language that would provide a language that could be used to describe any data format. This may alleviate not only the problem of parsing inbound data formats, but just like XSLT has provided a transformation language for XML, it opens the door to the possibility of also creating a transformation language for the parse trees created by a UDDL parser.

A universal data description language is a high bar to set, but it should not be impossible. Certainly all practical data languages are recursive, so any Turing-complete language should suffice. The challenge is to create such a language that permits description of data languages in an intuitive way. Simply defining a universal DDL is of little use if a UDDL parser cannot be efficiently implemented or if the syntax and semantics of the language are so arcane and complex that describing data formats in the UDDL is overly burdensome.

If you are interested in my research in this area, my prospectus and presentation are available at http://cis.usouthal.edu/~dmercer. Theoretically, I should be working on my thesis right now, but I am constantly encountering even more interesting problems or approaches to this problem, and writing up something I have already solved does not seem to take precedence over the demands of work and family.

Tuesday, July 3, 2007

How Java Should Have Handled Multiple Inheritence

C++ permits multiple inheritance. That is, a class can subclass more than one parent class. This can cause ambiguities when more than one parent class exposes identically named and prototyped methods or properties. Fortunately, C++ provides ways to disambiguate this situation.
Java, on the other hand, does not permit multiple inheritance. Instead, it has interfaces, which are guarantees that a given class implements a predefined set of methods and properties. While Java does not permit multiple inheritance, it does permit a class to implement multiple interfaces, and interface names can be used in variable and parameter declarations in much the same way as class names, though an interface cannot be directly instantiated like a class.

As Java postdates C++, it is an attempt to clean up the messiness that comes with multiple inheritance. However, I think it comes up short, as it requires class designers to decide in advance whether to define a class with and interface or not. While nothing prevents designers from defining all classes with interfaces, there is nothing that requires it, and even if a prudent designer did, there is nothing to force class users to use those interfaces instead of the classes directly. Java should:

Require that the entire interface that a class exposes be defined by the set of interfaces it declares that it implements. If a class implements more than one interface with the same method name and prototype, then the class can designate different functions to correspond to the different interfaces. This will properly handle interfaces that attach different semantic meanings to the same method name.
Require that variable and parameter types only be declared as the interface required. A class is not a valid type. A class, however, is the only way to instantiate an object that implements a particular type.
Permit interfaces to require other interfaces. This does not include the required interface’s signature in the requiring interface, but it does mean that implementing the requiring interface implicitly implements the required interface. For example, if I₁ is an interface with the method f₁. An interface I₂ may require I₁. Now, any class that implements I₂ must provide a method for I₁’s method f₁. This f₁ method is specifically not part of I₂. In fact, if I₂ had a method f₁, then the class definition would also need to provide a method for I₂’s f₁ (which may or may not be the same as I₁’s). Note that this avoids any messiness with multiple inheritance. Let’s say we also have an interface I₃ that requires I₁, and an interface I₄ that requires both I₂ and I₃. Although both I₂ and I₃ require I₁, I₁’s methods remain I₁’s, so defining a method implementing I₁’s method f₁ satisfies the requirement from both I₂ and I₃.