Monday, July 23, 2007

Javascript from the Command Line

I have an interest in Javascript, as I find it a very easy language to quickly put something together. It is very flexible, and its prototyping of objects instead of using class definitions makes it very easy to quickly create, well, prototypes. It always bothers me to click on a link purporting to be about the Javascript language only to discover it is about using Javascript in a browser. Most of my Javascript usage is outside of the browser using Windows Script Host.

Charles Lowell wrote in “Learning Javascript from the Command Line” about Javascript being a language in its own right, not attached to any browser. His directions, however, were for use on a Linux host rather than Windows. I have been using a WSF file to run Javascript from the command line for several years now, and since some of the comments on Reddit [http://programming.reddit.com/info/28gum/comments] indicate that this technique is not common knowledge, I share it with you here. Put the following file named jscript.wsf in your path (I put it in C:\WINDOWS:

<?XML version="1.0" standalone="yes" ?>
<package>
 <job id="jscript">
  <object id="FSO" progid="Scripting.FileSystemObject" events="true"/>
  <script language="JScript">
   <![CDATA[

var __prompt = "> ";
var __l;
var __e;
var __f;

while (true)
{
 WScript.StdOut.Write(__prompt);
 __l = WScript.StdIn.ReadLine();
 __l = __l.replace(/^\s+/, "");
 try 
 {
  switch (__l.split(/\s+/)[0].toLowerCase())
  {
   case "?":
   case "help":
    WScript.StdOut.WriteLine('Type JScript statement to evaluate or "HELP" or "QUIT"');
    break;
   
   case "exit":
   case "quit":
    WScript.Quit();
   
   case "include":
    __f = __l.replace(/^include\s+/i, "");
    __f = FSO.OpenTextFile(__f, 1);
    __l = __f.ReadAll();
    __f.Close();
   
   default:
    WScript.StdOut.WriteLine(eval(__l));
  }
 }
 catch (__e)
 {
  WScript.StdErr.WriteLine(__e.name + " " + __e.number + " (" + __e.description + "): " + __e.message);
 }
}

   ]]>
  </script>
 </job>
</package>

Now you can run Javascript from the command line:

C:\Documents and Settings\user\Desktop>jscript
> 1
1
> function f(x){return x+1}

> f
function f(x){return x+1}
> f(2)
3

If you get an error when you first try to run it, it is likely because your default script host is WScript instead of CScript. To fix this, issue the command:

C:\Documents and Settings\user\Desktop>cscript //H:CScript
The default script host is now set to "cscript.exe".

The script does not have much help, but if you look at the code, you can pretty much see the only undocumented feature, which is the ability to load script from a file using the include command.

Friday, July 20, 2007

Discovering Erlang

Distributed Systems

In my first post, “Toward a Universal Data Description Language,” I mentioned that I thought that “any computer system of interest is distributed.” My post today will describe a programming language I discovered only recently that is designed specifically for distributed systems. It is remarkable to me that it is not even a new language—having been around for well over a decade—since a language such as this, specifically designed to solve this common problem, should have been noticed by me before.

In distributed systems, nodes of the system communicate by sending messages between each other. There are two common patterns of the communication:

Pull. Information receiver “client” node pulls from an information source "server" node by sending a request to the server and receiving the information in its reply.
Push. Information source “publisher” node pushes information to a "subscriber" node.

If the information source and receiver use identical data formats and communication protocols, then implementation of the system is easy: information source and receiver can send messages directly between each other. However, it is quite common for different nodes to have been developed by different organization (e.g., the inventory system was developed by a company that specializes in MRP systems, the order entry system was created by an e-commerce company), so the different nodes don't speak the same language. This is where a real-time data linking application is required. A real-time data linking application (LINK) serves as a translator between the homogeneous systems, speaking each nodes native language and translating between them in real-time. These translations are not always as straightforward as they seem. In this scenario, the LINK acts as subscriber, client, and publisher in the same transaction.

Node A publishes information X to LINK.
LINK sends request for additional information Y to node B.
Node B replies to LINK with information Y.
LINK combines X & Y into Z then sends Z to node C.
Node C sends acknowledgment to LINK.
LINK sends acknowledgment to node A.

The Problem

In this scenario, the LINK withholds its acknowledgment to node A until the last step. This delay will block node A from sending more information until it is sure that all processing is complete with the original message X. However, it need not be so, and, in fact, this ordering of events could be disastrous to the system as a whole. Suppose node A publishes information at a rate of 10 messages per second, node B is well able to handle 10 requests per second, and node C can certainly receive 10 messages per second. It would seem that we should not have a problem.

And we should not, but we do: node B takes ten seconds to process each request. Because node B can handle 100 simultaneous connections, it can still keep up with the 10 requests per second pace required of it, but it requires the LINK to have multiple requests at any one time. If the LINK withholds its acknowledgment until step 6, then it will never have multiple requests to send to node B. Therefore, the LINK must send its acknowledgment around step 2. However, once it does this, it needs to be able to manage multiple instances of each piece of information and match up each Y with the correct X to form message Z. There may be no guarantee that node B will respond to requests in the order received (some information may be cached and return very quickly, while other requests might require querying some other remote data source), so a simple queue will not work either, where each Y is matched to the first X in the X-queue.

The Solution: Erlang

Data linking applications, therefore, usually require support for handling concurrent jobs. If we had the ability to spawn a new process for each message X received from node A, then each spawned process need only keep track of one X and one Y. The complexity of our program goes down significantly with this feature, since our program logic need only describe how to handle a single message X instead of having to describe how to manage all the message X’s, and Y’s, and the connections to node B, etc.

Erlang is a programming language specifically designed to handle concurrency, and my initial dabbles in it seem to indicate it is well-suited for this sort of problem. It is available for a variety of platforms, including Windows.

An Introduction to Erlang

I am admittedly no expert in Erlang, so my introduction will necessarily be “gentle”. My introduction, however, does assume that you managed to install Erlang and have tried some typical hello-world-type modules. I have little interest in repeating the excellent documentation that is available on-line. Your first stop should probably be trapexit.org.

Suppose node B requires us to assign a unique nine-digit ID to each of our requests. How do we generate such an ID without repeating ourselves? Our first thought might be to utilize the built-in now() function, which returns the number of microseconds since 00:00 GMT, January 1, 1970. The documentation for this function guarantees that subsequent calls to the function will return continuously increasing values so that it can be used to generate unique time-stamps. This is precisely what we are looking for!

Unfortunately, we are restricted to nine digits, and now() will produce well over nine digits. If we just take the least significant nine digits, then we have a cycle that repeats every three hours, so there is no guarantee that the chosen numbers will be unique. We could play the odds, but our odds of a collision are significantly increased if our system (like mine) has only millisecond resolution. We might then decide to drop the last three digits and use only the least significant nine of the milliseconds in the epoch, but even this repeats three times a year.

The solution is a sequence: start at 000000000 and increment it by one for each message. At our average rate of 10 transactions per second, the sequence will last us 31 years. One might argue that this merely puts off the repetition and we run the risk of having a Y2K-like problem in 2038, but with only nine digits to play with, this is the best we can do even in theory.

So we decide to write a module sequence to encapsulate our ID-generating logic. If we were to write this module using a more traditional language, our resulting Erlang module would use a variable to store the last ID generated, and we would increment this variable by one on each call. Translating this approach into Erlang produces a module that might look something like this:

-module(sequence).
-export([next/0]).

next() ->  Next_ID =
case get(sequence) of
undefined -> 0;
ID -> ID + 1
end,
put(sequence, Next_ID),
string:right(integer_to_list(Next_ID), 9, $0).

Testing this module in Erlang seems to indicate that it works:

(node1@host1)1> sequence:next(). 
"000000000"
(node1@host1)2> sequence:next(). 
"000000001"
(node1@host1)3> sequence:next(). 
"000000002"

But that is not a very good test. When we utilize the sequence in LINK implementation, each call to sequence:next() will be in its own process with its very own process dictionary. This means that the get() and put() calls of our implementation our initialized every time. To demonstrate, we spawn off processes to make the calls to sequence:next():

(node1@host1)4> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.63.0>
(node1@host1)5> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.65.0>
(node1@host1)6> spawn(fun() -> io:format(sequence:next()) end). 
000000000<0.67.0>

As expected, our function returned “000000000” every single time. (The numbers between angled brackets, such as “<0.63.0>”, are the return values of the spawn function, which is output by the Erlang interface.)

If we continue our traditional programming approach to this problem, we might come up with some work-arounds. For example, we might have the process that receives the message from node A generate the ID and pass it to the process it spawns to handle the message. This, however, leaks details of our implementation of the communication with node B into the code that handles node A communications. If node B changes its interface so as to require a different ID, or multiple ID's, or to no longer even require an ID, our code that handles communications with node A would have to change.

In Erlang, processes are cheap. They are cheaper even than operating system threads, so the Erlang solution is to create a process that generates the ID’s, and use that process to generate all our ID’s. This solution now looks something like this:

-module(sequence).
-export([start/0, next/0, next/1, loop/1]).

start() ->
register(sequence, spawn(sequence, loop, [0])).

loop(Next_ID) ->
receive
{next, Requester_PID, Request_ID} ->
Requester_PID ! {Request_ID, string:right(integer_to_list(Next_ID), 9, $0)}
end,
loop(Next_ID + 1).

get_next(Process) ->
Request_ID = make_ref(),
Process ! {next, self(), Request_ID},
receive
{Request_ID, ID} -> ID
end.

next() -> get_next(sequence).

next(Node) -> get_next({sequence, Node}).

Running the same test as before gives us the results we want:

(node1@host1)5> sequence:start().
true
(node1@host1)6> spawn(fun() -> io:format(sequence:next()) end).
<0.71.0>000000000
(node1@host1)7> spawn(fun() -> io:format(sequence:next()) end).
<0.73.0>000000001
(node1@host1)8> spawn(fun() -> io:format(sequence:next()) end).
<0.75.0>000000002

You may wonder what was so special about this program; after all, it is just a counter. The wonderful thing about this, however, is that we have a sequence server that can be accessed from anywhere within the network. We can start Erlang on any machine on our network and get the next sequence with one line of code! In our example, we set up our sequence-generator on a node called “node1” on a host called “host1”. If we start Erlang on another host, say “host2”, naming our instance “node2” (note that you can have multiple nodes running on the same host), then the following one line will fetch for us on host2 the next number in the sequence from host1:

(node2@host2)1> sequence:next(node1@host1).
"000000003"

Erlang appears to be an good language in which to write concurrent distributed applications, and I shall investigate this language further.

Thursday, July 12, 2007

Toward a Universal Data Description Language

I have been working in the computer industry for almost twenty-five years. While that may seem like a long time to many people, I am humbled when I meet computer scientists who were programming back before there was even software. My late uncle, for instance, got his start during World War II programming rockets or artillery or somesuch. He said they didn't think of it as “programming” back then, and that it was only later, when what they were doing evolved into computers that they could look back at what they were doing in the war as “programming.” To certain extent, I am jealous of those pioneers of the fifties and sixties, but I realize now that I myself am becoming an old fogey in our field. No, I don't remember vacuum tubes and I never programmed with punch cards (though I do remember when I was young seeing stacks of punch cards at my school, and I am always amazed by the stories of people younger than me who tell of having to use punch cards when they were in college, since I have always used a keyboard and monitor), but I did used to program in eight-bit assembly with three registers, I have wired a computer from scratch, with real wires and a soldering iron, and I used to know all about how the magnetic ones and zeroes were arranged on a disk drive to encode data in the sectors. Some of this knowledge is quite arcane nowadays, though I hesitate to say unnecessary. Indeed, I sure would like to wire a modern computer from scratch, though it would probably be quite painstaking wiring up many sixty-four-bit registers to a bus. Actually, the physical wiring is not as interesting to me as the microcode and nanocode controlling the signals on those wires, but my time is finite, and I have found a more interesting and more useful class of problems.

Almost every project I have worked on commercially has involved data transformation. If you look at it very basically, almost any project that involves a database requires some sort of data transformation to translate the input data to the database schema (and the commands to manipulate the data in the database). However, this is not really what I mean. When you have complete control over the formats of the input and output (i.e., you can design the schema to work with your data formats), this is mainly an exercise in mapping business entities to a schema.

However, almost all business software must be built to work with software made by other vendors. This is a necessity to sell the software. If you want to sell your financial software, it must integrate with the client's existing general ledger system. No matter if you also make a general ledger system, the client wants to use his existing one with your receivables application. You want to sell your order entry software? It must be able to feed data to the existing forecasting system. “No, we already have a good forecasting system and don’t want to buy yours, too. We just want your order entry system.” I have come to the conclusion that the function of nearly all business software is to take input in one format and output it in another. There is obviously a little value added in the middle there, but a significant amount of effort is involved in integrating heterogeneous systems. Indeed, I might even go so far as to say that any computer system of interest is distributed, so this is an unfortunate necessity.

An unfortunate necessity, I say, because data interfaces are neither the most glamorous nor the most interesting of problems that a software engineer wants to work on.

One of my goals has been to develop a universal data description language that would provide a language that could be used to describe any data format. This may alleviate not only the problem of parsing inbound data formats, but just like XSLT has provided a transformation language for XML, it opens the door to the possibility of also creating a transformation language for the parse trees created by a UDDL parser.

A universal data description language is a high bar to set, but it should not be impossible. Certainly all practical data languages are recursive, so any Turing-complete language should suffice. The challenge is to create such a language that permits description of data languages in an intuitive way. Simply defining a universal DDL is of little use if a UDDL parser cannot be efficiently implemented or if the syntax and semantics of the language are so arcane and complex that describing data formats in the UDDL is overly burdensome.

If you are interested in my research in this area, my prospectus and presentation are available at http://cis.usouthal.edu/~dmercer. Theoretically, I should be working on my thesis right now, but I am constantly encountering even more interesting problems or approaches to this problem, and writing up something I have already solved does not seem to take precedence over the demands of work and family.

Tuesday, July 3, 2007

How Java Should Have Handled Multiple Inheritence

C++ permits multiple inheritance. That is, a class can subclass more than one parent class. This can cause ambiguities when more than one parent class exposes identically named and prototyped methods or properties. Fortunately, C++ provides ways to disambiguate this situation.
Java, on the other hand, does not permit multiple inheritance. Instead, it has interfaces, which are guarantees that a given class implements a predefined set of methods and properties. While Java does not permit multiple inheritance, it does permit a class to implement multiple interfaces, and interface names can be used in variable and parameter declarations in much the same way as class names, though an interface cannot be directly instantiated like a class.

As Java postdates C++, it is an attempt to clean up the messiness that comes with multiple inheritance. However, I think it comes up short, as it requires class designers to decide in advance whether to define a class with and interface or not. While nothing prevents designers from defining all classes with interfaces, there is nothing that requires it, and even if a prudent designer did, there is nothing to force class users to use those interfaces instead of the classes directly. Java should:

Require that the entire interface that a class exposes be defined by the set of interfaces it declares that it implements. If a class implements more than one interface with the same method name and prototype, then the class can designate different functions to correspond to the different interfaces. This will properly handle interfaces that attach different semantic meanings to the same method name.
Require that variable and parameter types only be declared as the interface required. A class is not a valid type. A class, however, is the only way to instantiate an object that implements a particular type.
Permit interfaces to require other interfaces. This does not include the required interface’s signature in the requiring interface, but it does mean that implementing the requiring interface implicitly implements the required interface. For example, if I₁ is an interface with the method f₁. An interface I₂ may require I₁. Now, any class that implements I₂ must provide a method for I₁’s method f₁. This f₁ method is specifically not part of I₂. In fact, if I₂ had a method f₁, then the class definition would also need to provide a method for I₂’s f₁ (which may or may not be the same as I₁’s). Note that this avoids any messiness with multiple inheritance. Let’s say we also have an interface I₃ that requires I₁, and an interface I₄ that requires both I₂ and I₃. Although both I₂ and I₃ require I₁, I₁’s methods remain I₁’s, so defining a method implementing I₁’s method f₁ satisfies the requirement from both I₂ and I₃.