Thursday, July 12, 2007

Toward a Universal Data Description Language

I have been working in the computer industry for almost twenty-five years. While that may seem like a long time to many people, I am humbled when I meet computer scientists who were programming back before there was even software. My late uncle, for instance, got his start during World War II programming rockets or artillery or somesuch. He said they didn't think of it as “programming” back then, and that it was only later, when what they were doing evolved into computers that they could look back at what they were doing in the war as “programming.” To certain extent, I am jealous of those pioneers of the fifties and sixties, but I realize now that I myself am becoming an old fogey in our field. No, I don't remember vacuum tubes and I never programmed with punch cards (though I do remember when I was young seeing stacks of punch cards at my school, and I am always amazed by the stories of people younger than me who tell of having to use punch cards when they were in college, since I have always used a keyboard and monitor), but I did used to program in eight-bit assembly with three registers, I have wired a computer from scratch, with real wires and a soldering iron, and I used to know all about how the magnetic ones and zeroes were arranged on a disk drive to encode data in the sectors. Some of this knowledge is quite arcane nowadays, though I hesitate to say unnecessary. Indeed, I sure would like to wire a modern computer from scratch, though it would probably be quite painstaking wiring up many sixty-four-bit registers to a bus. Actually, the physical wiring is not as interesting to me as the microcode and nanocode controlling the signals on those wires, but my time is finite, and I have found a more interesting and more useful class of problems.

Almost every project I have worked on commercially has involved data transformation. If you look at it very basically, almost any project that involves a database requires some sort of data transformation to translate the input data to the database schema (and the commands to manipulate the data in the database). However, this is not really what I mean. When you have complete control over the formats of the input and output (i.e., you can design the schema to work with your data formats), this is mainly an exercise in mapping business entities to a schema.

However, almost all business software must be built to work with software made by other vendors. This is a necessity to sell the software. If you want to sell your financial software, it must integrate with the client's existing general ledger system. No matter if you also make a general ledger system, the client wants to use his existing one with your receivables application. You want to sell your order entry software? It must be able to feed data to the existing forecasting system. “No, we already have a good forecasting system and don’t want to buy yours, too. We just want your order entry system.” I have come to the conclusion that the function of nearly all business software is to take input in one format and output it in another. There is obviously a little value added in the middle there, but a significant amount of effort is involved in integrating heterogeneous systems. Indeed, I might even go so far as to say that any computer system of interest is distributed, so this is an unfortunate necessity.

An unfortunate necessity, I say, because data interfaces are neither the most glamorous nor the most interesting of problems that a software engineer wants to work on.

One of my goals has been to develop a universal data description language that would provide a language that could be used to describe any data format. This may alleviate not only the problem of parsing inbound data formats, but just like XSLT has provided a transformation language for XML, it opens the door to the possibility of also creating a transformation language for the parse trees created by a UDDL parser.

A universal data description language is a high bar to set, but it should not be impossible. Certainly all practical data languages are recursive, so any Turing-complete language should suffice. The challenge is to create such a language that permits description of data languages in an intuitive way. Simply defining a universal DDL is of little use if a UDDL parser cannot be efficiently implemented or if the syntax and semantics of the language are so arcane and complex that describing data formats in the UDDL is overly burdensome.

If you are interested in my research in this area, my prospectus and presentation are available at http://cis.usouthal.edu/~dmercer. Theoretically, I should be working on my thesis right now, but I am constantly encountering even more interesting problems or approaches to this problem, and writing up something I have already solved does not seem to take precedence over the demands of work and family.

No comments:

Post a Comment