The following is a selection of publications and technical reports I've written arranged by subject area. Most of them are available on line, and links are provided. A comprehesive list of my publications, arranged more-or-less chronologically, is also available.
Note that, in some cases, the papers have been regenerated from the original LaTeX or Word source files, and so formatting may be slightly different from the published versions.
Object Identity and Recursive Data Structures
The Object-Protocol Model (OPM) and OPM Multidatabase Query System
GeneExpress and Gene Expression Data Management
My work on schema merging was done during my first year at Penn, and was my first work in the area of databases. At the time, the existing work on schema merging was mainly based on various heuristics, rather than any formal interpretation, and had certain problems that were identified, such as not being associative (the order in which three or more schemas were merged affected the outcome). This work proposed a general model for database schemas, inspired by the Entity-Relationship model, and described merges in terms of an information ordering on schemas. It also defined an algorithm for generating merges and showed that the algorithm preserved meta-data models.
I didn't pursue this further, since, at the time, I felt that this work wasn't addressing real and practical problems of database integration, and was too abstract to be useful. In fact, this seems to have had the most influence and be the most known of any of the work I've done so far.
My PhD. thesis was on Database Transformations. It was inspired, in part, by interactions with the Computational Biology Department at Penn, and by problems that they encountered in transforming various heterogeneous data models in order to map them to local data warehouses. A Horn-clause based language, WOL, was defined which allowed the expression of both database transformations and constraints, and could be used to express transformations in a compact form and to reason about them. The language supported recursive data structures and the creation of complex objects by using Skolem functions to generate object identities. A prototype implementation was also developed, which would unfold a WOL transformation specification, into a mapping in some lower level query language.
Probably the most significant insight of this work is the importance of the interaction between database constraints and transformations, both in determining transformations and in reasoning about their correctness.
A side issue that I studied in my PhD thesis, was the use of object-identity in databases as a means of representing recursive data structures. In particular I looked at the information content of database, assuming that object-identities were not directly visible, and showed that this depended on the query language and operations available on the database. In particular, if equality tests on object identities were not available, then a sort of bisimulation relationship captured indistinguishability of data. A full version of this work is available as Part II of my thesis, above. The first paper below gives a condensed version of these results.
In addition, there were some interesting ideas about recursive functions, defined over databases with object identity, that I described in my PhD. thesis proposal, but was not able to pursue in the final thesis due to lack of time. The main premise here was that databases are fundamentally different from normal object-oriented programming environments, in that each class or type has a finite extent associated with it, which can be traversed in finite time, and that this makes it possible to express certain recurssive functions and queries which would not be meaningful or well-defined in a normal programming environment. There has been a recent resurgence of interest in this kind of data structure, and in languages for manipulating and querying them, because of the recent popularity of semi-structured data models such as XML, so perhaps it would be good to revisit these ideas. A copy of my original thesis proposal is below, just in case.
I joined the group working on the OPM project at Lawrence Berkeley Laboratory in 1995, and moved, with the group, to Gene Logic in 1997, with plans to commercialize the OPM tools. I was involved in a varity of parts of the project, including the specifation of the OPM Query Language and Translator. My most significant part was in the OPM Multidatabase Query System (MQS), for which I was the primary designer and implementator. This was a federated database system in which the component data sources maintained independent schemas, but were viewed through a common data model (OPM). The data sources could include relational databases, either developed using OPM or retrofitted with OPM schemas, hybrid XML/relational databases, flat-file databases, and other applications, such as BLAST engines, for which mediators were written. MQS included some novel features, such as the use of inter-database links for navigating between databases, Java-applet based graphical user interfaces, and support for application specific data types (ASDTs), which were not found in other federated database systems.
OPM and MQS were used by Smith-Kline Beecham for their TI-Net project, which resulted in a succesful Journal of Bioinformatics paper. This system involved support for various major molecular biology databases, both public and proprietary, some of which were accessed remotely and others which were mirrored locally using XML databases. It also included access to web-servers, BLAST engines and various ASDTs. This project was dropped after the merger with Glaxo, and, unfortunately, Gene Logic did not pursue additional sales of the OPM tools after Smith-Kline Beecham, since this did not fit with the business model it was developing. OPM and certain OPM tools are still used for internal database development at Gene Logic, and in parts of the GeneExpress(TM) product line, and are still in use at Lawrence Berkeley National Laboratory and some accademic sites.
Most recently, I've been involved in the software development for Gene Logic's GeneExpress and Genesis product lines. GeneExpress is consists of very large databases of proprietary gene expression data, together with related sample and gene annotation data, and tools for exploring and analysing the data. The gene expression data are generated primarily using the Affymetrix GeneChip platform, while the sample annotations are gathered from various collaborators using a custom data aquisition system, and the gene annotations are gathered from a large selection of public and proprietary data sources using data warehousing techniques. In addition we have taken on a number of custom data integration projects, allowing Gene Logic customers to integrate their own data into GeneExpress, and, more recently, have extended GeneExpress with the Genesis product range, which includes tools for integrating customer gene expression and sample annotation data into GeneExpress using standard formats or data entry tools.
Though not a research project, GeneExpress does provide interesting case studies of many of the pratical problems encountered when building a large data warehouse and tools supporting data from a wide variety of heterogeneous data sources, some of which are described in these papers.
These papers are various miscellaneous things I've written, that don't fit in any of the major categories of my work but I felt like including anyway...
The first paper is an essay that I wrote for a class during the first semester of my Masters program at Imperial. The lecturer teaching the class liked it, and offered to publish it in a journal he edited, which gave me my first academic publication. This taught me an important lesson which was "avoid using the lower case letter 'l' as a symbol in technical documents".
The second paper is something I wrote in order to fulfil the second part of the Written Preliminary Exam, for the PhD program at Penn. I felt it was a good oportunity to look at some subjects that I was interested in but wouldn't otherwise have a good excuse to study, so I looked at various formal models for communicating concurrent processes, including the pi-calculas.