Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2005 » December - geek tools and the scholar

Archive for December, 2005

Ruby sort_by and citeproc-rb

Posted in Uncategorized on December 21st, 2005 by darcusb – Comments Off

There’s a new Ruby weblog over at O’Reilly. This post on sorting arrays reminds me; how to sort an array of references by creator name, then year:

refs.sort_by {|ref| [ref[:creator], ref[:year]]}.each do |ref| 
  puts "#{ref[:creator]}, #{ref[:title]}, (#{ref[:year]})"

This has me wondering again how much easier it’d be to rewrite CiteProc in Ruby (or Python). Consider how complicated this XSLT code is, for example. It’s task is for the most part to sort a reference list by author-year, to track when there are more than one reference in an author group (because many styles replace duplicate author listings with em-dashes), and to assign proper suffixes to duplicate author-years (to get Doe, 2001c).

My hunch is this is much easier to do with Ruby or Python, neither of which I yet have any skill with.

One caveat: a Java-based XSLT processor like Saxon does a good job handling unicode sorting, which can be critical in bibliographic formatting. Am not sure how well Ruby handles this. If the array data above includes extended unicode characters, they seem not to get sorted correctly.

Ultimately, I think for the OpenOffice bibliographic project we’re going to have to rewrite CiteProc in C++ anyway, so at some point we’re going to need to figure out how to port it to a non-XSLT language.

Citation Metadata Workflow

Posted in Uncategorized on December 17th, 2005 by darcusb – 2 Comments

I got the following question in my inbox recently. I’ve edited it slightly to protect anonymity:

At some point, it would be useful to have a higher level “workflow” diagram of your vision. This would be especially helpful starting with the … research process and then ending in publication/distribution of a manuscript (analog (printed) and digital). You might even include the peer review cycle as well. What I’m trying to grasp is the various vectors or entry points for citations (how they are gathered/collected, organized and then deployed).

This is a great suggestion, but after sitting in front of OmniGraffle for a bit, I realize it’s actually quite difficult to clearly diagram just how much a mess the current landscape is, and the nirvana that I’d like to get to.

Awhile back, Alf Eaton did a decent job capturing some of this awhile back. It’s worth noting, however, that Alf’s diagram models workflow for a hard scientist, who tend only to ever cite secondary academic literature. In many other fields in the social sciences and humanities–where one often cites primary data–the universe of citable content is significantly broader. Every time I come across a news article on the New York Times or BBC websites, or across information in a Lexis-Nexis database, that is potentially citable content for me.

What I want to show in my diagram, then, is something like the following:

  1. scholarly data and its metadata can come from many sources
  2. it has to transfer between different kinds of applications, and across formats
  3. because the current software landscape in this area is so fragmented and without any real standards, the content/metadata link is incredibly fragile, and authors almost always—which is to say without significant exception—need to manually (and therefore incredibly awkwardly) maintain those links
  4. because applications and their file formats are similarly dumb, that link is again broken when authors release their work to the world, often through publishers

Leigh Dodds’ latest is on a similar theme of the metadata density of academic texts–he calls them “palimpsests”–and the need to break them open. As he writes:

I likened the process of authoring a scientific paper to that of the creation of a palimpsest. Starting from original research results and working through the synthesis of a cogent explanation of the results or discovery, at each step the content becomes more abstracted from the original results, the previous work being “lost” to the reader.

Data is presented in pre-analysed forms and is not amenable to reuse. Like the palimpsest the raw data has not really been lost, its just not (easily) accessible to the reader.

If the scriptio inferior, the underlying data, were made available to the reader, then there a lot of interesting possibilities arise.

Nice to have smart people in good places.

In his presentation, BTW, Leigh notes the utility of formats like OpenDocument for facilitating this sort of integration. Indeed, this is why the current metadata discussion is so important. So if I can’t quite work out a diagram that details all the links in the currently broken chain of the academic workflow, I can go back to this diagram I used to encapsulate the larger vision I presented at the Access 2005 conference:

OOoBib Plan

Posted in Uncategorized on December 17th, 2005 by darcusb – Comments Off

I recently posted a quick note that we need some C++ coders to help us get some things done. Since then, David Wilson and I have been working on a putting together a clear description of what we need at the OpenOffice wiki, which is available here.

In related news, Sun developer Florian Reuter has posted an explanation on his blog on how to code one piece of what we need to do for developers interested in contributing.

Intergenerational Democracy

Posted in Uncategorized on December 12th, 2005 by darcusb – Comments Off

The conclusion to the legal brief submitted by Linda Hamel (the General Counsel of the Technology Division) to the Senate post-audit committee:

Prior generations of the Commonwealth’s citizens have taken pains to ensure that current and historic government records are available to our citizens. Our children’s children will live in a world of information technology that we cannot now imagine. Long after today’s popular office applications have disappeared, future Massachusetts citizens will seek information about the past. When we create documents today in open standard formats, we engage in intergenerational democracy, reaching forward across time to future citizens of our Commonwealth to offer them unfettered access to the electronic record of their past and our future, and backward across time to honor the ideals of past citizens of the Commonwealth who fought for open access to public records.

ODF and XMP Comments

Posted in Uncategorized on December 9th, 2005 by darcusb – Comments Off

Leigh Dodds has posted the first of two pieces on his take on XMP and (to lesser extent) OpenDocument.

I’d like to return to Alan’s Lillich’s post to the OASIS ODF TC list, and focus in particular on my own answers to a series of smart questions he raises:

How quickly to move on new metadata?

It is essential to get the metadata support right. This could be reasonably done for the next major release of ODF.

Will the new metadata allow open extension?


How are formal schema used? Must end users provide a formal schema in order to use new metadata elements? If not required, is it allowed/supported?

The OpenDocument schema should provide standardized vocabulary support for both Dublin Core and Qualified Dublin Core, which would support the vast majority of metadata needs in ODF. For the bibliographic metadata support that we at the OpenOffice Bibliographic Project need–which is quite demanding–that default support should cover roughly 80% or so of our needs.

The schema should also define the option to include content from other namespaces, based on certain constraints (more below).

If formal schemas are used, what is the schema language? RELAX NG is clearly a better schema language than XML Schema. Can XML Schema be used at all by those who insist on it?

OpenDocument is defined in RELAX NG, and XML Schema is not an appropriate language to define any kind of RDF model. It simply is not expressive enough. The most serious problems with XML Schema is its lack of support for unordered content models, and its weak support for attribute-based validation.

While I expect that it might be possible to define somewhat looser XML Schema representation, RELAX NG ought to be the core language, as it already is in ODF.

What is the formal model for the metadata?

RDF, with some constraints.

Is the formal model based on RDF, or can it be expressed in RDF? If so, does it encompass all of RDF? If not all of RDF, what are the model constraints? Can any equivalent serialization of RDF be used?

The formal model is RDF, but removes support for reification.

Does the formal model have a specific notion of reference?

Yes. This is central to the RDF model in fact, and a major limitation to remove it.

If so, does it work broadly for general local file system use, networked file use, Internet use?

At minimum, the model should enable local (in the file wrapper) linking, which suggests support for rdf:nodeID. This allows use of relative URIs.

What happens to references as files are moved into and out of asset management systems?

If all metadata is embedded in the file wrapper and reference is relative, then there is no problem.

The question of extra-file-wrapper linking would also be valuable to explore, though perhaps ought to be considered a separate question, because it adds complexity.

What kinds of “user standard” metadata features are layered on top of the formal model? Users want helpful visible features. They generally don’t care if things are part of a formal model or part of conventions at higher levels. For example, a UI can make use of standard metadata elements to provide a rich browsing, searching, and discovery experience.

This is an implementation question that is somewhat separate from the details discussed above. However, certainly a GUI editor ought to be able to support:

  • basic string literals
  • ordered sequences
  • option to link to full resources: person objects, controlled subject vocabularies, and so forth
How important is interaction with XMP? Is it important to create a document using OpenDocument then publish and distribute it as PDF?

It is important, but not at the expense of the unique needs of OpenDocument.

If so, how is the OpenDocument metadata mapped into XMP in the PDF?

Simple transform via, for example, XSLT.

Is it important to import illustrations or images that contain XMP into OpenDodument files?


If so, how is the XMP in those files mapped into the OpenDocument metadata? How does it return to XMP when published as PDF?

In general, OpenDocument metadata should be expressed as an XML-tool-friendly RDF/XML subset, embedded as a separate files in the ODF wrapper (and therefore integrated into ODF’s existing packaging mechanism) so that metadata can be accessed in simple and consistent ways with a variety of tools. Indeed, one of the OASIS OpenDocument TC’s explicit charter goals is that it must be friendly to transformations using XSLT or similar XML-based languages or tools, and this must be true of the metadata support as well.

Ideally, then, their should be some mechanism to extract and embed XMP, and convert between it and the ODF metadata.

How important is interaction with other forms of metadata or other metadata systems? What other systems? How would the metadata be mapped?

It is very important that ODF metadata support be suitable for integration with other tools, such as external metadata servers. If those metadata servers happen to be RDF stores, there may be need to massage such RDF/XML metadata into the RDF constrained subset.

Are there things in XMP that are absolutely intolerable? Things that have no reasonable workaround? Does XMP place unacceptable limitations on possible future directions? Are there undesireable aspects of XMP that can reasonably be changed?

What XMP brings to table is embedding RDF metadata in binary files. I don’t believe it provides, in its current form, a compelling model and format for OpenDocument. To quote Leigh:

… XMP is … an RDF profile of sorts, although it opts for some rather quirky restrictions on the allowed RDF/XML syntax. Syntactic profiles of RDF don’t scare (or surprise) me, but this one left me with raised eye-brows. Rather than constraining the syntax to a fixed XML format, one that could be validated against an XML schema but still retain an RDF interpretation, the restrictions are placed elsewhere…. I think there are some benefits being lost here. It wouldn’t take much to bring the XMP and RDF models closer together, and still gain the benefits of both predictable structures for applications and the RDF model itself.

I believe this is crucial. The restrictions XMP places on the RDF model–where it throws out much of what is useful in RDF–are really not where the problem in RDF is. The problems are mostly to do with the too-flexible syntax of RDF/XML, which is easy enough to constrain with RELAX NG.

update: Leigh just posted part two, which is more directly about OpenDocument.

OpenDocument and XMP

Posted in Uncategorized on December 7th, 2005 by darcusb – Comments Off

Alan Lillich has just joined the OASIS OpenDocument TC. Alan is an engineer at Adobe who works on their Extensible Metadata Platform (XMP), and he joined the TC to lend his expertise towards resolving the metadata discussion that has been taking place over the past few months at the TC.

To that end, Alan has posted a long explanation of Adobe’s perspective on the problem, entitled OpenDocument metadata and XMP, which is in part a response to some of the concerns I have raised. In a nutshell, we are left some difficult choices. The TC can adopt XMP as is, and get instant metadata interoperability across a range of applications and file formats.

In doing so, however, one gives up a lot of important RDF features, and must play by the XMP rules. XMP has no support for native RDF linking facilities or typing. Likewise, it has no support for either rdf:parseType=”Literal” nor rdf:parseType=”Resource.” Finally, in the current syntax, all duplicate properties must be placed in either rdf:Alt, rdf:Bag, or rdf:Seq.

OTOH, XMP has been widely deployed across a variety of commercial and open source applications for the past few years. Modifying XMP to address some of the above limitations could involve serious backward compatibility issues for Adobe and others.

For those with the technical background adequate to assessing the trade-offs here, it would be good to comment on Alan’s piece somewhere. I suspect it will frame the discussion going forward, and we’ll all be living with the outcome for a long time to come.

update: Bob DuCharme reminded me of another big concern a number of people have raised: Adobe’s XMP toolkit only supports C++, which greatly restricts its real world usability.