Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2005 » July - geek tools and the scholar

Archive for July, 2005


Posted in Uncategorized on July 30th, 2005 by darcusb – Comments Off

Ian Davis—technical lead for the Silkworm project at Talis—and Richard Newman have announced the first public draft of an RDF OWL schema for the FRBR. It’s the first publicly available schema for the FRBR as far as I’m aware, and it’s RDF, which means it’s designed to be used in the context of other more domain-specific ontologies. It can begin to put in place the metadata plumbing that allows answers to queries like “give me all German-language expressions of Shakespeare’s Hamlet” or “give me all manifestations of Bob Marley’s ‘Burnin.”

This could be really interesting!

BTW, to see a pretty functional demo of what an FRBR-based search interface might look like for standard library data, see the FictionFinder.

Web Services and Distributed Citation Processing

Posted in Uncategorized on July 30th, 2005 by darcusb – Comments Off

One of the ideas I stumbled on when writing CiteProc, my XSLT-based citation processor, is that citation processing can be totally decoupled from metadata storage. A simple example of this is how I processed my recently completed book: by letting Saxon query an eXist XML DB over HTTP and using the returned MODS metadata to format the citations on-the-fly.

That was great because I didn’t have to write any code but the XSLT, and it worked! But things start to get more interesting when you think beyond this fairly simple model. Consider two examples that came out of collaborations with other projects:

In the first, Matthew Dovey at Oxford put together a simple web service that takes four parameters: document url, data store type (eXist’s XQuery-over-HTTP or SRU), data store url, and citation style. Here’s an example, where the document is on one server, and the bibliographic metadata is stored on another.

The second example is similar, and a demo is included in the CiteProc release archive. If you run the docbook-test-sru-refbase.xml example with the refbase-xhtml.xsl stylesheet, the processor (Saxon for now) will extract the citations, construct an SRU query, which it issues to a test server in Germany somewhere, returning the corresponding MODS records and formatting them, once again, on-the-fly.

OK, this is starting to look very cool, and very useful. All of the sudden we have an easy, standards-based, path to interoperability!

But given that I’ve been thinking about RDF lately again, I’m imaging extending this further. A simple solution would be a web service that could take a list of references, query distributed RDF stores, and return a collection of MODS records for processing. A more radical solution might be to use, say, a SPARQL XSLT extension to work with the RDF directly from within XSLT.

In either case, my hunch is that there’s a lot of possibility in this idea, and that the old notion of every user having to store and manage their own citation metadata—or conversely that it all ought to be stored on a centralized server—is one that is seriously holding back innovation in this space. Why do I even have to maintain my own citation metadata anyway?

The Semantic Web, RDF and Scholarly Metadata

Posted in Uncategorized on July 24th, 2005 by darcusb – 4 Comments

Metadata is central to scholarly activity of all kinds. Whether it’s students working on term papers, or researchers writing books and articles, much of that work involves marshaling metadata towards a convincing argument.

And yet, as I have said before, I find working with metadata far more work than I wish it was. More importantly, it’s more work than it needs to be.

Consider the work I do just to be able to gather the metadata to format my citations:

  1. For some journals, go to site X and find articles I want to cite; download RIS data for each separate article, then use Bibutils to convert them to MODS.
  2. For other journals, go site Y and find articles I want to cite; download Refer data (they don’t offer RIS) for each separate article, then use Bibutils to convert them to MODS.
  3. For most books, I can grab MODS data for them directly over SRU from the Library of Congress. Except, the data is often so bad for my purposes (missing name roles, spurious markup, etc.) that I often just create new MODS records by hand.
  4. For everything else (which is a lot in my case), hand create the MODS records, with a little help from emacs templates.

And in order to be able to use these data, I need to store it in a central location: a bibliographic collection in my eXist XML DB. Nevermind the months I spent writing code to be able to put it all to good use.

I sometimes feel like it’s more work to use “time-saving” web gateways than to just walk over to the library, pull the journals off the shelves, and hand write some notes on what I’m reading. Does it really need to be like this?

Having started to write this, I came across a video that is purportedly an Apple promotional video from the mid-1990s. It lays out a vision for what personal computing ought to look like in the future. Interestingly enough, the video does this by dramatizing scholarly workflow. A Berkeley professor—a geographer no less—enters his office and opens his computer. A talking head begins telling him his schedule for the day, which includes an afternoon lecture on Amazon deforestation. It seems the absent-minded professor had forgotten about the lecture, and so asks (verbally) the computer to pull up last year’s lecture notes. Not satisfied the information is sufficiently current, he asks his talking head assistant to find all recent related work. The assistant responds “only journal articles?” “Yes,” the professor responds.

I won’t recount the whole video, but suffice it to say that this exactly the sort of seamless access to information that I’d like to see sometime before my career is over. And yet, we’re so far from being there that I often find myself frustrated.

It’s within this context that I observe a rather old debate playing out with respect to the library world. I suppose I started it by asking an innocent question of Kevin Clarke’s post on metadata interoperability: “what about RDF??”

From my understanding, the origins of RDF lie with work done at Apple during roughly the same time period as this video was put together. Indeed, the video is essentially all about a vision of a semantic web. It’s telling to me that Apple chose to dramatize that vision using a professor.

Yet when the subject of RDF is raised in the library community, in general the response is either silence, or outright hostility. I’ve yet to hear a single convincing argument why not RDF, and it bothers me that the design of library standards like MODS and MADS suggests that there has been no attempt to make them RDF compliant.

And yet there is some RDF-related movement in the bibliographic world, though most of it spurred by people coming from outside the community. There is the SIMILE project at MIT, of course. And Leigh Dodds—who had some interesting things to say in response to Kevin—is heading up Ingenta’s quite ambitious move towards RDF. That started awhile back when they started serving up PRISM RSS feeds of their journal holdings, but will deepen significantly as they move to an RDF backend.

All of the sudden I can start to imagine a different way. Instead of me having to maintain my own normalized metadata—which takes a LOT of work—why can’t I just create citations that point to resources in disparate locations on the web? Why can’t I have elegant search applications that can find me the information I need—and access its metadata—without me having to access 10 different sites, most of them poorly designed, and each with their own UI excentricies? And for the RDF community, how about ditching BibTeX (with all of its significant problems) and adapting CiteProc to support an RDF-based approach, where one formats one’s LaTeX/DocBook/OpenOffice/Word documents using an elegant citation style language and distributed RDF metadata?

Really; we need a better way! Yeah, I know there are all kinds of institutional and financial and technical barriers to getting where we need to go, but we need to get there, and it seems to me RDF is a better solution than vanilla XML. As I said in a comment to Kevin’s post:

There are some serious metadata issues that the library world needs to grapple with over the coming years. I’m thinking not only about figuring out smooth ways to integrate disparate data, but also to begin to better put it in the framework of a larger view such as the FRBR. To do that well, I think the library world needs to do a much better job of interacting with other communities that are grappling with the same issues. I sadly don’t see even a hint that this is happening with RDF.

And conversely, I would add, the library world can still play an important role in revolutionizing the web more generally; in incrementally helping to realize at least some of the vision of the semantic web. What if, for example, the FRBR became broadly used and understood in the RDF world? What if library authority data became widely used and cited far beyond libraries?