Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2005 » May - geek tools and the scholar

Archive for May, 2005

Rails, RefDB and a Wishlist

Posted in General on May 18th, 2005 by darcusb – 1 Comment

Diwaker Gupta announced on the refdb-users list that he will be porting the RefDB perl client module to Ruby in order to build a web application around it using Rails.

This is good news, and has prompted me to put together my bib web application wish list, much of which I’ve written about in scattered posts, but not ever in one place. So, what I think the world needs in a 21st century bib app:

  1. Clean XHTML and CSS design.
  2. A configurable form system flexible enough to be configured for any resource type: everything from journal articles to books, to archival documents, to weblog posts. This presumes the form system should not be based on RIS or BibTeX, but rather around a more flexible standard like MODS. Either XML or YAML would be good bets for configuration languages in Ruby or Python.*
  3. Should exploit AJAX to make for an elegant and dynamic interface. The user should not have to reload pages just to see a record detail, nor to add a field to a form. Features like live search should be exploited where possible (not only in search, but maybe in auto-completing form fields for personal names).
  4. Powerful annotation support built in from the beginning, rather than as an after-thought. It should use a wiki language like textile—suitably extended for citation support—for easy semantic markup and linking. The new Rails app Backpack has done some nice work in this area
  5. Should integrate into the web, both by pulling in records from elsewhere (there is a ZOOM-binding for Ruby that gives support for z39.50 and SRU/W query and retrieval), and by distributing them with, for example, Atom.

Hmmm … am I missing something?

  • Here’s a fragment of a possible YAML example config file:

academic journal:
  class: part-inSerial
  fields:
    title: 
      label: title
    creator:
      label: author
      repeat: yes
    yearIssued:
      label: year
    isPartOf:
      title: 
        label: ABC Journal
      creator:
        label: editor
        repeat: yes
book:
  class: part-inSerial
  fields:
    title:
      label: title
    creator:
      label: author
      repeat: yes
    yearIssued:
      label: year
    publisher:
      label: publisher

XSLT 2.0 and CSS Parsing

Posted in Uncategorized on May 12th, 2005 by darcusb – Comments Off

Here is a nice example of XSLT 2.0’s regular expression support for parsing CSS.

[sometimes I use this blog for note-taking!]

XSLT and Blogging

Posted in Uncategorized on May 11th, 2005 by darcusb – Comments Off

Dave Pawson’s call to the XSLT community to create an open source weblog tool based on XSLT seems to have drawn some interest (see here and here for example).

I like the idea, with one caveat: such a project ought to be an opportunity to show the unique power that XML and XSLT can bring to the blogging mix; not just another Moveable Type clone.

Syncato seems like a nice place to start for a model, but the world has moved along since Syncato was released. Berkeley DB XML now has XQuery support, and AJAX has taken the web design by storm, showing just how good web GUI’s can be.

And I I still would like to see someone, somewhere, try to implement a content publishing system that was not quite so constraining as a blog; something that blurred the boundaries between blogs and wikis.

Citation Formatting in Theory and Practice

Posted in Uncategorized on May 8th, 2005 by darcusb – Comments Off

I’ve been arguing for a long time that there are better ways to do bibliographic formatting than BibTeX or Endnote et al., and that relatively new standards like MODS help facilitate this.

Last week I put my ideas to the test with the deadline for my book manuscript. The entire thing is authored in DocBook, and all the citation records are stored in MODS in an XML DB. I also wrote all of the code to create the formatted output, so any problems were my responsibility!

The book itself includes a range of references: journal articles and books, legal cases and bills, archival documents, and media sources. It’s a pretty demanding test of a citation processing system.

So how did it do? Actually, really well! Most the formatting problems I encountered could be traced to poorly-coded MODS records. I did run into some other issues that required last minute fixing, but the design of MODS and of my formatting system actually made this fairly easy.

For example, I had an issue at the end of a 19 hour day on Thursday with newspaper articles. I decided that instead of including the reporters as the sort key, I’d use the periodical title. The question was, how to do this?

The design of my system is such that default fallbacks like “article” tend to do most of the formatting, including for newspaper articles. So, I could define a new formatting definition specific to newspaper articles, but then this introduced some awkwardness in the structure of the styling definition (long story, but the language is hierarchical, and so it is not designed to mix structures across levels; a periodical title is one level, while the article reporter is another).

My solution was to rely on configuring the mods:name “role” handling. So, I have two variables: primary-contributors and secondary-contributors. When the XSLT encounters a record with names, it first says do any names that correspond to a primary-contributor exist? If yes, it uses them. If not, it uses the alternate sort-key, which is defined like:

<creator alternate-sortkey="container-title">
  ...
</creator>

I thus defined the “reporter” role as a secondary-contributor. Problem solved; the system now used the alternate-sortkey.

My theory has been that the more flexible and expressive power of MODS could actually make formatting easier, rather than more complex. This is because metadata is largely neatly modularized. So far practical tests suggest the theory is correct.

Of course, it has taken a lot of work to get the code in shape for this!

Version Control

Posted in Uncategorized on May 7th, 2005 by darcusb – Comments Off

When I started working on the book I just finished, I decided to put it under version control using CVS. I liked being able to get a history of the document and the option to rollback to previous versions. I also liked being able to easily synchronize work between my laptop and my office.

However, I found CVS also limiting. So I tried Subversion, since I’d read it does a better job with things like renaming and moving files and directories.

Again, though, I found limitations that didn’t fit well with my way of working. Beyond installation and administration problems, I also found I wasn’t actually using the change tracking a whole lot. This is in part because I often work off-line, where it’s impossible to commit changes under the CVS/SVN model.

Enter darcs. I read about this somewhere and decided to give it a try. I’m glad I did! The system is easy to install, easy to administer, easy to use, and also powerful! Like a number of other next generation SCM systems, darcs is based on a distributed model. So, rather than having a single central repository, I can have one on each machine I work on. When i make a change somewhere, I record it whether I am online or not. When I am done, I “push” those changes to the other repository.

After one more permissions problem with Subversion, then, I decided to move my projects to darcs. So far so good!

Innovation and Problems of Metadata Modeling

Posted in Uncategorized on May 1st, 2005 by darcusb – 5 Comments

Someone pointed me to the really promising new Ruby Rails application WEIRD.

The fantastically cool innovation in WEIRD is its annotation support. Rather than work with paper copies in which you write scribbles in the margin and underline key passages, and only later go back and type in notes in your bibliographic database, you do this directly in your browser. Highlight a passage, and the box magically appears in real-time on the screen, and gets stored in the database.

WEIRD makes use of a set of libraries that convert electronic formats like PDF into Flash, and then uses Flash to both display the pages in your browser, and to handle the user interaction that allows highlighting passages.

This is really cool stuff. I also like the attention to detail that allows one to mark a note as private (though wish it had suppport for semantic wiki-like markup of annotation content, and ability to export to LaTeX, DocBook, XHTML, etc.).

However, when dealing with applications for scholars, you need to start at the beginning and ask:

  1. who is your imagined user base?
  2. what kinds of documents do they want to store and annotate?
  3. what kinds of workflows make their work best, um, flow?

On these counts, I think WEIRD has a ways to go. As is far too typical in bibliographic application development, the imagined user base is a narrow one: for the most part people from hard science or technical fields. No where does it make room for, say, law students wishing to work through case law, or historians who work with archival documents, or media studies people who want to annotate, say, photographs. Or, in my case, the scholar who wants to post chapters of a manuscript for comment.

To understand the issues, we again must look at the lowest levels of the data model. Consider this from the model/articles.rb file:

class Article < ActiveRecord::Base

# Relationships belongsto :user hasmany :comments has_many :annotations

The database structure reflects this, with the following tables:

  1. annotations
  2. articles
  3. comments
  4. users

The bibliographic metadata itself is stored as a single string; as BibTeX! So once again we have YABA, with all of the baggage goes along with relying on a horribly broken data model.

A few months ago I sat down and tried to figure out a minimal SQL schema I felt would be simple enough to implement in these sorts of applications, but general enough that I could actually use the application that resulted. Here’s the tables I came up with:

  1. agents (person or organization)
  2. bibitem
  3. contributor (joins an agent to a bibitem)
  4. relations (joins bibitem-bibitem; article to journal, chapter to book, etc.)
  5. notes (perhaps distinguishing annotations from comments makes more sense in a WERID context)
  6. users
  7. topics

So a journal article would be a bibitem with a contributor with role of author and a genre of article, which has a relation of isPartOf to another bibitem with genre of academic journal. Most of the citation-specific metadata like volume, issue and page numbers would be stored in separate rows in the main bibitem level.

This is, mind you, a minimal acceptable schema from my standpoint. A more ambitious model would be based more closely on the FRBR—as is LibDB—and results in an additional three tables, with what is bibitem above broken down into:

  1. work
  2. expression
  3. manifestation
  4. item

I’m personally not convinced an XML DB like eXist or Berkeley DB XML isn’t a better approach to storing and querying bibliographic metadata and annotations, but if people must use a RDBMS, they really need to sit down and think carefully about the data model, and how best to exploit the unique strengths of the storage technology. Bibliographic data is NOT simple, and basing an application on BibTeX is a sure way to limit how broadly it might be used!