Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2004 » November - geek tools and the scholar

Archive for November, 2004

Citation IDs

Posted in Uncategorized on November 27th, 2004 by darcusb – 16 Comments

It’s clear the future of bibliographic and citation management is greater interoperability and collaboration; not less. In the future, users will create less bibliographic data, and consume more of it.

For individual documents authors, this raises an issue: how to code citations in such a way that they clearly and unambiguously point to the correct record? In a single-user context, this problem has been generally solved in one of two ways:

  1. a numeric database id
  2. a natural language citation key (e.g. Doe99a

The first approach—used by default in Endnote—has the virtue of uniqueness within a single-user context. Beyond that, however, documents break. An ID of 2312 will point to one record on User A’s system, and another record entirely on User B’s.

I prefer the second approach myself, because the identifier it tied to the content, not the storage. I can look at the citation and deduce the record it refers to. However, the citekey approach has its own problems when you start to scale it from the desktop to the internet. How does one insure that cite{Smith99} is unique?

So, question: what should be a standardized way to identify citation records that best balances these needs? There’s a discussion of this over the BibDesk wiki. I tend to the like the approach they note from CiteSeer, which concatenates the author name, two-digit year, and the title. Example: authorYEARtitle (mccracken03greatPaperAboutStuff). While it is a bit verbose, it seems to be the best approach to me. It is superior to the tradition (which I use!) of appending a suffix to multiple author-year combinations (e.g. Doe1999b) because:

  • more portable
  • more information rich (you can see at a glance the specific record it points to)
  • it doesn’t add punctuation like colons (that could cause problems in some contexts?)



Posted in Uncategorized on November 22nd, 2004 by darcusb – Comments Off

Going from a MS Word user to coding my own software tools has often been a frustrating experience. Every so often, however, I have an “ah ha” moment in which the world opens up; where I solve a tricky problem in a way that has implications for solving other problems.

About a week ago I had one of those moments when trying to work out how to modify my citation formatting stylesheets to grab the necessary bibliographic data from outside the document. I first tackled the fairly simple problem of reading in flat files. So, I scan the document for citation references, and then use that to suck the MODS records into a temporary tree. In other words, I create an in-memory copy of the document in which only the necessary MODS records are inserted into the DocBook bibliography element.

This was a useful step forward. The “ah ha” moment, however, came when I asked Wolfgang Meier how I get this basic strategy working such that I access the MODS data not from flat files, but from the eXist XML DB. Answer: send an HTTP query to eXist.

Ah ha; it actually worked! In the first approach, I just sent a new request for each unique citation reference. This is sort of slow and doesn’t scale well. So, Wolfgang and I instead worked out a way to grab all of the records with a single query by sending a simple XQuery request. Once Wolfgang fixed a bug in eXist that was slowing things down, we got it working. And it’s fast!

The implications of this are, to my mind, huge. In the case of my stylesheets, they can be integrated into any environment that supports queries over HTTP to return MODS records (like SRU/CQL). There’s no need for additional code. More broadly, just imagine the sort of dynamic database driven transformation processing this approach can enable.

Blogs and Wikis and Content Publishing

Posted in Uncategorized on November 7th, 2004 by darcusb – Comments Off

Tim Bray argues blogs and wikis have little in common. As he puts it:

they’re both about people placing content on the Web for other people, but in their essential nature, it seems like they couldn’t be more different. A wiki is a collaborative construction engine, with refactoring and edit-in-place being the dominant forms of activity, and many equal voices singing in a chorus. A blog is more like a content faucet, a source with one voice, always growing at one end; while updates to existing content are OK, the dominant activity is pouring new text and pictures and whatever in.

While recognizing Tim’s point about the differences, I think the more important point is their similarities; that they are both content authoring and publishing media. So I start to think simply about content tied to metadata, about different ways to serialize (time-based vs. topic-linked) and publish that content (public vs. private), and different ways to author (collaborative vs. individual). Why shouldn’t I be able to author a note in wiki markup, mark it as private, and then later decide to publish it as a blog entry? Similarly, one increasingly sees collaborative blogs.

I think developers need to think more creatively about exploiting the points of connection between blogs and wikis, as well as more comprehensive CMS systems. My concrete thinking is that the Ruby community seems to be doing some interesting work with the new Rails web development framework, and they may be in a good position to push the state of the art here. Some people in that community are starting to see these connections. For example, in comments on using Rails for blog applications, one poster says I’ve thought about using Instiki as the basis for a blog. It really seems like there is a lot of common code between a wiki and a blog.

This is right, but I’d really go much farther, and think about a sort of CMS framework built on Rails, and then various modules—wiki, blog, etc.—that can plug into it. So Instiki-like wiki functionality could simply be dropped into that system, as could blog funtionality, all of it integrated into a more comprehensive CMS context, which could include RSS aggregators, bibliographic databases, etc. Indeed, Instiki already has CMS-like functionality with its excellent export and TeX-processing support. Instead of starting by recreating existing applications like MT or Moin Moin in Rails, then, perhaps the Rails community is in an excellent position to rethink the whole universe of content publishing on the web?

Wiki/CMS Software

Posted in Uncategorized on November 2nd, 2004 by darcusb – Comments Off

I’ve previously mentioned my interest in what I guess amounts to a comprehensive CMS and bibliographic database solution: what I called a BibBliki.

It seems I’m not the only one frustrated with existing CMS solutions. Leo Simmons discusses an interesting idea of using Ruby for an SVN-based CMS system. Meanwhile, a web designer complains—rightly in my view—that existing wiki software sucks from the standpoint of a web designer who believes in standard’s-based semantic markup, CSS, etc. As they put it Even the ones that avoid nested tables and font tags (which are only a couple) spit out bad markup. Lots of classes, not enough semantics, and horribly written stylesheets.

Here’s what I really want:

  1. clean XHTML and CSS (clean enough to easily process with XSLT if need be)
  2. flexible wiki markup support (I’d like to add support for citation coding)
  3. robust storage (I don’t want to worry about lost data)
  4. easy export of content in various formats, including XML (say DocBook)
  5. ability to integrate with a bibliographic database

I’ve not found a single piece of software that satisfies even my first four desires (the last is obviously rather non-standard). The Ruby-based Instiki offers many of them (in particular excellent export support), but I’ve lost some data to it. Also, the textile support built into Instiki (using the RedCloth parser) is somewhat thin on its support for semantic HTML output (proper class attributes everywhere, for example).

Still, I get the feeling Ruby may be a good language to develop what I’m after; with its built-in RSS and XML support, and new web frameworks like Rails. Another possibility is the Drupal-based LibDB. Alas, I can’t work out how to get the Drupal textile support properly working.