Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2004 » August - geek tools and the scholar

Archive for August, 2004

Rendering Bibliographic Information in a Browser

Posted in Uncategorized on August 31st, 2004 by darcusb – Comments Off

Rendering bibliographic information in a browser is a rather tricky problem. Ideally, you want a visually simple overview of a large number of records, and the ability to easily find the ones you want. At the same time you want quick access to extended information like annotations and abstracts. I had always thought it necessary to see these as two different interfaces for the most part.

Well, a BibDesk user has come up with a clever way to handle both needs in the same interface; it’s even standards compliant xhtml and css!

It seems there’s a lot of energy over at the BibDesk project these days. If only they would bite the bullet and ditch bibtex for basic record storage.

New Look

Posted in General on August 23rd, 2004 by darcusb – Comments Off

Moved the blog to WordPress, hopefully with link redirects working correctly. OK, there are some CSS kinks to work out, and I’m short on time; bear with me.

Why I like RELAX NG

Posted in Uncategorized on August 21st, 2004 by darcusb – Comments Off

There are a lot of reasons why I really enjoy working with RELAX NG. Here’s one of them; how to customize DocBook to rip out all computer related structures, but add a few (like better bibliographic metadata) very useful for social science and humanities scholars:

# customization for scholarly writing
namespace db = "http://docbook.org/docbook-ng"
namespace mods = "http://www.loc.gov/mods/v3"
default namespace = "http://docbook.org/docbook-ng"

include "mods-3-0.rnc" { extensionSchema = db._any }

include "docbook-ng.rnc" {

db.any = element * - (db:|mods:) { db.any.attribute, (text, db._any) }

# remove all the main software/technology related definitions db.domain.inlines = notAllowed db.product.inlines = notAllowed db.technical.blocks = notAllowed db.verbatim.blocks = notAllowed db.synopsis.blocks = notAllowed # add a new element to quote db.quote = element quote { db.quote.attlist, (db.all.inlines | hb.nonquote)* } # add mods to bibliography element db.bibliography = element bibliography { db.bibliography.attlist, db.bibliography.info, db.all.blocks*, (db.bibliodiv+ | (ModsSchema | db.biblioentry | db.bibliomixed)+) } }

# nonquote pattern allows for semantic rendering of split quotes like: # "The world," the guy said, "is flat." hb.nonquote = element nq { empty }

Thanks to Norm Walsh for solving a stupid problem (I had the wrong namespace declaration), and adding a clever addition (any DocBook namespaced content to the MODS extension element).

XSLT 2.0, DocBook, and MODS

Posted in Uncategorized on August 18th, 2004 by darcusb – Comments Off

Being in a frantic rush to finish and submit some documents, needing a way to format them with XSLT, and wanting to push things forward on the free software front, I decided to tackle some tricky problems using XSLT 2.0.

I spent the better part of a week solving problems specific to the author-year citation style typical in the social sciences.

How to handle:

Doe, John (1999a) ...
———. (1999b) ...
———. (1999c) ...
Doe, John and Jane Smith (2000a) ...
Doe, John and Jane Smith (2000b) ...

… where the suffix is generated as part of a temporary tree operation, so to ensure it is the same in the bibliography and in the citation.

Note: no, it is not a mistake that multiple authors do not have the same handling as single authors!

Here’s an XSLT 2.0 stylesheet to render DocBook NG with MODS embedded in the db:bibliography element, ultimately (when finished!) according to the demands of the Chicago Manual of style, and my discipline’s flagship journal.

The seemingly simple example above it a huge PITA, particularly with multiple namespaces and modes!

I’m more convinced than ever that the XSLT recursive processing model is ideally suited to doing bibliographic formatting; at least along with the grouping functions in v 2.0.

Alas, as I say in the archive README, some of this stuff needs to be imported back into BiblioX, which is too complex for me to modify. There is a really desperate need to solve the problems of bibliographic formatting in XML, which is ideally suited to this sort of stuff, yet sorely lacking on the implementation end. If you have an interest in this and XSLT skills, please consider contributing.

XML DBs as Research Tools

Posted in General on August 14th, 2004 by darcusb – Comments Off

Borrowing an idea from Jon Udell and with a lot of help from eXist XML DB author Wolfgang Meier, I worked out a quick-and-dirty way to exploit XML and the XQuery-based eXist to do content analysis.

The story: I had to analyze a bunch of news documents; roughly 150 to be more precise. So, I downloaded them, used a simple shell script to run Tidy on all of them and convert them to clean XHTML, then ran an XSLT on those to clean them up further. From that base, I then went through and highlighted important content with keywords by using the class attributes on the span, p and q tags. I then used an XQuery script (pretty much written by Wolfgang) to pull out all paragraphs that contain the highlighted chunks of content, and organize them by document title. Finally, I used CSS to render the content (though this is still very much a moving target; how to represent more than one keyword, for example?).

I’ve never used a content analysis application, but this suggests to me an XML DB like eXist not only can serve as an excellent bibliographic application, but can also further tie together data (research content) and metadata (bib records). As an example, I added meta tags to XHTML headers, and wrote a stylesheet to generate MODS records from them automatically. While generating the metadata in the XHTML is time-consuming, it is less so than manually creating each MODS file, and also serves double-duty by enhancing future access to the records.

update: what I really want is to be able to have a button on my web browser that allows me to highlight an excerpt in the XHTML file, click it, and have the DocBook citation code pasted to the clipboard. Maybe with Javascript?

update 2: there seems to be a bug in MT (or maybe a virus) that replaces a posts content with deleted comment spam content. I had to go the google cache to recreate this entry. In other news, I got some help from Alf Eaton on the bookmarklet issue above.

Processing Citations

Posted in General on August 13th, 2004 by darcusb – Comments Off

In an effort to format my documents, and ultimately test my experiments in modifying the citation style language in BiblioX, I’ve decided to try to write a stylesheet drawing on new features in XSLT 2.0. A lot of help from the gurus shows both that citation and bibliographic formatting is a complex problem, and that XSLT 2.0 has features that make it easier.

Some quick notes on the general classes of citation styles, and the basic rules. Ultimately any program needs to be designed to easily switch between each of them.

  1. citekey: The most simple of style. Citation markers are natural language key like [doe99a] and bibliography list is ordered by the appearance order of the citation. Software people often use this style, so it makes sense it’s easy to process!
  2. author-year: Dominant in the social sciences, this one is much more difficult to process. Citation marker is an author-year combination. Where there is more than one author-year combination in a document, years gets appended with an alphabetic suffix (e.g. 1999a, 1999b, etc.). To add an additional layer of complexity, if there are more than one-author year combination within a citation, the author should be dropped from all but the first (e.g. Doe, 1999a, 1999b). In the bibliography, by contrast, all entries within each author (Doe, and Doe and Jones are grouped separately) ought to be sorted by date (on which the year suffix may be based), and all entries after the first generally have the author(s) replaced with three em-dashes and a period. This is the sort of complex grouping problem that XSLT 2.0 is well-suited to. Finally, author-year citations, like note-based, often have captions for page numbers and so forth, as well as a variety of different forms. For example, if in the text preceding the citation, the author is listed, then it is dropped from the citation proper. The traditional approach has been for authors to explicitly code how the citations ought to be rendered (full, year-only, etc.). It’s an open question at this pointed whether this can and should be automated. I tend to think it should, but this adds another layer of complexity best to worry about later.
  3. note: Footnote and endnote styles are common in the humanities. Often there is no bibliography list here, and so the citation contains the full bibliographic information. Except, in many (most?) cases, note-based citations distinguish between first and subsequent rendering. The first occurrence in the text gets the full reference; all else get an abbreviated one. In addition, these styles also often require references to previous entries; ibid., op. cit., etc. I really despise this stuff myself!
  4. numbered: As I understand, common in the hard-sciences. Here citations are just a numbered list and the citations are like (1). The only wrinkle here, presumably, is collapsing multiple-reference-citations, like (1,3, 4-5). This ought to be the same processing problem as with the author-year citations.

As I said above, one functional requirement for any new citation coding and processing tools should be that one can fully switch between these style without modifying the document source. A change from author-year to footnote ought to involve choosing a different style file from a commandline-processor or GUI menu. I am unaware of any tools that can do this, but that doesn’t mean it can’t be done. Indeed, BiblioX has pretty well shown it’s possible.

The features of XSLT 2.0 that make this sort of processing easier? Temporary trees, and multi-level grouping. In essence, you create a virtual bibliography enhanced with the processed data you need to insert in the final document (for example, a year appended with its suffix).

XOBIS and UnaLog

Posted in General on August 9th, 2004 by darcusb – Comments Off

There are times when open source development is just … cool. Here’s one of them:

So Dan Chudnov’s unalog project is now serving up XOBIS records. XOBIS is a schema out of Stanford. Written in RELAX NG, XOBIS is one of the more ambitious attempts to wrestle with the problems of bibliographic metadata. For those in the free software world that think MODS is, um, abstract, try this:

<Record>
  <ControlData>
    <ID>
      <OrganizationRef>unalog</OrganizationRef>
      <Value>person/dchud/450</Value>
    </ID>
    <Actions>
      <Action>
        <Type set="Action Type">Created</Type>
        <Time>
          <Year>2004</Year>
          <Month>08</Month>
          <Day>09</Day>
          <Hour>19</Hour>
          <Minute>04</Minute>
          <Second>25</Second>
          <BeingRef id="dchud">Dan Chudnov</BeingRef>
        </Time>
      </Action>
    </Actions>
    <Types><Type set="Record Type">Original</Type></Types>
  </ControlData>
  <Work role="instance">
    <Entry><Title>Brief backported patch from qx-1.0 to qx-0.7a3</Title></Entry>
    <Description>
      <Notation class="annotation">
        <Value>Necessary for medusa server to work w/unalog.</Value>
      </Notation>
    </Description>
  </Work>
  <Relationships>
    <Relationship class="geographic" type="associative">
      <Name>Internet link</Name>
      <Place>
        <Entry scheme="RFC 2396">
          <Name>http://irref.mine.nu/user/dchud/qx-0.7a3-http_request.py.patch</Name>
        </Entry>
      </Place>
    </Relationship>
  </Relationships>
</Record>

Unalog is a promising project, that over time I’d like to see blend more fully with bibliographic databases. Increasingly, online records are full citation resources.

SRW Record Create/Update/Delete

Posted in Uncategorized on August 8th, 2004 by darcusb – Comments Off

A draft proposal for adding support to SRW for record creation, updating and deletion is available here.

New DocBook-NG Release With Biblioref

Posted in Uncategorized on August 4th, 2004 by darcusb – Comments Off

Norm Walsh has issued another release of his DocBook-NG prototype. Previous releases included the new biblioref citation element, but incorrectly specified such that it was not allowed in the citation element. This bug is now fixed.

With this release and the forthcoming interim 4.4 release, DocBook now supports citation coding arguably superior to that in LaTeX/BibTeX.