Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » Blog Archive » On the Inclusion of BibTeX in HTML5 - geek tools and the scholar

On the Inclusion of BibTeX in HTML5

As part of the HTML5 effort, editor Ian Hickson has proposed a new way to encode structured data in HTML. Ian has since included within the proposal encodings of various widely used standards to describe events, contacts and citations. These vocabularies have normative status within the proposed spec, and have a privileged place within the DOM.

On the last use case, he has chosen BibTeX, on the basis that it is widely used and simple to author and process. Ian and I have chatted about this via email. To summarize my thoughts, then, I would like to argue against the inclusion of BibTeX based on the following points:

  1. BibTeX is designed for the sciences, that typically only cite secondary academic literature. It is thus inadequate for, nor widely used, in many fields outside of the sciences: the humanities and law being quite obvious examples. For this reason, BibTeX cannot by default adequately represent even the use cases Ian has identified. For example, there are many citations on Wikipedia that can only be represented using effectively useless types such as “misc” and which require new properties to be invented.
  2. Related, BibTeX cannot represent much of the data in widely used bibliographic applications such as Endnote, RefWorks and Zotero except in very general ways.
  3. The BibTeX extensibility model puts a rather large burden on inventing new properties to accommodate data not in the core model. For example, the core model has no way to represent a DOI identifier (this is no surprise, as BibTeX was created before DOIs existed). As a consequence, people have gradually added this to their BibTeX records and styles in a more ad hoc way. This ad hoc approach to extensibility has one of two consequences: either the vocabulary terms are understood as completely uncontrolled strings, or one needs to standardize them. If we assume the first case, we introduce potential interoperability problems. If we assume the second, we have an organizational and process problem: that the WHATWG and/or the W3C—neither of which have expertise in this domain—become the gate-keepers for such extensions. In either case, we have a rather brittle and anachronistic approach to extension.
  4. The BibTeX model conflicts with Dublin Core and with vCard, both of which are quite sensibly used elsewhere in the microdata spec to encode information related to the document proper. There seems little justification in having two different ways to represent a document depending on whether on it is THIS document or THAT document.
  5. Aspects of BibTeX’s core model are ambiguous/confusing. For example, what number does “number” refer to? Is it a document number, or an issue number? [note: it's actually both, depending on context; in a report it's the former, while in an article it's the latter]

My suggestion instead?

  1. reuse Dublin Core and vCard for the generic data: titles, creators/contributors, publisher, dates, part/version relations, etc., and only add those properties (volume, issue, pages, editors, etc.) that they omit
  2. typing should NOT be handled a bibtex-type property, but the same way everything else is typed in the microdata proposal: a global identifier
  3. make it possible for people to interweave other, richer, vocabularies such as bibo within such item descriptions. In other words, extension properties should be URIs.
  4. define the mapping to RDF of such an “item” description; can we say, for example, that it constitutes a dct:references link from the document to the described source?
The result would be something more consistent, general and extensible, while also still being easy to author and process. From a DOM perspective, we’re just talking about things like ref1.type returning a URI rather than doing ref1.bibtex-type that returns a string, and accessing a periodical title like ref1.isPartOf.title rather than ref1.journal (which of course doesn’t work for newspapers, or magazines, or court reporters, or weblogs, all of which have the exact same characteristics: they’re publications of sorts).


  1. Rick says:

    I wonder if some sort of guidance for encoding coins in here would be helpful. or, even using the XML encoding, rather than the KEV encoding? just my .02 yoctocents, Rick

  2. darcusb says:

    Thanks the comment Rick.

    No offense, but I think COinS is a dead-end technology.

  3. Simon Spiegel says:

    Time and again, I’m amazed how often the same mistakes are repeated in the area of bibliographic software. It seems like a natural law that every new software solution dealing with bibliographies always has to start with an extremely limited model which basically only covers English speaking sciences. It took nearly two decades until biblatex got rid of most of the basic shortcomings of BibTeX, but somehow other projects don’t seem to learn from this. I just say ‘bookauthor’. It’s really a basic need in humanities, but neither Zotero nor Mendeley support it at the moment, and now the newest and hottest in web technologies seems to make the same mistake again by adapting a 20 year old model, although its deficiencies are well known and have been discussed endlessly.

  4. darcusb says:

    Exactly. I’d be less bothered if they had looked at biblatex and incorporated some of its extended keys (though I still don’t favor this approach), but they just don’t seem think this is important. See the whatwg list discussion.

  5. Jakob says:

    BibTeX has many drawbacks but it is widely supported and there is much data in BibTeX. As long as it is not the only way to encode bibliographic information in HTML you should not argument against it because it helps people to publish bibliographic data at least in some structured way. If the only easy-to-use alternative is plain text or your own citation style without any defined semantics, you should always ask for BibTeX. Yes, it is broken and limited and we should find better formats - but we cannot wait more years before library scientists and metadata experts have finally agreed upon a standard and implementations have been established.

  6. darcusb says:

    I proposed a much better solution. The whole point of microdata is that it’s supposed to be extensible.

  7. [...] would really love if the extension mechanism was rich enough to allow integration of citations (say a Zotero extension; though perhaps something more distributed), and flexible enough to do it right (which by definition means not based on bibtex) [...]

  8. As far as widely adopted existing schemas with lots of data already out there — what’s wrong with RIS?

  9. darcusb says:

    RIS is better, but has problems of its own. For one thing, a property like “T1″ would be rather opague. For another, it still has the same problems as the other flat key-value formats.