On the Inclusion of BibTeX in HTML5
As part of the HTML5 effort, editor Ian Hickson has proposed a new way to encode structured data in HTML. Ian has since included within the proposal encodings of various widely used standards to describe events, contacts and citations. These vocabularies have normative status within the proposed spec, and have a privileged place within the DOM.
On the last use case, he has chosen BibTeX, on the basis that it is widely used and simple to author and process. Ian and I have chatted about this via email. To summarize my thoughts, then, I would like to argue against the inclusion of BibTeX based on the following points:
- BibTeX is designed for the sciences, that typically only cite secondary academic literature. It is thus inadequate for, nor widely used, in many fields outside of the sciences: the humanities and law being quite obvious examples. For this reason, BibTeX cannot by default adequately represent even the use cases Ian has identified. For example, there are many citations on Wikipedia that can only be represented using effectively useless types such as “misc” and which require new properties to be invented.
- Related, BibTeX cannot represent much of the data in widely used bibliographic applications such as Endnote, RefWorks and Zotero except in very general ways.
- The BibTeX extensibility model puts a rather large burden on inventing new properties to accommodate data not in the core model. For example, the core model has no way to represent a DOI identifier (this is no surprise, as BibTeX was created before DOIs existed). As a consequence, people have gradually added this to their BibTeX records and styles in a more ad hoc way. This ad hoc approach to extensibility has one of two consequences: either the vocabulary terms are understood as completely uncontrolled strings, or one needs to standardize them. If we assume the first case, we introduce potential interoperability problems. If we assume the second, we have an organizational and process problem: that the WHATWG and/or the W3C—neither of which have expertise in this domain—become the gate-keepers for such extensions. In either case, we have a rather brittle and anachronistic approach to extension.
- The BibTeX model conflicts with Dublin Core and with vCard, both of which are quite sensibly used elsewhere in the microdata spec to encode information related to the document proper. There seems little justification in having two different ways to represent a document depending on whether on it is THIS document or THAT document.
- Aspects of BibTeX’s core model are ambiguous/confusing. For example, what number does “number” refer to? Is it a document number, or an issue number? [note: it's actually both, depending on context; in a report it's the former, while in an article it's the latter]
My suggestion instead?
- reuse Dublin Core and vCard for the generic data: titles, creators/contributors, publisher, dates, part/version relations, etc., and only add those properties (volume, issue, pages, editors, etc.) that they omit
- typing should NOT be handled a bibtex-type property, but the same way everything else is typed in the microdata proposal: a global identifier
- make it possible for people to interweave other, richer, vocabularies such as bibo within such item descriptions. In other words, extension properties should be URIs.
- define the mapping to RDF of such an “item” description; can we say, for example, that it constitutes a dct:references link from the document to the described source?