Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » microdata - geek tools and the scholar

Posts Tagged ‘microdata’

HTML5 Process

Posted in Technology on June 9th, 2009 by darcusb – 2 Comments

Ben Adida on the microdata in HTML5 proposal:

So, I cannot live with something that throws away existing important implementations of the *exact* same use cases for no valid technical reason.

Ian’s response:

Indeed; I examined all the existing solutions that I could find closely as the first step (well, the second step, after collecting use cases). I didn’t go through all of them one by one in the e-mail, but I did explicitly examine Microformats and RDFa: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019681.html

If you go to that URI, here’s his explanation for why not RDFa:

- it uses prefixes, which most authors simply do not understand, and which many implementors end up getting wrong (e.g. SearchMonkey hard-coded certain prefixes in its first implementation, Google’s handling of RDF blocks for license declarations is all done with regular expressions instead of actually parsing the namespaces, etc). Even if implemented right, namespaces still lead to flaky copy-and-paste behaviour.

- it sometimes uses rel=”" and sometimes uses property=”" and it’s hard to know when to use one or the other.

- it introduces much more power than is necessary to solve this problem.

I think the first point is a reasonable one in the sense that prefixes have costs as well as benefits. But the same is true of unprefixed names. A balanced discussion of these tradeoffs seems warranted. Is it really (really!) worth it to invent an entirely new spec because of one fairly trivial issue? Is it really (really!) worth it to force tools developers and publishers to have to do double work?

The other two points range from trivial to entirely ridiculous. Who really decides, for example, how much power is needed for extensible metadata in HTML? Surely the answer will depend a lot on particular use cases? For example, on the general citation case, WikiPedia may have less demanding needs than an academic or legal journal. Shouldn’t that understanding that one size does not fit all be at the center of any extensible metadata support in HTML5?

He then goes on to try to “fix” these problems by removing prefixing, and the rel/property ambiguity. Recognizing that removing the prefixing introduces other problems for readability, etc., he concludes that This, though, is quite ugly.

OK, so aesthetics are now a requirement shaping the design; I have no clue where that came from. To solve this problem he introduces an equally ugly, and completely arbitrary, new way to indicate a global name: the reverse DNS. Where’s the analysis that justifies these conclusions? Do we just accept these claims about aesthetics and usability without any kind of evidence?

Is there no sanity at all in the HTML5 process?

On the Inclusion of BibTeX in HTML5

Posted in Technology on May 20th, 2009 by darcusb – 9 Comments

As part of the HTML5 effort, editor Ian Hickson has proposed a new way to encode structured data in HTML. Ian has since included within the proposal encodings of various widely used standards to describe events, contacts and citations. These vocabularies have normative status within the proposed spec, and have a privileged place within the DOM.

On the last use case, he has chosen BibTeX, on the basis that it is widely used and simple to author and process. Ian and I have chatted about this via email. To summarize my thoughts, then, I would like to argue against the inclusion of BibTeX based on the following points:

  1. BibTeX is designed for the sciences, that typically only cite secondary academic literature. It is thus inadequate for, nor widely used, in many fields outside of the sciences: the humanities and law being quite obvious examples. For this reason, BibTeX cannot by default adequately represent even the use cases Ian has identified. For example, there are many citations on Wikipedia that can only be represented using effectively useless types such as “misc” and which require new properties to be invented.
  2. Related, BibTeX cannot represent much of the data in widely used bibliographic applications such as Endnote, RefWorks and Zotero except in very general ways.
  3. The BibTeX extensibility model puts a rather large burden on inventing new properties to accommodate data not in the core model. For example, the core model has no way to represent a DOI identifier (this is no surprise, as BibTeX was created before DOIs existed). As a consequence, people have gradually added this to their BibTeX records and styles in a more ad hoc way. This ad hoc approach to extensibility has one of two consequences: either the vocabulary terms are understood as completely uncontrolled strings, or one needs to standardize them. If we assume the first case, we introduce potential interoperability problems. If we assume the second, we have an organizational and process problem: that the WHATWG and/or the W3C—neither of which have expertise in this domain—become the gate-keepers for such extensions. In either case, we have a rather brittle and anachronistic approach to extension.
  4. The BibTeX model conflicts with Dublin Core and with vCard, both of which are quite sensibly used elsewhere in the microdata spec to encode information related to the document proper. There seems little justification in having two different ways to represent a document depending on whether on it is THIS document or THAT document.
  5. Aspects of BibTeX’s core model are ambiguous/confusing. For example, what number does “number” refer to? Is it a document number, or an issue number? [note: it's actually both, depending on context; in a report it's the former, while in an article it's the latter]

My suggestion instead?

  1. reuse Dublin Core and vCard for the generic data: titles, creators/contributors, publisher, dates, part/version relations, etc., and only add those properties (volume, issue, pages, editors, etc.) that they omit
  2. typing should NOT be handled a bibtex-type property, but the same way everything else is typed in the microdata proposal: a global identifier
  3. make it possible for people to interweave other, richer, vocabularies such as bibo within such item descriptions. In other words, extension properties should be URIs.
  4. define the mapping to RDF of such an “item” description; can we say, for example, that it constitutes a dct:references link from the document to the described source?
The result would be something more consistent, general and extensible, while also still being easy to author and process. From a DOM perspective, we’re just talking about things like ref1.type returning a URI rather than doing ref1.bibtex-type that returns a string, and accessing a periodical title like ref1.isPartOf.title rather than ref1.journal (which of course doesn’t work for newspapers, or magazines, or court reporters, or weblogs, all of which have the exact same characteristics: they’re publications of sorts).