Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2006 » August - geek tools and the scholar

Archive for August, 2006

Time to Boycott Blackboard?

Posted in General on August 31st, 2006 by darcusb – Comments Off

I’ve never much liked Blackboard. Now might well be time to give it up for good. If ever the ridiculous of current patent practice was apparent, this is it.

NeoOffice 2

Posted in General on August 31st, 2006 by darcusb – Comments Off

Just as I was about to give up on OpenOffice on the Mac comes a new beta release of NeoOffice 2.0. I have to say, it’s really quite nice: a huge improvement on the X11 version. It’s quite fast in comparison, has very good font rendering (the X11 version is terrible), etc.

Now, if we could only fix the bibliographic support so it didn’t suck! If only that problem was solved, I’d likely use NeoOffice for serious writing.

Microsoft Does RDF

Posted in Uncategorized on August 24th, 2006 by darcusb – Comments Off

From Danny Ayers, news that Microsoft will be supporting embedded RDF metadata in the Windows Vista Photo Gallery. They will do this using Adobe’s XMP.

As the project manager explains:

XMP is an extensible framework for embedding metadata in files that was developed by Adobe, and is the foundation for our “truth is in the file” goal. All metadata written to photos by Windows Vista will be written to XMP (always directly to the file itself, never to a ‘sidecar’ file). When reading metadata from photos on Windows Vista, we will first look for XMP metadata, but if we don’t find any, we’ll also look for legacy EXIF and IPTC metadata as well. If we find legacy metadata, we’ll write future changes back to both XMP and the legacy metadata blocks (to improve compatibility with legacy applications).
Elsewhere, in comments, the same person explains this part of a comprehensive metadata approach:

“Truth in the file” is a principle that applies to all document types in Vista, not just photos. For photos, metadata is written back to an XMP block in the file. XMP is an industry standard for imaging metadata that was developed by Adobe.

A few quick comments:

  1. Adobe did not “develop” XMP per se; they simply borrowed pieces of RDF. It’s surprising to me that MS of all companies is accepting this sort of marketing uncritically. Everything that is good about XMP is in fact directly a consequence of it being based on RDF, which is an open standard developed by the W3C. Of course, at least some of what’s wrong with XMP (like the funky syntax) is also a consequence of RDF (or more precisely, the particular choices Adobe made in subsetting it back around 2000 or so).
  2. They seem to have a smart strategy for compatibility
  3. Good to see them have a comprehensive metadata strategy based on file-level metadata (as opposed, say, to OS-level)

XPath-ing an RDF Profile

Posted in Uncategorized on August 14th, 2006 by darcusb – 1 Comment

I’ve been working on some stuff for the OpenDocument metadata group, including an RDF profile amenable to XML processing, simply to show what might be possible. I was working on an XSLT to demonstrate how it could be processed using standard XML tools, and also how I might model the constraints in Schematron (having already figured it out in RELAX NG), so naturally had to figure out how to write generic xpaths for basic structures like resources, properties, and so forth.

Here’s what I came up with as a start …

All resources:

//[ and not(preceding-sibling::/text()) and not(parent::/@rdf:about)]

Am not too fond of this one, but it works with my example documents.

All properties:

//[not()]

That’s more like it!

Would need more work to come up with a robust, generic, RDF profile validator using only xpath (e.g. Schematron), but it seems not too hard. In any case, it’s certainly easier that writing a generic XML metadata validator!

Extensibility?

Posted in Uncategorized on August 12th, 2006 by darcusb – Comments Off

What does it mean to have extensible XML suipport?

This is a question that came up somewhat obliquely in the latest OpenDocument Metadata SC conference call, where I was presenting my draft requirements for the bibliographic use case, one of which was the need for extensilbility. XML, after all, is an acronym for eXtensible Markup Language. Given my focus on metadata, I’ll restrict myself more to that realm.

It seems to me there are largely two views on this question. One perspective—I’ll call it the “document-based” view—says that extensibility is defined first through the simple ability to create new languages, and second within those languages to create strategic extension points.

Another view—I’ll call it the “module” view—sees metadata not fundamentally in terms of documents and complete schemas, but rather in terms of modules of descriptions that can be plugged together, mixed up, or otherwise interact, mostly independently.

This first view suggests to me an image of a book, complete with introduction and conclusion, index, and covers. It’s a more hermetic view of metadata.

The second view is, I think, the view of the web and hyperlinks, RDF, and more recently microformats. Why invent invent elaborate new schemas, this view says, when you can instead mix-and-match from a rich set of existing alternatives?

So when we at the Metadata SC talk about “extensibillity,” then, as a requirement, what do we mean?

I can only really speak for myself, but to me—a partisan of the second view—extensibility has to mean both that one can add custom XML markup and that the markup conforms to some rules such that ad hoc mixing and interaction is possible.

Simply allowing anything-goes addition of arbitrary content achieves little that is useful. While there may well be use cases for this sort of thing—Microsoft’s custom schema functionality surely must be valuable in some contexts—it seems to me it would be counterproductive to not insist on some minimal expectations of interoperability across a document format’s metadata format.

This is not to say that all conforming applications must fully understand extension structures, but it is to insist on the need for at least minimal legibility (for example, the ability to display any foreign content).

Leopard Kittie

Posted in General on August 7th, 2006 by darcusb – Comments Off

Good “so what” analysis of Apple’s Leopard. Egads; not only am I not impressed by a lot of what Apple is now showing, I’m actually rather horrified by what they’re doing to Mail with its “templates” functionality. I already get far too much gratuitious HTML mail. And why would notes and todos somehow have anything so directly to do with mail that they’d be included in the application? Finally, what’s with all the “look, we do RSS” hype? How about Atom?

URIs as Names

Posted in Uncategorized on August 5th, 2006 by darcusb – 8 Comments

One of the things that is confusing for those new to the semantic web—either in its full-blown RDF guise, or in other contexts like microformats—is URIs, which are often used simultaneously to identify things, and to locate them. I still find myself a bit confused by this, though the fog is lifting.

Norm Walsh has a nice, clear, overview of the issue, and his conclusion is:

Time and again, we see individuals and organizations inventing new URI schemes in order to tackle the problem of “names” versus “addresses”. That is, they want to provide some sort of a globally unique identifier for “This Thing” independent of where representations of that thing might reside. Almost inevitably, these individuals and organizations fall into the trap of thinking that an “http” URI is somehow an address and not a name and is, therefore, inappropriate for their purpose. They are mistaken. I used to believe this too and I was wrong. A new URI scheme is not necessary, nor does it actually solve the problem.

This is an interesting issue for me, as I’ve found myself using URNs and INFO URIs to represent standard identifiers in my bibliographic data.

I consider the need for a common idenfier infastructure critical enough that I’ve argued it ought to be a requirement in the new OpenDocument metadata support that we use URIs for identification; always. I think that Microsoft is not using URIs to identify bibliographic records in Word 2007 is a short-sighted mistake.

But we are then left with what is in effect a social question: which URIs to use?

For bibliographic metadata, this becomes somewhat complicated in the face of a myriad of differerent identifers, controlling authorities, and uris.

Take a book, for example, which I asked Norm about in comments. I typically use an ISBN as a widely-adopted, reasonably robust, identifier. They are far from perfect (sometimes an ISBN is in fact not unique, and it refers to physical manifestations that are somewhat different conceptually than the somewhat more abstract things that scholars cite), but much better than a lot of alternatives.

But there are many ways to encode ISBNs as URIs. Let’s take my book as an example, with an ISBN of 0415948738. The following are all perfectly valid ways to represent this as a URI:

  1. urn:isbn:0415948738
  2. http://worldcat.org/wcpa/isbn/0415948738
  3. info:isbn/0415948738
  4. http://www.amazon.com/gp/product/0415948738
  5. http://isbn.nu/0415948738

As you can imagine, the list could be much longer, particularly if you include redirect URLs as options.

Aside: you would think the Library of Congress would have a smart URI system like WorldCat, but I ‘ve not managed to find it.

So how does one decide which to use? I need ids that are stable and unique, and which I have some confidence others out there in the world of bibliographic metadata—in particular applications developers—might also use or understand.

My solution has been to use URNs; the first option above. I really don’t care if it resolves to some URL or not, and it’s easy enough to parse and then use to grab records from different locations. I also use URNs for encoding links to periodicals for the same reason: they provide a convenient abstraction.

I still think this decision is right, though Norm’s post and the subsequent comments have me scratching my head again.

For example, DOIs are even better identifiers than ISBNs. But DOIs aren’t registered as URNs, so that’s no good. I’m then left with either using the info URI scheme (as above, but using the doi prefix), or the HTTP address for a resolver.

Mind you, I have no doubt that any kind of 21st century metadata interoperablity depends on the use of URIs for identification. But there’s no denying there are social issues involved in getting agreement on which names to use.

Suffice it to say I’ll need to clarify this all, particularly as we move forward on the ODF metadata work.

Update: I mentioned in comments the possibility of using an OCLC number for books. On first glance, that might actually be better. The URI for my book, then is http://www.worldcat.org/oclc/60500684. This has one obvious advantage over an ISBN in that the same identifier applies to both the hardcover and the softcover.

RELAX NG, XSD, Schematron

Posted in Uncategorized on August 3rd, 2006 by darcusb – Comments Off

For anyone writing a new XML language in 2006, they are faced with a choice of schema languages. Despite all the marketing and engineering dollars thrown at XML Schema, it is a brain-dead specification; horribly complex where it doesn’t need to be, and incredibly dumb elsewhere. There are all sorts of practical XML constraints that simply cannot be modeled in XSD.

Want to condition the validation of child elements on an attribute? Sorry, you can’t do that.

Want to give users a choice between an empty element with attributes, or text content without attributes? Nope.

Want to define a content model where order is unimportant? Sorry, you can’t do that either.

Thankfully, there is a better alternative in RELAX NG. Here’s an example (using the compact non-xml syntax) from my Citation Style Language (CSL) schema, where I condition validation on a root class attribute:

  CitationStyle =
    element cs:style {
      AuthorDateStyle
      | NumberStyle
      | LabelStyle
      | NoteStyle
      | AnnotatedStyle
      | CustomStyle

The AuthorDateStyle pattern is then defined like:

  AuthorDateStyle =
    attribute class { "author-date" },
    Info,
    Terms?,
    Defaults,
    AuthorDateCitation,
    AuthorDateBibliography

So the AuthorDateStyle class requires a citation and bibliography element, the sort element child of bibliography must be set to “author-date” and so forth. The schema reflects the expectations tools developers should bring to the table in designing scripts, or GUIs, or whatever.

But what happens if you need to provide your RNG schema for validation in XSD-oriented workflows? Here’s my own conclusion:

Define the schema in such a way that it is easy to create a customization that overrides the more complex restrictions; a simplified schema that Trang can automatically convert to valid XSD. It’s as a simple as:

include "csl-alt.rnc" {
  cs-citationstyle =
    element cs:style {
      attribute class { cs-classes },
      Info,
      Terms?,
      Defaults,
      Citation?,
      Bibliography?
    }
}
cs-classes = "author-date" | "number" | "label" | "annotated" | "note"

… where all of the above patterns are simple ones without content restrictions that will make XSD choke.

Trang will then happily create a valid XSD file from this simplified schema.

However, you end up with a much looser schema, so now what? It’s hardly much use to be creating instances against such a loose schema, where they may be invalid against the normative spec and schema.

Answer: create some separate Schematron rules to model the constraints that XSD cannot. If you want to write it within your RNG customization schema (which can then be extracted using Trang + XSLT), then just do stuff like:

    s:rule [
      context = "/cs:style[@class='author-date']"
      s:assert [
        test = "cs:bibliography/cs:sort/@algorithm='author-date'"
        "Must use author-date sorting for the author-date class."
      ]
      s:assert [
        test = "name(cs:citation/cs:layout/cs:item/*[1]) = 'author'"
        "The citation item layout must include an author element first."
      ]
    ]

Finally, write a little shell script to run both validations.

Not nearly as elegant as the pure-RNG approach (certainly does little for any real-time validating IDE’s I know of), but it can assure that the instances match the expectations modeled in your RNG schema. And learning a little Schematron is probably good anyway, because it in turn can express things that RELAX NG cannot.

Am personally hoping not to have to have to do this with CSL though; it’s enough for me to worry about one schema.