Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2005 » October - geek tools and the scholar

Archive for October, 2005

OpenDocument and RDF: Storing What Metadata Where?

Posted in Uncategorized on October 30th, 2005 by darcusb – Comments Off

Earlier I discussed one way I envision expanding metadata support in OpenDocument. That consisted of using some RELAX NG-magic to constrain the structure of metadata representation, but allowing it to be easily extended.

But this leaves two obvious questions: what sort of document objects might one want to add custom metadata to, and where might one store that metadata?

On the first question: the most obvious content would be the things that are summarized in lists of figures, captions, citations and so forth. Relevant metadata may include source information, including not only titles and creators and such, but also rights information. All of this metadata can lead not only to smarter documents that can be more easily searched, but also better user experiences. Imagine, for example, not just turbo-charged citation support, but also automatic figure captioning, including publisher-required rights information. Or perhaps information about where to access a data set summarized in a table.

On the question of where to store the metadata, since OD files are just zipped archives, the obvious place is in dedicated files in the file wrapper. Indeed, document metadata is already stored this way, in a “meta.xml” file.

Here I see two possibilities:

  1. Retain the single “meta.xml” file, and create elements to wrap the requisite metadata: meta:Document, meta:Figures, meta:Bibliography, etc.
  2. Create separate files for each kind of metadata: meta-document.xml, meta-figures.xml, meta-bibliography.xml.

I tend to favor the second approach myself.

Incidentally, when Daniel Vogelheim and I wrote the proposal to improve citation coding in OpenDocument last year (which was based on previous work with DocBook, and approved by the OD TC), we always had in mind something like this model; to move the bibliographic metadata out of the main content.xml file and into its own file. Indeed, the citation proposal is virtually meaningless without at least standardizing that bibliographic metadata is stored outside the content file, if not actually formalizing the format for interoperability purposes.

The current RDF discussion simply allows the opportunity to do this in a comprehensive and consistent way. Let’s hope the TC is far-sighted in its deliberations on this matter.

RDF Tutorials

Posted in Uncategorized on October 29th, 2005 by darcusb – Comments Off

In trying to understand RDF, I’ve found this tutorial helpful. Now, here’s another from Shelley Powers in a similar vein.

Search

Posted in Uncategorized on October 27th, 2005 by darcusb – Comments Off

One theme I noted at the Access 2005 conference (at least for the brief time I was there) was federated search. This is a useful evolution on the current mess of a search landscape that library users are faced with. Why, after all, should I have to visit five or ten different portals just to get the information I need? In that sense, federate search interfaces can be a nice simple access point.

Yet in the long-term, I don’t think portal-based federated search interfaces are the way to go, at least not without ways to integrate search more directly into user workflows. It seems Lorcan Dempsey said much the same thing in his talk (which I missed; ppt here), where he demonstrated work they were doing with Microsoft’s Research Pane to integrate online content into the desktop workflow.

At the OpenOffice bibliographic project we long ago settled on the notion–with a lot of help from Rob Sanderson and Matthew Dovey–that we ought to adopt a unified approach to local and remote queries. Why use different APIs and code and interfaces to query a data store just because it happens to be on the network?

Along somewhat similar break-out-of-the-box lines, Peter Sefton posted a note to the OOoBib dev list with an idea:

As I write, I’ve been making hyperlinks to various web resources, lots of government pages, but a few refereed papers from various sources. This is fine while I’m in draft mode and the document is usable and sharable.

Later, when I want to publish it more formally it would be great if some software could find all my links and for each one look for a bibliographic reference in my local store (whatever that might be) and if there is none search other places that might have bibliographic data linked to that URL. If no data can be found it would add an entry to my database with as much data as it could pre-set (eg title may be able to be scraped from an HTML page) so I can fill out the rest. Once I have entered the data locally it should be aggregated ‘up’ to my workgroup / institution.

Everything I want to refer to in the report I’m working on now is available on the open web, but I could refer to books, say, by pointing at the local campus library system on the web (with some convention for page references) and that should be enough for smart software to automatically grab the details for my bibliography later.

This is like the idea of adding a citation by reference but without needing a formal ID. It’s like the EndNote ‘cite while you write’ somewhat backwards, with citation coming long before ‘proper’ data capture.

I hadn’t thought of things quite this way, but it is indeed an interesting idea, particularly if you couple it with a more distributed vision of metadata. For example, in the long run, perhaps it’ll be possible to simply cite and have software pull in the metadata from the web, rather than having to store it locally.

Alas, this gets to the tricky subjects of identifiers that I and others have struggled with. With what uri identifier would one cite resources? With a webpage, the answer is simple. But the question quickly gets more complex.

Richer RSS Feeds at Ingenta

Posted in Uncategorized on October 26th, 2005 by darcusb – Comments Off

Good news from Leigh Dodds at ingenta:

I released a couple of tweaks to the IngentaConnect RSS feeds recently. The most notable addition being the inclusion of foaf:maker properties to associate authors with articles, and inclusion of authors as foaf:Person resources. I’ve added these alongside the existing dc:creator properties to ensure that Dublin Core aware aggregators can still do something useful with the extra metadata.

He does note room for feature requests for your favorite feed reader though:

I’m not yet aware of any feed readers that process FOAF, or PRISM for that matter.

I’m more interested, however, in bringing this sort of feed reading into the citation management universe. Ingenta’s feeds are now rich enough for citation-ready metadata.

Thanks Leigh!

A Bet: Lightweight vs. Heavyweight

Posted in Uncategorized on October 25th, 2005 by darcusb – 11 Comments

Ernie Prabhakar (of OpenDarwin) and I have been going back and forth on the future of document standards. Ernie is of the belief that the future lies in lightweight solutions like XHTML + microformats. I am generally a proponent of heavyweight solutions like DocBook, OpenDocument, and RDF. While I believe the lightweight solutions have their place (primarily as output/display formats), I simply don’t think they have much hope of solving the deep problems I really care about (smarter documents, better interoperability, long-term viability, etc.). I certainly see no hope of authoring my academic documents in them.

So Ernie made the conversation interesting by proposing a bet. We’ve gone back and forth on the language, but I hope he’s comfortable with settling on this:

By January 1st, 2010 more technical documents will be authored in XHTML + microformats than using any mix of DocBook or OpenDocument and RDF?

Loser pays for dinner at Chevy’s in San Francisco.

This is just a friendly bet, of course (and there would no doubt be complications in actually defining what we mean by “technical documents,” etc.). The question is really whether the future of serious document production will be with the lightweight solutions, or the heavyweight solutions. Will the future be a world where microformats dominate and RDF is left in the dust, or vice versa? Or perhaps (as I think much more likely) there’ll be a draw, and both will have their roles?

I wonder how others would wager?

Apple’s Photoshop Killer and Standards

Posted in Uncategorized on October 23rd, 2005 by darcusb – Comments Off

In a past life, I once had the idea of being a professional photographer. I worked in a darkroom (where I did it all, including color printing), and later in a professional photo studio in Switzerland. For a period of time I was fairly into large-format view cameeras, later a medium format Pentax, and only more recently back to 35mm film as I’ve grown less interested in the methodical precision of the larger formats, and more interested in street photography.

I’ve lost the passion I once had for photography, but I still occassionally take an image that rekindles the passion. While I have a Nikon scanner and Photoshop, however, I’ve yet to really jump into digital photography. It remains fairly primitive in comparison to analog. Cameras are improving really quickly, however, and printers are good enough these days to mostly best results one can get in a traditional darkroom.

One area that has yet to be really solved to my satisfaction is editing and workflow applications. I was once a user of Live Picture, which had one feature I consider essential: non-destructive editing. That application died under the weight of corporate mismanagement. The notions of editing files directly, of creating full copies for new versions, etc., will be seen as positively archaic when we look back on this in ten years.

Now, Apple has resurrected that basic principle, and dramatically raised the bar on what we can expect in professional editing application. Aperture looks to be a really stunning intervention in this market. And while Googling around makes clear that Apple is claiming this application is a comfortable complement to Photoshop rather than an outright competitor, I simple don’t believe it. This has all the makings of a Photoshop killer.

One issue I worry about though: Apple understands the importance of metadata in the application, but it surprises me they didn’t adopt Adobe’s XMP for that; both for elegant extensibilty, and for seamless interoperability with Adobe tools (and perhaps in the future OpenDocument).

For the most part, Apple understands application design better than any other major competitor. They are willing to take chances to improve the end results that others simply won’t. Compare, for example, Pages and OpenOffice 2.0. Pages has a horrendous non-standard XML format that OpenDocument beats by a mile. However, while Apple took the bold move to totally de-emphasize presentational styling by getting rid of bold and italic from the GUI (which in fact fits much better the strengths of having an XML document format to begin with), OO.o took the opposite tack by adding the brain-dead hack of Microsoft’s “format painter.” Why? Because users requested it. Ugh … perhaps users requested this because Writer needs a better and more intuitive interface to apply styles?

I frankly think both worlds have somethig to learn from each other. Apple needs to get much more serious about open standards and interoperability. NIH will kill the company if they don’t wake up. Conversely, OOo needs to be more bold; less always focused on copying Microsoft. There is much in XML that should make us rethink what productivity applications can be, and what they offer to end users.

I am, of course, primarily an end-user, and I’m tired ot uninspired or poorly interoperable productivity applications.

XML and RDF

Posted in Uncategorized on October 22nd, 2005 by darcusb – 11 Comments

Last week at the Access 2005 conference, I told a room full of mostly library people that their XML standards (I was talking about MODS and MADS in particular) are needlessly complex, inflexible, and awkward; that they were not hacker-friendly. I showed them an alternative schema I’ve been working on that is better, cleaner and much more hacker-friendly XML. Modeled on DOAP, this schema also happens to be RDF, and I exploit the basic plumbing of RDF like its linking support (which I explained was much more consistent than the use of xlink in MODS) to yield a data representation that I think may well be close to a perfect balance of simplicity and expressiveness for citation-related metadata.

As I explained to the audience, before I left I got a comment from a Mac developer on the schema. He has been working on a MODS GUI editor interface, which I have always felt would be a difficult task without some pretty serious abstraction, and so I thought what I was working on might give give some ideas.

Now, this developer has no experience at all with RDF, nor much with XML. However, when he looked at an example instance, he immediately understood why the basic structure of RDF would be valuable. He didn’t use the word “triples”; he just recognized it made sense to have authors be full resources: Person objects that one links to. As he concluded, he also found it much more readable than MODS. Exactly what I was after in fact!

My point goes against the grain of common wisdom, which says that XML is easy, and RDF hard. I simply believe both statements are wrong. What after all, is more basic than the notion of making statements about things as a list of triples, and using uris and namespaces to disambiguate names? My point also goes against a lot of the discussion about RDF that seems to be getting quite heated. Only this time the heat is in the RDF community itself, as people argue about the future of the technology in the face of outside pushback that is mostly about the XML syntax.

Let me say a few words on this as someone fairly new to RDF, and rather more experience with XML. Yes, the XML/RDF syntax needs work to make it more friendly to XML tools. If people like Edd Dumbill feel the need to rely on RELAX NG to constrain the syntax, then there’s something wrong (though I feel the need to do the same with the non-RDF MODS schema, so I’m not sure that’s such a big deal). I’m not one who believes, however, one must throw the baby out with the bath water. Fixing a single problem in the spec — that one can use attributes to denote properties — would go a long way towards rationalizing it.

What else? I don’t understand the purpose of Alt and Bag, and bet I’m not alone. Likewise, while coming from an XML background leaves me predisposed to wanting to use reification, I tend to think it complicates the triples model without clear pay-off.

I’m also still not sure about the need for datatyping.

But bigger picture that all manner of critics sometimes forget: RDF is trying to solve some hard and important problems. Metadata is hard, particularly in a distributed context, and I see nothing out there will offer a reasonable alternative.

Certainly the RDF world could look at simplifying the XML syntax, but I agree with Dan Brickley that an even more important goal is continuing the evolution of RDF tools. If hot application environments like Ruby Rails had RDF support that mirrored its current SQL-based ActiveRecord, then that will do more to encourage uptake of RDF than anything done with standards documents.

But let’s keep in mind my bigger point again: metadata is hard, and technology is not an ideal world of black & white options, but a messy one of grounded compromises. One can very easily do very bad XML (witness OPML), while writing very clean RDF/XML. And I believe designing an XML schema based on RDF would tend to lead to better XML design. Large companies like Adobe are putting RDF to practical use. Perhaps, then, we need less talk at the level of Platonic technology ideals, and more practical discussion of what problems we need to solve, and how to do just that?

Modeling References Relationally

Posted in Uncategorized on October 22nd, 2005 by darcusb – Comments Off

Given that I’ve been doing a lot of work on reference metadata modeling over the past year, I’ve been trying to put that knowledge down in a formalized way wherever possible. For the most part, this has been in RELAX NG for XML representations like CSL, and the new RDF/XML representation I’ve recently been working on.

However, I’ve long been calling for better SQL models for this sort of metadata. So I thought it would make sense to tackle this now and release it to the world for anyone to use if I actually get anything useful done. Indeed, it might well fit with how I’ve been thinking about this from an RDF standpoint.

I ran into DB guru Josh Berkus on the OpenOffice DB dev list awhile back, who had challenged my contention that SQL DBs are awkward ways to store reference metadata. Josh has thus been graciously trying to help me work out the structure of the model.

The basic question, as Josh notes, is how to handle the basic modeling of parts (articles), containers (books and journals), and collections (series, archival collections, etc.). I have recently leaned toward the view the basics here can be handled in a single table (title, date, description, etc.), where each level would be a separate row. Contributors and notes and such would be handled in separate tables.

But not so fast, Josh says! A good RDBMS designer takes a rather different, more methodical, approach to design than do people like me, that come from an XML background. He doesn’t want me to worry about abstractions like parts and containers and collections. “Just tell me exactly what you need to store,” he asks.

This is a problem. There is only one spec I am aware of that has this information in a comprehensive way. The problem is that the spec is not easy to read, and its heavy focus on the level-based abstraction may lead DB designers down a wrong path. It is true the levels and relations are crucial, but it is also true that from the perspective of users, they do not care. Citations are about the reference; from that standpoint there is no distinction between an analytical (article) title, and a monographic (say report) title. They are both names for citable resources. If one searches on a title, one should get results from either level.

As Josh notes, there are time constraints here. While I can do this myself, I’m overextended already (as is he!), and the task of compiling a table for each of 40+ reference types with their specific attributes will take more time than I think I have. I did start a table in OmniOutliner a couple weeks ago; just haven’t found time to get back to it. If someone wants to help, let me know.

I can say that some of the more difficult areas of bibliographic metadata if you really want to do it right (e.g. not just use natural language strings) are the following:

  • dates: they come in different kinds (publication data, events dates for hearings and conferences, decision dates for legal cases) and different forms (1999, “Auguest 2000″, “September 21, 2001″, “Spring/Summer, 2002″, “June 1, 5-9, 1971″)
  • names: again, different types (personal vs. organizational) and forms (think about the international problem of sort order and transliteration, but also even Western names like “J. Edgar Hoover” or “Alexander von Humbolt”)

These problems of course apply across the spectrum of XML, RDF, and SQL.

A Complete Metadata Cycle

Posted in Uncategorized on October 18th, 2005 by darcusb – 2 Comments

I gave a talk at the Access 2005 yesterday that lays out what I think are some really exciting possibilities in recent discussions around OpenDocument and RDF metadata, and how we at the bibliographic project hope to contribute to realizing those possibilities. The slides are here for those that are interested.

Feel free to post comments or questions, whether you were there, or not.

Access is an interesting conference, and really showcases how the library world is pushing technology frontiers as a way to solve problems for end-users. The conference has a blog aggregator.

One thing I forgot to mention at the end of my talk: anyone out there that wants to help us realize this vision, let us know!

MODS GUI

Posted in Uncategorized on October 7th, 2005 by darcusb – 1 Comment

Mr. Kool is busy. From comments on the previous post:

BTW, there is also a screenshot for a MODS Tight GUI. It’s a quite complete approach, as in that you can edit all of the MODS Tight xml in it (eventually), but it’s also taking an enormous amount of screen real estate for it…

The result is:

I’ve long thought it not a good idea to try to model MODS too literally in a GUI. The trick with a bibliographic data model and the GUI that represents it to the user is that it needs to present users flexibility, but also clarity. The two exist in some tension.

So what kind of flexibility do I want as a user? The most important is flexible typing and relations. If I want to enter a paper presented at a conference, and later republished in a proceedings, I want my GUI to make it both possible to do this, and intuitive. No GUI I am aware of does either.

I still think a GUI can offer more constraint — and thus clarity — for users, rather following MODS too closely. For example, let’s say I create a record and start by telling the GUI that I am entering a “paper”. I have fields for author/creator and title and maybe subjects. I then can optionally add the metadata associated with a conference presentation. Likewise, I can then add some sort of “published in” relation and choose a proceedings, which gives me a title and publisher choices for those, and so forth. Finally, the GUI knows I need to add page numbers.

Personal names are tricky. In some ways it’s easier to just have a name field, and do “Doe, John” and have simple code that splits the name. But what if you need to indicate that the name is a transliterated Asian name, with a different sort order?

BTW, with respect to the data model, ideally many of this stuff would be normalized so that among other thing one could use auto-completion to deal with authors, publishers, and so forth.

So I want a flexible data model and GUI, but also that is clear. It should be possible to configure them too to determine the mix of metadata fields/properties associated with different kinds of named resources.