Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2006 » June - geek tools and the scholar

Archive for June, 2006

The Chronicle Does Citation Software

Posted in Uncategorized on June 24th, 2006 by darcusb – Comments Off

The Chronicle has an article (which I stumbled on here) on citation management software. A couple of interesting excepts:

And a few faculty members have tried the software for their own research and then gone back to tried-and-true manual methods. One is Lowell Turner, a professor of international and comparative labor and collective bargaining at Cornell. He says his graduate research assistants urged him to use RefWorks, but he found that the program couldn’t quickly or easily import a career’s worth of bibliographic material, in a variety of formats.

Another similar point, though this one hitting on the data model theme I’ve focused on extensively here:

He’s not alone. Even though legal scholarship follows exceedingly detailed citation rules that seemingly would be well suited to a computer program, legal scholars as a whole avoid citation software, says Kevin M. Clermont, a law professor at Cornell. Legal scholars often cite arcane documents from around the world, which citation software has difficulty handling, he said.

“It’s by light years not sophisticated enough to handle our problems,” he says.

Yup, I feel his pain, and unless MS fixes their data modelling approach, I’m afraid their new support won’t work for him either. Am hoping we can get it right at OpenOffice though.

Finally, on the costs:

Since November 2003, almost 11,000 people at the University of Minnesota-Twin Cities have registered for RefWorks, and they have stored a total of 570,000 references on RefWorks’ servers …

How much do they pay for this? $12,500 per year.

Sigh … so how much would it take to build a better open source solution using PostgreSQL and Ruby on Rails? If each institution that had such a site license put, say, $500 in a pot? No, that doesn’t include all the support issues involved in such an enterprise, but how hard can that really be?

Regretably, the article focuses solely on RefWorks and Endnote. There’s no mention at all about the forthcoming support in Word, nor the OpenOffice work I’m involved in. In both cases, these efforts will offer superior integrated citation formatting support to word processors.

Likewise, there’s no mention of interesting developments in the world of free services and software like Connotea and CiteULike. Admittedly, neither of these are general enough to serve as real substitutes, but I think it’s only a matter of time before they are.

So nice to see the article, though it seems strangely dated.

ODF and OpenRaster

Posted in Uncategorized on June 20th, 2006 by darcusb – Comments Off

There’s some discussion of creating a new open raster image file format using OpenDocument as the technical base. Great idea! It shows some of what the ODF infrastructure (both technical and otherwise) can make possible. And the new metadata work ought to fit well with such an effort.

Wither Apple?

Posted in General on June 16th, 2006 by darcusb – Comments Off

I’ve not written about Apple in awhile. Mark Pilgram’s announcement that he’s switching to Linux after 22 yeass on the Mac, and the absolutely absurd comments from the Mac zeolots just reminds me that I’ll also likely be following Mark’s lead next time I buy hardware, and for most of the same reasons. I don’t have the interest to go into it in depth, but in short, their software is uninspiring, the company is more closed and standards-unfriendly than even Microsoft, and their hardware is expensive. I don’t buy music from iTunes, and I don’t ever plan to. The only software I care about that is not on Linux is the advanced photo-editing applications like Photoshop and Lightroom.

Of course, therre’s still this little issue I’ve been obsessing about regarding citation support, but for now the stuff being added to Word 2007 is nowhere in sight on the Mac.

Flat vs. Relational

Posted in Uncategorized on June 16th, 2006 by darcusb – Comments Off

Now that I’ve covered most of the details citations and bibliographies in Word 2007, let me return to the subject of the source format. The team that designed the schema made a number of design decisions. In designing the equivalent for use in OpenOffice and OpenDocument, I have made some different decisions. Similar debates have accompanied the effort to put together an hCite micro-format. Let’s compare.

So the structure of the bib schema in Office 2007 (and Brian Jones tells me, Open XML; this will be documented in the ECMA format) is a flat model, with a root of b:Sources, and primary child elements of b:Source. Typing is provided by a b:SourceType element. All properties of the bibliographic item are then described with child elements of b:Source; there is no hierarchy. So, for example, to encode titles:

  • for a Book, you use b:Title
  • for a BookSection, you use b:Title for the chapter title, but b:BookTitle for the container
  • for the journal article title, as above you use b:Title
  • for the journal title, you use b:JournalTitle
  • etc., etc.

The problem with this approach is you end with an explosion of elements to describe the range of resources. I count 9 elements that are used to describe the same thing: titles (though currently they incorrectly assume a Case “Reporter” is a contributor; rather, it’s a periodical title). And they are missing a few: CollectionTitle and SeriesTitle are the obvious ones. Essentially, every new resource type—particularly if at has some part-container relation—needs a new title structure! And every time you add a new title structure, you have to update code elsewhere (in, for example, every single XSLT file that implements your citation styles!).

Also, the modeling is inconsistent, both internally, and with respect to the document-level metadata description in OXML. On the former, a simple example: the title of a book is b:Title, except when you are describing a section within the book, at which point it is a b:BookTitle. On the latter, OXML now uses DC to describe documents, but here we see no evidence of DC.

There’s another problem, incidentally, with the structure of the MS schema, which is more a limitation of the validation technology they are using (XML Schema) than anything. Because they use the same element for all types, they cannot validate the content by type. So it will be possible, for example, to include a b:BookTitle element within a journal article record. RELAX NG has no such limitations, but the schema isn’t expressed in RELAX NG.

My approach, by contrast, is not flat, but relational. I use RDF for the relational modeling and linking. In the XML, I use typed nodes to encode the important information, which means one need only have two title structures: title and shortTitle. Conceptually, then, you end up with:


And the majority of critical properties can be represented with standard DC and Extended DC; the same ones, incidentally, OXML already supports for the document!

Finally, an XML schema (expressed in RELAX NG) tightly controls the structure of the content by type.

More broadly, using a relational structure in which you keep the number of properties to a minimum has further benefits. The formatting system, for example, can be made much more robust.

(X)Forms in Biblilographic Apps

Posted in General on June 15th, 2006 by darcusb – Comments Off

Awhile back I wrote that a new bibliographic web application ought to include:

A configurable form system flexible enough to be configured for any resource type: everything from journal articles to books, to archival documents, to weblog posts. This presumes the form system should not be based on RIS or BibTeX, but rather around a more flexible standard like MODS. Either XML or YAML would be good bets for configuration languages in Ruby or Python.

I probably mentioned the idea of using a simple XML language to configure the GUI elsewhere too. In any case, MS has done just that in Word 2007:

So it seems the entire editing forms are configured with this XML file. In fact, I bet (though cannot now test) that one could add custom types by simply editing this file.

Interestingly, the author definition includes an assocaited XSLT that converts a simple string to properly-structured XML, and another to convert the other way (though I still hate that it all—including the XML—presumes standard Western name forms; what if I am a scholar of Chinese history?). I wonder, can you do this in XForms?

I’ve been saying for awhile that OOo needs to deepen XForms support to open it up to developers for these sorts of uses. This would be particularly interesing when coupled with the idea that a couple of the Sun engineers were discussing at the ODF metadata SC of creating a standard RDF XForms binding for our metadata work. That could GUIs to be essentially auto-configured for custom content.

Opening Up the Market

Posted in Uncategorized on June 13th, 2006 by darcusb – Comments Off

I said in my last post that:

I cannot emphasize enough how important it is that this stuff be standardized within document formats and included within editing applications. It’s critical, and the sad state of the current market is a direct consequence of the fact that it is not.

What I am saying may seem paradoxical: that including standard support commonly found in third-party plug-ins will actually open up the market, rather than close it. This is so only, however, if one can use alternate data sources. I should, put simply, be able to have Word access RefWorks, or Endnote, or whatever reference management software I want.

Thankfully, there’s a fairly easy way for Microsoft to allow this: tweak their Research Pane a bit.

Right now, the “insert citation” button on the Word ribbon includes an option to “search libraries.” When you click it, it brings the Reearch Pane up. Good!

Sadly, it doesn’t do anything useful (yet). What it should do is give default access to the Library of Congress SRU/W gateway, and to MS’s Academic search service. Further, it should be trivial to add any new data source to this.

Also, a user ought to be able to drag-and-drop the search results onto the document to cite them. I think this does suggest some enhancements to the Research Pane, including removing the requirement to use SOAP. RESTful web service are winning th day, and MS ought to support them.

Problem solved … mostly. We now have good standard base support, but open up options for different kinds of users and user communities, as well as developers.

One problem with this approach, however, is that it puts a lot of burden on the source data format for interoperability, and right now, it is rather more limited than it should be to fulfill that requirement.

Incidentally, everything I’ve been saying is pretty much what we’ve been advocating at the OpenOffice bibliographic project. While it could be coincidence, can’t help but wonder if people at MS haven’t been paying attention, and if we haven’t unintentially done a bit of design work for them!

Update …

From MS’s Chris Pratley, on some forum, more info:

Word 2007 comes with a citation library capability, and by the time we ship it will have connections to on-line reference libraries so you can search for citations and download them to your local library. In beta 2 you have to manually enter citations, but you can keep them in your library and re-use them in different docs.

Word 2007 beta 2 has a set of the most common citation formats (MLA, APA, etc.), and this can be expanded either by end users (need to edit an XML file), or by third parties or Microsoft in the future. We expect a lot of people to add more formats you can download so you don’t have to make them yourself. We’re just two weeks into public beta so that hasn’t had a chance to happen yet.

So seems like good news, though his explanation on citation styling is cryptic.

Plug-in vs. Standard, XSLT vs. CSL

Posted in Uncategorized on June 13th, 2006 by darcusb – 1 Comment

Peter again on citations in Word. Two issues he raises; first about my argument that MS ought to use a CSL (or CSL-like) abstraction on top of a generic XSLT:

Bruce has some concerns about the complexity and size of the XSLT involved, but I don’t think that matters so much -what matters is that XSLT is involved. All that’s required is an CSL to XSLT compiler. Feed CSL in one end and get a Word 2007 compatible stylesheet out the other. This could be done with a stand alone tool.

That would be possible, but not very realistic. It adds further steps to setting up a new style, and as I mentioned, each style file would be verty large. We need to start thinking about open citation style repositories, where a user (or even just a processing tool) can grab a new definition as needed. That is only convenient where the files are:

  • self-contained
  • small

The questions Peter asks near the end (about adding and creating new styles, repositories, etc.) will all have fairly uninspiring answers with the current approach. With CSL, not only do we have a feature-rich language that satisfies the above requirements, but one that is both language and document-format agnostic. One can use the same styles files with ANY document format: Open XML, OpenDocument, DocBook, XHTML, RTF; even TeX.

The second big question Peter asks is whether citations support ought to be standard in Word (or OpenOffice for that matter).

And I’m still dubious about the value of having the bibliographic software built into Word 2007; Microsoft’s site clearly states that if you load a file with citations in it into an earlier version of Word they will be converted to plain text. This means that the feature will not be usable in a real-world context for several years. People have to collaborate with others, work from home and in internet cafes; we can’t mandate Word 2007 in all those places.

First, I think MS can do better than convert the citations to text. I suggest that with their patch to add OXML suppot to previous versions of Office, they include at least basic support to preserve the new citation logic, and perhaps a separate plug-in that provides basic GUI support that would allow compatability with Word 2007.

I cannot emphasize enough how important it is that this stuff be standardized within document formats and included within editing applications. It’s critical, and the sad state of the current market is a direct consequence of the fact that it is not. So I’d emphasize again that I think there’s tremendous promise in this approach, and that it is just in need of some refinement.

Multi-Reference Citations in Word 2007

Posted in Uncategorized on June 13th, 2006 by darcusb – Comments Off

As I was looking at the way Word 2007 now implements citation coding, started to worry. Using a token to represent the reference information is awkward with a single reference, but what happens if you need to include multiple; e.g. (Doe, 1999; Smith, 1998)? Did they not include this option?

My tech guy got me access to the Word 2007 beta, so I just checked for myself. It turns out Word 2007 does allow multiple citations. However, it is quite limited at the moment, not allowing you to order the fields, or in any way modify the formatting. This would be a deal-breaker for many scholars, including me.

Moreover, the formatting system does not properly implement author-year styles, including APA. If you read this excellent overview of APA, for example. it says:

If you are citing more than one work from the same year, use the suffixes “a,”"b,”"c” etc., so that your reader can differentiate between them (these suffixes will correspond to the order of entries in your references page)

Now, observe how Word’s XSLT-based formatting system address this issue:

Instead of properly adding suffixes, it instead adds the titles to disambiguate. No journal would accept this sort of gross error.

Hmm, this is one of the reasons I was saying using XSLT 1.0 to format citation properly is difficult. Citeproc handles this all correctly, incidentally, and it took me a long time to figure it out, with a lot of help from XSLT experts like Michael Kay and Jenni Tennison. I still haven’t implemented another little wrinkle, which is to disambiguate multi-reference citations where two different authors share the same last name.

Now, what about the XML?

So, yes, it supports multi-reference citations, but only in the most awkward of ways. Everything is crammed into a single, largely opaque, attribute value.

I just discovered that Word does allow local style modification. Oops; more below.

… so wait, I was wrong; Word does allow editing of local citations. Here’s the contextual menu:

… and here is the actual dialog.

In general, this is well-done. However, it makes a problematic assumption that a user will only ever be using page numbers to identify a location within a document. In law and history, they often use paragraph, and even line, numbers. And sometimes they are combined. This is why in ODF we have separate elements—cite:detail—to encode this. Also, they might add an option to suppress automatic-ordering, which is a feature they need to add.

Here is what happens when you save the file:

I can only assume that the \m and \s flags are used to suppress output.

Wither Endnote?

Posted in Uncategorized on June 10th, 2006 by darcusb – Comments Off

One obvious question is, what does the new citation and bibliographic support in Word 2007 mean for Endnote? Short answer: the final end to a slow death that began when ISI bought the company that originally produced it.

Somewhat longer explanation:

By standarding the details of citation and bibliographic coding in Open XML, Endnote loses any technical advantages behind lock-in. In theory, any application should now be able to serve as a database for Word. So the playing field is leveled; ISI will now have to compete on the merits of their product. Will see if they use the new support in Word 2007 in their latest version (out this month), or if for some reason they simply continue to implement their legacy non-standard approach.

The problem for them, I think, is that reference management is increasingly moving to the web. My library catalog can now automatically load a bibliographic record into a RefWorks account, which I can then access from any web browser. This is really useful to me, and it’s a trend that will only accelerate over time.

Endnote users often trumpet the ability to search catalogs from within the app, but I really don’t think a) this is that hard to do (there’s open source code to do it from Index Data), and b) it’s not where the future it. As a user, I want to load reference data from my web browser. Also, Microsoft has the Research Pane tool, which should be able to be configured to plug into any number of library web services.

So why would a user spend $100/year to stay up-to-date with a limited, buggy product that is bound to the desktop?

Of course, it seems ISI recognizes the need for a response, and this press release states it is coming in the form of a web-based version of Endnote. I won’t be using it, instead sticking with RefWorks for the short-term, and hoping to contribute to better open source solutions long-term. Ruby on Rails + AJAX + a few smart designers and coders really ought to constitute a killer alternative.

It’s not (just) about about file formats

Posted in Uncategorized on June 9th, 2006 by darcusb – Comments Off

Regarding the presumed forthcoming Google Office, Ken Fisher makes the following argument:

  1. free and open source productivity suite have been a failure; witness OpenOffice
  2. Microsoft’s monopoly is a function of its file formats
  3. Google is not investing in OOo because of this, and is instead investing in a web-based approach whose prime advantage is its OpenDocument file format
  4. success in the future = the company that best exploits ODF

I think all of this is only half-true.

First, the problem with OOo is organizational. It is an open source effort that is not fully open. It is like Mozilla before AOL spun it off; benefitting from the support of its sponsor, but also stifled by it. The reason why OOo is not more successful is because, quite frankly, it’s neither good enough, nor is there the organizational structure to encourage investment, both of cash from large corporations, and time by developers. My guess is that this is why you see IBM and Google going their own way. It’s a rather pathetic state of affairs when Microsoft will end up implementing the vision of the OpenOffice bibliographic project (of which I am co-project lead) before OOo!

So those that are thinking that ODF will save the day are deluding themselves. Open standards are important, even critical, but they are no magic bullet. And Microsoft’s ECMA submission fundamentally changes the game here, and I have every belief that ISO will also approve it.

What will save the day is for companies and free software communities to produce better software. I’m frankly skeptical this is coming any time soon, as every single would-be Office competitor I have seen—from OOo to Writely—seems to be chasing MS’s taillights. Nowhere do I see people rethinking what productivity should be in the 21st century. MS has excelled by always being just good enough, and they’re continuing that even while giving up their file format lock-in.

What I’d really like to see is for Sun to spin-off OOo, and announce—along with IBM, Google and Novell—a large infusion of cash for the independent organization. I’d like to see that organization in turn work closely with Mozilla, first to figure out what they did right, and second how to transition advanced ODF-based productivity to the web.

As for Google Office: it won’t be enough, Google, to just put the same boring productivity applications we’ve seen for the past two decades on the web.