Mike Kay on XSLT 2.0 performance tuning. Gotta come back to this when I have more time to tweak citeproc-xsl. Quite a few of the solutions to some really difficult problems there, BTW, were coutesy of Mike by way of the XSL list.
Archive for July, 2006
I jsut stumbled on this analysis of citeproc. Alas, it requires I subscribe to Passport to leave a comment, so I thought I’d instead post a quick update on progress related to citeproc and CSL, since public documentation is rather dated.
My focus has always been on CSL, and the XSLT 2.0 implementation as a solid proof-of-concept. To my mind the promise of CSL is really langauge and document format independent citation styles.
I have been struggling my way to the 1.0 finish line with the schema, trying to finish with some tricky features, and to wherever possible simplify and rationalize the logic so it is easy for styles authors and developers to work with.
When they release the extension sometime in the next few months, expect it to support CSL out of the box for citation style configuration.
Alongside that, Johan Kool jumped in and decided to work out a Python version. With the three of us working on design details of CSLâ€”and each using different languagesâ€”we’ve mananged to make a lot of progress in resolving some of the more difficult problems. We are targetting a pre-1.0 test release sometime in the next couple of weeks, and then a final 1.0 freeze early September.
At that point, hopefully things get more interesting, and I can sit back and watch how others make use of CSL.
- the “info” metadata element uses the same content modeling as that in Atom
- the whole thing is designed on a consistent inheritance model that makes it simple to do the common stuff, and possible to do more complicated customization
- in part as a result, the data field and template models are simpler; constructing GUI editors ought to be easier
- I finally figured out how to support complex internationalization options without making things more complicated for those who don’t need them
- at the request of Matthias Steffens from the RefBase project, we figured out a fairly elegant localization approach
It seems Microsoft is gearing up for yet another new anti-ODF FUD offensive, and Brian Jones is leading it. I find responding to every little detail tiresome, so will just address this point about how the standards are developed:
I think the key here is for everyone to just be clear on the goals. The ODF format is based on Sun’s StarOffice, and Open XML was based on the Microsoft Office formats. Both have the goals of being open, both have been submitted to standards bodies, and both have a commitment from the donating companies (Sun and Microsoft) that there will be no licensing restrictions and anyone is allowed to freely use the formats.
This is classic FUD: factually true enough, but false by way of omission.
If you want to understand the goals of ODF, just read the TC Charter. There are a few goals which are notably absent from Microsoft’s, notably friendliness to processing using XML tools and the reuse of existing standards. I happen to think those matter to developers and ultimately users.
FWIW, I am on the ODF TC. But I have also given MS plenty of constructive comments on the way they are implementing citation support in OXML, because my interest is in promoting better solutions in general. I’d rather have two excellent open XML formats, than two weak ones.
Perhaps this will be a good test of how well the two standards processes work? My guess is none of my comments will have any effect on OXML.
Rob Weir, with yet more smart commentary on the MS ECMA spec and standards:
Now take a look at Chapter 23, VML, pages 3571-3795 (PDF pages 3669-3893). We see here 224 pages of “VML Reference Material”, which appears to be a rehash of the 1999 VML Reference from MSDN, and in this form it hides itself in a 4,081-page OOXML specification, racing through Ecma and then straight into ISO. Is this right? Should a rejected standard from 1998, be fast-tracked to ISO over a successful, widely implemented alternative like SVG?
Rob makes some good points about why using standards matter in a really practical sense (they are often technically-superior because they’ve done through extensive review, they have knowledge and tools built around them, etc.). I wonder how these issues relate to Rick Jelliffe’s discussion of the developer-friendliness of the two formats?
A colleague of mineâ€”Palestinian-born geographer Ghazi-Walid Falahâ€”has been detained by the Israeli state since July 9 without charge, and without access to a lawyer. His apparent alleged crime: taking pictures in a restricted area (a tourist beach).
Ghazi-Walid gave a lecture to my beginning students last Fall about the intersections of culture, globalization, and the Palestinian conflict. Not surprisingly for a geographer, he showed a lot of pictures, and the students seemed to get a lot out of the exchange.
Update: the court lifted the gag order on the case, and the story is covered here. It’s a sad state of affairs when people can be detained without apparent evidence, and without charge, on the mere suspicion of some connection to terrorism.
The ODF metadata subcommittee has wrapped up work on a draft [ODF, PDF] of the use cases document we will be submitting for approval to the OASIS ODF TC. The document lays out our vision of what this new support ought to make possible, authored as it was by a group that represents various areas: academia and research, medicine, law, real estate, and of course the software engineers who make it all happen. In many ways, we believe this will go beyond what MS offers in their custom schema support.
If we missed something, please send us your comments.
We will use this document to derive a set of requirements, and, once the ODF TC approves it, then move on to actual implementation details.
We deal with millions of Web masters who can’t configure a server, can’t write HTML. It’s hard for them to go to the next step. The second problem is competition. Some commercial providers say, I’m the leader. Why should I standardize?’ The third problem is one of deception. We deal every day with people who try to rank higher in the results and then try to sell someone Viagra when that’s not what they are looking for. With less human oversight with the Semantic Web, we are worried about it being easier to be deceptive,
Ahem, in other words, “Google wants to be the center of the metadata and search universe, and there’s little need for Google-as-we-know-it if the semantic web is realized.”
The notion of a web-scale, decentralized, and distributed database is a challenging one, both for conventional thinking, as well as for the market position of thoseâ€”like Googleâ€”who rose to the top based on a different approach.
I personally would rather have thousands of distributed nodes serving up RDF scholarly metadata, than rely on Google for pulling it all together for me. I have more confidence in that vision than I do, for example, in Google Scholar or MS Academic Live.
Two comments from people at Microsoft on the suggestion (from me and others) that they join the OpenDocument Technical Committee to help ease interoperability gaps in the two formats going forward; first Brian Jones:
I think there are still plenty of ways we can help out the OASIS folks with the ODF format. The entire translator project is open source, so the conversion will be completely transparent and everyone will have the ability to benefit from what we discover as the transformations are built. In addition to that, as I’ve looked through our Ecma documentation, I’ve also been looking at the ODF spec as a point of comparison. As I come across areas that are either missing, or just not fully specified, I’ll be sure to point them out on my blog. That should help them in creating a list of areas to improve.
On one hand, this sounds quite generous. To this I say, sure Brian, that’d be great.
But if you parse the language (and my career is just doing just that) it reflects the arrogance of a company that has for too long gotten by on the weight of its own monopoly position. Note: he does not acknowledge that MS might learn something from the experience (see below), and that OXML might be better for it. Likewise, he doesn’t acknowledge that OXML has already borrowed from ODF; for example, in its zipped package file structure.
Now, here’s Dare commenting on Brian’s post:
Unfortunately, the ODF discussion has seemed to be more political than technical which often obscures the truth. Microsoft is making moves to ensure that Microsoft Office not only provides the best features for its customers but ensures that they can exchange documents in a variety of document formats from those owned by Microsoft to PDF and ODF.
Make no mistake: there is something “political” in this position that MS is staking out, which seems to be:
- see, we are just as open as ODF?
- but ODF is a weak spec that pails in comparison to the technical excellence of Open XML
- MS is giving the people what they really want, which is file format support; witness the new BSD licensed ODF plug-in for Office
IBM’s Rob Weir is starting to pay some careful technical attention to these sorts to details. In his latest, he argues the heavy weight of OXML is going to introduce serious implementation, and thus interoperability, problems.
He addresses this through the 50+ pages of references to an obscure feature of page art borders. Yes, the spec actually includes these details! And as Rob points out, this sort of functionality is quite culturally-specific.
The images are heavily weighted to Western even Anglo-American celebratory icons, things like gingerbreadmen for Christmas or slices of Birthday cake, pumpkins for Halloween, or images of Cupid for St. Valentines day, or globes which are neatly centered on the United States.
Rob argues this is a perfect example of over-the-top spec bloat that will make implementation awkward for anyone but MS. Moreover, Rob actually provides an elegant alternative suggestion.
All of these problems (spec bloat, cultural bias, non-extensibility, copyright concerns) can be solved by one simple mechanism. Instead of having ST_Border be a fixed enumerated set of values, have it include only a small number of trivial values like the basic line styles, and have everything else (all of the Art Borders) be stored as a separate image file in the document archive.
Brian, you listening?
Meanwhile, I have extensively pointed out where MS ha fallen down in their new citation support. They have invented their own source format, have ignored library communications standards, and appear to be using critical citation coding that will be impossible for standard xpath-based XML tools to process. Some of this has implications for the file format, and I’ve yet to see any serious concern about the issues out of Redmond.
Despite what it might seem, my position on these matters isn’t blindly political. I believe in open standards because I think in the end they yield better results for end users. I expect to prove that with the citation use case, but I really do want to raise the bar for academic end users all around. Enhancing interoperability between ODF and OXML is an important part of that, and both groups can learn from each other.
Jennifer Michelstein, the Microsoft program manager for academic features, has posted the first of a series of blog entries on the new citation support in Word 2007. In comments, we have been going back-and-forth on a few issues of concern.
Out of that conversation, I conclude:
- version 1 will not support the footnnote/endnote style citations common in the humanities
- it seems (though I’ve not confirmed it) they don’t support first/subsequent citations in author-year styles
- they will provide an SDK to connect remote databases, though no evidence that they are even aware that there are well-deployed existing standards (z30,.50, and the more modern SRU and SRW equivalents) in this space
- they think it quite fine to leave it to third-parties to provide different styles for users, in XSLT
As a I said in the comments, I think the last point particularly short-sighted. It will be really hard for anybody but XSLT experts to write good styles, and they will be specific to Word. Moreover, the style files will be huge, and difficult for users to install (impossible in some cases, if they don’t have appropriate installation rights).
Nevertheless, it does mean there’s plenty of room for someone to swap out the existing raw XSLT approach and replace it with my (I think much better) citeproc alternative. And I’ve been talking to M. David Peterson about just that.
Mark had an idea that I think may well be brilliant: use Atom to do much of the metadata and distribution work. It turns out that the current metadata element in CSL is almost exactly (in fact, by design) the same at the Atom metadata content. Moreover, the rest of a CSL file could be easily embedded in the atom:content element, and then individual entries linked together into feeds.
So what if, then, users never had to worry about installing style files? They would just subscribe to one or feeds in their areas. If a new style they wanted appeared, they’d click a link and it would be automatically installed. If an updated version of an already-installed style showed up, the local version would be automatically updated.
Finally, because CSL is designed to be document-format agnostic, the same files could be used by users of any authoring solution: Word, OpenOffice, Writely, LaTeX, DocBook, web applications, etc.
There’d still be details to work out (I’d really like to allow distributed repositories), of course, but doesn’t this seem much better than the current MS approach?
CiteULike’s Richard Cameron has posted an interesting outline of a plan to rewrite the code in Python, called for now PyULike. Meanwhile, last week I heard from one of the developers of a really interesting new in-development Firefox plug-in called Firefox Scholar (we are talking about integrating my CSL language for citation processing, as well as import/export formats). Each attempts to solve very real problems for scholars, researchers, and students, but in quite different ways.
The problems are:
- How can you best integrate reference management seamlessly into modern web-focused research workflows? As a user, I spend a lot of time working with documents sourced from the web, so why should I then have to open a desktop application and manually enter reference data?
- How can one exploit the web and its network effects to allow users to benefit from the social aspects of reference management? It’s really hard to keep up with new work in my own field, let own affiliated ones, so why can’t my reference management solution give me hints once in awhile based on what others with like interests are reading?
Now, how do they solve these problems?
PyULike, like its predecessor, is based on a fully-centralized model. To quote Richard:
Previously I’ve resisted releasing or “open sourcing” code for the site for reasons which I outline on the site’s FAQ. Briefly, these are that I wish to prevent fragmentation of the userbase among a thousand private installations of the CiteULike software…. The benefits of keeping things centralised is that we keep the community effects. Users find others who are reading the same material, and they find papers serendipitously which they wouldn’t otherwise.
Firefox Scholarâ€”aka SmartFoxâ€”is based on a slightly different, more distributed, model. Reference data will be stored locally, within Firefox 2.0’s embedded SQLite database. One will be able to extract references from pages one is browsing, or also manually enter and edit references within Firefox.
They then plan to add the ability to sync that data with a centralized server to provide similar sorts of social networking support. Moreover, it will be fully open-sourced, under a GPL license.
OK, but as a user and developer, I’m not so sure I want to be left with such discreteâ€”all-or-nothingâ€”choices. Why couldn’t I, for example, use SmartFox locally in my browser, but have it sync with PyULike’s server? Or more importantly, I don’t accept the notion that a centralized server and social networking are mutual requirements. Can we not allow the sort of vision of these tools but in more distributed fashion? RDF and SPARQL, Atom?
Finally, I’ll reiterate the point I’ve repeatedly made: we need to get this stuff integrated within the desktop and publishing workflow. If I’m using PyUlike or SmartFox (or both) I really need to be able to easily integrate my citations into Word or OpenOffice. MS is already adding the infrastructure to allow this in Word 2007, and we are trying hard to make the same happen at OpenOffice. Until that happens, only part of the puzzle is in place.
So how about some collaborative discussion among these projects so that we can have real interoperability, not only between these projects, but also between them and OpenOffice and Word? Maybe we could even settle on compatible licenses so that we can share code where appropriate.