Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 512 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 527 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 534 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-settings.php on line 570 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/cache.php on line 103 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/query.php on line 61 Deprecated: Assigning the return value of new by reference is deprecated in /var/san/www/prod/html/blogs/darcusb/wp-includes/theme.php on line 1109 darcusblog » 2004 » September - geek tools and the scholar

Archive for September, 2004

Download News Script

Posted in Uncategorized on September 28th, 2004 by darcusb – Comments Off

With help from various places (the new Ruby Forum, the author of the new Tidy package for Ruby, Matthias; thanks all!), I’ve now got the following script for my download issue.

So, I have an Applescript I invoke from NetNewsWire to create a YAML config file with a list of article titles and urls.

tell application "NetNewsWire"
    set articleurl to (URL of selectedHeadline)
    set articletitle to (title of selectedHeadline)
    set article_pubdate to ((date published of selectedHeadline) as string)
end tell

set delim to AppleScript's text item delimiters

-- Get the cleaned URL set AppleScript's text item delimiters to "?" set theUrl to text item 1 of article_url

set AppleScript's text item delimiters to delim

set newline to ASCII character 10

set theData to (newline & "-" & newline & " title: " & articletitle & newline & " url: " & theUrl & newline & " pubdate: " & articlepubdate) & newline

set theFilePath to ((path to documents folder) & "News:downloads:downloadIndex.txt") as string

set fileRef to open for access theFilePath with write permission set fileEOF to get eof fileRef write theData to fileRef starting at (fileEOF + 1) close access fileRef llowing Ruby script then reads that file, downloads the html file from the url, runs Tidy on it to convert to XHTML, and then runs an XSLT on that to create final (very clean) output (including passing the url as parameter to saxon for insertion in the document header).

It could no doubt use much more work (error handling, support for different sites, etc.), but it’s a good start. It uses standard Ruby libraries, except for the Tidy package, which is new.

$TIDYLIB = '/usr/local/lib/libtidy.dylib'
require 'tidy'
require 'yaml' 
require 'open-uri' 


datafile = 'downloadIndex.txt' tidyconfigfile = 'tidyconfig.txt'

Fetch pages do |tidy| tidy.loadconfig(tidyconfigfile) YAML.load( do |doc| title,url = doc.valuesat('title', 'url') name = title.gsub(/s/, '') file = name + '.html' file_xhtml = name + '.xhtml' uri = url.dup uri = ''+url if url =~ /bbc/ uri << '?hp=&pagewanted=print' if url =~ /nytimes/, 'w') do |article| puts "nbegin processing "#{title}" ..." page = open(uri, 'referer'=>'') { |io| } article.write(tidy.clean(page)) end saxon -o #{file_xhtml} #{file} ../clean.xsl url="#{url}" puts "... finished processing "#{title}"" end puts "ndone.nn" end

OpenOffice and ISO

Posted in Uncategorized on September 28th, 2004 by darcusb – Comments Off

Tim Bray outlines the Sun response to recent conclusions from the European Commission on document standards. In short, the OOo OASIS TC is now on its way towards ISO submission, and the next version of both the commericial StarOffice package and the open source OpenOffice will include filters for WordML and ExcelML. That the OASIS TC recently approved the inclusion of a new citation schema makes this all the more interesting.

Tim offers the following quote from the commission, which I quite like:

Transparency and accessibility requirements dictate that public information and government transactions avoid depending on technologies that imply or impose a specific product or platform on businesses or citizens.

Alas, I can’t tell you how often people send me—or insist I send them—.doc files. It’s really discouraging. I just deleted an email from someone wanting me to submit information in a .doc file after telling them four or five times I don’t accept them. Sigh …


Posted in Uncategorized on September 21st, 2004 by darcusb – 8 Comments

Sometimes people misunderstand my objection to bibtex and advocacy of XML. While I do believe XML and associated tools have tremendous advantages for the sort of data and workflow I’m interested in, my objection to bibtex goes beyond simply thinking XML is a better markup language. The problem with bibtex is really the data model.

To illustrate, here’s what a compact non-XML bibliographic representation could look like:

> bib = <<EOL
" -
"   id: doe2000
"   title: A Book Title
"   creator: Doe, Jane; Smith, John
"   genre: book
"   origins:
"      year: 2000
"      publisher: ABC Books
"      place: New York
" -
"   id: smith2001
"   title: Article Title
"   creator: Smith, John
"   genre: article
"   container:
"     title: Journal of ABC
"     origins:
"       date: 2001-11
"     parts:
"       volume: 21
"       issue: 3
"       pages: 23-34
"     genre: academic journal
> YAML.load bib
=> [{"creator"=>"Doe, Jane; Smith, John", "title"=>"A Book Title", "id"=>"doe2000", 
"origins"=>{"place"=>"New York", "publisher"=>"ABC Books", "year"=>2000}, "genre"=>"book"}, 
{"container"=>{"title"=>"Journal of ABC", "parts"=>{"issue"=>3, "pages"=>"23-34", "volume"=>21},
 "origins"=>{"date"=>"2001-11"}, "genre"=>"academic journal"}, "creator"=>"Smith, John",
 "title"=>"Article Title", "id"=>"smith2001", "genre"=>"article"}]

Note: this is not a serious attempt; it took me a few minutes. Still, it shows you can represent a MODS/DC-like structure in a compact syntax (YAML) which languages like Ruby (which I’m using above) support out of the box.

Citation Style Language and GUIs

Posted in Uncategorized on September 19th, 2004 by darcusb – Comments Off

As I’ve been working on a citation style schema for use with my XSLT stylesheets, I’ve had a few primary goals. Among them, it should be:

  1. much easier to use than other citation style languages like bibtex’s .bst files
  2. able to support the most demanding of citation needs
  3. designed in such a way that a simple GUI could be built around it

With respect to the third goal, I strongly believe the success of much of I’m what I’m involved in will be based on a really good citation style language, and an easy-to-use web repository that makes it easy to access, create, and edit such style files.

To test my design a bit, I put together a simple mockup that shows how the style language might work in the context of a GUI. This is not a complete interface, and is completely non-functional; it just shows what I am envisioning an interface might look like.

In terms of the logic on which the interface is based, I split reference “types” into “class” and “type.” Within each (abstract) class, I require a definition. Therefore, for any given style, one must create definitions for the following types: article, chapter, and book. There are then extensible lists that allow one to have a definition that optionally can inherit from the required types within the class.

For example, say I have a style that for some bizarre reason (yes, citation styles are often bizarre) requires magazine articles to be formatted slightly differently than generic articles. I can then create a definition for “article-magazine”, inherit from the base-type within the class (article), and only indicate the specific piece that needs to be changed.

One issue I struggled with was internationalization. I decided to have a root-level xml:lang attribute, and to simply assume all styles are language-specific. This may mean large numbers of instances for a given style, but that seems a better approach that to introduce additional complexity to the spec. On the other hand, I have changed the structure a bit to allow me to internationalize within the individual styles if need be. Any opinions?

The Mac Community and Free Software

Posted in General on September 16th, 2004 by darcusb – Comments Off

There are times when I think the old time Mac community is utterly clueless on the benefits and opportunities of free software. While there are a lot Mac users in the free software community these days, it seems most of them migrated over from Linux.

Witness a thread at Macintouch on database solutions for a guy interested in doing scientific development. Let’s see: out of roughly 25 recommendations, I count three mentions of MySQL or PostgreSQL. Everything else is one commercial products or another.

I don’t get it.

OpenOffice TC Approves Citation Proposal

Posted in Uncategorized on September 16th, 2004 by darcusb – Comments Off

The OASIS OpenOffice Technical Committee has approved a proposal—written by Daniel Vogelheim and I—to dramatically improve citation support in the XML file format. The proposal builds on the new biblioref element in DocBook, but extends it to be both more general, and to work within the context of WYSIWYG applications like word-processors. It allows coding that will support the following features (if applications provide them):

  1. semantic coding of complex point citation details like (Doe, 1999: 1, 3-4), or (Doe, 1999: part 3, paragraph 2).
  2. ability to seamlessly switch between, for example, footnote and in-text citation styles

Because the code is in the form of a (tiny) standalone namespaced schema, it could be used really in any XML format, including WordML. It will take time to see all of the benefits reflected in workable software products, but it’s an important step forward.

Draft version here.

Downloading News?

Posted in Uncategorized on September 14th, 2004 by darcusb – 7 Comments

Anyone out there know of any simple scripts (preferably Python or Ruby, or maybe Bash, or even Applescript) that I can use to download articles from a password-protected site like the New York Times?

I want to download the “printer” version of the files, run Tidy on them to convert to XHTML, and then run an XSLT stylesheet I’ve written to clean them up further (for, say, creating MODS records, or dumping in an XML DB and querying them).

I’m not exactly sure of my ideal workflow. Perhaps invoking a script from a menu within my newsreader? I have feeds for the New York Times; so that seems the simplest and quickest route. Still, a commandline interface is totally fine as well.

update: I found this on using wget to grab stuff from sites like the New York Times.

update 2: With a little help from Brent Simmons (author of my newsreader), I’ve got an Applescript that takes the URL for the highlighted article, and then passes it to wget for download. OK, good. The problem is that the URL from the feed has a bunch of cruft added to the end that doesn’t allow me to get the specific (printer-friendly) page I want. Does anyone have any idea how to take the URL as variable, but to remove everything after the “.html” in Applescript?

WordML and XSLT

Posted in Uncategorized on September 10th, 2004 by darcusb – Comments Off

On the xsl list, David Carlisle posted a link to a useful new article on using WordML and XSLT.

New nXML Release

Posted in Uncategorized on September 8th, 2004 by darcusb – Comments Off

James Clark took a break from his software projects for awhile, but is now back with a new release of nXML, his excellent RELAX NG-based schema validating xml editor for emacs. This one has some low-level changes to indenting support, as well as a nice new XML menu. A look in the TODO file shows an ambitious roadmap that includes support for XML Schema.

Understanding Metadata

Posted in Uncategorized on September 6th, 2004 by darcusb – Comments Off

A new document is available from NISO called Understanding Metadata. Covers MODS, TEI, METS and Dublin Core, among others. It’s a good overview (though the typesetting is pretty horrid; looks like it was created in MS Word).