Re: [BOAI] Formats for electronic dissemination

From: "Michael J. O'Donnell" <michael_odonnell AT>
Date: Mon, 27 Oct 2003 12:15:10 -0600

Threading: [BOAI] Formats for electronic dissemination from tarabore AT
Dario Taraborelli wrote:

 > I would like to point out that a much more fundamental
> issue has been so far underestimated (at least to my knowledge) : the
> question of the overall accessibility and interoperability of FORMATS for
> archived documents.
> This seems to be - at least prima facie - much more urgent than the
> problem of which formats allow full-text data mining
> (I confess that I don't thoroughly understand the problem with pdf's,
> since pdf documents can be indexed by search engines as easily as html
> documents: it doesn't look like an insuperable technical problem).

Of course, getting things archived is more important than the choice of 
format. But the format is important, too, as long as we don't let the 
choice delay archiving. I wrote about this in 1993:

Well-crafted PDFs are searchable for words and phrases, but PDF is 
inherently a format for describing page layouts, rather than a format 
for describing texts. E.g., textual structure (section headings, etc.) 
is much harder to pull from PDF than from HTML.

My article emphasizes the data-structural qualities of different 
formats, because we can evaluate these relatively objectively and 
definitely. Many of the crucial long-term issues, such as indefinite 
readability, cannot be attacked directly, because they depend on unknown 
future developments. But a format that is well structured technically 
will be easier to convert in the future, and has one advantage in 
attracting a large enough user community to insure that it will be 
converted when necessary.

Mike O'Donnell
The University of Chicago

