Budapest Open Access Initiative      

Budapest Open Access Initiative: BOAI Forum Archive

[BOAI] [Forum Home] [index] [options] [help]

boaiforum messages

[BOAI] Formats for electronic dissemination

From: Dario Taraborelli <tarabore AT clipper.ens.fr>
Date: Mon, 27 Oct 2003 17:55:55 +0100 (MET)


Threading:      • This Message
             Re: [BOAI] Formats for electronic dissemination from harnad AT ecs.soton.ac.uk
             Re: [BOAI] Formats for electronic dissemination from michael_odonnell AT acm.org
             Re: [BOAI] Formats for electronic dissemination from radu AT monicsoft.net

[Apologies: slightly OT]

As a contribution to the discussion about appropriate formats for
self-archiving, I would like to point out that a much more fundamental
issue has been so far underestimated (at least to my knowledge) : the
question of the overall accessibility and interoperability of FORMATS for
archived documents.
This seems to be - at least prima facie - much more urgent than the
problem of which formats allow full-text data mining
(I confess that I don't thoroughly understand the problem with pdf's,
since pdf documents can be indexed by search engines as easily as html
documents: it doesn't look like an insuperable technical problem).

As far as I know, the current version of Eprints allows users to deposit a
document in either open standards (like HTML, PDF, PS, DVI etc.) or
semi-open (like RTF) and proprietary formats (like DOC, PPT, XLS etc.).
Correct me if I'm wrong: I've never found in the OAI discussion lists any
clear statement about which formats are appropriate for electronic
archiving and which formats should be avoided. Still, there is a huge
debate in the digital library community about how to grant accessibility
and perennity of electronic content: one of the main recommendations is
that institutions involved in dissemination of electronic documents should
begin to strongly discourage the use of proprietary standard and
promote the use of accessible and public standards.
[See for instance UNESCO's Preliminary Draft Charter on the Preservation
of the Digital Heritage - http://www.knaw.nl/ecpa/PUBL/unesco.html)

Consider the following scenario: a growing number of electronic documents
are deposited in open archives in proprietary formats and one day the
software/plugin required for displaying such formats suddenly is no more
available (it is already the case with older versions of existing document
formats). The result is that a considerable part of online papers made
available through Open Access Archives will simply become *no more
accessible* for technical reasons.

To put it another way, does it make sense to promote *toll-free access* to
electronic papers without considering the crucial but often ignored issue
of granting *format accessibility* to this content?

Do we have any statistics about formats used in open access archives?
Has this general issue ever been raised within the OAI community (if it
has, my apologies: could someone please give me some pointers)?  If not,
don't you find it is urgent to think through this kind of problems?


Best,

Dario



--
Dario Taraborelli

Institut Jean-Nicod
1bis, avenue de Lowendal
F-75007 Paris
+33 (0)1 53593294
www.institutnicod.org

taraborelli AT ens.fr



Re: [BOAI] Formats for electronic dissemination

From: Stevan Harnad <harnad AT ecs.soton.ac.uk>
Date: Mon, 27 Oct 2003 18:30:42 +0000 (GMT)


Threading: [BOAI] Formats for electronic dissemination from tarabore AT clipper.ens.fr
      • This Message
             Re: [BOAI] Formats for electronic dissemination from holl AT konkoly.hu

On Mon, 27 Oct 2003, Dario Taraborelli wrote:

> As far as I know, the current version of Eprints allows users to deposit a
> document in either open standards (like HTML, PDF, PS, DVI etc.) or
> semi-open (like RTF) and proprietary formats (like DOC, PPT, XLS etc.).
> Correct me if I'm wrong: 

This is not quite correct. If you mean Eprints.org, then you *must*
deposit a document in screen-readable form; this mostly means in one
of the open standards (including also TeX, and XML -- but I believe
PDF, which is acceptable, is proprietary). Then *in addition* you may
also attach a version in a non-screen-readable format. A document in
*only* a proprietary format like WORD is not accepted by the software.

> there is a huge
> debate in the digital library community about how to grant accessibility
> and perennity of electronic content

But self-archiving is not about perennity, just about accessibility,
because the self-archived versions of toll-access articles are merely
*supplements* to and not *substitutes* for the publisher's proprietary
toll-access version. Preservation concerns should accordingly be focussed
on the proprietary version until and unless the self-archived version
begins to be the *only* version (because of a transition to open-access
publishing).

Having said that: even this parallel form of supplementary access for
the have-nots needs to last long enough to keep providing the open
access until and unless there is a global transition to open-access
publishing. And that it does. Self-archived papers (in TeX, PDF, PS,
HTML, etc.) are still as accessible today as when they were first
self-archived, some over a decade and a half ago.

Please do not conflate digital preservation with open-access provision
-- at least not while the open-access cupboards are still bare and there
is next to nothing to preserve!

http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/0413.html
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2837.html

Stevan Harnad


Re: [BOAI] Formats for electronic dissemination

From: "Michael J. O'Donnell" <michael_odonnell AT acm.org>
Date: Mon, 27 Oct 2003 12:15:10 -0600


Threading: [BOAI] Formats for electronic dissemination from tarabore AT clipper.ens.fr
      • This Message

Dario Taraborelli wrote:

 > I would like to point out that a much more fundamental
> issue has been so far underestimated (at least to my knowledge) : the
> question of the overall accessibility and interoperability of FORMATS for
> archived documents.
> This seems to be - at least prima facie - much more urgent than the
> problem of which formats allow full-text data mining
> (I confess that I don't thoroughly understand the problem with pdf's,
> since pdf documents can be indexed by search engines as easily as html
> documents: it doesn't look like an insuperable technical problem).

Of course, getting things archived is more important than the choice of 
format. But the format is important, too, as long as we don't let the 
choice delay archiving. I wrote about this in 1993:

http://people.cs.uchicago.edu/~odonnell/Scholar/Technical_papers/Electronic_Journal/description.html

Well-crafted PDFs are searchable for words and phrases, but PDF is 
inherently a format for describing page layouts, rather than a format 
for describing texts. E.g., textual structure (section headings, etc.) 
is much harder to pull from PDF than from HTML.

My article emphasizes the data-structural qualities of different 
formats, because we can evaluate these relatively objectively and 
definitely. Many of the crucial long-term issues, such as indefinite 
readability, cannot be attacked directly, because they depend on unknown 
future developments. But a format that is well structured technically 
will be easier to convert in the future, and has one advantage in 
attracting a large enough user community to insure that it will be 
converted when necessary.

Mike O'Donnell
The University of Chicago


Re: [BOAI] Formats for electronic dissemination

From: holl AT konkoly.hu (Andras Holl)
Date: Tue, 28 Oct 2003 10:33:20 +0100 (MET)


Threading: Re: [BOAI] Formats for electronic dissemination from harnad AT ecs.soton.ac.uk
      • This Message


> Please, let's not lose sight of the problem, which is still there,
> as pressing as ever, but now being kept at a distance by yet
> *another* groundless, confusion-generating, and -- most important --
> *inaction-encouraging* reservation.

Isn't there a room for discussing the question of format and other
"technical" details like accessibility of document elements (figures,
tables)?

I agree it is not the best strategy to inflate a problem with the
inclusion of other related problems - but there are ways to handle
situations like this. Open Access is about ACCESS, I accept. But we might
discuss Open Formats as well.

> But self-archiving is not about perennity, just about accessibility,
> because the self-archived versions of toll-access articles are merely
> *supplements* to and not *substitutes* for the publisher's proprietary
> toll-access version.

I'm afraid publisher's proprietary toll-access versions are neither
about perennity. Some of the publishers do have good solutions for
the format problem - I believe Univ. of Chicago Press is an example,
but others might not. It might not be the best strategy to put too
much pressure on them, indeed. I think, though, we should pay 
attention to document formats and related questions in the case
of self-archiving and institutional repositories.


Andras Holl
-----------------------------------------------------------------------------
Andras Holl / Holl Andras                            e-mail: holl AT konkoly.hu
Konkoly Observatory / MTA CsKI                       Tel.: +36 1 3754-122
IT manager / Szamitastechn. rendszervezeto           Mail: H-1525 P.O.Box 67, 
                                                           Budapest, Hungary
-----------------------------------------------------------------------------


Re: [BOAI] Formats for electronic dissemination

From: Radu <radu AT monicsoft.net>
Date: Tue, 28 Oct 2003 13:37:25 -0500


Threading: [BOAI] Formats for electronic dissemination from tarabore AT clipper.ens.fr
      • This Message
             Re: [BOAI] Formats for electronic dissemination from m.brown AT liverpool.ac.uk
             Re: [BOAI] Formats for electronic dissemination from drvinod AT hotpop.com
             Re: [BOAI] Formats for electronic dissemination from M.Brown AT liverpool.ac.uk

At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>(I confess that I don't thoroughly understand the problem with pdf's,
>since pdf documents can be indexed by search engines as easily as html
>documents: it doesn't look like an insuperable technical problem).

There's something else about archived pdfs, much worse than the relative 
inaccessibility of the semantics for their content, and that's image-based 
text.

I have seen many journal archives which simply dump page scans into pdf 
format. The resulting documents are huge and totally impenetrable by 
current classification/data mining tools. It's even impossible to 
copy/paste text out of these 'archives'.


Yours,
Radu
--
Eastcree.org project
Carleton University
www.monicsoft.net/proj/creeTime.html
(613) 520-2600x2174 


Re: [BOAI] Formats for electronic dissemination

From: Michael Brown <m.brown AT liverpool.ac.uk>
Date: Wed, 29 Oct 2003 16:30:30 +0000


Threading: Re: [BOAI] Formats for electronic dissemination from radu AT monicsoft.net
      • This Message
             RE: [BOAI] Formats for electronic dissemination from eds AT library.caltech.edu

> I have seen many journal archives which simply dump page scans into 
> pdf format. The resulting documents are huge and totally impenetrable 
> by current classification/data mining tools. It's even impossible to 
> copy/paste text out of these 'archives'.

I think most organisations opt just to use images because of the large 
amount of time & resources (read $$$$$) it would take to OCR the 
articles and then proof-read them. At least with an image you know you 
are getting a mirror image without having to employ any extra 
resources.

I know this is not the ideal solution - but metadata can be added to 
the PDF, and as OCR matures the current "image" PDFs could be 
converted 
to full text.

At the moment Acrobat 6 Professional will let you OCR "image" PDFs 
and

1) either replace the image with "full" text - which takes time to 
convert and then time to proof-read against the original copy...

or

2) leave the image in place and the text behind the image - so that it 
is selectable and you can mine it - for an example see the two reports 
from:

http://www.filariasis.net/library/reports/index.html

(however if you don't proof read it the text could contain some errors 
[- this is mainly a concern if the text is older]).

At the moment option 2 is the only one available to me - as I'm working 
alone converting hundreds of paper-based journal articles (from 1870's 
to date) and simply don't have the resources (read $$$$$ ;-)) to create 
full text versions. If I getting funding though I already have the 
"image" PDFs which I can then convert to full text.

In the meantime using option 2 allows both my search engine and PDF 
readers to search within the PDFs - and it is very accurate - but 
consumes less of my resources.

Mike

==============================
Michael Brown
Lymphatic Filariasis Support Centre
Liverpool School of Tropical Medicine
Pembroke Place
Liverpool
Merseyside
L3 5QA
United Kingdom

t: +44-151-705-3243
f: +44-151-705-3243
e: m.brown AT liverpool.ac.uk
w: http://www.filariasis.net/
==============================
On 28 Oct 2003, at 18:37, Radu wrote:

> At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>> (I confess that I don't thoroughly understand the problem with pdf's,
>> since pdf documents can be indexed by search engines as easily as html
>> documents: it doesn't look like an insuperable technical problem).
>
> There's something else about archived pdfs, much worse than the 
> relative inaccessibility of the semantics for their content, and 
> that's image-based text.
>
> I have seen many journal archives which simply dump page scans into 
> pdf format. The resulting documents are huge and totally impenetrable 
> by current classification/data mining tools. It's even impossible to 
> copy/paste text out of these 'archives'.
>
>
> Yours,
> Radu
> --
> Eastcree.org project
> Carleton University
> www.monicsoft.net/proj/creeTime.html
> (613) 520-2600x2174


Re: [BOAI] Formats for electronic dissemination

From: "Dr.Vinod Scaria" <drvinod AT hotpop.com>
Date: Mon, 20 Oct 2003 13:46:17 +0530


Threading: Re: [BOAI] Formats for electronic dissemination from radu AT monicsoft.net
      • This Message

I agree to Radu's views.
I have always wondered why they convert page scans to PDFs.
They can always use them as GIFs or JPEGs which is much handy and easily
downloadable and there is no special advantage being a PDF by itself, as
Radu notes, these are virtually impenetrable by data mining tools.
Moreover, the print quality of many of these scanned PDFs are equally poor.

kind regards
Vinod


Dr.Vinod Scaria
WEB: www.drvinod.netfirms.com
MAIL: vinodscaria AT yahoo.co.in
Mobile: +91 98474 65452



----- Original Message -----
From: Radu
To: BOAI Forum
Sent: Wednesday, October 29, 2003 12:07 AM
Subject: Re: [BOAI] Formats for electronic dissemination


At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>(I confess that I don't thoroughly understand the problem with pdf's,
>since pdf documents can be indexed by search engines as easily as html
>documents: it doesn't look like an insuperable technical problem).

There's something else about archived pdfs, much worse than the relative
inaccessibility of the semantics for their content, and that's image-based
text.

I have seen many journal archives which simply dump page scans into pdf
format. The resulting documents are huge and totally impenetrable by
current classification/data mining tools. It's even impossible to
copy/paste text out of these 'archives'.


Yours,
Radu
--
Eastcree.org project
Carleton University
www.monicsoft.net/proj/creeTime.html
(613) 520-2600x2174



RE: [BOAI] Formats for electronic dissemination

From: eds AT library.caltech.edu
Date: Wed, 29 Oct 2003 11:49:36 -0800


Threading: Re: [BOAI] Formats for electronic dissemination from m.brown AT liverpool.ac.uk
      • This Message
             RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com

The Paper Capture feature in Acrobat 6 Professional can help. Using it with
the  "Formatted text and graphics" option will convert an image 
(tiff) based
PDF document (originating from a scan) into a mostly text based PDF via an
OCR step. 

The quality is significantly better, because you are now looking at rendered
fonts, not bit mapped images, and it becomes indexable. 

All the words, equations, diagrams, etc., that Acrobat can't convert to text
are left as mini images within the document.

--
Ed Sponsler
Caltech Library System
Pasadena, CA USA



> -----Original Message-----
> From: Dr.Vinod Scaria [mailto:drvinod AT hotpop.com] 
> Sent: Monday, October 20, 2003 1:16 AM
> To: BOAI Forum
> Subject: Re: [BOAI] Formats for electronic dissemination
> 
> 
> I agree to Radu's views.
> I have always wondered why they convert page scans to PDFs. 
> They can always use them as GIFs or JPEGs which is much handy 
> and easily downloadable and there is no special advantage 
> being a PDF by itself, as Radu notes, these are virtually 
> impenetrable by data mining tools. Moreover, the print 
> quality of many of these scanned PDFs are equally poor.
> 
> kind regards
> Vinod
> 
> 
> Dr.Vinod Scaria
> WEB: www.drvinod.netfirms.com
> MAIL: vinodscaria AT yahoo.co.in
> Mobile: +91 98474 65452
> 
> 
> 
> ----- Original Message -----
> From: Radu
> To: BOAI Forum
> Sent: Wednesday, October 29, 2003 12:07 AM
> Subject: Re: [BOAI] Formats for electronic dissemination
> 
> 
> At 11:55 AM 10/27/03, Dario Taraborelli wrote:
> >(I confess that I don't thoroughly understand the problem 
> with pdf's, 
> >since pdf documents can be indexed by search engines as 
> easily as html
> >documents: it doesn't look like an insuperable technical problem).
> 
> There's something else about archived pdfs, much worse than 
> the relative inaccessibility of the semantics for their 
> content, and that's image-based text.
> 
> I have seen many journal archives which simply dump page 
> scans into pdf format. The resulting documents are huge and 
> totally impenetrable by current classification/data mining 
> tools. It's even impossible to copy/paste text out of these 
> 'archives'.
> 
> 
> Yours,
> Radu
> --
> Eastcree.org project
> Carleton University
> www.monicsoft.net/proj/creeTime.html
> (613) 520-2600x2174
> 
> 


RE: [BOAI] Formats for electronic dissemination

From: JOATO JOATP <joatp2000 AT yahoo.com>
Date: Thu, 30 Oct 2003 05:03:08 -0800 (PST)


Threading: RE: [BOAI] Formats for electronic dissemination from eds AT library.caltech.edu
      • This Message
             RE: [BOAI] Formats for electronic dissemination from remoran AT digcns.com
             Re: [BOAI] Formats for electronic dissemination from lqthede AT apk.net
             RE: [BOAI] Formats for electronic dissemination from remoran AT digcns.com
             Re: [BOAI] Formats for electronic dissemination from M.Brown AT liverpool.ac.uk
             Re: [BOAI] Formats for electronic dissemination from michael_odonnell AT acm.org
             Re: [BOAI] Formats for electronic dissemination from lqthede AT apk.net
             Re: [BOAI] Formats for electronic dissemination from m.brown AT liverpool.ac.uk



I kind of agree on this issue.  Being one that does different articles from 
time to time I have often noticed how certain graphics, especially math 
printouts can look so great when in say MS Word format.   But after conversion 
to PDF format you at times have to well over enlarge the text to even read the 
math.   Regular prints of these suffer the same problem.   In fact, one of the 
Editors of our own Journal once mentioned to me he was having problems along 
this line in something I had converted to PDF.   When I sent him the original 
in MS word he had not problem at all reading it.   I then suggested he enlarge 
the PDF version by a bit over 200% and agin he could proper read them.
 
I believe it is the University of Texas that allows upload and displays of 
files in PDF, Latex, and MS Word formats as well as HTML, etc.   I have often 
thought that feature would be good on a all the systems in general.   Another 
feature out there often absent from websites in general is the ability to 
upload directly PDF files.  Now websites are not archive systems.   But they do 
enter into the general spread of information worldwide.   Since most Institutes 
and most researchers tend to use the PDF format, having websites that allow 
direct upload of PDF files as well as other types of files would be an 
improvement from the regular hyperlink to another site.
 
Just a thought on that last point.


---------------------------------
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears

ATTACHMENT: message.html!


RE: [BOAI] Formats for electronic dissemination

From: "Robert E. Moran" <remoran AT digcns.com>
Date: Thu, 30 Oct 2003 09:10:18 -0700


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message

ATTACHMENT: message.html!


Re: [BOAI] Formats for electronic dissemination

From: Mike Brown <M.Brown AT liverpool.ac.uk>
Date: Thu, 30 Oct 2003 15:11:45 +0000 (GMT)


Threading: Re: [BOAI] Formats for electronic dissemination from radu AT monicsoft.net
      • This Message

And I have to strongly disagree ;-)

>They can always use them as GIFs or JPEGs which is much handy and
>easily downloadable and there is no special advantage being a PDF by
>itself,

Have you ever tried to assemble a journal article for dissemination
using just GIFs or JPEGs? Take a look at this article:

An investigation on filariasis in the Berau region (Inanwatan District,
North-West New Guinea) (1.4MB pdf)
http://filariasis.net/library/media/report_pdfs/spc/1957/spc_tpn_105.pdf

and tell me it would be easier for non-IT literate people (99% of the
auidence - unless it's an IT one :-)) to view it as a series of
individual GIFs/JPEGs - rather than click on the link and for it to open
in Acrobat Reader - just like a facsimile of the original printed
report - across *all* platforms (UNIX, Mac, WinPC).  Furthermore if the
article is still protected by Copyright using
PDF allows (should you desire) you to control access to the document -
something you cannot do with GIFs/JPEGs.

>there is no special advantage being a PDF

I would suggest spending sometime with the PDF specification and an
application such as Adobe Acrobat to see what you can do with PDF - did
you know, for example, that the windowing system in Mac OS X uses
PDF as the basis of its imaging model?

http://www.apple.com/macosx/features/quartz/

>these are virtually impenetrable by data mining tools.

This is not the fault of PDF - but the person who applies the
technology - see my last e-mail on this subject.

>Moreover, the print quality of many of these scanned PDFs are equally
poor.

Again, this is not the fault of PDF - but the person who applies the
technology.

The report I link to above (and all  of the articles I have converted on
filariasis.net) prints out a high quality - why? because it has a DPI of
300 - and not screen DPI of 72.

Many Journals when converting their archives to PDF simply choose a low
resolution either at the scan stage or conversion to PDF stage as it
lowers the final size of the PDF - why?

I think because:

1)  smaller frootprint (less cost to produce and store)
2)  smaller bandwith to transmit across a network (less costs to
transmit)

It's down to economics.

The result, often, is poor quality printing. A further problem is that
image quality in these pdfs is often very poor - making the images
unreadable - and for us in the world of medicine - useless.

So in short, it's not often the fault of a technology - is mostly the
fault of a human using the technology inappropriately or being constrained by
economic factors.

Best wishes,

Mike


On Mon, 20 Oct 2003, Dr.Vinod Scaria wrote:

>I agree to Radu's views.
>I have always wondered why they convert page scans to PDFs.
>They can always use them as GIFs or JPEGs which is much handy and easily
>downloadable and there is no special advantage being a PDF by itself, as
>Radu notes, these are virtually impenetrable by data mining tools.
>Moreover, the print quality of many of these scanned PDFs are equally poor.
>
>kind regards
>Vinod
>
>
>Dr.Vinod Scaria
>WEB: www.drvinod.netfirms.com
>MAIL: vinodscaria AT yahoo.co.in
>Mobile: +91 98474 65452
>
>
>
>----- Original Message -----
>From: Radu
>To: BOAI Forum
>Sent: Wednesday, October 29, 2003 12:07 AM
>Subject: Re: [BOAI] Formats for electronic dissemination
>
>
>At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>>(I confess that I don't thoroughly understand the problem with pdf's,
>>since pdf documents can be indexed by search engines as easily as html
>>documents: it doesn't look like an insuperable technical problem).
>
>There's something else about archived pdfs, much worse than the relative
>inaccessibility of the semantics for their content, and that's image-based
>text.
>
>I have seen many journal archives which simply dump page scans into pdf
>format. The resulting documents are huge and totally impenetrable by
>current classification/data mining tools. It's even impossible to
>copy/paste text out of these 'archives'.
>
>
>Yours,
>Radu
>--
>Eastcree.org project
>Carleton University
>www.monicsoft.net/proj/creeTime.html
>(613) 520-2600x2174
>
>
>


Re: [BOAI] Formats for electronic dissemination

From: "Linda Q. Thede" <lqthede AT apk.net>
Date: Thu, 30 Oct 2003 09:59:01 -0500


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message

I have in on the pdf discussion. Personally I detest them for two 
reasons. One, they take longer to download, and two, they are too often 
in columns making them difficult to read online. Not all of use want to 
print everything we see online.

-- 
Linda Q. Thede
435-4 Chandler Drive
Aurora, OH 44202
lqthede AT apk.net
330-562-3281




RE: [BOAI] Formats for electronic dissemination

From: "Robert E. Moran" <remoran AT digcns.com>
Date: Thu, 30 Oct 2003 07:39:16 -0700


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message
             RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com

ATTACHMENT: message.html!


Re: [BOAI] Formats for electronic dissemination

From: Mike Brown <M.Brown AT liverpool.ac.uk>
Date: Thu, 30 Oct 2003 18:55:53 +0000 (GMT)


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message

>One, they take longer to download

Take longer to download compared to what?

>and two, they are too often in columns making them difficult to read
>online.

This is not the technology at fault - this is the preference of the
producer of the document.

On Thu, 30 Oct 2003, Linda Q. Thede wrote:

>I have in on the pdf discussion. Personally I detest them for two
>reasons. One, they take longer to download, and two, they are too often
>in columns making them difficult to read online. Not all of use want to
>print everything we see online.
>
>--
>Linda Q. Thede
>435-4 Chandler Drive
>Aurora, OH 44202
>lqthede AT apk.net
>330-562-3281
>
>
>
>


RE: [BOAI] Formats for electronic dissemination

From: JOATO JOATP <joatp2000 AT yahoo.com>
Date: Thu, 30 Oct 2003 16:00:18 -0800 (PST)


Threading: RE: [BOAI] Formats for electronic dissemination from remoran AT digcns.com
      • This Message



Over time, having adpated from the MS WORD/HTML formats I have learned to work 
fine with PDF, and in fact, almost all the articles our Journal does are now 
PDF simply because its almost a universal accepted format.   I've also found it 
uses up less space in general in most cases.   However, there are aspects of 
it's format I've yet to learn, especially in the area of graphics.   But I 
actually don't fault PDF in general for this since I've seen excellent done 
articles under this format with great graphics.


---------------------------------
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears

ATTACHMENT: message.html!


Re: [BOAI] Formats for electronic dissemination

From: "Michael J. O'Donnell" <michael_odonnell AT acm.org>
Date: Thu, 30 Oct 2003 14:57:30 -0600


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message

Well, first I agree with Steven Harnad that achieving open archiving is 
much more important than the choice of format. Can we coin a proverb: 
"PDF in the archive is better than perfect format deleted"? But, 
since I 
am expert on formats, I have to respond:

Mike Brown wrote:

>>and two, they are too often in columns making them difficult to read
>>online.
>>    
>>
>
>This is not the technology at fault - this is the preference of the
>producer of the document.
>
It's partly the fault of the choice of "technology" (that is, 
format). 
PDF is a page layout format, not a structured text format. It requires 
the producer to make choices that are much better left to readers, in 
particular because different readers have different needs. It should be 
the reader, rather than the producer, who is making layout choices.

Supposing that we can choose which format to put in the archive, we 
should choose formats that provide maximum flexible utility to readers. 
We may enumerate some things that readers might like to do with 
documents: display them, search them, perform statistical analyses on 
them, display them in huge fonts because of low visual acuity 
(magnifying a 10-point layout does a very bad job of this), browse them 
audibly (which is much more than just having them read), ... But the 
point is not to support particular anticipated uses. Rather, the point 
is to support maximum flexibility, and avoid foreclosing uses that will 
be invented in the future.

Paradoxically, it is quite possible to evaluate the inherent flexibility 
of a format without knowing the precise use to which that flexibility 
will contribute. I wrote up some analysis of known types of formats, and 
the article is openly (but not very well organizedly) archived at 
http://people.cs.uchicago.edu/~odonnell/Scholar/Technical_papers/Electronic_Journal/description.html

Mike O'Donnell
The University of Chicago



Re: [BOAI] Formats for electronic dissemination

From: "Linda Q. Thede" <lqthede AT apk.net>
Date: Thu, 30 Oct 2003 16:20:35 -0500


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message

Mike

PDF file take longer to download than an ordinary file because they are larger 
than an html file of the same information.

Yes, being in columns is a preference of the producer, but most choose this. I 
love it when Google offers an html version of the pdf file...

-- 

Linda Q. Thede
435-4 Chandler Drive
Aurora, OH 44202
lqthede AT apk.net
330-562-3281




Re: [BOAI] Formats for electronic dissemination

From: Michael Brown <m.brown AT liverpool.ac.uk>
Date: Fri, 31 Oct 2003 16:59:20 +0000


Threading: RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com
      • This Message


--Apple-Mail-1--881952606
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=US-ASCII;
	delsp=yes;
	format=flowed

I was responding to the comments about the PDF format made on this  
forum which are clearly incorrect - not about it's suitability as a  
format for self-archiving - for example:

> and two, they are too often in columns making them difficult to read
> online.

This is *not* the fault of the format - clearly you can make one column  
PDFs.

> There's something else about archived pdfs, much worse than the  
> relative
> inaccessibility of the semantics for their content, and that's  
> image-based
> text.

again this is *not* the fault of the format.

I don't believe that the PDF format is the only (or best) way of  
archiving material - that is what SGML and XML are for - allowing  
information to be extracted and re-used in a variety of dissemination  
strategies - giving the user the choice.

Cheers,

Mike

On 30 Oct 2003, at 20:57, Michael J. O'Donnell wrote:

> Well, first I agree with Steven Harnad that achieving open archiving  
> is much more important than the choice of format. Can we coin a  
> proverb: "PDF in the archive is better than perfect format 
deleted"?  
> But, since I am expert on formats, I have to respond:
>
> Mike Brown wrote:
>
>>> and two, they are too often in columns making them difficult to 
read
>>> online.
>>>
>>
>> This is not the technology at fault - this is the preference of the
>> producer of the document.
>>
> It's partly the fault of the choice of "technology" (that is, 
format).  
> PDF is a page layout format, not a structured text format. It requires  
> the producer to make choices that are much better left to readers, in  
> particular because different readers have different needs. It should  
> be the reader, rather than the producer, who is making layout choices.
>
> Supposing that we can choose which format to put in the archive, we  
> should choose formats that provide maximum flexible utility to  
> readers. We may enumerate some things that readers might like to do  
> with documents: display them, search them, perform statistical  
> analyses on them, display them in huge fonts because of low visual  
> acuity (magnifying a 10-point layout does a very bad job of this),  
> browse them audibly (which is much more than just having them read),  
> ... But the point is not to support particular anticipated uses.  
> Rather, the point is to support maximum flexibility, and avoid  
> foreclosing uses that will be invented in the future.
>
> Paradoxically, it is quite possible to evaluate the inherent  
> flexibility of a format without knowing the precise use to which that  
> flexibility will contribute. I wrote up some analysis of known types  
> of formats, and the article is openly (but not very well organizedly)  
> archived at  
> http://people.cs.uchicago.edu/~odonnell/Scholar/Technical_papers/ 
> Electronic_Journal/description.html
>
> Mike O'Donnell
> The University of Chicago
>
>

--Apple-Mail-1--881952606
Content-Transfer-Encoding: 7bit
Content-Type: text/enriched;
	charset=US-ASCII

I was responding to the comments about the PDF format made on this
forum which are clearly incorrect - not about it's suitability as a
format for self-archiving - for example:


<excerpt>and two, they are too often in columns making them difficult
to read

online.

</excerpt>

This is *not* the fault of the format - clearly you can make one
column PDFs.


<excerpt>There's something else about archived pdfs, much worse than
the relative 

inaccessibility of the semantics for their content, and that's
image-based 

text.

</excerpt>

again this is *not* the fault of the format.


I don't believe that the PDF format is the only (or best) way of
archiving material - that is what SGML and XML are for -
<fontfamily><param>Arial</param>allowing information to be 
extracted
and re-used in a variety of dissemination strategies - giving the user
the choice.


</fontfamily>Cheers,


Mike


On 30 Oct 2003, at 20:57, Michael J. O'Donnell wrote:


<excerpt>Well, first I agree with Steven Harnad that achieving open
archiving is much more important than the choice of format. Can we
coin a proverb: "PDF in the archive is better than perfect format
deleted"? But, since I am expert on formats, I have to respond:


Mike Brown wrote:


<excerpt><excerpt>and two, they are too often in columns making 
them
difficult to read

online.

   

</excerpt>

This is not the technology at fault - this is the preference of the

producer of the document.


</excerpt>It's partly the fault of the choice of "technology" 
(that
is, format). PDF is a page layout format, not a structured text
format. It requires the producer to make choices that are much better
left to readers, in particular because different readers have
different needs. It should be the reader, rather than the producer,
who is making layout choices.


Supposing that we can choose which format to put in the archive, we
should choose formats that provide maximum flexible utility to
readers. We may enumerate some things that readers might like to do
with documents: display them, search them, perform statistical
analyses on them, display them in huge fonts because of low visual
acuity (magnifying a 10-point layout does a very bad job of this),
browse them audibly (which is much more than just having them read),
... But the point is not to support particular anticipated uses.
Rather, the point is to support maximum flexibility, and avoid
foreclosing uses that will be invented in the future.


Paradoxically, it is quite possible to evaluate the inherent
flexibility of a format without knowing the precise use to which that
flexibility will contribute. I wrote up some analysis of known types
of formats, and the article is openly (but not very well organizedly)
archived at
http://people.cs.uchicago.edu/~odonnell/Scholar/Technical_papers/Electronic_Journal/description.html


Mike O'Donnell

The University of Chicago



</excerpt>
--Apple-Mail-1--881952606--


[BOAI] [Forum Home] [index] [options] [help]

 E-mail:  openaccess@soros.org .