Budapest Open Access Initiative      

Budapest Open Access Initiative: BOAI Forum Archive

[BOAI] [Forum Home] [index] [prev] [next] [options] [help]

boaiforum messages

Re: [BOAI] Formats for electronic dissemination

From: Michael Brown <m.brown AT liverpool.ac.uk>
Date: Wed, 29 Oct 2003 16:30:30 +0000


Threading: Re: [BOAI] Formats for electronic dissemination from radu AT monicsoft.net
      • This Message
             RE: [BOAI] Formats for electronic dissemination from eds AT library.caltech.edu

> I have seen many journal archives which simply dump page scans into 
> pdf format. The resulting documents are huge and totally impenetrable 
> by current classification/data mining tools. It's even impossible to 
> copy/paste text out of these 'archives'.

I think most organisations opt just to use images because of the large 
amount of time & resources (read $$$$$) it would take to OCR the 
articles and then proof-read them. At least with an image you know you 
are getting a mirror image without having to employ any extra 
resources.

I know this is not the ideal solution - but metadata can be added to 
the PDF, and as OCR matures the current "image" PDFs could be 
converted 
to full text.

At the moment Acrobat 6 Professional will let you OCR "image" PDFs 
and

1) either replace the image with "full" text - which takes time to 
convert and then time to proof-read against the original copy...

or

2) leave the image in place and the text behind the image - so that it 
is selectable and you can mine it - for an example see the two reports 
from:

http://www.filariasis.net/library/reports/index.html

(however if you don't proof read it the text could contain some errors 
[- this is mainly a concern if the text is older]).

At the moment option 2 is the only one available to me - as I'm working 
alone converting hundreds of paper-based journal articles (from 1870's 
to date) and simply don't have the resources (read $$$$$ ;-)) to create 
full text versions. If I getting funding though I already have the 
"image" PDFs which I can then convert to full text.

In the meantime using option 2 allows both my search engine and PDF 
readers to search within the PDFs - and it is very accurate - but 
consumes less of my resources.

Mike

==============================
Michael Brown
Lymphatic Filariasis Support Centre
Liverpool School of Tropical Medicine
Pembroke Place
Liverpool
Merseyside
L3 5QA
United Kingdom

t: +44-151-705-3243
f: +44-151-705-3243
e: m.brown AT liverpool.ac.uk
w: http://www.filariasis.net/
==============================
On 28 Oct 2003, at 18:37, Radu wrote:

> At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>> (I confess that I don't thoroughly understand the problem with pdf's,
>> since pdf documents can be indexed by search engines as easily as html
>> documents: it doesn't look like an insuperable technical problem).
>
> There's something else about archived pdfs, much worse than the 
> relative inaccessibility of the semantics for their content, and 
> that's image-based text.
>
> I have seen many journal archives which simply dump page scans into 
> pdf format. The resulting documents are huge and totally impenetrable 
> by current classification/data mining tools. It's even impossible to 
> copy/paste text out of these 'archives'.
>
>
> Yours,
> Radu
> --
> Eastcree.org project
> Carleton University
> www.monicsoft.net/proj/creeTime.html
> (613) 520-2600x2174


[BOAI] [Forum Home] [index] [prev] [next] [options] [help]

 E-mail:  openaccess@soros.org .