Budapest Open Access Initiative      

Budapest Open Access Initiative: BOAI Forum Archive

[BOAI] [Forum Home] [index] [prev] [next] [options] [help]

boaiforum messages

RE: [BOAI] Formats for electronic dissemination

From: eds AT library.caltech.edu
Date: Wed, 29 Oct 2003 11:49:36 -0800


Threading: Re: [BOAI] Formats for electronic dissemination from m.brown AT liverpool.ac.uk
      • This Message
             RE: [BOAI] Formats for electronic dissemination from joatp2000 AT yahoo.com

The Paper Capture feature in Acrobat 6 Professional can help. Using it with
the  "Formatted text and graphics" option will convert an image 
(tiff) based
PDF document (originating from a scan) into a mostly text based PDF via an
OCR step. 

The quality is significantly better, because you are now looking at rendered
fonts, not bit mapped images, and it becomes indexable. 

All the words, equations, diagrams, etc., that Acrobat can't convert to text
are left as mini images within the document.

--
Ed Sponsler
Caltech Library System
Pasadena, CA USA



> -----Original Message-----
> From: Dr.Vinod Scaria [mailto:drvinod AT hotpop.com] 
> Sent: Monday, October 20, 2003 1:16 AM
> To: BOAI Forum
> Subject: Re: [BOAI] Formats for electronic dissemination
> 
> 
> I agree to Radu's views.
> I have always wondered why they convert page scans to PDFs. 
> They can always use them as GIFs or JPEGs which is much handy 
> and easily downloadable and there is no special advantage 
> being a PDF by itself, as Radu notes, these are virtually 
> impenetrable by data mining tools. Moreover, the print 
> quality of many of these scanned PDFs are equally poor.
> 
> kind regards
> Vinod
> 
> 
> Dr.Vinod Scaria
> WEB: www.drvinod.netfirms.com
> MAIL: vinodscaria AT yahoo.co.in
> Mobile: +91 98474 65452
> 
> 
> 
> ----- Original Message -----
> From: Radu
> To: BOAI Forum
> Sent: Wednesday, October 29, 2003 12:07 AM
> Subject: Re: [BOAI] Formats for electronic dissemination
> 
> 
> At 11:55 AM 10/27/03, Dario Taraborelli wrote:
> >(I confess that I don't thoroughly understand the problem 
> with pdf's, 
> >since pdf documents can be indexed by search engines as 
> easily as html
> >documents: it doesn't look like an insuperable technical problem).
> 
> There's something else about archived pdfs, much worse than 
> the relative inaccessibility of the semantics for their 
> content, and that's image-based text.
> 
> I have seen many journal archives which simply dump page 
> scans into pdf format. The resulting documents are huge and 
> totally impenetrable by current classification/data mining 
> tools. It's even impossible to copy/paste text out of these 
> 'archives'.
> 
> 
> Yours,
> Radu
> --
> Eastcree.org project
> Carleton University
> www.monicsoft.net/proj/creeTime.html
> (613) 520-2600x2174
> 
> 


[BOAI] [Forum Home] [index] [prev] [next] [options] [help]

 E-mail:  openaccess@soros.org .