Budapest Open Access Initiative: BOAI Forum Archive[BOAI] [Forum Home] [index] [prev] [next] [options] [help]
RE: [BOAI] Formats for electronic dissemination
From: eds AT library.caltech.edu
The Paper Capture feature in Acrobat 6 Professional can help. Using it with the "Formatted text and graphics" option will convert an image ↵ (tiff) based PDF document (originating from a scan) into a mostly text based PDF via an OCR step. The quality is significantly better, because you are now looking at rendered fonts, not bit mapped images, and it becomes indexable. All the words, equations, diagrams, etc., that Acrobat can't convert to text are left as mini images within the document. -- Ed Sponsler Caltech Library System Pasadena, CA USA > -----Original Message----- > From: Dr.Vinod Scaria [mailto:drvinod AT hotpop.com] > Sent: Monday, October 20, 2003 1:16 AM > To: BOAI Forum > Subject: Re: [BOAI] Formats for electronic dissemination > > > I agree to Radu's views. > I have always wondered why they convert page scans to PDFs. > They can always use them as GIFs or JPEGs which is much handy > and easily downloadable and there is no special advantage > being a PDF by itself, as Radu notes, these are virtually > impenetrable by data mining tools. Moreover, the print > quality of many of these scanned PDFs are equally poor. > > kind regards > Vinod > > > Dr.Vinod Scaria > WEB: www.drvinod.netfirms.com > MAIL: vinodscaria AT yahoo.co.in > Mobile: +91 98474 65452 > > > > ----- Original Message ----- > From: Radu > To: BOAI Forum > Sent: Wednesday, October 29, 2003 12:07 AM > Subject: Re: [BOAI] Formats for electronic dissemination > > > At 11:55 AM 10/27/03, Dario Taraborelli wrote: > >(I confess that I don't thoroughly understand the problem > with pdf's, > >since pdf documents can be indexed by search engines as > easily as html > >documents: it doesn't look like an insuperable technical problem). > > There's something else about archived pdfs, much worse than > the relative inaccessibility of the semantics for their > content, and that's image-based text. > > I have seen many journal archives which simply dump page > scans into pdf format. The resulting documents are huge and > totally impenetrable by current classification/data mining > tools. It's even impossible to copy/paste text out of these > 'archives'. > > > Yours, > Radu > -- > Eastcree.org project > Carleton University > www.monicsoft.net/proj/creeTime.html > (613) 520-2600x2174 > >
[BOAI] [Forum Home] [index] [prev] [next] [options] [help]
E-mail: email@example.com .