Budapest Open Access Initiative: BOAI Forum Archive[BOAI] [Forum Home] [index] [prev] [next] [options] [help]
Re: [BOAI] Formats for electronic dissemination
From: Michael Brown <m.brown AT liverpool.ac.uk>
> I have seen many journal archives which simply dump page scans into > pdf format. The resulting documents are huge and totally impenetrable > by current classification/data mining tools. It's even impossible to > copy/paste text out of these 'archives'. I think most organisations opt just to use images because of the large amount of time & resources (read $$$$$) it would take to OCR the articles and then proof-read them. At least with an image you know you are getting a mirror image without having to employ any extra resources. I know this is not the ideal solution - but metadata can be added to the PDF, and as OCR matures the current "image" PDFs could be ↵ converted to full text. At the moment Acrobat 6 Professional will let you OCR "image" PDFs ↵ and 1) either replace the image with "full" text - which takes time to convert and then time to proof-read against the original copy... or 2) leave the image in place and the text behind the image - so that it is selectable and you can mine it - for an example see the two reports from: http://www.filariasis.net/library/reports/index.html (however if you don't proof read it the text could contain some errors [- this is mainly a concern if the text is older]). At the moment option 2 is the only one available to me - as I'm working alone converting hundreds of paper-based journal articles (from 1870's to date) and simply don't have the resources (read $$$$$ ;-)) to create full text versions. If I getting funding though I already have the "image" PDFs which I can then convert to full text. In the meantime using option 2 allows both my search engine and PDF readers to search within the PDFs - and it is very accurate - but consumes less of my resources. Mike ============================== Michael Brown Lymphatic Filariasis Support Centre Liverpool School of Tropical Medicine Pembroke Place Liverpool Merseyside L3 5QA United Kingdom t: +44-151-705-3243 f: +44-151-705-3243 e: m.brown AT liverpool.ac.uk w: http://www.filariasis.net/ ============================== On 28 Oct 2003, at 18:37, Radu wrote: > At 11:55 AM 10/27/03, Dario Taraborelli wrote: >> (I confess that I don't thoroughly understand the problem with pdf's, >> since pdf documents can be indexed by search engines as easily as html >> documents: it doesn't look like an insuperable technical problem). > > There's something else about archived pdfs, much worse than the > relative inaccessibility of the semantics for their content, and > that's image-based text. > > I have seen many journal archives which simply dump page scans into > pdf format. The resulting documents are huge and totally impenetrable > by current classification/data mining tools. It's even impossible to > copy/paste text out of these 'archives'. > > > Yours, > Radu > -- > Eastcree.org project > Carleton University > www.monicsoft.net/proj/creeTime.html > (613) 520-2600x2174
[BOAI] [Forum Home] [index] [prev] [next] [options] [help]
E-mail: email@example.com .