I agree completely with the scanned image problem because images are dumb and cannot be searched by any system. Another problem is the ability to create meaningful meatadata tags for documents prior to putting them into a viable archive. To me, this is single most important thing you can do when developing a good DMS for users because properly configured Metadata tags attached to files enable users to  find the right documents in the easiest way possible. Good resources to find out how to properlty create metadata tags for files is to go to vendors like Documentum, Verity and Q-Docs and search engines like Google and All The Webo see what they recommend to users when using or setting up a system for users. W3 and Dublin Core are great metadata resources as well.
 
Bob Moran


-------- Original Message --------
Subject: Re: [BOAI] Formats for electronic dissemination
From: "Radu" <radu@monicsoft.net>
Date: Tue, October 28, 2003 11:37 am
To: "BOAI Forum" <boai-forum@ecs.soton.ac.uk>

At 11:55 AM 10/27/03, Dario Taraborelli wrote:
>(I confess that I don't thoroughly understand the problem with pdf's,
>since pdf documents can be indexed by search engines as easily as html
>documents: it doesn't look like an insuperable technical problem).

There's something else about archived pdfs, much worse than the relative
inaccessibility of the semantics for their content, and that's image-based
text.

I have seen many journal archives which simply dump page scans into pdf
format. The resulting documents are huge and totally impenetrable by
current classification/data mining tools. It's even impossible to
copy/paste text out of these 'archives'.


Yours,
Radu
--
Eastcree.org project
Carleton University
www.monicsoft.net/proj/creeTime.html
(613) 520-2600x2174