-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I would guess this is related to their work with tesseract. It works
quite well, though as I pointed out on the Code4Lib thread mentioned in
this thread, it isn't possible to add the OCRd text back into the PDF.
Presently.
James
Richard Rankin wrote:
> Does anyone know what OCR software they are using we currently have a
> project to OCR a large number of pdfs per month and would like to
> automate the process
>
>
>
> Ricky
>
> ______________________
>
> Principal Analyst
>
> Information Services
>
> Queen's University Belfast
>
>
>
> tel: 02890 974824
>
> fax: 02890 976586
>
> email: [log in to unmask]
>
>
>
> ------------------------------------------------------------------------
>
> *From:* Repositories discussion list
> [mailto:[log in to unmask]] *On Behalf Of *Peter Millington
> *Sent:* 04 November 2008 10:14
> *To:* [log in to unmask]
> *Subject:* Google Indexes Non-searchable PDFs
>
>
>
> The following development from Google could have a big impact on
> institutional repositories. PDFs from scanned documents and/or from
> low-end software are often just images that can be read by humans, but
> cannot be searched by keyword or indexed by search engines. I am sure
> that most if not all repositories hold such PDFs. This Google initiative
> will unfurl the cloak of invisibility from them.
>
>
>
> Peter Millington
>
> SHERPA, University of Nottingham
>
>
>
> * Google sheds light on 'Dark Web' by searching scanned documents
> http://cwflyris.computerworld.com/t/3821061/247711/148332/2/
>
> Using optical character recognition (OCR) technology, Google's search
> engine now can convert scanned PDF documents into text that can be
> searched and indexed, the company said. Thus, government reports,
> academic papers and other scanned documents can now show up in search
> results. Search engines generally interpret PDF documents as images of
> text rather than text.
>
>
> This message has been checked for viruses but the contents of an
> attachment may still contain software viruses, which could damage your
> computer system: you are advised to perform your own checks. Email
> communications with the University of Nottingham may be monitored as
> permitted by UK legislation.
>
- --
- -------------------------------
James Tuttle
Digital Repository Librarian
NCSU Libraries, Box 7111
North Carolina State University
Raleigh, NC 27695-7111
[log in to unmask]
(919)513-0651 Phone
(919)515-3031 Fax
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkQUugACgkQKxpLzx+LOWO/BwCgmyaBVqz0a+2bPH7VLsWGs2mu
AeUAn20jI8WsRKInyH22JNHm9kM54HA9
=lx/c
-----END PGP SIGNATURE-----
|