Free Software Tools

at August 31, 2007

Hi, I updated the ebuild for gscan2pdf in bugzilla. I thought I'd put some information about the app here since this thread is the only result when searching for gscan2pdf on the forums. Also, I think anyone who wants to try gscan2pdf might want to know about the issues with unpaper, which is the topic of this thread.

The ebuild for gscan2pdf-0.9.16 includes support for tesseract-ocr. This means that with gscan2pdf, it's now possible to scan in a document, have OCR run automatically, and then when the scan is exported to pdf the OCR text will be attached as a comment or annotation. This text can be viewed using the Acrobat pdf reader, but, more importantly perhaps, desktop search engines like beagle will index it. So this makes pdfs of scanned paper much easier to find, especially if one has a lot of them. Unlike other FOSS OCR apps, tesseract actually works reasonably well, certainly well-enough for indexing purposes. I don't know of another GUI frontend for tesseract.

Of course, gscan2pdf has a lot of other features, including ADF (automatic document feeder) support, creation of multi-page pdfs, thumbnail previews for easy page reordering, export to tiff and djvu, and so on. It uses libsane, gtk2-perl and PDF-API2. More info here: http://gscan2pdf.sourceforge.net/
I think it's definitely worth a try for anyone who wants to scan books, bills, or other documents (and then be able to find them easily with desktop search).

To get back on topic, gscan2pdf also can make use of unpaper to clean up pages after scanning (it's the only frontend I know of for unpaper, too). The issue is that unpaper.c won't compile without some specific, and perhaps broken, compiler options (-ftree-vectorize, in particular). In fact, I can't get it to compile at all on my system, even when I use the same compiler options as used in preparing the binary that is distributed along with the source code. Has anyone been able to compile an unpaper binary from unpaper.c? If someone wants to try, the tarball with the source file can be found here: http://unpaper.berlios.de/

Also, there doesn't seem to have been any response at all to the bug filed with unpaper upstream last April. I'm not sure of the best way to approach the situation at this point. Because I can't make a binary from the .c file, it doesn't seem like a packaging or ebuild problem to me.

As a workaround, I've been using the precompiled binary for unpaper (I copied it to /usr/bin). I do realize that this is not a very good solution. But unpaper really does improve the look of my scans and it also makes the OCR work even better, so I've just settled on this unsatisfying compromise for the moment.

Anyway, although I suppose that the unpaper issues will keep gscan2pdf out of portage until they are resolved, I'd be interested in hearing any feedback on all of this stuff, including the gscan2pdf ebuild I linked earlier. Thanks.

Free Software Tools

Friday, August 31, 2007

An OCR Sccess Story: tesseract, gscan2pdf, and unpaper.

No comments:

Setting file times in Emacs from Time-stamps in file, or "Created " at start of a file: Generated through Google AI Summary

Report Abuse

Labels