Friday, March 21, 2008

Followup on gscan2pdf

I have spent quite a bit of time over the past 24 hours working with gscan2pdf. One document was unworkable, and another worked out with an estimated 95+% accuracy. I was attempting to ocr the document. The ability to save a pdf is a bonus. I could have typed all the pages in over the time it took me to get through this. As it is, I will be able to clean it up in minutes. See the note at the end of this post about Ubuntu.

Document 1: 0% OCR success.
An ancient printout on a 9 pin dot matrix printer, faint due to worn out ribbon. I attempted to deal with various settings for scanning (not many settings possible from the interface of gscan2pdf), unpaper (the options of which I understood but little, if at all), and the OCR---I specified tesseract.

0% isn't good. I will attempt to use the methods described on line using a tiff file and some tweaks. Much too much work.

Document 2: a dark, inkjet printed copy, about 9 pages, with hand written edits on the page.

Discussion and Results:
After spending some hours working with Document 1, with NO effect observed, I was pleased that Document 2 ran through gscan2pdf with almost perfect OCRs. I was displeased that I had to select and paste into a file using an editor. Did I miss something?

I did nothing to the optoins this time around: mostly defaults, except setting the language to English, and setting the scan dpi at 500. Fewer would perhaps work. It didn't take too long, though.

This is more work than one would like to have to do to get editable copies of a stack of pages. Not bad at all, and the next time around, I won't even try with faint dot matrix copies.


This is another instance where Ubuntu has it right, or at least right enough to make my life easier. Ubuntu does share the Debian concept regarding compiling kernels and packages that gives me fits and starts once in a while. Productivity is improved at least over the short haul. I need to reflect on this a bit.

No comments: