Friday, August 31, 2007

An OCR Sccess Story: tesseract, gscan2pdf, and unpaper.

Wherein I opine that OCR is ready for prime time on GNU/Linux.

For some years I've been carrying around a manila file of a printout of a digital file for a project of mine from the early 90s, a list of observed seasonal events over a two or three year period on Tol Island, in Chuuk Lagoon, E. Caroline Islands. It seemed to be the litmus test of OCR. I didn't believe it would ever be possible to scan and reduce it back to characters. It was printed on an old HP Portable Deskjet printer.

At the time I was using a Toshiba Satellite laptop. A friend had given me two 10 Watt solar panels, really ancient ones. I had scrounged a 100 AHr deep cycle battery from the local Air Force Civic Action Team (CAT Team) when they had to maintain their batteries on their heavy equipment. Every three weeks or so I would carry the battery into town, and the CAT team would charge it. In between, I had the 10 Watt panel to trickle charge the beast. It kept my computer working for about three weeks! Not bad. And I had built an adaptor for my Zeiss microscope, to run it from 12VDC, with a rheostat dimmer, so it was just as good as power mains. So there was a bit of load on the old Deep Cycle battery after all: I spent many hours a day either on the computer, or with the microscope. (The Stereomicroscope that I'd gotten from Jerry Bakus at USC was illuminated by a kerosene pressure lantern).

I jury rigged the printer to run off of 12VDC also. The beauty of this was that by bypassing the power adaptor plug, and wiring directly in parallel with the batteries, I saved a considerable amount of power, because the printer had a built in sleep mode that only worked when it was running off battery power.

This printer wasn't a great one, and I am sure I harmed the quality by refilling the cartridges. In those days, refilling technology---not to mention inket technology---was in a primitive state, and the refilled cartridge, if it worked at all, would not print anywhere near the same quality of a new one. But, I could not possible afford a new cartridge every month or so, and it was only lucky that I had gotten ahold of a pint of ink, so I made do.

Making do was the by word of Terry Frohm, the technician at the Chuuk branch of the Coral Reef Research Lagoratory, in Neauwo. He was a bit of a bricoleur(See The Savage Mind, by Claude Levi-Strauss), and had cobbled together many of the systems at the lab from scrounged articles: the water tank feed was a prime example. I resolved to pull together a book about "making do" beginning with ideas from Terry.

Back to OCR: tesseract, gscan2pdf, and unpaper.

Last night one of page of the printouts of my seasonality files surfaced on top of the pile of my desk. I had recently played around with tesseract. I had read a couple of recent articles. OCR was beginning to surface. Some articles and links:

Hi, I updated the ebuild for gscan2pdf in bugzilla. I thought I'd put some information about the app here since this thread is the only result when searching for gscan2pdf on the forums. Also, I think anyone who wants to try gscan2pdf might want to know about the issues with unpaper, which is the topic of this thread.

The ebuild for gscan2pdf-0.9.16 includes support for tesseract-ocr. This means that with gscan2pdf, it's now possible to scan in a document, have OCR run automatically, and then when the scan is exported to pdf the OCR text will be attached as a comment or annotation. This text can be viewed using the Acrobat pdf reader, but, more importantly perhaps, desktop search engines like beagle will index it. So this makes pdfs of scanned paper much easier to find, especially if one has a lot of them. Unlike other FOSS OCR apps, tesseract actually works reasonably well, certainly well-enough for indexing purposes. I don't know of another GUI frontend for tesseract.

Of course, gscan2pdf has a lot of other features, including ADF (automatic document feeder) support, creation of multi-page pdfs, thumbnail previews for easy page reordering, export to tiff and djvu, and so on. It uses libsane, gtk2-perl and PDF-API2. More info here:
I think it's definitely worth a try for anyone who wants to scan books, bills, or other documents (and then be able to find them easily with desktop search).

To get back on topic, gscan2pdf also can make use of unpaper to clean up pages after scanning (it's the only frontend I know of for unpaper, too). The issue is that unpaper.c won't compile without some specific, and perhaps broken, compiler options (-ftree-vectorize, in particular). In fact, I can't get it to compile at all on my system, even when I use the same compiler options as used in preparing the binary that is distributed along with the source code. Has anyone been able to compile an unpaper binary from unpaper.c? If someone wants to try, the tarball with the source file can be found here:

Also, there doesn't seem to have been any response at all to the bug filed with unpaper upstream last April. I'm not sure of the best way to approach the situation at this point. Because I can't make a binary from the .c file, it doesn't seem like a packaging or ebuild problem to me.

As a workaround, I've been using the precompiled binary for unpaper (I copied it to /usr/bin). I do realize that this is not a very good solution. But unpaper really does improve the look of my scans and it also makes the OCR work even better, so I've just settled on this unsatisfying compromise for the moment.

Anyway, although I suppose that the unpaper issues will keep gscan2pdf out of portage until they are resolved, I'd be interested in hearing any feedback on all of this stuff, including the gscan2pdf ebuild I linked earlier. Thanks. :)
I don't know how you are going to install gscan2pdf on another system. Unpaper is required for the use of gscan2pdf. And gscan2pdf uses tesseract or gocr. I have both installed, but I don't know whether gocr works. It used to be a PITA.

Here are the links for these two bits:
Gscan2pdf is a gui, it can do the scan, the adjustments using unpaper(another way of doing the adjustments is explained in the Linux Journal article: it worked for me also). One more click and it does the scan.

Multiple columns required me to fire up the Gimp, and save the columns into separate files.

I now have a copy of my page of phenology observations in an editable text file. I hardly had to edit it at all: it was remarkably clean.

No comments:

Tide graph experiment: seeking a colorblind friendly palette

This is a first try.  I am working on a graph of height of tide as a function of (x) clock time. This time, I have used the "Juxtapo...