Less than two weeks ago, I got a new laptop from work. In the period since I have installed three different distributions of GNU/Linux (4 if you could two different versions of Ubuntu): Ubuntu Alpha (Gutsy Gibbon, early version); Sabayon, a recent DVD; Ubuntu current stable version; and Gentoo. Gentoo won again, hands down; although, Ubuntu stable version did fine for the less than a day it was on my machine.
A brief summary of the four installs will be pertinent.
Ubuntu unstable was indeed unstable. I couldn't connect to the WPA encrypted wireless network at work, the main reason for abandonment; but in reality, the installation became very, very buggy as I upgraded.
Sabayon was another top contender: a gentoo derived, very high glitz level, cool if it worked. In all fairness, I had trouble with the DVD itself, but the install was a mess, with many unexplainable instabilities. I eventually threw up my hands in frustration, and started downloading a Gentoo 2007.0 Live CD and also a Minimal Install CD.
While I was downloading those, I found a CD of Ubuntu stable, and I installed it. I was able to connect to WPA wireless, and I even considered bailing on the Gentoo install. I knew the Gentoo installation would take time, and I was afraid of the WPA issues (and even now I haven't solved that problem, but I feel a little better about it.) Ubuntu stable was better, but the die had been cast.
I have to say that Ubuntu did give me serious troubles partitioning. I have had that trouble before with Ubuntu, and I do not like the utility that is bundled with the Live CD---the Disk Druid. I have botched several partitioning jobs over the past couple of years, due to, I think, the non-intuitive command set for that GUI utility. It's easy, I suppose, if one has made a decision to install in a particular way, but there are too many things that can go wrong.
And did go wrong in my case: the Windoze partition was blown off. I admit I was relieved, and went ahead and repartitioned for Gentoo, leaving a 5 GM partition for Ubuntu if I decide I need a short term bail out, or some extra storage. (This machine has only 40GB of HDD storage!)
Finally, I tried Gentoo with a Minimal install CD. This is a CD that required a working internet connection, the faster the better. I did have a connection at home, but I gave this up when the Live CD was finally burned.
The Live CD worked well. Better than the last time, when the Gentoo Live CD bombed. This time it went well, although I had to study the installation docs carefully a couple of times. I still believe, as I have for a couple of years, that Gentoo docs are the best of all; but the installation docs are so numerous it's hard to keep them straight. I did find what I needed, although I had to watch carefully along the way.
The Live CD has taken several days to get cherried out. As of tonight, about four or five days out, it's working extremely well.
Why do I stay with Gentoo? It takes days to install, a discouraging prospect, but I'm not about to give it a second go. It's the only distro that is solid enough to not require several installation attempts, but it sure better be.
This year's Gentoo is better still. Tonight I installed the most unstable version of Avidemux, and it works perfectly. That is probably a litmus test. Vlc seems fine, and I have been working on Mplayer. Gentoo has taken everything I've thr0wn at it, and that's alot.
Sunday, September 23, 2007
Friday, August 31, 2007
An OCR Sccess Story: tesseract, gscan2pdf, and unpaper.
Wherein I opine that OCR is ready for prime time on GNU/Linux.
For some years I've been carrying around a manila file of a printout of a digital file for a project of mine from the early 90s, a list of observed seasonal events over a two or three year period on Tol Island, in Chuuk Lagoon, E. Caroline Islands. It seemed to be the litmus test of OCR. I didn't believe it would ever be possible to scan and reduce it back to characters. It was printed on an old HP Portable Deskjet printer.
At the time I was using a Toshiba Satellite laptop. A friend had given me two 10 Watt solar panels, really ancient ones. I had scrounged a 100 AHr deep cycle battery from the local Air Force Civic Action Team (CAT Team) when they had to maintain their batteries on their heavy equipment. Every three weeks or so I would carry the battery into town, and the CAT team would charge it. In between, I had the 10 Watt panel to trickle charge the beast. It kept my computer working for about three weeks! Not bad. And I had built an adaptor for my Zeiss microscope, to run it from 12VDC, with a rheostat dimmer, so it was just as good as power mains. So there was a bit of load on the old Deep Cycle battery after all: I spent many hours a day either on the computer, or with the microscope. (The Stereomicroscope that I'd gotten from Jerry Bakus at USC was illuminated by a kerosene pressure lantern).
I jury rigged the printer to run off of 12VDC also. The beauty of this was that by bypassing the power adaptor plug, and wiring directly in parallel with the batteries, I saved a considerable amount of power, because the printer had a built in sleep mode that only worked when it was running off battery power.
This printer wasn't a great one, and I am sure I harmed the quality by refilling the cartridges. In those days, refilling technology---not to mention inket technology---was in a primitive state, and the refilled cartridge, if it worked at all, would not print anywhere near the same quality of a new one. But, I could not possible afford a new cartridge every month or so, and it was only lucky that I had gotten ahold of a pint of ink, so I made do.
Making do was the by word of Terry Frohm, the technician at the Chuuk branch of the Coral Reef Research Lagoratory, in Neauwo. He was a bit of a bricoleur(See The Savage Mind, by Claude Levi-Strauss), and had cobbled together many of the systems at the lab from scrounged articles: the water tank feed was a prime example. I resolved to pull together a book about "making do" beginning with ideas from Terry.
Back to OCR: tesseract, gscan2pdf, and unpaper.
Last night one of page of the printouts of my seasonality files surfaced on top of the pile of my desk. I had recently played around with tesseract. I had read a couple of recent articles. OCR was beginning to surface. Some articles and links:
Here are the links for these two bits:
Gscan2pdf is a gui, it can do the scan, the adjustments using unpaper(another way of doing the adjustments is explained in the Linux Journal article: it worked for me also). One more click and it does the scan.
Multiple columns required me to fire up the Gimp, and save the columns into separate files.
I now have a copy of my page of phenology observations in an editable text file. I hardly had to edit it at all: it was remarkably clean.
For some years I've been carrying around a manila file of a printout of a digital file for a project of mine from the early 90s, a list of observed seasonal events over a two or three year period on Tol Island, in Chuuk Lagoon, E. Caroline Islands. It seemed to be the litmus test of OCR. I didn't believe it would ever be possible to scan and reduce it back to characters. It was printed on an old HP Portable Deskjet printer.
At the time I was using a Toshiba Satellite laptop. A friend had given me two 10 Watt solar panels, really ancient ones. I had scrounged a 100 AHr deep cycle battery from the local Air Force Civic Action Team (CAT Team) when they had to maintain their batteries on their heavy equipment. Every three weeks or so I would carry the battery into town, and the CAT team would charge it. In between, I had the 10 Watt panel to trickle charge the beast. It kept my computer working for about three weeks! Not bad. And I had built an adaptor for my Zeiss microscope, to run it from 12VDC, with a rheostat dimmer, so it was just as good as power mains. So there was a bit of load on the old Deep Cycle battery after all: I spent many hours a day either on the computer, or with the microscope. (The Stereomicroscope that I'd gotten from Jerry Bakus at USC was illuminated by a kerosene pressure lantern).
I jury rigged the printer to run off of 12VDC also. The beauty of this was that by bypassing the power adaptor plug, and wiring directly in parallel with the batteries, I saved a considerable amount of power, because the printer had a built in sleep mode that only worked when it was running off battery power.
This printer wasn't a great one, and I am sure I harmed the quality by refilling the cartridges. In those days, refilling technology---not to mention inket technology---was in a primitive state, and the refilled cartridge, if it worked at all, would not print anywhere near the same quality of a new one. But, I could not possible afford a new cartridge every month or so, and it was only lucky that I had gotten ahold of a pint of ink, so I made do.
Making do was the by word of Terry Frohm, the technician at the Chuuk branch of the Coral Reef Research Lagoratory, in Neauwo. He was a bit of a bricoleur(See The Savage Mind, by Claude Levi-Strauss), and had cobbled together many of the systems at the lab from scrounged articles: the water tank feed was a prime example. I resolved to pull together a book about "making do" beginning with ideas from Terry.
Back to OCR: tesseract, gscan2pdf, and unpaper.
Last night one of page of the printouts of my seasonality files surfaced on top of the pile of my desk. I had recently played around with tesseract. I had read a couple of recent articles. OCR was beginning to surface. Some articles and links:
- An article at LinuxJournal on Tesseract This one works!
- A gentoo wiki howto on OCR, including some on Tesseract
- A howto for Ubuntu and Tesseract
- Ocropus
- Tesseract (Google Summer of Code)
- And the one that makes it all work together: gscan2pdf and unpaper. Here's an excerpt from the gentoo forum about the installation on Gentoo:
I don't know how you are going to install gscan2pdf on another system. Unpaper is required for the use of gscan2pdf. And gscan2pdf uses tesseract or gocr. I have both installed, but I don't know whether gocr works. It used to be a PITA.
Hi, I updated the ebuild for gscan2pdf in bugzilla. I thought I'd put some information about the app here since this thread is the only result when searching for gscan2pdf on the forums. Also, I think anyone who wants to try gscan2pdf might want to know about the issues with unpaper, which is the topic of this thread.
The ebuild for gscan2pdf-0.9.16 includes support for tesseract-ocr. This means that with gscan2pdf, it's now possible to scan in a document, have OCR run automatically, and then when the scan is exported to pdf the OCR text will be attached as a comment or annotation. This text can be viewed using the Acrobat pdf reader, but, more importantly perhaps, desktop search engines like beagle will index it. So this makes pdfs of scanned paper much easier to find, especially if one has a lot of them. Unlike other FOSS OCR apps, tesseract actually works reasonably well, certainly well-enough for indexing purposes. I don't know of another GUI frontend for tesseract.
Of course, gscan2pdf has a lot of other features, including ADF (automatic document feeder) support, creation of multi-page pdfs, thumbnail previews for easy page reordering, export to tiff and djvu, and so on. It uses libsane, gtk2-perl and PDF-API2. More info here: http://gscan2pdf.sourceforge.net/
I think it's definitely worth a try for anyone who wants to scan books, bills, or other documents (and then be able to find them easily with desktop search).
To get back on topic, gscan2pdf also can make use of unpaper to clean up pages after scanning (it's the only frontend I know of for unpaper, too). The issue is that unpaper.c won't compile without some specific, and perhaps broken, compiler options (-ftree-vectorize, in particular). In fact, I can't get it to compile at all on my system, even when I use the same compiler options as used in preparing the binary that is distributed along with the source code. Has anyone been able to compile an unpaper binary from unpaper.c? If someone wants to try, the tarball with the source file can be found here: http://unpaper.berlios.de/
Also, there doesn't seem to have been any response at all to the bug filed with unpaper upstream last April. I'm not sure of the best way to approach the situation at this point. Because I can't make a binary from the .c file, it doesn't seem like a packaging or ebuild problem to me.
As a workaround, I've been using the precompiled binary for unpaper (I copied it to /usr/bin). I do realize that this is not a very good solution. But unpaper really does improve the look of my scans and it also makes the OCR work even better, so I've just settled on this unsatisfying compromise for the moment.
Anyway, although I suppose that the unpaper issues will keep gscan2pdf out of portage until they are resolved, I'd be interested in hearing any feedback on all of this stuff, including the gscan2pdf ebuild I linked earlier. Thanks.
Here are the links for these two bits:
Gscan2pdf is a gui, it can do the scan, the adjustments using unpaper(another way of doing the adjustments is explained in the Linux Journal article: it worked for me also). One more click and it does the scan.
Multiple columns required me to fire up the Gimp, and save the columns into separate files.
I now have a copy of my page of phenology observations in an editable text file. I hardly had to edit it at all: it was remarkably clean.
Discussion of LaTeX flipbook idea
On the group comp.text.tex, is found a discussion of how to do a flip book included in a book.
The winning post, in my opinion so far, is a suggestion to look at the documentation for the package fancyhdr.
The winning post, in my opinion so far, is a suggestion to look at the documentation for the package fancyhdr.
Saturday, August 4, 2007
In the beginning ...
I am starting a blog of notes about GNU/Linux.
GNU/Linux is an important part of my life, but I am not a developer. This BLOG will be a means of organizing my thoughts about GNU/Linux, it's place in my life, and thinking about what I can do to give something back to the Community.
As a Gentoo User, I noticed a need for user feedback and participation, mentioned recently by Danial Robbins, original architect of Gentoo, in his BLOG,
I also need to keep track of my own meanderings, on the continuing learning journey of using GNU/Linux and Free Software generally.
GNU/Linux is an important part of my life, but I am not a developer. This BLOG will be a means of organizing my thoughts about GNU/Linux, it's place in my life, and thinking about what I can do to give something back to the Community.
As a Gentoo User, I noticed a need for user feedback and participation, mentioned recently by Danial Robbins, original architect of Gentoo, in his BLOG,
I also need to keep track of my own meanderings, on the continuing learning journey of using GNU/Linux and Free Software generally.
Subscribe to:
Posts (Atom)
The Free Software Foundation (FSF) is a beacon
I just stumbled upon a statement on the website of the FSF ( fsf.org ) about the appointment of three new board members of the organisation....
-
Ubuntu Essentials In this space I will outline several essential tweaks, installs, and configurations that FOR ME have improved my experienc...
-
If I'm going to write about literature, libraries, bibliographies, it is obligatory to begin with some specific mentions. Time Again At...