Gentoo Archives: gentoo-user

From: Landis Blackwell <blackwelllandis@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] multi-region OCR
Date: Wed, 30 Nov 2016 19:48:35
Message-Id: 1a2dfbf4-e061-3162-58c2-1da289430568@gmail.com
In Reply to: Re: [gentoo-user] multi-region OCR by Michael Mol
1 Did you train tesseract per chance? And could I get some sample images?
2
3 Landis
4
5
6 On 11/30/2016 12:28 PM, Michael Mol wrote:
7 > On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:
8 >> On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol <mikemol@×××××.com>
9 > wrote:
10 >>> On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
11 >>>> On Tuesday, November 29, 2016 11:18:36 PM karl@××××××××.se wrote:
12 >>>>> Michael Mol:
13 >>>>> ...
14 >>>>>
15 >>>>>> xsane would have let me do it during the scan process if I'd
16 >>> thought of
17 >>>
18 >>>>>> it
19 >>>>>> then, but the scans are done, drives aren't there any more.
20 >>> Something
21 >>>
22 >>>>> ...
23 >>>>>
24 >>>>> If xsane solves your need why don't you just print your scans so
25 >>> xsane
26 >>>
27 >>>>> can do its job ?
28 >>>> There has to be a way to do this without killing an entire forest...
29 >>> And big chunks of ink cartridges. The scans stretched the contrast so I
30 >>> can
31 >>> clearly read the drive labels through the translucent anti-static bags,
32 >>> which
33 >>> means a huge chunk of the image (what's outside the labels) is pure
34 >>> black.
35 >>>
36 >>> Which I could get around by spending fifteen minutes munging things in
37 >>> the Gimp
38 >>> before printing, but at that point, I may as well just transcribe
39 >>> things
40 >>> manually at that point.
41 >>>
42 >>> Looking for something reasonably simple to improve the general
43 >>> workflow. I'd
44 >>> have hoped something would have already been available on Linux; it'd
45 >>> be easy
46 >>> enough to copy the scans to my phone and feed them through Google
47 >>> Goggles for
48 >>> the desired output, but then I'm deliberately filtering company data
49 >>> through an
50 >>> outside entity.
51 >> Did you manage to use that link I sent?
52 > I did. tesseract almost worked, even separating the regions cleanly in its
53 > output, but it seems, sadly, that the 300dpi scans were insufficient to get a
54 > good read; lots of clear corruption of the text, so things like serial
55 > numbers, model numbers, version numbers--everything you'd care about--would be
56 > highly suspect.
57 >
58 > The next tool that looked like it might work, gscan2pdf, wasn't in portage,
59 > and with the semi-garbled output from tesseract suggesting the scans were too
60 > poor quality, I didn't pursue further.
61 >