Gentoo Archives: gentoo-user

From: Francisco Ares <frares@×××××.com>
To: gentoo-user <gentoo-user@l.g.o>
Subject: Re: [gentoo-user] multi-region OCR
Date: Wed, 30 Nov 2016 18:42:19
Message-Id: CAHH9eM58yai8=VLPmaO2LOij_H5BxkohVbDRz3ndo-OLitr0AQ@mail.gmail.com
In Reply to: Re: [gentoo-user] multi-region OCR by Michael Mol
1 2016-11-30 16:28 GMT-02:00 Michael Mol <mikemol@×××××.com>:
2
3 > On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:
4 > > On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol <
5 > mikemol@×××××.com>
6 > wrote:
7 > > >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
8 > > >> On Tuesday, November 29, 2016 11:18:36 PM karl@××××××××.se wrote:
9 > > >> > Michael Mol:
10 > > >> > ...
11 > > >> >
12 > > >> > > xsane would have let me do it during the scan process if I'd
13 > > >
14 > > >thought of
15 > > >
16 > > >> > > it
17 > > >> > > then, but the scans are done, drives aren't there any more.
18 > > >
19 > > >Something
20 > > >
21 > > >> > ...
22 > > >> >
23 > > >> > If xsane solves your need why don't you just print your scans so
24 > > >
25 > > >xsane
26 > > >
27 > > >> > can do its job ?
28 > > >>
29 > > >> There has to be a way to do this without killing an entire forest...
30 > > >
31 > > >And big chunks of ink cartridges. The scans stretched the contrast so I
32 > > >can
33 > > >clearly read the drive labels through the translucent anti-static bags,
34 > > >which
35 > > >means a huge chunk of the image (what's outside the labels) is pure
36 > > >black.
37 > > >
38 > > >Which I could get around by spending fifteen minutes munging things in
39 > > >the Gimp
40 > > >before printing, but at that point, I may as well just transcribe
41 > > >things
42 > > >manually at that point.
43 > > >
44 > > >Looking for something reasonably simple to improve the general
45 > > >workflow. I'd
46 > > >have hoped something would have already been available on Linux; it'd
47 > > >be easy
48 > > >enough to copy the scans to my phone and feed them through Google
49 > > >Goggles for
50 > > >the desired output, but then I'm deliberately filtering company data
51 > > >through an
52 > > >outside entity.
53 > >
54 > > Did you manage to use that link I sent?
55 >
56 > I did. tesseract almost worked, even separating the regions cleanly in its
57 > output, but it seems, sadly, that the 300dpi scans were insufficient to
58 > get a
59 > good read; lots of clear corruption of the text, so things like serial
60 > numbers, model numbers, version numbers--everything you'd care
61 > about--would be
62 > highly suspect.
63 >
64 > The next tool that looked like it might work, gscan2pdf, wasn't in portage,
65 > and with the semi-garbled output from tesseract suggesting the scans were
66 > too
67 > poor quality, I didn't pursue further.
68 >
69 > --
70 > :wq
71
72
73 Well, I've had similar issue. I had gimp to resize the image to its double
74 (width and height, of course), filtered it a bit (edge enhancement) and
75 split the image in several ones for the regions of interest.
76
77 Of course, there might be an easier way ;-)
78
79 Francisco