Gentoo Archives: gentoo-user

From: "Urs Schütz" <u.schutz@×××××××.ch>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] extracting text, numbers from screencasts
Date: Sat, 09 Apr 2016 00:54:42
Message-Id: 570852C6.1070601@bluewin.ch
In Reply to: Re: [gentoo-user] extracting text, numbers from screencasts by Helmut Jarausch
1 On 04/08/16 11:30, Helmut Jarausch wrote:
2 > On 04/08/2016 03:26:53 PM, hw wrote:
3 >>
4 >> Hi,
5 >>
6 >> what would be the best approach to extract data
7 >> from a screencast?
8 >>
9 >> The task is to acquire some data from the display of
10 >> a GUI program used interactively by a user. There are
11 >> a couple 'fields' (as in "designated areas of the display")
12 >> in which the relevant data is being displayed while the
13 >> program is being used. The acquired data needs to be
14 >> entered into a mysql database, preferably as soon as
15 >> possible. (The program needs windoze, and the sources
16 >> are unavailable :( )
17 >>
18 >>
19 >> The idea is to make a screen recording and postprocess
20 >> the recording with some sort of OCR software. This might
21 >> require using ffmpeg (or the like) to create a single
22 >> image from each frame of the recording; then treat each
23 >> image with an OCR software to get the interesting data
24 >> which can then be entered into the database.
25 >>
26 >> Data to extract is mostly numbers. The relevant fields
27 >> can be expected to be either filled or empty. The FPS rate
28 >> of the recording can be kept reasonably low, like 1 FPS,
29 >> or perhaps even less, depending on how frequent the relevant
30 >> fields change.
31 >>
32 >> Using tesseract comes to mind, but after reading that
33 >>
34 >> "Tesseract's output will be very poor quality if the input
35 >> images are not preprocessed to suit it: Images (especially
36 >> screenshots) must be scaled up such that the text x-height
37 >> is at least 20 pixels,[12] any rotation or skew must be
38 >> corrected or no text will be recognized, low-frequency
39 >> changes in brightness must be high-pass filtered, or
40 >> Tesseract's binarization stage will destroy much of the
41 >> page, and dark borders must be manually removed, or they
42 >> will be misinterpreted as characters."[1]
43 >>
44 >> I'm even more doubtful that this would produce usable
45 >> results with sufficient reliability.
46 >>
47 >> So what might be the best way to get text/numbers out of
48 >> what a program displays?
49 >>
50 >>
51 >> [1]: https://en.wikipedia.org/wiki/Tesseract_(software)
52 >>
53 >
54 > I can't help with Gentoo.
55 > Try to find an old (free) version of FineReader which runs under wine.
56 > If you do it only occasionally, transfer the image to an Android phone
57 > where there a good and cheap OCR apps, even FineReader.
58 >
59 >
60 >
61
62 I had some surprisingly good experience with tesseact in digitizing
63 photographed pages of an old book recently. So I gave it a try today
64 with a cropped screenshot of thunderbird.
65
66 $ convert scrsht.png -type Grayscale -filter point -resize 300%
67 -normalize upscaled.png
68 $ tesseract -l eng upscaled.png out
69 $ less out.txt
70
71 convert is from media-gfx/imagemagick-6.9.0.3
72 tesseract is app-text/tesseract-3.04.00-r2
73
74 Here are my findings:
75 Any graphical elements sized similar to an character appear as strange
76 letters.
77 Recognition of serif fonts was better than sans-serif fonts, even at
78 smaller font size.
79 Text which can be spell-checked was nearly perfectly recognized.
80 Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized.
81 Selected text (white on blue background) was poorly recognized.
82 Dates were not recognized correctly.
83 Times were correctly read.
84 "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4
85 seconds.
86 "tesseract" time was a little more than 6s on an Intel(R) Core(TM)
87 i7-4710MQ CPU @ 2.50GHz, without opencl.
88 The image conversion and tesseract ocr could easily be scripted.
89
90 In short I would say that the following steps would help with tesseract:
91 Avoid GUI with a lot of graphics.
92 Try to screenshot just the relevant areas.
93 Increase GUI font size.
94 Configure GUI to use a well known serif font, or train tesseract for the
95 specific font used.
96 Configure GUI to use high contrasts, avoid colors which get converted to
97 gray.
98 Tesseract time could be improved by enabling opencl.
99
100 I would be interested to hear about your findings with numerical data,
101 and which approach finally works for you.
102
103 Urs

Replies