Gentoo Archives: gentoo-user

From: Helmut Jarausch <jarausch@××××××.be>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] extracting text, numbers from screencasts
Date: Fri, 08 Apr 2016 14:30:32
Message-Id: m/+pKjbYIbfzWo/QhKO4vz@Ml3T0rT5WgVXnfMZucxeg
In Reply to: [gentoo-user] extracting text, numbers from screencasts by hw
1 On 04/08/2016 03:26:53 PM, hw wrote:
2 >
3 > Hi,
4 >
5 > what would be the best approach to extract data
6 > from a screencast?
7 >
8 > The task is to acquire some data from the display of
9 > a GUI program used interactively by a user. There are
10 > a couple 'fields' (as in "designated areas of the display")
11 > in which the relevant data is being displayed while the
12 > program is being used. The acquired data needs to be
13 > entered into a mysql database, preferably as soon as
14 > possible. (The program needs windoze, and the sources
15 > are unavailable :( )
16 >
17 >
18 > The idea is to make a screen recording and postprocess
19 > the recording with some sort of OCR software. This might
20 > require using ffmpeg (or the like) to create a single
21 > image from each frame of the recording; then treat each
22 > image with an OCR software to get the interesting data
23 > which can then be entered into the database.
24 >
25 > Data to extract is mostly numbers. The relevant fields
26 > can be expected to be either filled or empty. The FPS rate
27 > of the recording can be kept reasonably low, like 1 FPS,
28 > or perhaps even less, depending on how frequent the relevant
29 > fields change.
30 >
31 > Using tesseract comes to mind, but after reading that
32 >
33 > "Tesseract's output will be very poor quality if the input
34 > images are not preprocessed to suit it: Images (especially
35 > screenshots) must be scaled up such that the text x-height
36 > is at least 20 pixels,[12] any rotation or skew must be
37 > corrected or no text will be recognized, low-frequency
38 > changes in brightness must be high-pass filtered, or
39 > Tesseract's binarization stage will destroy much of the
40 > page, and dark borders must be manually removed, or they
41 > will be misinterpreted as characters."[1]
42 >
43 > I'm even more doubtful that this would produce usable
44 > results with sufficient reliability.
45 >
46 > So what might be the best way to get text/numbers out of
47 > what a program displays?
48 >
49 >
50 > [1]: https://en.wikipedia.org/wiki/Tesseract_(software)
51 >
52
53 I can't help with Gentoo.
54 Try to find an old (free) version of FineReader which runs under wine.
55 If you do it only occasionally, transfer the image to an Android phone
56 where there a good and cheap OCR apps, even FineReader.

Replies

Subject Author
Re: [gentoo-user] extracting text, numbers from screencasts "Urs Schütz" <u.schutz@×××××××.ch>
Re: [gentoo-user] extracting text, numbers from screencasts hw <hw@×××××.de>