Gentoo Archives: gentoo-user

From: hw <hw@×××××.de>
To: gentoo-user@l.g.o
Subject: [gentoo-user] extracting text, numbers from screencasts
Date: Fri, 08 Apr 2016 13:27:02
Message-Id: 5707B19D.6030608@gc-24.de
1 Hi,
2
3 what would be the best approach to extract data
4 from a screencast?
5
6 The task is to acquire some data from the display of
7 a GUI program used interactively by a user. There are
8 a couple 'fields' (as in "designated areas of the display")
9 in which the relevant data is being displayed while the
10 program is being used. The acquired data needs to be
11 entered into a mysql database, preferably as soon as
12 possible. (The program needs windoze, and the sources
13 are unavailable :( )
14
15
16 The idea is to make a screen recording and postprocess
17 the recording with some sort of OCR software. This might
18 require using ffmpeg (or the like) to create a single
19 image from each frame of the recording; then treat each
20 image with an OCR software to get the interesting data
21 which can then be entered into the database.
22
23 Data to extract is mostly numbers. The relevant fields
24 can be expected to be either filled or empty. The FPS rate
25 of the recording can be kept reasonably low, like 1 FPS,
26 or perhaps even less, depending on how frequent the relevant
27 fields change.
28
29 Using tesseract comes to mind, but after reading that
30
31 "Tesseract's output will be very poor quality if the input
32 images are not preprocessed to suit it: Images (especially
33 screenshots) must be scaled up such that the text x-height
34 is at least 20 pixels,[12] any rotation or skew must be
35 corrected or no text will be recognized, low-frequency
36 changes in brightness must be high-pass filtered, or
37 Tesseract's binarization stage will destroy much of the
38 page, and dark borders must be manually removed, or they
39 will be misinterpreted as characters."[1]
40
41 I'm even more doubtful that this would produce usable
42 results with sufficient reliability.
43
44 So what might be the best way to get text/numbers out of
45 what a program displays?
46
47
48 [1]: https://en.wikipedia.org/wiki/Tesseract_(software)

Replies

Subject Author
Re: [gentoo-user] extracting text, numbers from screencasts Helmut Jarausch <jarausch@××××××.be>