Gentoo Archives: gentoo-user

From: Alan McKinnon <alan.mckinnon@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] extracting text, numbers from screencasts
Date: Sat, 07 May 2016 23:31:08
Message-Id: 9e2b826a-9030-c03c-0dc0-dce8b9460a76@gmail.com
In Reply to: Re: [gentoo-user] extracting text, numbers from screencasts by hw
1 On 07/05/2016 16:31, hw wrote:
2 > Helmut Jarausch schrieb:
3 >> On 04/08/2016 03:26:53 PM, hw wrote:
4 >>>
5 >>> Hi,
6 >>>
7 >>> what would be the best approach to extract data
8 >>> from a screencast?
9 >>>
10 >>> The task is to acquire some data from the display of
11 >>> a GUI program used interactively by a user. There are
12 >>> a couple 'fields' (as in "designated areas of the display")
13 >>> in which the relevant data is being displayed while the
14 >>> program is being used. The acquired data needs to be
15 >>> entered into a mysql database, preferably as soon as
16 >>> possible. (The program needs windoze, and the sources
17 >>> are unavailable :( )
18 >>>
19 >>>
20 >>> The idea is to make a screen recording and postprocess
21 >>> the recording with some sort of OCR software. This might
22 >>> require using ffmpeg (or the like) to create a single
23 >>> image from each frame of the recording; then treat each
24 >>> image with an OCR software to get the interesting data
25 >>> which can then be entered into the database.
26 >>>
27 >>> Data to extract is mostly numbers. The relevant fields
28 >>> can be expected to be either filled or empty. The FPS rate
29 >>> of the recording can be kept reasonably low, like 1 FPS,
30 >>> or perhaps even less, depending on how frequent the relevant
31 >>> fields change.
32 >>>
33 >>> Using tesseract comes to mind, but after reading that
34 >>>
35 >>> "Tesseract's output will be very poor quality if the input
36 >>> images are not preprocessed to suit it: Images (especially
37 >>> screenshots) must be scaled up such that the text x-height
38 >>> is at least 20 pixels,[12] any rotation or skew must be
39 >>> corrected or no text will be recognized, low-frequency
40 >>> changes in brightness must be high-pass filtered, or
41 >>> Tesseract's binarization stage will destroy much of the
42 >>> page, and dark borders must be manually removed, or they
43 >>> will be misinterpreted as characters."[1]
44 >>>
45 >>> I'm even more doubtful that this would produce usable
46 >>> results with sufficient reliability.
47 >>>
48 >>> So what might be the best way to get text/numbers out of
49 >>> what a program displays?
50 >>>
51 >>>
52 >>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software)
53 >>>
54 >>
55 >> I can't help with Gentoo.
56 >> Try to find an old (free) version of FineReader which runs under wine.
57 >> If you do it only occasionally, transfer the image to an Android phone
58 >> where there a good and cheap OCR apps, even FineReader.
59 >
60 > It would be too much video to process. Besides, phones are
61 > ok for making phone calls and entirely incompatible with
62 > computers, which makes them useless for anything else but
63 > making phone calls.
64
65
66 Huh? da fuck you talkin' 'bout?
67
68
69 My trusty collection of Android devices would be very surprised to hear
70 they now don't have real CPUs, wifi chips, RAM and storage. Or can't run
71 a web browser, do email, instant chat, play x264 video with less cpu
72 load than my 8 core laptop, share with smb on the network, do bluetooth,
73 video calls or any of the other bazzillion things computers have always
74 done with each other.
75
76 How odd. I really thought my Android phones could do all of that. I must
77 have imagined it .... that means my delusions are worse than I thought
78 and maybe I need different and more pills from the nice lady who's my GP.
79
80
81
82 --
83 Alan McKinnon
84 alan.mckinnon@×××××.com