1 |
Hi, |
2 |
|
3 |
what would be the best approach to extract data |
4 |
from a screencast? |
5 |
|
6 |
The task is to acquire some data from the display of |
7 |
a GUI program used interactively by a user. There are |
8 |
a couple 'fields' (as in "designated areas of the display") |
9 |
in which the relevant data is being displayed while the |
10 |
program is being used. The acquired data needs to be |
11 |
entered into a mysql database, preferably as soon as |
12 |
possible. (The program needs windoze, and the sources |
13 |
are unavailable :( ) |
14 |
|
15 |
|
16 |
The idea is to make a screen recording and postprocess |
17 |
the recording with some sort of OCR software. This might |
18 |
require using ffmpeg (or the like) to create a single |
19 |
image from each frame of the recording; then treat each |
20 |
image with an OCR software to get the interesting data |
21 |
which can then be entered into the database. |
22 |
|
23 |
Data to extract is mostly numbers. The relevant fields |
24 |
can be expected to be either filled or empty. The FPS rate |
25 |
of the recording can be kept reasonably low, like 1 FPS, |
26 |
or perhaps even less, depending on how frequent the relevant |
27 |
fields change. |
28 |
|
29 |
Using tesseract comes to mind, but after reading that |
30 |
|
31 |
"Tesseract's output will be very poor quality if the input |
32 |
images are not preprocessed to suit it: Images (especially |
33 |
screenshots) must be scaled up such that the text x-height |
34 |
is at least 20 pixels,[12] any rotation or skew must be |
35 |
corrected or no text will be recognized, low-frequency |
36 |
changes in brightness must be high-pass filtered, or |
37 |
Tesseract's binarization stage will destroy much of the |
38 |
page, and dark borders must be manually removed, or they |
39 |
will be misinterpreted as characters."[1] |
40 |
|
41 |
I'm even more doubtful that this would produce usable |
42 |
results with sufficient reliability. |
43 |
|
44 |
So what might be the best way to get text/numbers out of |
45 |
what a program displays? |
46 |
|
47 |
|
48 |
[1]: https://en.wikipedia.org/wiki/Tesseract_(software) |