1 |
On 04/08/16 11:30, Helmut Jarausch wrote: |
2 |
> On 04/08/2016 03:26:53 PM, hw wrote: |
3 |
>> |
4 |
>> Hi, |
5 |
>> |
6 |
>> what would be the best approach to extract data |
7 |
>> from a screencast? |
8 |
>> |
9 |
>> The task is to acquire some data from the display of |
10 |
>> a GUI program used interactively by a user. There are |
11 |
>> a couple 'fields' (as in "designated areas of the display") |
12 |
>> in which the relevant data is being displayed while the |
13 |
>> program is being used. The acquired data needs to be |
14 |
>> entered into a mysql database, preferably as soon as |
15 |
>> possible. (The program needs windoze, and the sources |
16 |
>> are unavailable :( ) |
17 |
>> |
18 |
>> |
19 |
>> The idea is to make a screen recording and postprocess |
20 |
>> the recording with some sort of OCR software. This might |
21 |
>> require using ffmpeg (or the like) to create a single |
22 |
>> image from each frame of the recording; then treat each |
23 |
>> image with an OCR software to get the interesting data |
24 |
>> which can then be entered into the database. |
25 |
>> |
26 |
>> Data to extract is mostly numbers. The relevant fields |
27 |
>> can be expected to be either filled or empty. The FPS rate |
28 |
>> of the recording can be kept reasonably low, like 1 FPS, |
29 |
>> or perhaps even less, depending on how frequent the relevant |
30 |
>> fields change. |
31 |
>> |
32 |
>> Using tesseract comes to mind, but after reading that |
33 |
>> |
34 |
>> "Tesseract's output will be very poor quality if the input |
35 |
>> images are not preprocessed to suit it: Images (especially |
36 |
>> screenshots) must be scaled up such that the text x-height |
37 |
>> is at least 20 pixels,[12] any rotation or skew must be |
38 |
>> corrected or no text will be recognized, low-frequency |
39 |
>> changes in brightness must be high-pass filtered, or |
40 |
>> Tesseract's binarization stage will destroy much of the |
41 |
>> page, and dark borders must be manually removed, or they |
42 |
>> will be misinterpreted as characters."[1] |
43 |
>> |
44 |
>> I'm even more doubtful that this would produce usable |
45 |
>> results with sufficient reliability. |
46 |
>> |
47 |
>> So what might be the best way to get text/numbers out of |
48 |
>> what a program displays? |
49 |
>> |
50 |
>> |
51 |
>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) |
52 |
>> |
53 |
> |
54 |
> I can't help with Gentoo. |
55 |
> Try to find an old (free) version of FineReader which runs under wine. |
56 |
> If you do it only occasionally, transfer the image to an Android phone |
57 |
> where there a good and cheap OCR apps, even FineReader. |
58 |
> |
59 |
> |
60 |
> |
61 |
|
62 |
I had some surprisingly good experience with tesseact in digitizing |
63 |
photographed pages of an old book recently. So I gave it a try today |
64 |
with a cropped screenshot of thunderbird. |
65 |
|
66 |
$ convert scrsht.png -type Grayscale -filter point -resize 300% |
67 |
-normalize upscaled.png |
68 |
$ tesseract -l eng upscaled.png out |
69 |
$ less out.txt |
70 |
|
71 |
convert is from media-gfx/imagemagick-6.9.0.3 |
72 |
tesseract is app-text/tesseract-3.04.00-r2 |
73 |
|
74 |
Here are my findings: |
75 |
Any graphical elements sized similar to an character appear as strange |
76 |
letters. |
77 |
Recognition of serif fonts was better than sans-serif fonts, even at |
78 |
smaller font size. |
79 |
Text which can be spell-checked was nearly perfectly recognized. |
80 |
Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized. |
81 |
Selected text (white on blue background) was poorly recognized. |
82 |
Dates were not recognized correctly. |
83 |
Times were correctly read. |
84 |
"convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 |
85 |
seconds. |
86 |
"tesseract" time was a little more than 6s on an Intel(R) Core(TM) |
87 |
i7-4710MQ CPU @ 2.50GHz, without opencl. |
88 |
The image conversion and tesseract ocr could easily be scripted. |
89 |
|
90 |
In short I would say that the following steps would help with tesseract: |
91 |
Avoid GUI with a lot of graphics. |
92 |
Try to screenshot just the relevant areas. |
93 |
Increase GUI font size. |
94 |
Configure GUI to use a well known serif font, or train tesseract for the |
95 |
specific font used. |
96 |
Configure GUI to use high contrasts, avoid colors which get converted to |
97 |
gray. |
98 |
Tesseract time could be improved by enabling opencl. |
99 |
|
100 |
I would be interested to hear about your findings with numerical data, |
101 |
and which approach finally works for you. |
102 |
|
103 |
Urs |