1 |
Urs Schütz schrieb: |
2 |
> On 04/08/16 11:30, Helmut Jarausch wrote: |
3 |
>> On 04/08/2016 03:26:53 PM, hw wrote: |
4 |
>>> |
5 |
>>> Hi, |
6 |
>>> |
7 |
>>> what would be the best approach to extract data |
8 |
>>> from a screencast? |
9 |
>>> |
10 |
>>> The task is to acquire some data from the display of |
11 |
>>> a GUI program used interactively by a user. There are |
12 |
>>> a couple 'fields' (as in "designated areas of the display") |
13 |
>>> in which the relevant data is being displayed while the |
14 |
>>> program is being used. The acquired data needs to be |
15 |
>>> entered into a mysql database, preferably as soon as |
16 |
>>> possible. (The program needs windoze, and the sources |
17 |
>>> are unavailable :( ) |
18 |
>>> |
19 |
>>> |
20 |
>>> The idea is to make a screen recording and postprocess |
21 |
>>> the recording with some sort of OCR software. This might |
22 |
>>> require using ffmpeg (or the like) to create a single |
23 |
>>> image from each frame of the recording; then treat each |
24 |
>>> image with an OCR software to get the interesting data |
25 |
>>> which can then be entered into the database. |
26 |
>>> |
27 |
>>> Data to extract is mostly numbers. The relevant fields |
28 |
>>> can be expected to be either filled or empty. The FPS rate |
29 |
>>> of the recording can be kept reasonably low, like 1 FPS, |
30 |
>>> or perhaps even less, depending on how frequent the relevant |
31 |
>>> fields change. |
32 |
>>> |
33 |
>>> Using tesseract comes to mind, but after reading that |
34 |
>>> |
35 |
>>> "Tesseract's output will be very poor quality if the input |
36 |
>>> images are not preprocessed to suit it: Images (especially |
37 |
>>> screenshots) must be scaled up such that the text x-height |
38 |
>>> is at least 20 pixels,[12] any rotation or skew must be |
39 |
>>> corrected or no text will be recognized, low-frequency |
40 |
>>> changes in brightness must be high-pass filtered, or |
41 |
>>> Tesseract's binarization stage will destroy much of the |
42 |
>>> page, and dark borders must be manually removed, or they |
43 |
>>> will be misinterpreted as characters."[1] |
44 |
>>> |
45 |
>>> I'm even more doubtful that this would produce usable |
46 |
>>> results with sufficient reliability. |
47 |
>>> |
48 |
>>> So what might be the best way to get text/numbers out of |
49 |
>>> what a program displays? |
50 |
>>> |
51 |
>>> |
52 |
>>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software) |
53 |
>>> |
54 |
>> |
55 |
>> I can't help with Gentoo. |
56 |
>> Try to find an old (free) version of FineReader which runs under wine. |
57 |
>> If you do it only occasionally, transfer the image to an Android phone |
58 |
>> where there a good and cheap OCR apps, even FineReader. |
59 |
>> |
60 |
>> |
61 |
>> |
62 |
> |
63 |
> I had some surprisingly good experience with tesseact in digitizing photographed pages of an old book recently. So I gave it a try today with a cropped screenshot of thunderbird. |
64 |
> |
65 |
> $ convert scrsht.png -type Grayscale -filter point -resize 300% -normalize upscaled.png |
66 |
> $ tesseract -l eng upscaled.png out |
67 |
> $ less out.txt |
68 |
> |
69 |
> convert is from media-gfx/imagemagick-6.9.0.3 |
70 |
> tesseract is app-text/tesseract-3.04.00-r2 |
71 |
> |
72 |
> Here are my findings: |
73 |
> Any graphical elements sized similar to an character appear as strange letters. |
74 |
> Recognition of serif fonts was better than sans-serif fonts, even at smaller font size. |
75 |
> Text which can be spell-checked was nearly perfectly recognized. |
76 |
> Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized. |
77 |
> Selected text (white on blue background) was poorly recognized. |
78 |
> Dates were not recognized correctly. |
79 |
> Times were correctly read. |
80 |
> "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 seconds. |
81 |
> "tesseract" time was a little more than 6s on an Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, without opencl. |
82 |
> The image conversion and tesseract ocr could easily be scripted. |
83 |
|
84 |
Considering the amount of video, 6s per frame would be too long. |
85 |
The application is time-critical such that I have a window of about |
86 |
10s to extract and to process the data from at least 8 video streams. |
87 |
Recording at only 10 FPS and taking 8 seconds to extract and to |
88 |
process the data would require 640s per 10s window, and I don't have |
89 |
about 70 CPUs available to do the work. To make things worse, it's |
90 |
an ongoing process, i. e. dividing it into 10s windows is too artificial |
91 |
to keep things running as smoothly as they should. |
92 |
|
93 |
> In short I would say that the following steps would help with tesseract: |
94 |
> Avoid GUI with a lot of graphics. |
95 |
> Try to screenshot just the relevant areas. |
96 |
> Increase GUI font size. |
97 |
> Configure GUI to use a well known serif font, or train tesseract for the specific font used. |
98 |
> Configure GUI to use high contrasts, avoid colors which get converted to gray. |
99 |
> Tesseract time could be improved by enabling opencl. |
100 |
> |
101 |
> I would be interested to hear about your findings with numerical data, and which approach finally works for you. |
102 |
|
103 |
Thank you very much for giving me a better idea of what I'm looking at! |
104 |
Considering it, I have resorted to use autohotkey, which has the ability |
105 |
to actually read data from GUI-elements. It also can make requests to |
106 |
web servers. With that, things become a hell of a lot simpler than |
107 |
trying to process video streams, for I can simply read the data and send |
108 |
it over to the web server which puts it into the database where it needs |
109 |
to end up anyway. |
110 |
|
111 |
Unfortunately, the application the data is being read from has a bad |
112 |
habit of renaming the GUI-elements I need to read. This makes things |
113 |
difficult again. |
114 |
|
115 |
Autohotkey is a really nice tool, though. I wonder if there is an |
116 |
equivalent for X11. |