Re: [gentoo-user] extracting text, numbers from screencasts - gentoo-user

From:	hw <hw@×××××.de>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] extracting text, numbers from screencasts
Date:	Sat, 07 May 2016 14:57:24
Message-Id:	`572E0247.9030600@gc-24.de`
In Reply to:	Re: [gentoo-user] extracting text, numbers from screencasts by "Urs Schütz"

1

Urs Schütz schrieb:

2

> On 04/08/16 11:30, Helmut Jarausch wrote:

3

>> On 04/08/2016 03:26:53 PM, hw wrote:

4

>>>

5

>>> Hi,

6

>>>

7

>>> what would be the best approach to extract data

8

>>> from a screencast?

9

>>>

10

>>> The task is to acquire some data from the display of

11

>>> a GUI program used interactively by a user.  There are

12

>>> a couple 'fields' (as in "designated areas of the display")

13

>>> in which the relevant data is being displayed while the

14

>>> program is being used.  The acquired data needs to be

15

>>> entered into a mysql database, preferably as soon as

16

>>> possible.  (The program needs windoze, and the sources

17

>>> are unavailable :( )

18

>>>

19

>>>

20

>>> The idea is to make a screen recording and postprocess

21

>>> the recording with some sort of OCR software.  This might

22

>>> require using ffmpeg (or the like) to create a single

23

>>> image from each frame of the recording; then treat each

24

>>> image with an OCR software to get the interesting data

25

>>> which can then be entered into the database.

26

>>>

27

>>> Data to extract is mostly numbers.  The relevant fields

28

>>> can be expected to be either filled or empty.  The FPS rate

29

>>> of the recording can be kept reasonably low, like 1 FPS,

30

>>> or perhaps even less, depending on how frequent the relevant

31

>>> fields change.

32

>>>

33

>>> Using tesseract comes to mind, but after reading that

34

>>>

35

>>> "Tesseract's output will be very poor quality if the input

36

>>> images are not preprocessed to suit it: Images (especially

37

>>> screenshots) must be scaled up such that the text x-height

38

>>> is at least 20 pixels,[12] any rotation or skew must be

39

>>> corrected or no text will be recognized, low-frequency

40

>>> changes in brightness must be high-pass filtered, or

41

>>> Tesseract's binarization stage will destroy much of the

42

>>> page, and dark borders must be manually removed, or they

43

>>> will be misinterpreted as characters."[1]

44

>>>

45

>>> I'm even more doubtful that this would produce usable

46

>>> results with sufficient reliability.

47

>>>

48

>>> So what might be the best way to get text/numbers out of

49

>>> what a program displays?

50

>>>

51

>>>

52

>>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software)

53

>>>

54

>>

55

>> I can't help with Gentoo.

56

>> Try to find an old (free) version of FineReader which runs under wine.

57

>> If you do it only occasionally, transfer the image to an Android phone

58

>> where there a good and cheap OCR apps, even FineReader.

59

>>

60

>>

61

>>

62

>

63

> I had some surprisingly good experience with tesseact in digitizing photographed pages of an old book recently. So I gave it a try today with a cropped screenshot of thunderbird.

64

>

65

> $ convert scrsht.png -type Grayscale -filter point -resize 300% -normalize upscaled.png

66

> $ tesseract -l eng upscaled.png out

67

> $ less out.txt

68

>

69

> convert is from media-gfx/imagemagick-6.9.0.3

70

> tesseract is app-text/tesseract-3.04.00-r2

71

>

72

> Here are my findings:

73

> Any graphical elements sized similar to an character appear as strange letters.

74

> Recognition of serif fonts was better than sans-serif fonts, even at smaller font size.

75

> Text which can be spell-checked was nearly perfectly recognized.

76

> Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized.

77

> Selected text (white on blue background) was poorly recognized.

78

> Dates were not recognized correctly.

79

> Times were correctly read.

80

> "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 seconds.

81

> "tesseract" time was a little more than 6s on an Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, without opencl.

82

> The image conversion and tesseract ocr could easily be scripted.

83

84

Considering the amount of video, 6s per frame would be too long.

85

The application is time-critical such that I have a window of about

86

10s to extract and to process the data from at least 8 video streams.

87

Recording at only 10 FPS and taking 8 seconds to extract and to

88

process the data would require 640s per 10s window, and I don't have

89

about 70 CPUs available to do the work.  To make things worse, it's

90

an ongoing process, i. e. dividing it into 10s windows is too artificial

91

to keep things running as smoothly as they should.

92

93

> In short I would say that the following steps would help with tesseract:

94

> Avoid GUI with a lot of graphics.

95

> Try to screenshot just the relevant areas.

96

> Increase GUI font size.

97

> Configure GUI to use a well known serif font, or train tesseract for the specific font used.

98

> Configure GUI to use high contrasts, avoid colors which get converted to gray.

99

> Tesseract time could be improved by enabling opencl.

100

>

101

> I would be interested to hear about your findings with numerical data, and which approach finally works for you.

102

103

Thank you very much for giving me a better idea of what I'm looking at!

104

Considering it, I have resorted to use autohotkey, which has the ability

105

to actually read data from GUI-elements.  It also can make requests to

106

web servers.  With that, things become a hell of a lot simpler than

107

trying to process video streams, for I can simply read the data and send

108

it over to the web server which puts it into the database where it needs

109

to end up anyway.

110

111

Unfortunately, the application the data is being read from has a bad

112

habit of renaming the GUI-elements I need to read.  This makes things

113

difficult again.

114

115

Autohotkey is a really nice tool, though.  I wonder if there is an

116

equivalent for X11.

1	Urs Schütz schrieb:
2	> On 04/08/16 11:30, Helmut Jarausch wrote:
3	>> On 04/08/2016 03:26:53 PM, hw wrote:
4	>>>
5	>>> Hi,
6	>>>
7	>>> what would be the best approach to extract data
8	>>> from a screencast?
9	>>>
10	>>> The task is to acquire some data from the display of
11	>>> a GUI program used interactively by a user. There are
12	>>> a couple 'fields' (as in "designated areas of the display")
13	>>> in which the relevant data is being displayed while the
14	>>> program is being used. The acquired data needs to be
15	>>> entered into a mysql database, preferably as soon as
16	>>> possible. (The program needs windoze, and the sources
17	>>> are unavailable :( )
18	>>>
19	>>>
20	>>> The idea is to make a screen recording and postprocess
21	>>> the recording with some sort of OCR software. This might
22	>>> require using ffmpeg (or the like) to create a single
23	>>> image from each frame of the recording; then treat each
24	>>> image with an OCR software to get the interesting data
25	>>> which can then be entered into the database.
26	>>>
27	>>> Data to extract is mostly numbers. The relevant fields
28	>>> can be expected to be either filled or empty. The FPS rate
29	>>> of the recording can be kept reasonably low, like 1 FPS,
30	>>> or perhaps even less, depending on how frequent the relevant
31	>>> fields change.
32	>>>
33	>>> Using tesseract comes to mind, but after reading that
34	>>>
35	>>> "Tesseract's output will be very poor quality if the input
36	>>> images are not preprocessed to suit it: Images (especially
37	>>> screenshots) must be scaled up such that the text x-height
38	>>> is at least 20 pixels,[12] any rotation or skew must be
39	>>> corrected or no text will be recognized, low-frequency
40	>>> changes in brightness must be high-pass filtered, or
41	>>> Tesseract's binarization stage will destroy much of the
42	>>> page, and dark borders must be manually removed, or they
43	>>> will be misinterpreted as characters."[1]
44	>>>
45	>>> I'm even more doubtful that this would produce usable
46	>>> results with sufficient reliability.
47	>>>
48	>>> So what might be the best way to get text/numbers out of
49	>>> what a program displays?
50	>>>
51	>>>
52	>>> [1]: https://en.wikipedia.org/wiki/Tesseract_(software)
53	>>>
54	>>
55	>> I can't help with Gentoo.
56	>> Try to find an old (free) version of FineReader which runs under wine.
57	>> If you do it only occasionally, transfer the image to an Android phone
58	>> where there a good and cheap OCR apps, even FineReader.
59	>>
60	>>
61	>>
62	>
63	> I had some surprisingly good experience with tesseact in digitizing photographed pages of an old book recently. So I gave it a try today with a cropped screenshot of thunderbird.
64	>
65	> $ convert scrsht.png -type Grayscale -filter point -resize 300% -normalize upscaled.png
66	> $ tesseract -l eng upscaled.png out
67	> $ less out.txt
68	>
69	> convert is from media-gfx/imagemagick-6.9.0.3
70	> tesseract is app-text/tesseract-3.04.00-r2
71	>
72	> Here are my findings:
73	> Any graphical elements sized similar to an character appear as strange letters.
74	> Recognition of serif fonts was better than sans-serif fonts, even at smaller font size.
75	> Text which can be spell-checked was nearly perfectly recognized.
76	> Gentoo-specific words like "GLSA" and "NVMe" was not correctly recognized.
77	> Selected text (white on blue background) was poorly recognized.
78	> Dates were not recognized correctly.
79	> Times were correctly read.
80	> "convert" time for a initial screenshot size of 956 x 639 pixels was 0.4 seconds.
81	> "tesseract" time was a little more than 6s on an Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, without opencl.
82	> The image conversion and tesseract ocr could easily be scripted.
83
84	Considering the amount of video, 6s per frame would be too long.
85	The application is time-critical such that I have a window of about
86	10s to extract and to process the data from at least 8 video streams.
87	Recording at only 10 FPS and taking 8 seconds to extract and to
88	process the data would require 640s per 10s window, and I don't have
89	about 70 CPUs available to do the work. To make things worse, it's
90	an ongoing process, i. e. dividing it into 10s windows is too artificial
91	to keep things running as smoothly as they should.
92
93	> In short I would say that the following steps would help with tesseract:
94	> Avoid GUI with a lot of graphics.
95	> Try to screenshot just the relevant areas.
96	> Increase GUI font size.
97	> Configure GUI to use a well known serif font, or train tesseract for the specific font used.
98	> Configure GUI to use high contrasts, avoid colors which get converted to gray.
99	> Tesseract time could be improved by enabling opencl.
100	>
101	> I would be interested to hear about your findings with numerical data, and which approach finally works for you.
102
103	Thank you very much for giving me a better idea of what I'm looking at!
104	Considering it, I have resorted to use autohotkey, which has the ability
105	to actually read data from GUI-elements. It also can make requests to
106	web servers. With that, things become a hell of a lot simpler than
107	trying to process video streams, for I can simply read the data and send
108	it over to the web server which puts it into the database where it needs
109	to end up anyway.
110
111	Unfortunately, the application the data is being read from has a bad
112	habit of renaming the GUI-elements I need to read. This makes things
113	difficult again.
114
115	Autohotkey is a really nice tool, though. I wonder if there is an
116	equivalent for X11.

Gentoo Archives: gentoo-user