[gentoo-amd64] Re: gcc 4.1.1 - gentoo-amd64 - Gentoo Mailing List Archives

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: gcc 4.1.1
Date:	Wed, 07 Jun 2006 08:24:15
Message-Id:	`e66277$1kt$1@sea.gmane.org`
In Reply to:	[gentoo-amd64] Re: gcc 4.1.1 by Jani Averbach

1

Jani Averbach <jaa@×××××××.fi> posted 20060607010816.GA31588@×××××××.fi,

2

excerpted below, on  Tue, 06 Jun 2006 19:08:16 -0600:

3

4

> Inspired by your comment, I installed 4.1.1 and did very un-scientific

5

> test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw

6

> picture with it:

7

>

8

> time dcraw-3 -w test.CR2

9

> real           0m10.338s

10

> user           0m9.969s

11

> sys            0m0.332s

12

>

13

> time dcraw-4 -w test.CR2

14

> real           0m9.141s

15

> user           0m8.849s

16

> sys            0m0.292s

17

>

18

> This is pretty good, and that was only the dcraw, all libraries are still

19

> done by gcc 3.4.x.

20

>

21

> BR, Jani

22

>

23

> P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms

24

25

Very interesting.  I hadn't done any similar direct comparisons, but had

26

just been amazed at how much more responsive things seem to be with 4.1.x

27

as compared to 3.4.x.  Given the generally agreed rule of thumb that users

28

won't definitively notice a difference of less than about 15% performance,

29

I've estimated an at minimum 20% difference, with everything compiled

30

4.1.x as compared to 3.4.x.

31

32

One test I've always been interested in but have never done, is the effect

33

of -Os vs -O2 vs -O3.  I think it's generally agreed from testing that -O2

34

makes a BIG difference as opposed to unoptimized or -O (-O1), but the

35

differences between -O2, -O3, and -Os, are less clearly defined and, it

36

would appear, the best depends on what one is compiling.  I know -O3 is

37

actually supposed to be worse than -O2 in many cases, because the effects

38

of the loop unrolling and similar optimizations is generally to markedly

39

increase code size, and the effects of the resulting cache misses and

40

thereby idling CPU while waiting for memory, is often worse than the

41

cycles saved by the additional optimization.

42

43

For that reason, I've always tended to go what could be argued to be to

44

the other extreme, favoring -Os over -O2, figuring that in a multitasking

45

environment, even with today's increased cache sizes, and with main memory

46

increasingly falling behind CPU speeds (well, until CPU speeds started

47

leveling off recently as they moved to multi-core instead), -Os should in

48

general be faster than -O2 for the same reason that -O2 is so often faster

49

than -O3.

50

51

OTOH, there are a couple specific optimizations that can increase overall

52

code size while increasing cache hit ratios as well, thereby negating the

53

general cache hit increases of -Os.  Perhaps the most significant of

54

these, where it can be used, is -freorder-blocks-and-partition.  The

55

effect of this flag is to cause gcc to try to regroup routines into "hot"

56

and "cold", with each group in its own "partition".  Hot routines are

57

those that are called most frequently, cold, the least frequently, so the

58

effect is that despite a bit of overall increase in code size, the most

59

frequently used routines will be in cache a much higher percentage of the

60

time as compared to generally un-reordered routines/blocks.  In theory,

61

that could dramatically affect performance as the CPU will far more

62

frequently find the stuff it needs in cache and not have to wait for it to

63

be retrieved from much slower main memory.  The biggest problem with this

64

flag is that there's a LOT of code that can't use it, including the

65

exception code so common to C++. Now gcc does spot that and turn it off, so

66

no harm done, but it spits out warnings in the process, saying that it

67

turned it off, and this breaks a lot of ebuilds in the configure step as

68

they often abort on those warnings when they shouldn't.  As a result, I've

69

split my CFLAGS from my CXXFLAGS and only include

70

-freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS.  I

71

also have the weaker form -freorder-blocks (without the -and-partition) in

72

both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning

73

form is turned off.  I've not done actual performance tests on this either

74

way, but I do know the occasional problem with an aborted ebuild I'd have

75

with the partition version in CXXFLAGS is no longer a problem, with it

76

only in CFLAGS, and it hasn't seemed to cause me any /problems/ since

77

then, quite apart from its not-verified-by-me affect on performance.

78

79

Likewise with the flags -frename-registers and -fweb.  Since registers are

80

the fastest of all memory, operating at full CPU speed, and these increase

81

the efficiency of register allocation, it is IMO worth invoking them even

82

at the expense of slightly increased code side.  This is likely to be

83

particularly true on amd64/x86_64, with its increased number of registers

84

in comparison to x86.  Our arch has those extra registers, we might as

85

well make as much of them as we possibly can!

86

87

Conversely, I don't like the unroll loops flags at all.  It's my

88

(unverified, no argument there) belief that these will be BIG drags on

89

performance because they blow up single loop structures to multiple times

90

their original size, all in an (IMO misguided) effort to inline the loops,

91

preventing a few jump instructions.  Jumps are far less costly on x86_64

92

and even full 32-bit i586+ x86 than they used to be on the original 16-bit

93

8088-80486 generations.  That's particularly true in the case of tight

94

loops where the entire loop will be in L1 cache.  With proper

95

pre-fetching, it's /possible/ inline unrolling of the loops could keep the

96

registers full from L1 and the code running at full CPU speed as opposed

97

to the slight waits possibly necessary at the loopback jump for a fetch

98

from L1 instead of being able to continue full-speed register operations,

99

but I believe it's much more likely that inlining the unrolled loops will

100

either force code out to L2, or that the prefetching couldn't keep the

101

registers sufficiently full even with inlining, so there'd be the wait in

102

any case.  -O2 does a bit of simple loop unrolling, which -Os should

103

discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if

104

I'm correctly reading the gcc manpages, anyway.  It's that size intensive

105

loop unrolling that I most want to discourage, which is why I'd seldom

106

consider -O3 at all, and the big reason why I favor -Os over -O2, even for

107

-O2's limited loop unrolling.

108

109

However, arguably, -O2's limited loop unrolling is more optimal than

110

discouraging it with -Os.  I believe it would actually come down to the

111

code in question, and which is "better" as an overall system CFLAGS choice

112

very likely depends on exactly what applications one actually chooses to

113

merge.  It's also very likely dependant on how much multitasking an

114

individual installation routinely gets, and whether that's single-core or

115

multi-core/multi-CPU based multi-tasking, plus the specifics of the

116

sub-arch caching implementation.  (Intel's memory management, particularly

117

as the number of cores and CPUs increases, isn't at this point as

118

efficient as AMD's, tho with Conroe Intel is likely to brute-force the

119

leadership position once again, for the single and dual-core models

120

normally found on desktops/laptops and low end workstations, anyway,

121

despite AMD's more elegant memory management currently and as the number

122

of cores and CPUs scales, 4 and above.)

123

124

...

125

126

I **HAVE** come across a single **VERY** convincing demonstration of the

127

problems with gcc 3.x on amd64/x86_64, however.  This one blew me away --

128

it was TOTALLY unexpected and another guy and I spent quite some

129

troubleshooting time finding it, as a result.

130

131

Those of you using pan as their news client of choice may already be aware

132

of the fact that there's a newer 0.90+ beta series available.  Portage has

133

a couple masked ebuilds for the series, but hasn't been keeping up as a

134

new one has been coming out every weekend (with this past weekend an

135

exception, Charles, the main developer, took a few days vacation) since

136

April first.  Therefore, one can either build from source, or do what I've

137

been doing and rename the ebuild (in my overlay) for each successive

138

weekly release.  (My overlay ebuild is slightly modified, as well, but no

139

biggie for this discussion.)

140

141

Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN

142

groups was having a /terrible/ time compiling the new PAN series, with the

143

then latest ~amd64 gcc-3.4.x.  With a gigabyte of memory, plus swap, he

144

kept running into insufficient memory errors.

145

146

I wondered how that could be, as I'm running a generally ~amd64 system

147

myself and had experienced no issues, and while I'm running 8 gig of

148

memory now, that's a fairly recent upgrade and I had neither experience

149

problems compiling pan before, nor noticed it using a lot of memory after

150

the upgrade.  I run ulimit set to a gig of virtual memory (ulimit -v) by

151

default, and certainly would have expected to run into issues compiling

152

pan with that if it required that sort of memory.  I routinely /do/ run

153

into such problems merging kmail, and always have to boost my ulimit

154

settings to compile it, so I knew it would happen if it really required

155

that sort of memory to compile.

156

157

As it happened, while he was running ~amd64, he wasn't routinely using

158

--deep with his emerge --update world runs, so he had a number of packages

159

that were less than the very latest, and that's what he and I focused on

160

first as the difference between his and my systems, figuring a newer

161

version of /something/ I had must explain why I had no problem compiling

162

it while he did.

163

164

After he upgraded a few packages with no change in the problem, someone

165

else mentioned that it might be gcc.  Turned out he was right, it WAS gcc.

166

With gcc-3.4.x, compiling the new pan on amd64 at one point requires an

167

incredible 1.3 gigabytes of usable virtual memory for a single compile

168

job! (That's the makeopts=-jX setting.)  He apparently had enough memory

169

and swap to do it -- if he shut down X and virtually everything else he

170

was running -- but was experiencing errors due to lack of one or the

171

other, with everything he normally had running continuing to run while he

172

compiled pan.

173

174

When out of curiosity I checked how much memory it took with gcc-4.1.0,

175

the version I was running at the time (tho it was masked for Gentoo

176

users), I quickly saw why I hadn't noticed a problem -- less than 300 MB

177

usage at any point.  I /think/ it was actually less than 200, but I didn't

178

verify that.  In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR

179

TIMES that, at JUST LESS THAT 1.3 GB required.  No WONDER I hadn't noticed

180

anything unusual compiling it with gcc-4.1.x, while he had all sorts of

181

problems with gcc-3.4.x!

182

183

I haven't verified this on x86, but I suspect the reason it didn't come up

184

with anyone else is because it's not a problem with x86.  gcc 3.4.x is

185

apparently fairly efficient at dealing with 32-bit memory addresses and is

186

already reasonably optimized for x86.  The same cannot be said for its

187

treatment of amd64.  While this pan case is certainly an extreme corner

188

case, it does serve to emphasize the fact that gcc-3.x was simply not

189

designed for amd64/x86_64, and its x86_64 capacities are and will remain

190

"bolted on", and as such. far more cumbersome and less efficient than they

191

/could/ be.  The 4.x rewrite provided the opportunity to change that and

192

it was taken.  As I've said, however, 4.0 was /just/ the rewrite and didn't

193

really do much else but try to keep regressions to a minimum.  With the

194

4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own,

195

and the performance improvements dramatically demonstrate that.  The jump

196

from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc

197

support for amd64 since support was originally added, and it's probably

198

the biggest jump we'll ever see, because while improvements will continue

199

to be made, from this point on, they will be incremental improvements,

200

significant yes, but not the blow-me-away improvements of 4.1, much as

201

improvements have been only incremental on x86 for some time.

202

203

...

204

205

Anyway... thanks for that little test.  The results are certainly

206

enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3

207

and register and reorder flags vs the standard -Ox alone, if you are up to

208

it, but haven't bothered to run them myself, and just this little test

209

alone was quite informative and definitely more concrete than the "feel"

210

I've been basing my comments on to date.  Hopefully, my above comments

211

prove useful to someone as well, and if I'm lucky, motivation for some

212

tests (by you or someone else) to prove or disprove them.  =8^)

213

214

--

215

Duncan - List replies preferred.   No HTML msgs.

216

"Every nonfree program has a lord, a master --

217

and if you use the program, he is your master."  Richard Stallman

218

219

--

220

gentoo-amd64@g.o mailing list

1	Jani Averbach <jaa@×××××××.fi> posted 20060607010816.GA31588@×××××××.fi,
2	excerpted below, on Tue, 06 Jun 2006 19:08:16 -0600:
3
4	> Inspired by your comment, I installed 4.1.1 and did very un-scientific
5	> test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw
6	> picture with it:
7	>
8	> time dcraw-3 -w test.CR2
9	> real 0m10.338s
10	> user 0m9.969s
11	> sys 0m0.332s
12	>
13	> time dcraw-4 -w test.CR2
14	> real 0m9.141s
15	> user 0m8.849s
16	> sys 0m0.292s
17	>
18	> This is pretty good, and that was only the dcraw, all libraries are still
19	> done by gcc 3.4.x.
20	>
21	> BR, Jani
22	>
23	> P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms
24
25	Very interesting. I hadn't done any similar direct comparisons, but had
26	just been amazed at how much more responsive things seem to be with 4.1.x
27	as compared to 3.4.x. Given the generally agreed rule of thumb that users
28	won't definitively notice a difference of less than about 15% performance,
29	I've estimated an at minimum 20% difference, with everything compiled
30	4.1.x as compared to 3.4.x.
31
32	One test I've always been interested in but have never done, is the effect
33	of -Os vs -O2 vs -O3. I think it's generally agreed from testing that -O2
34	makes a BIG difference as opposed to unoptimized or -O (-O1), but the
35	differences between -O2, -O3, and -Os, are less clearly defined and, it
36	would appear, the best depends on what one is compiling. I know -O3 is
37	actually supposed to be worse than -O2 in many cases, because the effects
38	of the loop unrolling and similar optimizations is generally to markedly
39	increase code size, and the effects of the resulting cache misses and
40	thereby idling CPU while waiting for memory, is often worse than the
41	cycles saved by the additional optimization.
42
43	For that reason, I've always tended to go what could be argued to be to
44	the other extreme, favoring -Os over -O2, figuring that in a multitasking
45	environment, even with today's increased cache sizes, and with main memory
46	increasingly falling behind CPU speeds (well, until CPU speeds started
47	leveling off recently as they moved to multi-core instead), -Os should in
48	general be faster than -O2 for the same reason that -O2 is so often faster
49	than -O3.
50
51	OTOH, there are a couple specific optimizations that can increase overall
52	code size while increasing cache hit ratios as well, thereby negating the
53	general cache hit increases of -Os. Perhaps the most significant of
54	these, where it can be used, is -freorder-blocks-and-partition. The
55	effect of this flag is to cause gcc to try to regroup routines into "hot"
56	and "cold", with each group in its own "partition". Hot routines are
57	those that are called most frequently, cold, the least frequently, so the
58	effect is that despite a bit of overall increase in code size, the most
59	frequently used routines will be in cache a much higher percentage of the
60	time as compared to generally un-reordered routines/blocks. In theory,
61	that could dramatically affect performance as the CPU will far more
62	frequently find the stuff it needs in cache and not have to wait for it to
63	be retrieved from much slower main memory. The biggest problem with this
64	flag is that there's a LOT of code that can't use it, including the
65	exception code so common to C++. Now gcc does spot that and turn it off, so
66	no harm done, but it spits out warnings in the process, saying that it
67	turned it off, and this breaks a lot of ebuilds in the configure step as
68	they often abort on those warnings when they shouldn't. As a result, I've
69	split my CFLAGS from my CXXFLAGS and only include
70	-freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS. I
71	also have the weaker form -freorder-blocks (without the -and-partition) in
72	both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning
73	form is turned off. I've not done actual performance tests on this either
74	way, but I do know the occasional problem with an aborted ebuild I'd have
75	with the partition version in CXXFLAGS is no longer a problem, with it
76	only in CFLAGS, and it hasn't seemed to cause me any /problems/ since
77	then, quite apart from its not-verified-by-me affect on performance.
78
79	Likewise with the flags -frename-registers and -fweb. Since registers are
80	the fastest of all memory, operating at full CPU speed, and these increase
81	the efficiency of register allocation, it is IMO worth invoking them even
82	at the expense of slightly increased code side. This is likely to be
83	particularly true on amd64/x86_64, with its increased number of registers
84	in comparison to x86. Our arch has those extra registers, we might as
85	well make as much of them as we possibly can!
86
87	Conversely, I don't like the unroll loops flags at all. It's my
88	(unverified, no argument there) belief that these will be BIG drags on
89	performance because they blow up single loop structures to multiple times
90	their original size, all in an (IMO misguided) effort to inline the loops,
91	preventing a few jump instructions. Jumps are far less costly on x86_64
92	and even full 32-bit i586+ x86 than they used to be on the original 16-bit
93	8088-80486 generations. That's particularly true in the case of tight
94	loops where the entire loop will be in L1 cache. With proper
95	pre-fetching, it's /possible/ inline unrolling of the loops could keep the
96	registers full from L1 and the code running at full CPU speed as opposed
97	to the slight waits possibly necessary at the loopback jump for a fetch
98	from L1 instead of being able to continue full-speed register operations,
99	but I believe it's much more likely that inlining the unrolled loops will
100	either force code out to L2, or that the prefetching couldn't keep the
101	registers sufficiently full even with inlining, so there'd be the wait in
102	any case. -O2 does a bit of simple loop unrolling, which -Os should
103	discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if
104	I'm correctly reading the gcc manpages, anyway. It's that size intensive
105	loop unrolling that I most want to discourage, which is why I'd seldom
106	consider -O3 at all, and the big reason why I favor -Os over -O2, even for
107	-O2's limited loop unrolling.
108
109	However, arguably, -O2's limited loop unrolling is more optimal than
110	discouraging it with -Os. I believe it would actually come down to the
111	code in question, and which is "better" as an overall system CFLAGS choice
112	very likely depends on exactly what applications one actually chooses to
113	merge. It's also very likely dependant on how much multitasking an
114	individual installation routinely gets, and whether that's single-core or
115	multi-core/multi-CPU based multi-tasking, plus the specifics of the
116	sub-arch caching implementation. (Intel's memory management, particularly
117	as the number of cores and CPUs increases, isn't at this point as
118	efficient as AMD's, tho with Conroe Intel is likely to brute-force the
119	leadership position once again, for the single and dual-core models
120	normally found on desktops/laptops and low end workstations, anyway,
121	despite AMD's more elegant memory management currently and as the number
122	of cores and CPUs scales, 4 and above.)
123
124	...
125
126	I HAVE come across a single VERY convincing demonstration of the
127	problems with gcc 3.x on amd64/x86_64, however. This one blew me away --
128	it was TOTALLY unexpected and another guy and I spent quite some
129	troubleshooting time finding it, as a result.
130
131	Those of you using pan as their news client of choice may already be aware
132	of the fact that there's a newer 0.90+ beta series available. Portage has
133	a couple masked ebuilds for the series, but hasn't been keeping up as a
134	new one has been coming out every weekend (with this past weekend an
135	exception, Charles, the main developer, took a few days vacation) since
136	April first. Therefore, one can either build from source, or do what I've
137	been doing and rename the ebuild (in my overlay) for each successive
138	weekly release. (My overlay ebuild is slightly modified, as well, but no
139	biggie for this discussion.)
140
141	Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN
142	groups was having a /terrible/ time compiling the new PAN series, with the
143	then latest ~amd64 gcc-3.4.x. With a gigabyte of memory, plus swap, he
144	kept running into insufficient memory errors.
145
146	I wondered how that could be, as I'm running a generally ~amd64 system
147	myself and had experienced no issues, and while I'm running 8 gig of
148	memory now, that's a fairly recent upgrade and I had neither experience
149	problems compiling pan before, nor noticed it using a lot of memory after
150	the upgrade. I run ulimit set to a gig of virtual memory (ulimit -v) by
151	default, and certainly would have expected to run into issues compiling
152	pan with that if it required that sort of memory. I routinely /do/ run
153	into such problems merging kmail, and always have to boost my ulimit
154	settings to compile it, so I knew it would happen if it really required
155	that sort of memory to compile.
156
157	As it happened, while he was running ~amd64, he wasn't routinely using
158	--deep with his emerge --update world runs, so he had a number of packages
159	that were less than the very latest, and that's what he and I focused on
160	first as the difference between his and my systems, figuring a newer
161	version of /something/ I had must explain why I had no problem compiling
162	it while he did.
163
164	After he upgraded a few packages with no change in the problem, someone
165	else mentioned that it might be gcc. Turned out he was right, it WAS gcc.
166	With gcc-3.4.x, compiling the new pan on amd64 at one point requires an
167	incredible 1.3 gigabytes of usable virtual memory for a single compile
168	job! (That's the makeopts=-jX setting.) He apparently had enough memory
169	and swap to do it -- if he shut down X and virtually everything else he
170	was running -- but was experiencing errors due to lack of one or the
171	other, with everything he normally had running continuing to run while he
172	compiled pan.
173
174	When out of curiosity I checked how much memory it took with gcc-4.1.0,
175	the version I was running at the time (tho it was masked for Gentoo
176	users), I quickly saw why I hadn't noticed a problem -- less than 300 MB
177	usage at any point. I /think/ it was actually less than 200, but I didn't
178	verify that. In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR
179	TIMES that, at JUST LESS THAT 1.3 GB required. No WONDER I hadn't noticed
180	anything unusual compiling it with gcc-4.1.x, while he had all sorts of
181	problems with gcc-3.4.x!
182
183	I haven't verified this on x86, but I suspect the reason it didn't come up
184	with anyone else is because it's not a problem with x86. gcc 3.4.x is
185	apparently fairly efficient at dealing with 32-bit memory addresses and is
186	already reasonably optimized for x86. The same cannot be said for its
187	treatment of amd64. While this pan case is certainly an extreme corner
188	case, it does serve to emphasize the fact that gcc-3.x was simply not
189	designed for amd64/x86_64, and its x86_64 capacities are and will remain
190	"bolted on", and as such. far more cumbersome and less efficient than they
191	/could/ be. The 4.x rewrite provided the opportunity to change that and
192	it was taken. As I've said, however, 4.0 was /just/ the rewrite and didn't
193	really do much else but try to keep regressions to a minimum. With the
194	4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own,
195	and the performance improvements dramatically demonstrate that. The jump
196	from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc
197	support for amd64 since support was originally added, and it's probably
198	the biggest jump we'll ever see, because while improvements will continue
199	to be made, from this point on, they will be incremental improvements,
200	significant yes, but not the blow-me-away improvements of 4.1, much as
201	improvements have been only incremental on x86 for some time.
202
203	...
204
205	Anyway... thanks for that little test. The results are certainly
206	enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3
207	and register and reorder flags vs the standard -Ox alone, if you are up to
208	it, but haven't bothered to run them myself, and just this little test
209	alone was quite informative and definitely more concrete than the "feel"
210	I've been basing my comments on to date. Hopefully, my above comments
211	prove useful to someone as well, and if I'm lucky, motivation for some
212	tests (by you or someone else) to prove or disprove them. =8^)
213
214	--
215	Duncan - List replies preferred. No HTML msgs.
216	"Every nonfree program has a lord, a master --
217	and if you use the program, he is your master." Richard Stallman
218
219	--
220	gentoo-amd64@g.o mailing list

Gentoo Archives: gentoo-amd64