[gentoo-amd64] Re: Re: Re: Giving up 64 platform - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: Re: Re: Giving up 64 platform
Date:	Wed, 26 Apr 2006 04:31:17
Message-Id:	`pan.2006.04.26.04.28.46.27923@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: Re: Giving up 64 platform by Brandon Edens

1

Brandon Edens posted <20060425221729.GA8185@××××××××××××.edu>, excerpted

2

below,  on Tue, 25 Apr 2006 18:17:29 -0400:

3

4

> Thats interesting stuff Duncan. I've been looking for details regarding gcc's

5

> possible optimizations for AMD64 along with low-level details of what they do.

6

> Are you using gcc 3 or 4? Is there much work to be done, porting AMD64

7

> optimizations from 3 to 4's tree stuff? I'd like to read more or know where to

8

> find additional info.

9

10

I used 4.0.0 as my default compiler from the time it was released, altho

11

at first there were quite a number of packages I had to switch back to

12

3.4.x for.  The Gentoo devs that came up with Gentoo's slotting  idea and

13

then implemented it for gcc get major kudos for making that so easy! =8^) 

14

15

Then for a period I was using the 4.0.1-pre snapshots, as they almost

16

immediately fixed a couple regressions in 4.0.0.  When 4.0.1 release came,

17

I stuck with it, until 4.1.0 came out.  4.1.0 then became my default

18

compiler, while I have 4.0.3 and 3.4.6-r1 slotted in as well, when I need

19

to use them.  

20

21

While Stelling (Gentoo AMD64 Op Lead, seems quite active in all sorts of

22

things, from AMD64 to GCC to portage =8^) says 4.1.0 has a few regressions

23

that are now fixed, with the fixes to be released in 4.1.1 and of course

24

4.2.x, I've not seen them. 4.1.0 has been a /very/ smooth ride here. 

25

Smoother than any version of GCC since 3.2 or whatever it was that first

26

supported AMD64, before it even got the -march=k8 stuff. I believe there's

27

just one case I've had where I had to revert to 3.4.x, and the fact that

28

4.0.x didn't cut it either in that case almost certainly means the package

29

in question simply hadn't been patched to take 4.x's additional strictness

30

into account yet.  Of course, part of that is due to the much rougher

31

transition to 4.0, with its much stricter code requirements and early

32

regressions.  That's to be expected for what amounted to a huge rewrite,

33

for 4.0, however, and the change still wasn't as rough as the

34

incompatibilities introduced early in the 3.x cycle (at which point I was

35

just switching to Linux and using Mandrake, so I pretty much only read

36

about them).

37

38

IMO, 4.1 is the most significant improvement for AMD64 since we got

39

-march=k8 support in 3.3 (IIRC).  GCC's AMD64 support is finally

40

coming into its own! =8^)  There are a several reasons for this. 

41

42

One, the 3.x series simply wasn't designed with AMD64 as a major arch; the

43

support for AMD64 in 3.x was in many ways simply bolted on, and it showed.

44

 GCC3 simply wasn't designed to be able to take full advantage of the

45

optimizations possible for AMD64, as opposed to x86.  The rewrite for 4.0

46

was done with AMD64 in mind, and much better optimizations became possible.  

47

48

Two, the reorganization for 4.x gave GCC a much better organized and more

49

modular hierarchy in general -- one where it is possible to optimize to

50

greater efficiency because all that spaghetti code that was 3.x is gone,

51

and it's now far easier for dependant optimizations to be made up and down

52

the hierarchical chain without risking a serious miscompile regression due

53

to all the spaghetti code that 3.x had become. That's of course across all

54

archs, but it made optimizing for AMD64 that much easier, as it was no

55

longer treated as a special case of x86 in terms of branches off that

56

spaghetti code.  (IOW, there will be improvements for x86 as well, but

57

they won't be quite as dramatic, both because it was already quite

58

optimized, and because it was designed in as a major target, while AMD64

59

was bolted on, from the GCC3 perspectiive.)

60

61

Of course (and this is point three) the goal for 4.0 was simply to clean

62

out the spaghetti code and get the rewrite and new framework in place with

63

as few regressions (both in optimization and in downright miscompiles) as

64

possible. As such, it didn't advance the concept or optimization much

65

anyway, because that wasn't the goal, and any such changes intruduced for

66

4.0 just complicated the verification process, in terms of ensuring there

67

were no serious regressions, which /was/ the goal.  In that regard, 4.1 is

68

the 4.x series finally coming into its own.  The improvements made

69

possible by the overall rearchitecting in 4.0 finally begin to appear in

70

4.1.  The promise of 4.x is now delivered.

71

72

Together, those three points mean a HUGE step for GCC's AMD64 support, 3.x

73

to 4.1.x.  It's the first time it has been possible, and the differences

74

really /are/ noticeable.  

75

76

...

77

78

(Recall my earlier posting to the effect that xorg's composite rendering,

79

with xorg-7.0 (modular-X), as compiled by gcc-4.1, is actually practical

80

now -- it doesn't slow down the system to the point of unusability.  BTW,

81

while I'm not running xorg-7.1 due to stability issues this early in the

82

release cycle, I played with it a bit, and the improvements to EXA to the

83

point that it can replace XAA are dramatic!  Configuring 2D rendering to

84

use EXA on xorg-7.1, there is now virtually /zero/, that's  right, /zero/

85

additional CPU cost, to turning composite on!  I was literally ASTOUNDED! 

86

I couldn't have imagined it possible!  The significance in terms of

87

bringing transparency and etc to the X desktop is tremendous!  I had

88

thought that there'd always be an additional cost, and that only those

89

with the latest video cards (and slaveryware drivers) and just being

90

introduced CPUs would be able to run with the bells and whistles turned

91

on, and that we'd have to grow into it, but I was apparently and happily

92

very very wrong!  At least for those with Radeon 92xx series cards -- I've

93

a 9250 -- even running merged framebuffer with dual 1600x1200 monitors

94

resolution, the thing had such a low CPU cost that I literally couldn't

95

tell the difference, either in responsiveness or in the CPU activity

96

graphs, between composite with all the goodies on, and composite toggled

97

off altogether.  As I said, I couldn't have dreamed that was technically

98

possible!  Of course, that's compiling with gcc-4.1.0.  How it works when

99

compiled with 3.4.6, I really don't know, nor am I eager to personally

100

find out, tho I'm certainly open to reading the experiences of others.)

101

102

...

103

104

Back to GCC.  Looking forward, I see a number of additional significant

105

improvements marked out for gcc 4.2 and 4.3.  With the now clean code and

106

modular framework of 4.x, its promise of making additional optimizations

107

(and compiling speed improvements, lets not forget them) possible

108

continues to be delivered.  However, from 4.1, the improvements for AMD64

109

will probably simply be incremental once again, because 4.1 is where a

110

reasonably optimized gcc for amd64 was finally delivered.  It's the giant

111

step. Beyond that, improvements will continue, but should be much smaller

112

in comparison.

113

114

...

115

116

As for specific CFLAGS/CXXFLAGS, I posted mine with a fairly detailed

117

explanation of why I chose them, probably about a month to six weeks ago

118

(as a followon to that xorg 7.0 post mentioned above). I'd suggest looking

119

it up in the archives if you want the details, and the bit of further

120

discussion that followed.  I'll repeat here briefly.

121

122

CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers 

123

-funit-at-a-time -fweb -freorder-blocks -freorder-blocks-and-partition

124

-ftree-pre -fmerge-all-constants"

125

126

The -march and -pipe things are the usual.  -fomit-frame-pointer is

127

actually part of -Os (and -O2/3) on amd64.  I include it specifically

128

however, because some ebuilds use replaceflags or similar from flagomatic,

129

to change -Os into something else.  Since I haven't examined all of them

130

I use to be sure what the replacement would be, including

131

-fomit-frame-pointer specifically ensures it gets used, even if -O1 or

132

similar is used by the ebuild (unless of course -fomit-frame-pointer is

133

specifically deleted/replaced as well).  Also, for 32-bit compiling,

134

-fomit-frame-pointer kills certain debugging, so it's not default for any

135

-Ox.  Again, just include it so it gets used.  Similarly, -funit-at-a-time

136

is invoked by -O(s|2|3), from at least 4.0.  I'm not sure of its status

137

for 3.4, but it was only introduced with 3.3 (well, 3.2 Hammer editions,

138

IIRC), and had to be invoked specifically at that time.

139

140

-frename-registers and -fweb sort of go together.  Note that -fweb is NOT

141

recommended for gcc 4.0 where  it behaved somewhat strangely.  AFAIK it's

142

fine for 4.1 again.  The effect of both of these is to make more efficient

143

use of registers.  Note that -frename-registers is invoked by -O3 but not

144

-O2 (if memory serves).  That implies it might (haven't tested to

145

verify and haven't seen an explicit statement to that effect) increase

146

code size, undoing part of what -Os does, but the tradeoff should still be

147

worth it.

148

149

The -freorder-blocks flags go together as well.  With -and-partition,

150

reorder-blocks is redundant, but -and-partition is automatically disabled

151

in many cases where it can't work, so the weaker form is included to cover

152

that case.  The idea here is the hot/cold function separation mentioned

153

upthread.  Functions used frequently are grouped together such that they

154

have a better chance of staying in-cache.  Functions used infrequently are

155

likewise grouped.  From what I've read, this /does/ increase code size

156

some, but the tradeoff should be worth it because for most code, it'll

157

increase the cache hit ratio, which is why we are targeting size in the

158

first place.

159

160

**IMPORTANT**  C++ makes heavy use of exceptions where -and-partition

161

won't work, causing a warning to be emitted.  THIS WARNING BREAKS CERTAIN

162

CONFIGURE SCRIPTS. Thus, my CXXFLAGS are equivalent to CFLAGS minus

163

-freorder-blocks-and-partition.  I've had far less trouble with broken

164

emerges since I did that, and eliminating all those warnings is nice, too.

165

166

-ftree-pre is new to 4.x (so you'll want to eliminate it for 3.x

167

compiles, but the amd64 profiles have filtered out invalid flags

168

automatically for some time, now =8^). A weaker form of it is -ftree-fre

169

(partial/full redundancy elimination, full redundancy is faster to check

170

for but doesn't find as many cases, so it's weaker). The 4.1 manpage says

171

the -fre form is enabled by default at -O(1), the -pre form by -O2/3.  One

172

would guess it'd be logical to include it with -Os as well, but the

173

manpage doesn't say it is, so...  In any case, the same rule applies here

174

as above -- since I can't be sure an ebuild won't kill my -Ox setting, if

175

I really want the flag, it's best to include it specifically. If it really

176

doesn't work for a particular package, the ebuild should disable the flag

177

specifically anyway.

178

179

-fmerge-all-constants breaks the C specifications, so is never enabled by

180

default.  The weaker -fmerge-constants is C spec compliant, and is enabled

181

with any -O.  See the manpage for the details of the distinction and why

182

it should (in theory) be safe even if it breaks the spec.  In any case,

183

I've had no trouble with it, tho I was prepared to eliminate it if I did.

184

YMMV of course. This one should contribute significantly toward the goals

185

of -Os.

186

187

As with 4.1, I've had surprisingly few problems with this set of CFLAGS,

188

once I eliminated -freorder-blocks-and-partition from my CXX flags,

189

anyway.  They seem pretty solid, and I haven't verified whether it's the

190

CFLAGS or gcc-4.1 or both, but together, they make some rather

191

impressively fast code!  (Again, see the previous thread on xorg-7.0. 

192

Yes, the effect /was/ that impressive!  It felt like a good 50%

193

difference, which is truly astounding in an area where eking out a

194

hard-fought 1-2% improvement is far more common.  Again, xorg 7.1 with EXA

195

rendering in place of XAA looks set to repeat that, at least on my

196

hardware, as hard to believe as it may seem, this time due to xorg, not

197

the compiler, as I'm using 4.1 for both xorg-7.0 and 7.1.)

198

199

--

200

Duncan - List replies preferred.   No HTML msgs.

201

"Every nonfree program has a lord, a master --

202

and if you use the program, he is your master."  Richard Stallman in

203

http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html

204

205

206

--

207

gentoo-amd64@g.o mailing list

Gentoo Archives: gentoo-amd64

Replies

1	Brandon Edens posted <20060425221729.GA8185@××××××××××××.edu>, excerpted
2	below, on Tue, 25 Apr 2006 18:17:29 -0400:
3
4	> Thats interesting stuff Duncan. I've been looking for details regarding gcc's
5	> possible optimizations for AMD64 along with low-level details of what they do.
6	> Are you using gcc 3 or 4? Is there much work to be done, porting AMD64
7	> optimizations from 3 to 4's tree stuff? I'd like to read more or know where to
8	> find additional info.
9
10	I used 4.0.0 as my default compiler from the time it was released, altho
11	at first there were quite a number of packages I had to switch back to
12	3.4.x for. The Gentoo devs that came up with Gentoo's slotting idea and
13	then implemented it for gcc get major kudos for making that so easy! =8^)
14
15	Then for a period I was using the 4.0.1-pre snapshots, as they almost
16	immediately fixed a couple regressions in 4.0.0. When 4.0.1 release came,
17	I stuck with it, until 4.1.0 came out. 4.1.0 then became my default
18	compiler, while I have 4.0.3 and 3.4.6-r1 slotted in as well, when I need
19	to use them.
20
21	While Stelling (Gentoo AMD64 Op Lead, seems quite active in all sorts of
22	things, from AMD64 to GCC to portage =8^) says 4.1.0 has a few regressions
23	that are now fixed, with the fixes to be released in 4.1.1 and of course
24	4.2.x, I've not seen them. 4.1.0 has been a /very/ smooth ride here.
25	Smoother than any version of GCC since 3.2 or whatever it was that first
26	supported AMD64, before it even got the -march=k8 stuff. I believe there's
27	just one case I've had where I had to revert to 3.4.x, and the fact that
28	4.0.x didn't cut it either in that case almost certainly means the package
29	in question simply hadn't been patched to take 4.x's additional strictness
30	into account yet. Of course, part of that is due to the much rougher
31	transition to 4.0, with its much stricter code requirements and early
32	regressions. That's to be expected for what amounted to a huge rewrite,
33	for 4.0, however, and the change still wasn't as rough as the
34	incompatibilities introduced early in the 3.x cycle (at which point I was
35	just switching to Linux and using Mandrake, so I pretty much only read
36	about them).
37
38	IMO, 4.1 is the most significant improvement for AMD64 since we got
39	-march=k8 support in 3.3 (IIRC). GCC's AMD64 support is finally
40	coming into its own! =8^) There are a several reasons for this.
41
42	One, the 3.x series simply wasn't designed with AMD64 as a major arch; the
43	support for AMD64 in 3.x was in many ways simply bolted on, and it showed.
44	GCC3 simply wasn't designed to be able to take full advantage of the
45	optimizations possible for AMD64, as opposed to x86. The rewrite for 4.0
46	was done with AMD64 in mind, and much better optimizations became possible.
47
48	Two, the reorganization for 4.x gave GCC a much better organized and more
49	modular hierarchy in general -- one where it is possible to optimize to
50	greater efficiency because all that spaghetti code that was 3.x is gone,
51	and it's now far easier for dependant optimizations to be made up and down
52	the hierarchical chain without risking a serious miscompile regression due
53	to all the spaghetti code that 3.x had become. That's of course across all
54	archs, but it made optimizing for AMD64 that much easier, as it was no
55	longer treated as a special case of x86 in terms of branches off that
56	spaghetti code. (IOW, there will be improvements for x86 as well, but
57	they won't be quite as dramatic, both because it was already quite
58	optimized, and because it was designed in as a major target, while AMD64
59	was bolted on, from the GCC3 perspectiive.)
60
61	Of course (and this is point three) the goal for 4.0 was simply to clean
62	out the spaghetti code and get the rewrite and new framework in place with
63	as few regressions (both in optimization and in downright miscompiles) as
64	possible. As such, it didn't advance the concept or optimization much
65	anyway, because that wasn't the goal, and any such changes intruduced for
66	4.0 just complicated the verification process, in terms of ensuring there
67	were no serious regressions, which /was/ the goal. In that regard, 4.1 is
68	the 4.x series finally coming into its own. The improvements made
69	possible by the overall rearchitecting in 4.0 finally begin to appear in
70	4.1. The promise of 4.x is now delivered.
71
72	Together, those three points mean a HUGE step for GCC's AMD64 support, 3.x
73	to 4.1.x. It's the first time it has been possible, and the differences
74	really /are/ noticeable.
75
76	...
77
78	(Recall my earlier posting to the effect that xorg's composite rendering,
79	with xorg-7.0 (modular-X), as compiled by gcc-4.1, is actually practical
80	now -- it doesn't slow down the system to the point of unusability. BTW,
81	while I'm not running xorg-7.1 due to stability issues this early in the
82	release cycle, I played with it a bit, and the improvements to EXA to the
83	point that it can replace XAA are dramatic! Configuring 2D rendering to
84	use EXA on xorg-7.1, there is now virtually /zero/, that's right, /zero/
85	additional CPU cost, to turning composite on! I was literally ASTOUNDED!
86	I couldn't have imagined it possible! The significance in terms of
87	bringing transparency and etc to the X desktop is tremendous! I had
88	thought that there'd always be an additional cost, and that only those
89	with the latest video cards (and slaveryware drivers) and just being
90	introduced CPUs would be able to run with the bells and whistles turned
91	on, and that we'd have to grow into it, but I was apparently and happily
92	very very wrong! At least for those with Radeon 92xx series cards -- I've
93	a 9250 -- even running merged framebuffer with dual 1600x1200 monitors
94	resolution, the thing had such a low CPU cost that I literally couldn't
95	tell the difference, either in responsiveness or in the CPU activity
96	graphs, between composite with all the goodies on, and composite toggled
97	off altogether. As I said, I couldn't have dreamed that was technically
98	possible! Of course, that's compiling with gcc-4.1.0. How it works when
99	compiled with 3.4.6, I really don't know, nor am I eager to personally
100	find out, tho I'm certainly open to reading the experiences of others.)
101
102	...
103
104	Back to GCC. Looking forward, I see a number of additional significant
105	improvements marked out for gcc 4.2 and 4.3. With the now clean code and
106	modular framework of 4.x, its promise of making additional optimizations
107	(and compiling speed improvements, lets not forget them) possible
108	continues to be delivered. However, from 4.1, the improvements for AMD64
109	will probably simply be incremental once again, because 4.1 is where a
110	reasonably optimized gcc for amd64 was finally delivered. It's the giant
111	step. Beyond that, improvements will continue, but should be much smaller
112	in comparison.
113
114	...
115
116	As for specific CFLAGS/CXXFLAGS, I posted mine with a fairly detailed
117	explanation of why I chose them, probably about a month to six weeks ago
118	(as a followon to that xorg 7.0 post mentioned above). I'd suggest looking
119	it up in the archives if you want the details, and the bit of further
120	discussion that followed. I'll repeat here briefly.
121
122	CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
123	-funit-at-a-time -fweb -freorder-blocks -freorder-blocks-and-partition
124	-ftree-pre -fmerge-all-constants"
125
126	The -march and -pipe things are the usual. -fomit-frame-pointer is
127	actually part of -Os (and -O2/3) on amd64. I include it specifically
128	however, because some ebuilds use replaceflags or similar from flagomatic,
129	to change -Os into something else. Since I haven't examined all of them
130	I use to be sure what the replacement would be, including
131	-fomit-frame-pointer specifically ensures it gets used, even if -O1 or
132	similar is used by the ebuild (unless of course -fomit-frame-pointer is
133	specifically deleted/replaced as well). Also, for 32-bit compiling,
134	-fomit-frame-pointer kills certain debugging, so it's not default for any
135	-Ox. Again, just include it so it gets used. Similarly, -funit-at-a-time
136	is invoked by -O(s\|2\|3), from at least 4.0. I'm not sure of its status
137	for 3.4, but it was only introduced with 3.3 (well, 3.2 Hammer editions,
138	IIRC), and had to be invoked specifically at that time.
139
140	-frename-registers and -fweb sort of go together. Note that -fweb is NOT
141	recommended for gcc 4.0 where it behaved somewhat strangely. AFAIK it's
142	fine for 4.1 again. The effect of both of these is to make more efficient
143	use of registers. Note that -frename-registers is invoked by -O3 but not
144	-O2 (if memory serves). That implies it might (haven't tested to
145	verify and haven't seen an explicit statement to that effect) increase
146	code size, undoing part of what -Os does, but the tradeoff should still be
147	worth it.
148
149	The -freorder-blocks flags go together as well. With -and-partition,
150	reorder-blocks is redundant, but -and-partition is automatically disabled
151	in many cases where it can't work, so the weaker form is included to cover
152	that case. The idea here is the hot/cold function separation mentioned
153	upthread. Functions used frequently are grouped together such that they
154	have a better chance of staying in-cache. Functions used infrequently are
155	likewise grouped. From what I've read, this /does/ increase code size
156	some, but the tradeoff should be worth it because for most code, it'll
157	increase the cache hit ratio, which is why we are targeting size in the
158	first place.
159
160	IMPORTANT C++ makes heavy use of exceptions where -and-partition
161	won't work, causing a warning to be emitted. THIS WARNING BREAKS CERTAIN
162	CONFIGURE SCRIPTS. Thus, my CXXFLAGS are equivalent to CFLAGS minus
163	-freorder-blocks-and-partition. I've had far less trouble with broken
164	emerges since I did that, and eliminating all those warnings is nice, too.
165
166	-ftree-pre is new to 4.x (so you'll want to eliminate it for 3.x
167	compiles, but the amd64 profiles have filtered out invalid flags
168	automatically for some time, now =8^). A weaker form of it is -ftree-fre
169	(partial/full redundancy elimination, full redundancy is faster to check
170	for but doesn't find as many cases, so it's weaker). The 4.1 manpage says
171	the -fre form is enabled by default at -O(1), the -pre form by -O2/3. One
172	would guess it'd be logical to include it with -Os as well, but the
173	manpage doesn't say it is, so... In any case, the same rule applies here
174	as above -- since I can't be sure an ebuild won't kill my -Ox setting, if
175	I really want the flag, it's best to include it specifically. If it really
176	doesn't work for a particular package, the ebuild should disable the flag
177	specifically anyway.
178
179	-fmerge-all-constants breaks the C specifications, so is never enabled by
180	default. The weaker -fmerge-constants is C spec compliant, and is enabled
181	with any -O. See the manpage for the details of the distinction and why
182	it should (in theory) be safe even if it breaks the spec. In any case,
183	I've had no trouble with it, tho I was prepared to eliminate it if I did.
184	YMMV of course. This one should contribute significantly toward the goals
185	of -Os.
186
187	As with 4.1, I've had surprisingly few problems with this set of CFLAGS,
188	once I eliminated -freorder-blocks-and-partition from my CXX flags,
189	anyway. They seem pretty solid, and I haven't verified whether it's the
190	CFLAGS or gcc-4.1 or both, but together, they make some rather
191	impressively fast code! (Again, see the previous thread on xorg-7.0.
192	Yes, the effect /was/ that impressive! It felt like a good 50%
193	difference, which is truly astounding in an area where eking out a
194	hard-fought 1-2% improvement is far more common. Again, xorg 7.1 with EXA
195	rendering in place of XAA looks set to repeat that, at least on my
196	hardware, as hard to believe as it may seem, this time due to xorg, not
197	the compiler, as I'm using 4.1 for both xorg-7.0 and 7.1.)
198
199	--
200	Duncan - List replies preferred. No HTML msgs.
201	"Every nonfree program has a lord, a master --
202	and if you use the program, he is your master." Richard Stallman in
203	http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
204
205
206	--
207	gentoo-amd64@g.o mailing list