Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Re: Re: Giving up 64 platform
Date: Wed, 26 Apr 2006 04:31:17
Message-Id: pan.2006.04.26.04.28.46.27923@cox.net
In Reply to: Re: [gentoo-amd64] Re: Re: Giving up 64 platform by Brandon Edens
1 Brandon Edens posted <20060425221729.GA8185@××××××××××××.edu>, excerpted
2 below, on Tue, 25 Apr 2006 18:17:29 -0400:
3
4 > Thats interesting stuff Duncan. I've been looking for details regarding gcc's
5 > possible optimizations for AMD64 along with low-level details of what they do.
6 > Are you using gcc 3 or 4? Is there much work to be done, porting AMD64
7 > optimizations from 3 to 4's tree stuff? I'd like to read more or know where to
8 > find additional info.
9
10 I used 4.0.0 as my default compiler from the time it was released, altho
11 at first there were quite a number of packages I had to switch back to
12 3.4.x for. The Gentoo devs that came up with Gentoo's slotting idea and
13 then implemented it for gcc get major kudos for making that so easy! =8^)
14
15 Then for a period I was using the 4.0.1-pre snapshots, as they almost
16 immediately fixed a couple regressions in 4.0.0. When 4.0.1 release came,
17 I stuck with it, until 4.1.0 came out. 4.1.0 then became my default
18 compiler, while I have 4.0.3 and 3.4.6-r1 slotted in as well, when I need
19 to use them.
20
21 While Stelling (Gentoo AMD64 Op Lead, seems quite active in all sorts of
22 things, from AMD64 to GCC to portage =8^) says 4.1.0 has a few regressions
23 that are now fixed, with the fixes to be released in 4.1.1 and of course
24 4.2.x, I've not seen them. 4.1.0 has been a /very/ smooth ride here.
25 Smoother than any version of GCC since 3.2 or whatever it was that first
26 supported AMD64, before it even got the -march=k8 stuff. I believe there's
27 just one case I've had where I had to revert to 3.4.x, and the fact that
28 4.0.x didn't cut it either in that case almost certainly means the package
29 in question simply hadn't been patched to take 4.x's additional strictness
30 into account yet. Of course, part of that is due to the much rougher
31 transition to 4.0, with its much stricter code requirements and early
32 regressions. That's to be expected for what amounted to a huge rewrite,
33 for 4.0, however, and the change still wasn't as rough as the
34 incompatibilities introduced early in the 3.x cycle (at which point I was
35 just switching to Linux and using Mandrake, so I pretty much only read
36 about them).
37
38 IMO, 4.1 is the most significant improvement for AMD64 since we got
39 -march=k8 support in 3.3 (IIRC). GCC's AMD64 support is finally
40 coming into its own! =8^) There are a several reasons for this.
41
42 One, the 3.x series simply wasn't designed with AMD64 as a major arch; the
43 support for AMD64 in 3.x was in many ways simply bolted on, and it showed.
44 GCC3 simply wasn't designed to be able to take full advantage of the
45 optimizations possible for AMD64, as opposed to x86. The rewrite for 4.0
46 was done with AMD64 in mind, and much better optimizations became possible.
47
48 Two, the reorganization for 4.x gave GCC a much better organized and more
49 modular hierarchy in general -- one where it is possible to optimize to
50 greater efficiency because all that spaghetti code that was 3.x is gone,
51 and it's now far easier for dependant optimizations to be made up and down
52 the hierarchical chain without risking a serious miscompile regression due
53 to all the spaghetti code that 3.x had become. That's of course across all
54 archs, but it made optimizing for AMD64 that much easier, as it was no
55 longer treated as a special case of x86 in terms of branches off that
56 spaghetti code. (IOW, there will be improvements for x86 as well, but
57 they won't be quite as dramatic, both because it was already quite
58 optimized, and because it was designed in as a major target, while AMD64
59 was bolted on, from the GCC3 perspectiive.)
60
61 Of course (and this is point three) the goal for 4.0 was simply to clean
62 out the spaghetti code and get the rewrite and new framework in place with
63 as few regressions (both in optimization and in downright miscompiles) as
64 possible. As such, it didn't advance the concept or optimization much
65 anyway, because that wasn't the goal, and any such changes intruduced for
66 4.0 just complicated the verification process, in terms of ensuring there
67 were no serious regressions, which /was/ the goal. In that regard, 4.1 is
68 the 4.x series finally coming into its own. The improvements made
69 possible by the overall rearchitecting in 4.0 finally begin to appear in
70 4.1. The promise of 4.x is now delivered.
71
72 Together, those three points mean a HUGE step for GCC's AMD64 support, 3.x
73 to 4.1.x. It's the first time it has been possible, and the differences
74 really /are/ noticeable.
75
76 ...
77
78 (Recall my earlier posting to the effect that xorg's composite rendering,
79 with xorg-7.0 (modular-X), as compiled by gcc-4.1, is actually practical
80 now -- it doesn't slow down the system to the point of unusability. BTW,
81 while I'm not running xorg-7.1 due to stability issues this early in the
82 release cycle, I played with it a bit, and the improvements to EXA to the
83 point that it can replace XAA are dramatic! Configuring 2D rendering to
84 use EXA on xorg-7.1, there is now virtually /zero/, that's right, /zero/
85 additional CPU cost, to turning composite on! I was literally ASTOUNDED!
86 I couldn't have imagined it possible! The significance in terms of
87 bringing transparency and etc to the X desktop is tremendous! I had
88 thought that there'd always be an additional cost, and that only those
89 with the latest video cards (and slaveryware drivers) and just being
90 introduced CPUs would be able to run with the bells and whistles turned
91 on, and that we'd have to grow into it, but I was apparently and happily
92 very very wrong! At least for those with Radeon 92xx series cards -- I've
93 a 9250 -- even running merged framebuffer with dual 1600x1200 monitors
94 resolution, the thing had such a low CPU cost that I literally couldn't
95 tell the difference, either in responsiveness or in the CPU activity
96 graphs, between composite with all the goodies on, and composite toggled
97 off altogether. As I said, I couldn't have dreamed that was technically
98 possible! Of course, that's compiling with gcc-4.1.0. How it works when
99 compiled with 3.4.6, I really don't know, nor am I eager to personally
100 find out, tho I'm certainly open to reading the experiences of others.)
101
102 ...
103
104 Back to GCC. Looking forward, I see a number of additional significant
105 improvements marked out for gcc 4.2 and 4.3. With the now clean code and
106 modular framework of 4.x, its promise of making additional optimizations
107 (and compiling speed improvements, lets not forget them) possible
108 continues to be delivered. However, from 4.1, the improvements for AMD64
109 will probably simply be incremental once again, because 4.1 is where a
110 reasonably optimized gcc for amd64 was finally delivered. It's the giant
111 step. Beyond that, improvements will continue, but should be much smaller
112 in comparison.
113
114 ...
115
116 As for specific CFLAGS/CXXFLAGS, I posted mine with a fairly detailed
117 explanation of why I chose them, probably about a month to six weeks ago
118 (as a followon to that xorg 7.0 post mentioned above). I'd suggest looking
119 it up in the archives if you want the details, and the bit of further
120 discussion that followed. I'll repeat here briefly.
121
122 CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
123 -funit-at-a-time -fweb -freorder-blocks -freorder-blocks-and-partition
124 -ftree-pre -fmerge-all-constants"
125
126 The -march and -pipe things are the usual. -fomit-frame-pointer is
127 actually part of -Os (and -O2/3) on amd64. I include it specifically
128 however, because some ebuilds use replaceflags or similar from flagomatic,
129 to change -Os into something else. Since I haven't examined all of them
130 I use to be sure what the replacement would be, including
131 -fomit-frame-pointer specifically ensures it gets used, even if -O1 or
132 similar is used by the ebuild (unless of course -fomit-frame-pointer is
133 specifically deleted/replaced as well). Also, for 32-bit compiling,
134 -fomit-frame-pointer kills certain debugging, so it's not default for any
135 -Ox. Again, just include it so it gets used. Similarly, -funit-at-a-time
136 is invoked by -O(s|2|3), from at least 4.0. I'm not sure of its status
137 for 3.4, but it was only introduced with 3.3 (well, 3.2 Hammer editions,
138 IIRC), and had to be invoked specifically at that time.
139
140 -frename-registers and -fweb sort of go together. Note that -fweb is NOT
141 recommended for gcc 4.0 where it behaved somewhat strangely. AFAIK it's
142 fine for 4.1 again. The effect of both of these is to make more efficient
143 use of registers. Note that -frename-registers is invoked by -O3 but not
144 -O2 (if memory serves). That implies it might (haven't tested to
145 verify and haven't seen an explicit statement to that effect) increase
146 code size, undoing part of what -Os does, but the tradeoff should still be
147 worth it.
148
149 The -freorder-blocks flags go together as well. With -and-partition,
150 reorder-blocks is redundant, but -and-partition is automatically disabled
151 in many cases where it can't work, so the weaker form is included to cover
152 that case. The idea here is the hot/cold function separation mentioned
153 upthread. Functions used frequently are grouped together such that they
154 have a better chance of staying in-cache. Functions used infrequently are
155 likewise grouped. From what I've read, this /does/ increase code size
156 some, but the tradeoff should be worth it because for most code, it'll
157 increase the cache hit ratio, which is why we are targeting size in the
158 first place.
159
160 **IMPORTANT** C++ makes heavy use of exceptions where -and-partition
161 won't work, causing a warning to be emitted. THIS WARNING BREAKS CERTAIN
162 CONFIGURE SCRIPTS. Thus, my CXXFLAGS are equivalent to CFLAGS minus
163 -freorder-blocks-and-partition. I've had far less trouble with broken
164 emerges since I did that, and eliminating all those warnings is nice, too.
165
166 -ftree-pre is new to 4.x (so you'll want to eliminate it for 3.x
167 compiles, but the amd64 profiles have filtered out invalid flags
168 automatically for some time, now =8^). A weaker form of it is -ftree-fre
169 (partial/full redundancy elimination, full redundancy is faster to check
170 for but doesn't find as many cases, so it's weaker). The 4.1 manpage says
171 the -fre form is enabled by default at -O(1), the -pre form by -O2/3. One
172 would guess it'd be logical to include it with -Os as well, but the
173 manpage doesn't say it is, so... In any case, the same rule applies here
174 as above -- since I can't be sure an ebuild won't kill my -Ox setting, if
175 I really want the flag, it's best to include it specifically. If it really
176 doesn't work for a particular package, the ebuild should disable the flag
177 specifically anyway.
178
179 -fmerge-all-constants breaks the C specifications, so is never enabled by
180 default. The weaker -fmerge-constants is C spec compliant, and is enabled
181 with any -O. See the manpage for the details of the distinction and why
182 it should (in theory) be safe even if it breaks the spec. In any case,
183 I've had no trouble with it, tho I was prepared to eliminate it if I did.
184 YMMV of course. This one should contribute significantly toward the goals
185 of -Os.
186
187 As with 4.1, I've had surprisingly few problems with this set of CFLAGS,
188 once I eliminated -freorder-blocks-and-partition from my CXX flags,
189 anyway. They seem pretty solid, and I haven't verified whether it's the
190 CFLAGS or gcc-4.1 or both, but together, they make some rather
191 impressively fast code! (Again, see the previous thread on xorg-7.0.
192 Yes, the effect /was/ that impressive! It felt like a good 50%
193 difference, which is truly astounding in an area where eking out a
194 hard-fought 1-2% improvement is far more common. Again, xorg 7.1 with EXA
195 rendering in place of XAA looks set to repeat that, at least on my
196 hardware, as hard to believe as it may seem, this time due to xorg, not
197 the compiler, as I'm using 4.1 for both xorg-7.0 and 7.1.)
198
199 --
200 Duncan - List replies preferred. No HTML msgs.
201 "Every nonfree program has a lord, a master --
202 and if you use the program, he is your master." Richard Stallman in
203 http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
204
205
206 --
207 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: Re: Re: Giving up 64 platform Brandon Edens <brandon@××××××.edu>