1 |
Brandon Edens posted <20060425221729.GA8185@××××××××××××.edu>, excerpted |
2 |
below, on Tue, 25 Apr 2006 18:17:29 -0400: |
3 |
|
4 |
> Thats interesting stuff Duncan. I've been looking for details regarding gcc's |
5 |
> possible optimizations for AMD64 along with low-level details of what they do. |
6 |
> Are you using gcc 3 or 4? Is there much work to be done, porting AMD64 |
7 |
> optimizations from 3 to 4's tree stuff? I'd like to read more or know where to |
8 |
> find additional info. |
9 |
|
10 |
I used 4.0.0 as my default compiler from the time it was released, altho |
11 |
at first there were quite a number of packages I had to switch back to |
12 |
3.4.x for. The Gentoo devs that came up with Gentoo's slotting idea and |
13 |
then implemented it for gcc get major kudos for making that so easy! =8^) |
14 |
|
15 |
Then for a period I was using the 4.0.1-pre snapshots, as they almost |
16 |
immediately fixed a couple regressions in 4.0.0. When 4.0.1 release came, |
17 |
I stuck with it, until 4.1.0 came out. 4.1.0 then became my default |
18 |
compiler, while I have 4.0.3 and 3.4.6-r1 slotted in as well, when I need |
19 |
to use them. |
20 |
|
21 |
While Stelling (Gentoo AMD64 Op Lead, seems quite active in all sorts of |
22 |
things, from AMD64 to GCC to portage =8^) says 4.1.0 has a few regressions |
23 |
that are now fixed, with the fixes to be released in 4.1.1 and of course |
24 |
4.2.x, I've not seen them. 4.1.0 has been a /very/ smooth ride here. |
25 |
Smoother than any version of GCC since 3.2 or whatever it was that first |
26 |
supported AMD64, before it even got the -march=k8 stuff. I believe there's |
27 |
just one case I've had where I had to revert to 3.4.x, and the fact that |
28 |
4.0.x didn't cut it either in that case almost certainly means the package |
29 |
in question simply hadn't been patched to take 4.x's additional strictness |
30 |
into account yet. Of course, part of that is due to the much rougher |
31 |
transition to 4.0, with its much stricter code requirements and early |
32 |
regressions. That's to be expected for what amounted to a huge rewrite, |
33 |
for 4.0, however, and the change still wasn't as rough as the |
34 |
incompatibilities introduced early in the 3.x cycle (at which point I was |
35 |
just switching to Linux and using Mandrake, so I pretty much only read |
36 |
about them). |
37 |
|
38 |
IMO, 4.1 is the most significant improvement for AMD64 since we got |
39 |
-march=k8 support in 3.3 (IIRC). GCC's AMD64 support is finally |
40 |
coming into its own! =8^) There are a several reasons for this. |
41 |
|
42 |
One, the 3.x series simply wasn't designed with AMD64 as a major arch; the |
43 |
support for AMD64 in 3.x was in many ways simply bolted on, and it showed. |
44 |
GCC3 simply wasn't designed to be able to take full advantage of the |
45 |
optimizations possible for AMD64, as opposed to x86. The rewrite for 4.0 |
46 |
was done with AMD64 in mind, and much better optimizations became possible. |
47 |
|
48 |
Two, the reorganization for 4.x gave GCC a much better organized and more |
49 |
modular hierarchy in general -- one where it is possible to optimize to |
50 |
greater efficiency because all that spaghetti code that was 3.x is gone, |
51 |
and it's now far easier for dependant optimizations to be made up and down |
52 |
the hierarchical chain without risking a serious miscompile regression due |
53 |
to all the spaghetti code that 3.x had become. That's of course across all |
54 |
archs, but it made optimizing for AMD64 that much easier, as it was no |
55 |
longer treated as a special case of x86 in terms of branches off that |
56 |
spaghetti code. (IOW, there will be improvements for x86 as well, but |
57 |
they won't be quite as dramatic, both because it was already quite |
58 |
optimized, and because it was designed in as a major target, while AMD64 |
59 |
was bolted on, from the GCC3 perspectiive.) |
60 |
|
61 |
Of course (and this is point three) the goal for 4.0 was simply to clean |
62 |
out the spaghetti code and get the rewrite and new framework in place with |
63 |
as few regressions (both in optimization and in downright miscompiles) as |
64 |
possible. As such, it didn't advance the concept or optimization much |
65 |
anyway, because that wasn't the goal, and any such changes intruduced for |
66 |
4.0 just complicated the verification process, in terms of ensuring there |
67 |
were no serious regressions, which /was/ the goal. In that regard, 4.1 is |
68 |
the 4.x series finally coming into its own. The improvements made |
69 |
possible by the overall rearchitecting in 4.0 finally begin to appear in |
70 |
4.1. The promise of 4.x is now delivered. |
71 |
|
72 |
Together, those three points mean a HUGE step for GCC's AMD64 support, 3.x |
73 |
to 4.1.x. It's the first time it has been possible, and the differences |
74 |
really /are/ noticeable. |
75 |
|
76 |
... |
77 |
|
78 |
(Recall my earlier posting to the effect that xorg's composite rendering, |
79 |
with xorg-7.0 (modular-X), as compiled by gcc-4.1, is actually practical |
80 |
now -- it doesn't slow down the system to the point of unusability. BTW, |
81 |
while I'm not running xorg-7.1 due to stability issues this early in the |
82 |
release cycle, I played with it a bit, and the improvements to EXA to the |
83 |
point that it can replace XAA are dramatic! Configuring 2D rendering to |
84 |
use EXA on xorg-7.1, there is now virtually /zero/, that's right, /zero/ |
85 |
additional CPU cost, to turning composite on! I was literally ASTOUNDED! |
86 |
I couldn't have imagined it possible! The significance in terms of |
87 |
bringing transparency and etc to the X desktop is tremendous! I had |
88 |
thought that there'd always be an additional cost, and that only those |
89 |
with the latest video cards (and slaveryware drivers) and just being |
90 |
introduced CPUs would be able to run with the bells and whistles turned |
91 |
on, and that we'd have to grow into it, but I was apparently and happily |
92 |
very very wrong! At least for those with Radeon 92xx series cards -- I've |
93 |
a 9250 -- even running merged framebuffer with dual 1600x1200 monitors |
94 |
resolution, the thing had such a low CPU cost that I literally couldn't |
95 |
tell the difference, either in responsiveness or in the CPU activity |
96 |
graphs, between composite with all the goodies on, and composite toggled |
97 |
off altogether. As I said, I couldn't have dreamed that was technically |
98 |
possible! Of course, that's compiling with gcc-4.1.0. How it works when |
99 |
compiled with 3.4.6, I really don't know, nor am I eager to personally |
100 |
find out, tho I'm certainly open to reading the experiences of others.) |
101 |
|
102 |
... |
103 |
|
104 |
Back to GCC. Looking forward, I see a number of additional significant |
105 |
improvements marked out for gcc 4.2 and 4.3. With the now clean code and |
106 |
modular framework of 4.x, its promise of making additional optimizations |
107 |
(and compiling speed improvements, lets not forget them) possible |
108 |
continues to be delivered. However, from 4.1, the improvements for AMD64 |
109 |
will probably simply be incremental once again, because 4.1 is where a |
110 |
reasonably optimized gcc for amd64 was finally delivered. It's the giant |
111 |
step. Beyond that, improvements will continue, but should be much smaller |
112 |
in comparison. |
113 |
|
114 |
... |
115 |
|
116 |
As for specific CFLAGS/CXXFLAGS, I posted mine with a fairly detailed |
117 |
explanation of why I chose them, probably about a month to six weeks ago |
118 |
(as a followon to that xorg 7.0 post mentioned above). I'd suggest looking |
119 |
it up in the archives if you want the details, and the bit of further |
120 |
discussion that followed. I'll repeat here briefly. |
121 |
|
122 |
CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers |
123 |
-funit-at-a-time -fweb -freorder-blocks -freorder-blocks-and-partition |
124 |
-ftree-pre -fmerge-all-constants" |
125 |
|
126 |
The -march and -pipe things are the usual. -fomit-frame-pointer is |
127 |
actually part of -Os (and -O2/3) on amd64. I include it specifically |
128 |
however, because some ebuilds use replaceflags or similar from flagomatic, |
129 |
to change -Os into something else. Since I haven't examined all of them |
130 |
I use to be sure what the replacement would be, including |
131 |
-fomit-frame-pointer specifically ensures it gets used, even if -O1 or |
132 |
similar is used by the ebuild (unless of course -fomit-frame-pointer is |
133 |
specifically deleted/replaced as well). Also, for 32-bit compiling, |
134 |
-fomit-frame-pointer kills certain debugging, so it's not default for any |
135 |
-Ox. Again, just include it so it gets used. Similarly, -funit-at-a-time |
136 |
is invoked by -O(s|2|3), from at least 4.0. I'm not sure of its status |
137 |
for 3.4, but it was only introduced with 3.3 (well, 3.2 Hammer editions, |
138 |
IIRC), and had to be invoked specifically at that time. |
139 |
|
140 |
-frename-registers and -fweb sort of go together. Note that -fweb is NOT |
141 |
recommended for gcc 4.0 where it behaved somewhat strangely. AFAIK it's |
142 |
fine for 4.1 again. The effect of both of these is to make more efficient |
143 |
use of registers. Note that -frename-registers is invoked by -O3 but not |
144 |
-O2 (if memory serves). That implies it might (haven't tested to |
145 |
verify and haven't seen an explicit statement to that effect) increase |
146 |
code size, undoing part of what -Os does, but the tradeoff should still be |
147 |
worth it. |
148 |
|
149 |
The -freorder-blocks flags go together as well. With -and-partition, |
150 |
reorder-blocks is redundant, but -and-partition is automatically disabled |
151 |
in many cases where it can't work, so the weaker form is included to cover |
152 |
that case. The idea here is the hot/cold function separation mentioned |
153 |
upthread. Functions used frequently are grouped together such that they |
154 |
have a better chance of staying in-cache. Functions used infrequently are |
155 |
likewise grouped. From what I've read, this /does/ increase code size |
156 |
some, but the tradeoff should be worth it because for most code, it'll |
157 |
increase the cache hit ratio, which is why we are targeting size in the |
158 |
first place. |
159 |
|
160 |
**IMPORTANT** C++ makes heavy use of exceptions where -and-partition |
161 |
won't work, causing a warning to be emitted. THIS WARNING BREAKS CERTAIN |
162 |
CONFIGURE SCRIPTS. Thus, my CXXFLAGS are equivalent to CFLAGS minus |
163 |
-freorder-blocks-and-partition. I've had far less trouble with broken |
164 |
emerges since I did that, and eliminating all those warnings is nice, too. |
165 |
|
166 |
-ftree-pre is new to 4.x (so you'll want to eliminate it for 3.x |
167 |
compiles, but the amd64 profiles have filtered out invalid flags |
168 |
automatically for some time, now =8^). A weaker form of it is -ftree-fre |
169 |
(partial/full redundancy elimination, full redundancy is faster to check |
170 |
for but doesn't find as many cases, so it's weaker). The 4.1 manpage says |
171 |
the -fre form is enabled by default at -O(1), the -pre form by -O2/3. One |
172 |
would guess it'd be logical to include it with -Os as well, but the |
173 |
manpage doesn't say it is, so... In any case, the same rule applies here |
174 |
as above -- since I can't be sure an ebuild won't kill my -Ox setting, if |
175 |
I really want the flag, it's best to include it specifically. If it really |
176 |
doesn't work for a particular package, the ebuild should disable the flag |
177 |
specifically anyway. |
178 |
|
179 |
-fmerge-all-constants breaks the C specifications, so is never enabled by |
180 |
default. The weaker -fmerge-constants is C spec compliant, and is enabled |
181 |
with any -O. See the manpage for the details of the distinction and why |
182 |
it should (in theory) be safe even if it breaks the spec. In any case, |
183 |
I've had no trouble with it, tho I was prepared to eliminate it if I did. |
184 |
YMMV of course. This one should contribute significantly toward the goals |
185 |
of -Os. |
186 |
|
187 |
As with 4.1, I've had surprisingly few problems with this set of CFLAGS, |
188 |
once I eliminated -freorder-blocks-and-partition from my CXX flags, |
189 |
anyway. They seem pretty solid, and I haven't verified whether it's the |
190 |
CFLAGS or gcc-4.1 or both, but together, they make some rather |
191 |
impressively fast code! (Again, see the previous thread on xorg-7.0. |
192 |
Yes, the effect /was/ that impressive! It felt like a good 50% |
193 |
difference, which is truly astounding in an area where eking out a |
194 |
hard-fought 1-2% improvement is far more common. Again, xorg 7.1 with EXA |
195 |
rendering in place of XAA looks set to repeat that, at least on my |
196 |
hardware, as hard to believe as it may seem, this time due to xorg, not |
197 |
the compiler, as I'm using 4.1 for both xorg-7.0 and 7.1.) |
198 |
|
199 |
-- |
200 |
Duncan - List replies preferred. No HTML msgs. |
201 |
"Every nonfree program has a lord, a master -- |
202 |
and if you use the program, he is your master." Richard Stallman in |
203 |
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html |
204 |
|
205 |
|
206 |
-- |
207 |
gentoo-amd64@g.o mailing list |