Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: gcc 4.1.1
Date: Wed, 07 Jun 2006 08:24:15
Message-Id: e66277$1kt$1@sea.gmane.org
In Reply to: [gentoo-amd64] Re: gcc 4.1.1 by Jani Averbach
1 Jani Averbach <jaa@×××××××.fi> posted 20060607010816.GA31588@×××××××.fi,
2 excerpted below, on Tue, 06 Jun 2006 19:08:16 -0600:
3
4 > Inspired by your comment, I installed 4.1.1 and did very un-scientific
5 > test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw
6 > picture with it:
7 >
8 > time dcraw-3 -w test.CR2
9 > real 0m10.338s
10 > user 0m9.969s
11 > sys 0m0.332s
12 >
13 > time dcraw-4 -w test.CR2
14 > real 0m9.141s
15 > user 0m8.849s
16 > sys 0m0.292s
17 >
18 > This is pretty good, and that was only the dcraw, all libraries are still
19 > done by gcc 3.4.x.
20 >
21 > BR, Jani
22 >
23 > P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms
24
25 Very interesting. I hadn't done any similar direct comparisons, but had
26 just been amazed at how much more responsive things seem to be with 4.1.x
27 as compared to 3.4.x. Given the generally agreed rule of thumb that users
28 won't definitively notice a difference of less than about 15% performance,
29 I've estimated an at minimum 20% difference, with everything compiled
30 4.1.x as compared to 3.4.x.
31
32 One test I've always been interested in but have never done, is the effect
33 of -Os vs -O2 vs -O3. I think it's generally agreed from testing that -O2
34 makes a BIG difference as opposed to unoptimized or -O (-O1), but the
35 differences between -O2, -O3, and -Os, are less clearly defined and, it
36 would appear, the best depends on what one is compiling. I know -O3 is
37 actually supposed to be worse than -O2 in many cases, because the effects
38 of the loop unrolling and similar optimizations is generally to markedly
39 increase code size, and the effects of the resulting cache misses and
40 thereby idling CPU while waiting for memory, is often worse than the
41 cycles saved by the additional optimization.
42
43 For that reason, I've always tended to go what could be argued to be to
44 the other extreme, favoring -Os over -O2, figuring that in a multitasking
45 environment, even with today's increased cache sizes, and with main memory
46 increasingly falling behind CPU speeds (well, until CPU speeds started
47 leveling off recently as they moved to multi-core instead), -Os should in
48 general be faster than -O2 for the same reason that -O2 is so often faster
49 than -O3.
50
51 OTOH, there are a couple specific optimizations that can increase overall
52 code size while increasing cache hit ratios as well, thereby negating the
53 general cache hit increases of -Os. Perhaps the most significant of
54 these, where it can be used, is -freorder-blocks-and-partition. The
55 effect of this flag is to cause gcc to try to regroup routines into "hot"
56 and "cold", with each group in its own "partition". Hot routines are
57 those that are called most frequently, cold, the least frequently, so the
58 effect is that despite a bit of overall increase in code size, the most
59 frequently used routines will be in cache a much higher percentage of the
60 time as compared to generally un-reordered routines/blocks. In theory,
61 that could dramatically affect performance as the CPU will far more
62 frequently find the stuff it needs in cache and not have to wait for it to
63 be retrieved from much slower main memory. The biggest problem with this
64 flag is that there's a LOT of code that can't use it, including the
65 exception code so common to C++. Now gcc does spot that and turn it off, so
66 no harm done, but it spits out warnings in the process, saying that it
67 turned it off, and this breaks a lot of ebuilds in the configure step as
68 they often abort on those warnings when they shouldn't. As a result, I've
69 split my CFLAGS from my CXXFLAGS and only include
70 -freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS. I
71 also have the weaker form -freorder-blocks (without the -and-partition) in
72 both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning
73 form is turned off. I've not done actual performance tests on this either
74 way, but I do know the occasional problem with an aborted ebuild I'd have
75 with the partition version in CXXFLAGS is no longer a problem, with it
76 only in CFLAGS, and it hasn't seemed to cause me any /problems/ since
77 then, quite apart from its not-verified-by-me affect on performance.
78
79 Likewise with the flags -frename-registers and -fweb. Since registers are
80 the fastest of all memory, operating at full CPU speed, and these increase
81 the efficiency of register allocation, it is IMO worth invoking them even
82 at the expense of slightly increased code side. This is likely to be
83 particularly true on amd64/x86_64, with its increased number of registers
84 in comparison to x86. Our arch has those extra registers, we might as
85 well make as much of them as we possibly can!
86
87 Conversely, I don't like the unroll loops flags at all. It's my
88 (unverified, no argument there) belief that these will be BIG drags on
89 performance because they blow up single loop structures to multiple times
90 their original size, all in an (IMO misguided) effort to inline the loops,
91 preventing a few jump instructions. Jumps are far less costly on x86_64
92 and even full 32-bit i586+ x86 than they used to be on the original 16-bit
93 8088-80486 generations. That's particularly true in the case of tight
94 loops where the entire loop will be in L1 cache. With proper
95 pre-fetching, it's /possible/ inline unrolling of the loops could keep the
96 registers full from L1 and the code running at full CPU speed as opposed
97 to the slight waits possibly necessary at the loopback jump for a fetch
98 from L1 instead of being able to continue full-speed register operations,
99 but I believe it's much more likely that inlining the unrolled loops will
100 either force code out to L2, or that the prefetching couldn't keep the
101 registers sufficiently full even with inlining, so there'd be the wait in
102 any case. -O2 does a bit of simple loop unrolling, which -Os should
103 discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if
104 I'm correctly reading the gcc manpages, anyway. It's that size intensive
105 loop unrolling that I most want to discourage, which is why I'd seldom
106 consider -O3 at all, and the big reason why I favor -Os over -O2, even for
107 -O2's limited loop unrolling.
108
109 However, arguably, -O2's limited loop unrolling is more optimal than
110 discouraging it with -Os. I believe it would actually come down to the
111 code in question, and which is "better" as an overall system CFLAGS choice
112 very likely depends on exactly what applications one actually chooses to
113 merge. It's also very likely dependant on how much multitasking an
114 individual installation routinely gets, and whether that's single-core or
115 multi-core/multi-CPU based multi-tasking, plus the specifics of the
116 sub-arch caching implementation. (Intel's memory management, particularly
117 as the number of cores and CPUs increases, isn't at this point as
118 efficient as AMD's, tho with Conroe Intel is likely to brute-force the
119 leadership position once again, for the single and dual-core models
120 normally found on desktops/laptops and low end workstations, anyway,
121 despite AMD's more elegant memory management currently and as the number
122 of cores and CPUs scales, 4 and above.)
123
124 ...
125
126 I **HAVE** come across a single **VERY** convincing demonstration of the
127 problems with gcc 3.x on amd64/x86_64, however. This one blew me away --
128 it was TOTALLY unexpected and another guy and I spent quite some
129 troubleshooting time finding it, as a result.
130
131 Those of you using pan as their news client of choice may already be aware
132 of the fact that there's a newer 0.90+ beta series available. Portage has
133 a couple masked ebuilds for the series, but hasn't been keeping up as a
134 new one has been coming out every weekend (with this past weekend an
135 exception, Charles, the main developer, took a few days vacation) since
136 April first. Therefore, one can either build from source, or do what I've
137 been doing and rename the ebuild (in my overlay) for each successive
138 weekly release. (My overlay ebuild is slightly modified, as well, but no
139 biggie for this discussion.)
140
141 Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN
142 groups was having a /terrible/ time compiling the new PAN series, with the
143 then latest ~amd64 gcc-3.4.x. With a gigabyte of memory, plus swap, he
144 kept running into insufficient memory errors.
145
146 I wondered how that could be, as I'm running a generally ~amd64 system
147 myself and had experienced no issues, and while I'm running 8 gig of
148 memory now, that's a fairly recent upgrade and I had neither experience
149 problems compiling pan before, nor noticed it using a lot of memory after
150 the upgrade. I run ulimit set to a gig of virtual memory (ulimit -v) by
151 default, and certainly would have expected to run into issues compiling
152 pan with that if it required that sort of memory. I routinely /do/ run
153 into such problems merging kmail, and always have to boost my ulimit
154 settings to compile it, so I knew it would happen if it really required
155 that sort of memory to compile.
156
157 As it happened, while he was running ~amd64, he wasn't routinely using
158 --deep with his emerge --update world runs, so he had a number of packages
159 that were less than the very latest, and that's what he and I focused on
160 first as the difference between his and my systems, figuring a newer
161 version of /something/ I had must explain why I had no problem compiling
162 it while he did.
163
164 After he upgraded a few packages with no change in the problem, someone
165 else mentioned that it might be gcc. Turned out he was right, it WAS gcc.
166 With gcc-3.4.x, compiling the new pan on amd64 at one point requires an
167 incredible 1.3 gigabytes of usable virtual memory for a single compile
168 job! (That's the makeopts=-jX setting.) He apparently had enough memory
169 and swap to do it -- if he shut down X and virtually everything else he
170 was running -- but was experiencing errors due to lack of one or the
171 other, with everything he normally had running continuing to run while he
172 compiled pan.
173
174 When out of curiosity I checked how much memory it took with gcc-4.1.0,
175 the version I was running at the time (tho it was masked for Gentoo
176 users), I quickly saw why I hadn't noticed a problem -- less than 300 MB
177 usage at any point. I /think/ it was actually less than 200, but I didn't
178 verify that. In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR
179 TIMES that, at JUST LESS THAT 1.3 GB required. No WONDER I hadn't noticed
180 anything unusual compiling it with gcc-4.1.x, while he had all sorts of
181 problems with gcc-3.4.x!
182
183 I haven't verified this on x86, but I suspect the reason it didn't come up
184 with anyone else is because it's not a problem with x86. gcc 3.4.x is
185 apparently fairly efficient at dealing with 32-bit memory addresses and is
186 already reasonably optimized for x86. The same cannot be said for its
187 treatment of amd64. While this pan case is certainly an extreme corner
188 case, it does serve to emphasize the fact that gcc-3.x was simply not
189 designed for amd64/x86_64, and its x86_64 capacities are and will remain
190 "bolted on", and as such. far more cumbersome and less efficient than they
191 /could/ be. The 4.x rewrite provided the opportunity to change that and
192 it was taken. As I've said, however, 4.0 was /just/ the rewrite and didn't
193 really do much else but try to keep regressions to a minimum. With the
194 4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own,
195 and the performance improvements dramatically demonstrate that. The jump
196 from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc
197 support for amd64 since support was originally added, and it's probably
198 the biggest jump we'll ever see, because while improvements will continue
199 to be made, from this point on, they will be incremental improvements,
200 significant yes, but not the blow-me-away improvements of 4.1, much as
201 improvements have been only incremental on x86 for some time.
202
203 ...
204
205 Anyway... thanks for that little test. The results are certainly
206 enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3
207 and register and reorder flags vs the standard -Ox alone, if you are up to
208 it, but haven't bothered to run them myself, and just this little test
209 alone was quite informative and definitely more concrete than the "feel"
210 I've been basing my comments on to date. Hopefully, my above comments
211 prove useful to someone as well, and if I'm lucky, motivation for some
212 tests (by you or someone else) to prove or disprove them. =8^)
213
214 --
215 Duncan - List replies preferred. No HTML msgs.
216 "Every nonfree program has a lord, a master --
217 and if you use the program, he is your master." Richard Stallman
218
219 --
220 gentoo-amd64@g.o mailing list