1 |
Jani Averbach <jaa@×××××××.fi> posted 20060607010816.GA31588@×××××××.fi, |
2 |
excerpted below, on Tue, 06 Jun 2006 19:08:16 -0600: |
3 |
|
4 |
> Inspired by your comment, I installed 4.1.1 and did very un-scientific |
5 |
> test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw |
6 |
> picture with it: |
7 |
> |
8 |
> time dcraw-3 -w test.CR2 |
9 |
> real 0m10.338s |
10 |
> user 0m9.969s |
11 |
> sys 0m0.332s |
12 |
> |
13 |
> time dcraw-4 -w test.CR2 |
14 |
> real 0m9.141s |
15 |
> user 0m8.849s |
16 |
> sys 0m0.292s |
17 |
> |
18 |
> This is pretty good, and that was only the dcraw, all libraries are still |
19 |
> done by gcc 3.4.x. |
20 |
> |
21 |
> BR, Jani |
22 |
> |
23 |
> P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms |
24 |
|
25 |
Very interesting. I hadn't done any similar direct comparisons, but had |
26 |
just been amazed at how much more responsive things seem to be with 4.1.x |
27 |
as compared to 3.4.x. Given the generally agreed rule of thumb that users |
28 |
won't definitively notice a difference of less than about 15% performance, |
29 |
I've estimated an at minimum 20% difference, with everything compiled |
30 |
4.1.x as compared to 3.4.x. |
31 |
|
32 |
One test I've always been interested in but have never done, is the effect |
33 |
of -Os vs -O2 vs -O3. I think it's generally agreed from testing that -O2 |
34 |
makes a BIG difference as opposed to unoptimized or -O (-O1), but the |
35 |
differences between -O2, -O3, and -Os, are less clearly defined and, it |
36 |
would appear, the best depends on what one is compiling. I know -O3 is |
37 |
actually supposed to be worse than -O2 in many cases, because the effects |
38 |
of the loop unrolling and similar optimizations is generally to markedly |
39 |
increase code size, and the effects of the resulting cache misses and |
40 |
thereby idling CPU while waiting for memory, is often worse than the |
41 |
cycles saved by the additional optimization. |
42 |
|
43 |
For that reason, I've always tended to go what could be argued to be to |
44 |
the other extreme, favoring -Os over -O2, figuring that in a multitasking |
45 |
environment, even with today's increased cache sizes, and with main memory |
46 |
increasingly falling behind CPU speeds (well, until CPU speeds started |
47 |
leveling off recently as they moved to multi-core instead), -Os should in |
48 |
general be faster than -O2 for the same reason that -O2 is so often faster |
49 |
than -O3. |
50 |
|
51 |
OTOH, there are a couple specific optimizations that can increase overall |
52 |
code size while increasing cache hit ratios as well, thereby negating the |
53 |
general cache hit increases of -Os. Perhaps the most significant of |
54 |
these, where it can be used, is -freorder-blocks-and-partition. The |
55 |
effect of this flag is to cause gcc to try to regroup routines into "hot" |
56 |
and "cold", with each group in its own "partition". Hot routines are |
57 |
those that are called most frequently, cold, the least frequently, so the |
58 |
effect is that despite a bit of overall increase in code size, the most |
59 |
frequently used routines will be in cache a much higher percentage of the |
60 |
time as compared to generally un-reordered routines/blocks. In theory, |
61 |
that could dramatically affect performance as the CPU will far more |
62 |
frequently find the stuff it needs in cache and not have to wait for it to |
63 |
be retrieved from much slower main memory. The biggest problem with this |
64 |
flag is that there's a LOT of code that can't use it, including the |
65 |
exception code so common to C++. Now gcc does spot that and turn it off, so |
66 |
no harm done, but it spits out warnings in the process, saying that it |
67 |
turned it off, and this breaks a lot of ebuilds in the configure step as |
68 |
they often abort on those warnings when they shouldn't. As a result, I've |
69 |
split my CFLAGS from my CXXFLAGS and only include |
70 |
-freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS. I |
71 |
also have the weaker form -freorder-blocks (without the -and-partition) in |
72 |
both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning |
73 |
form is turned off. I've not done actual performance tests on this either |
74 |
way, but I do know the occasional problem with an aborted ebuild I'd have |
75 |
with the partition version in CXXFLAGS is no longer a problem, with it |
76 |
only in CFLAGS, and it hasn't seemed to cause me any /problems/ since |
77 |
then, quite apart from its not-verified-by-me affect on performance. |
78 |
|
79 |
Likewise with the flags -frename-registers and -fweb. Since registers are |
80 |
the fastest of all memory, operating at full CPU speed, and these increase |
81 |
the efficiency of register allocation, it is IMO worth invoking them even |
82 |
at the expense of slightly increased code side. This is likely to be |
83 |
particularly true on amd64/x86_64, with its increased number of registers |
84 |
in comparison to x86. Our arch has those extra registers, we might as |
85 |
well make as much of them as we possibly can! |
86 |
|
87 |
Conversely, I don't like the unroll loops flags at all. It's my |
88 |
(unverified, no argument there) belief that these will be BIG drags on |
89 |
performance because they blow up single loop structures to multiple times |
90 |
their original size, all in an (IMO misguided) effort to inline the loops, |
91 |
preventing a few jump instructions. Jumps are far less costly on x86_64 |
92 |
and even full 32-bit i586+ x86 than they used to be on the original 16-bit |
93 |
8088-80486 generations. That's particularly true in the case of tight |
94 |
loops where the entire loop will be in L1 cache. With proper |
95 |
pre-fetching, it's /possible/ inline unrolling of the loops could keep the |
96 |
registers full from L1 and the code running at full CPU speed as opposed |
97 |
to the slight waits possibly necessary at the loopback jump for a fetch |
98 |
from L1 instead of being able to continue full-speed register operations, |
99 |
but I believe it's much more likely that inlining the unrolled loops will |
100 |
either force code out to L2, or that the prefetching couldn't keep the |
101 |
registers sufficiently full even with inlining, so there'd be the wait in |
102 |
any case. -O2 does a bit of simple loop unrolling, which -Os should |
103 |
discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if |
104 |
I'm correctly reading the gcc manpages, anyway. It's that size intensive |
105 |
loop unrolling that I most want to discourage, which is why I'd seldom |
106 |
consider -O3 at all, and the big reason why I favor -Os over -O2, even for |
107 |
-O2's limited loop unrolling. |
108 |
|
109 |
However, arguably, -O2's limited loop unrolling is more optimal than |
110 |
discouraging it with -Os. I believe it would actually come down to the |
111 |
code in question, and which is "better" as an overall system CFLAGS choice |
112 |
very likely depends on exactly what applications one actually chooses to |
113 |
merge. It's also very likely dependant on how much multitasking an |
114 |
individual installation routinely gets, and whether that's single-core or |
115 |
multi-core/multi-CPU based multi-tasking, plus the specifics of the |
116 |
sub-arch caching implementation. (Intel's memory management, particularly |
117 |
as the number of cores and CPUs increases, isn't at this point as |
118 |
efficient as AMD's, tho with Conroe Intel is likely to brute-force the |
119 |
leadership position once again, for the single and dual-core models |
120 |
normally found on desktops/laptops and low end workstations, anyway, |
121 |
despite AMD's more elegant memory management currently and as the number |
122 |
of cores and CPUs scales, 4 and above.) |
123 |
|
124 |
... |
125 |
|
126 |
I **HAVE** come across a single **VERY** convincing demonstration of the |
127 |
problems with gcc 3.x on amd64/x86_64, however. This one blew me away -- |
128 |
it was TOTALLY unexpected and another guy and I spent quite some |
129 |
troubleshooting time finding it, as a result. |
130 |
|
131 |
Those of you using pan as their news client of choice may already be aware |
132 |
of the fact that there's a newer 0.90+ beta series available. Portage has |
133 |
a couple masked ebuilds for the series, but hasn't been keeping up as a |
134 |
new one has been coming out every weekend (with this past weekend an |
135 |
exception, Charles, the main developer, took a few days vacation) since |
136 |
April first. Therefore, one can either build from source, or do what I've |
137 |
been doing and rename the ebuild (in my overlay) for each successive |
138 |
weekly release. (My overlay ebuild is slightly modified, as well, but no |
139 |
biggie for this discussion.) |
140 |
|
141 |
Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN |
142 |
groups was having a /terrible/ time compiling the new PAN series, with the |
143 |
then latest ~amd64 gcc-3.4.x. With a gigabyte of memory, plus swap, he |
144 |
kept running into insufficient memory errors. |
145 |
|
146 |
I wondered how that could be, as I'm running a generally ~amd64 system |
147 |
myself and had experienced no issues, and while I'm running 8 gig of |
148 |
memory now, that's a fairly recent upgrade and I had neither experience |
149 |
problems compiling pan before, nor noticed it using a lot of memory after |
150 |
the upgrade. I run ulimit set to a gig of virtual memory (ulimit -v) by |
151 |
default, and certainly would have expected to run into issues compiling |
152 |
pan with that if it required that sort of memory. I routinely /do/ run |
153 |
into such problems merging kmail, and always have to boost my ulimit |
154 |
settings to compile it, so I knew it would happen if it really required |
155 |
that sort of memory to compile. |
156 |
|
157 |
As it happened, while he was running ~amd64, he wasn't routinely using |
158 |
--deep with his emerge --update world runs, so he had a number of packages |
159 |
that were less than the very latest, and that's what he and I focused on |
160 |
first as the difference between his and my systems, figuring a newer |
161 |
version of /something/ I had must explain why I had no problem compiling |
162 |
it while he did. |
163 |
|
164 |
After he upgraded a few packages with no change in the problem, someone |
165 |
else mentioned that it might be gcc. Turned out he was right, it WAS gcc. |
166 |
With gcc-3.4.x, compiling the new pan on amd64 at one point requires an |
167 |
incredible 1.3 gigabytes of usable virtual memory for a single compile |
168 |
job! (That's the makeopts=-jX setting.) He apparently had enough memory |
169 |
and swap to do it -- if he shut down X and virtually everything else he |
170 |
was running -- but was experiencing errors due to lack of one or the |
171 |
other, with everything he normally had running continuing to run while he |
172 |
compiled pan. |
173 |
|
174 |
When out of curiosity I checked how much memory it took with gcc-4.1.0, |
175 |
the version I was running at the time (tho it was masked for Gentoo |
176 |
users), I quickly saw why I hadn't noticed a problem -- less than 300 MB |
177 |
usage at any point. I /think/ it was actually less than 200, but I didn't |
178 |
verify that. In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR |
179 |
TIMES that, at JUST LESS THAT 1.3 GB required. No WONDER I hadn't noticed |
180 |
anything unusual compiling it with gcc-4.1.x, while he had all sorts of |
181 |
problems with gcc-3.4.x! |
182 |
|
183 |
I haven't verified this on x86, but I suspect the reason it didn't come up |
184 |
with anyone else is because it's not a problem with x86. gcc 3.4.x is |
185 |
apparently fairly efficient at dealing with 32-bit memory addresses and is |
186 |
already reasonably optimized for x86. The same cannot be said for its |
187 |
treatment of amd64. While this pan case is certainly an extreme corner |
188 |
case, it does serve to emphasize the fact that gcc-3.x was simply not |
189 |
designed for amd64/x86_64, and its x86_64 capacities are and will remain |
190 |
"bolted on", and as such. far more cumbersome and less efficient than they |
191 |
/could/ be. The 4.x rewrite provided the opportunity to change that and |
192 |
it was taken. As I've said, however, 4.0 was /just/ the rewrite and didn't |
193 |
really do much else but try to keep regressions to a minimum. With the |
194 |
4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own, |
195 |
and the performance improvements dramatically demonstrate that. The jump |
196 |
from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc |
197 |
support for amd64 since support was originally added, and it's probably |
198 |
the biggest jump we'll ever see, because while improvements will continue |
199 |
to be made, from this point on, they will be incremental improvements, |
200 |
significant yes, but not the blow-me-away improvements of 4.1, much as |
201 |
improvements have been only incremental on x86 for some time. |
202 |
|
203 |
... |
204 |
|
205 |
Anyway... thanks for that little test. The results are certainly |
206 |
enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3 |
207 |
and register and reorder flags vs the standard -Ox alone, if you are up to |
208 |
it, but haven't bothered to run them myself, and just this little test |
209 |
alone was quite informative and definitely more concrete than the "feel" |
210 |
I've been basing my comments on to date. Hopefully, my above comments |
211 |
prove useful to someone as well, and if I'm lucky, motivation for some |
212 |
tests (by you or someone else) to prove or disprove them. =8^) |
213 |
|
214 |
-- |
215 |
Duncan - List replies preferred. No HTML msgs. |
216 |
"Every nonfree program has a lord, a master -- |
217 |
and if you use the program, he is your master." Richard Stallman |
218 |
|
219 |
-- |
220 |
gentoo-amd64@g.o mailing list |