1 |
"Mark Knecht" <markknecht@×××××.com> posted |
2 |
5bdc1c8b0609140715t2c135c9q50bf7ff8cbf1d82e@××××××××××.com, excerpted |
3 |
below, on Thu, 14 Sep 2006 07:15:42 -0700: |
4 |
|
5 |
> I'm just curious whether anyone besides me is noticing their machine |
6 |
> feeling somewhat sluggish since doing the gcc-4.1 upgrade? Mine seems ot |
7 |
> be using a lot of memory. Alt-tabbing between windows seems slow. |
8 |
> Ethernet traffic in my browser is causing pretty noticeable |
9 |
> interruptions in things like MythTV. |
10 |
|
11 |
> The machine is still quite usable, but it doesn't feel as snappy as it |
12 |
> did last week. |
13 |
> |
14 |
> I made no changes in /etc/make.conf for the upgrade. Everything is |
15 |
> pretty basic as far as I can tell: |
16 |
> |
17 |
> CFLAGS="-march=k8 -O2 -pipe" |
18 |
|
19 |
> CXXFLAGS="${CFLAGS}" |
20 |
|
21 |
I've noticed rather the opposite, here. gcc-4.1.1 compiled binaries are |
22 |
/dramatically/ faster and more efficient than 3.x. However, I'm using a |
23 |
rather more elaborate CFLAGS/CXXFLAGS, and it's my conviction that gcc-4.1 |
24 |
does better at optimizing exactly the way you've told it to. That is, if |
25 |
you've given it inefficient optimizations, I'm convinced it makes a bad |
26 |
thing worse, while if you've chosen your optimizations well, it makes a |
27 |
good thing dramatically better. |
28 |
|
29 |
Here's my CFLAGS/CXXFLAGS: |
30 |
|
31 |
CFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks |
32 |
-freorder-blocks-and-partition -combine -funit-at-a-time -ftree-pre |
33 |
-fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants" |
34 |
|
35 |
CXXFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks |
36 |
-funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload |
37 |
-fmerge-all-constants" |
38 |
|
39 |
The general strategy here is to take advantage of size optimization -- on |
40 |
modern compilers, L1 and L2 cache are FAR FAR faster than main memory, and |
41 |
raw CPU cycles runs circles around even cache speeds. Thus, optimizing |
42 |
for CPU speed at the expense of size makes little sense, because all those |
43 |
saved cycles and more are likely to be spent waiting for memory to return |
44 |
code that /would/ have fit in the cache were it size optimized. |
45 |
|
46 |
Thus, for example, where traditional optimizations unroll loops into |
47 |
flat code where possible, to avoid the expense of the jump back to the top |
48 |
of the loop, that spreads out the loop to several times its original code |
49 |
size, thus taking far more room in fast cache and forcing the CPU to wait |
50 |
far more often for code to be fetched from main memory. I prefer to keep |
51 |
the loops, making the code smaller and thus allowing more of it to fit in |
52 |
faster cache. I believe that for most code, this technique will result in |
53 |
faster execution in the real world, despite the theoretical loss of a CPU |
54 |
cycle here or there due to jumping back to the top of the loop. |
55 |
|
56 |
The -freorder-blocks-and-partition, OTOH, can make code slightly larger, |
57 |
but the effect is the same as the above, increasing execution speed. What |
58 |
this optimization does is separate code that is used often from that which |
59 |
is seldom used, so the "hot" code is smaller and fits better in high speed |
60 |
cache, while the "cold" code ends up in slower main memory most of the |
61 |
time. While a lower percentage of the code may be in cache due to the |
62 |
larger size, cache will be used far more effectively, as more "hot" code |
63 |
will be retained therein, with the cold code that's not used so often |
64 |
allowed to drop out of cache into main memory. This particular |
65 |
optimization doesn't work well with C++, however, so it's in my CFLAGS but |
66 |
not my CXXFLAGS. |
67 |
|
68 |
Likewise with -combine, which allows the compiler to optimize across |
69 |
multiple source files at a time. It's only implemented for C at this time |
70 |
(according to the gcc manpage), so it's in my CFLAGS but omitted from my |
71 |
CXXFLAGS. |
72 |
|
73 |
The other strategy here is to make as full a use of the extra registers |
74 |
available to amd64 in 64-bit mode (as opposed to 32-bit x86 mode) as |
75 |
possible. Registers operate at the speed of the CPU, no wait at all, as |
76 |
there is for even L1 cache, so it pays to use them as efficiently as |
77 |
possible. Several of the flags (-frename-registers of course, -fweb, etc) |
78 |
in my CFLAGS are therefore designed to encourage gcc to do this. |
79 |
|
80 |
All the flags I've not mentioned specifically are designed to further the |
81 |
three common goals mentioned above, making as efficient a use as possible |
82 |
of the speed of (1) registers and (2) cache memory, by allowing gcc to |
83 |
optimize over as wide a scope (3, whole units with unit-at-a-time, or |
84 |
even multiple units with -combine) as possible. Of course, see the gcc |
85 |
manpage for additional details. |
86 |
|
87 |
As I said, with the above, there's a /dramatic/ improvement in |
88 |
performance between gcc-3.x and gcc-4.1.x. |
89 |
|
90 |
-- |
91 |
Duncan - List replies preferred. No HTML msgs. |
92 |
"Every nonfree program has a lord, a master -- |
93 |
and if you use the program, he is your master." Richard Stallman |
94 |
|
95 |
-- |
96 |
gentoo-amd64@g.o mailing list |