Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone?
Date: Thu, 14 Sep 2006 20:10:46
Message-Id: eeccsd$99s$2@sea.gmane.org
In Reply to: [gentoo-amd64] gcc 4.1 upgrade - bad desktop interactivity anyone? by Mark Knecht
1 "Mark Knecht" <markknecht@×××××.com> posted
2 5bdc1c8b0609140715t2c135c9q50bf7ff8cbf1d82e@××××××××××.com, excerpted
3 below, on Thu, 14 Sep 2006 07:15:42 -0700:
4
5 > I'm just curious whether anyone besides me is noticing their machine
6 > feeling somewhat sluggish since doing the gcc-4.1 upgrade? Mine seems ot
7 > be using a lot of memory. Alt-tabbing between windows seems slow.
8 > Ethernet traffic in my browser is causing pretty noticeable
9 > interruptions in things like MythTV.
10
11 > The machine is still quite usable, but it doesn't feel as snappy as it
12 > did last week.
13 >
14 > I made no changes in /etc/make.conf for the upgrade. Everything is
15 > pretty basic as far as I can tell:
16 >
17 > CFLAGS="-march=k8 -O2 -pipe"
18
19 > CXXFLAGS="${CFLAGS}"
20
21 I've noticed rather the opposite, here. gcc-4.1.1 compiled binaries are
22 /dramatically/ faster and more efficient than 3.x. However, I'm using a
23 rather more elaborate CFLAGS/CXXFLAGS, and it's my conviction that gcc-4.1
24 does better at optimizing exactly the way you've told it to. That is, if
25 you've given it inefficient optimizations, I'm convinced it makes a bad
26 thing worse, while if you've chosen your optimizations well, it makes a
27 good thing dramatically better.
28
29 Here's my CFLAGS/CXXFLAGS:
30
31 CFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
32 -freorder-blocks-and-partition -combine -funit-at-a-time -ftree-pre
33 -fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants"
34
35 CXXFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
36 -funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload
37 -fmerge-all-constants"
38
39 The general strategy here is to take advantage of size optimization -- on
40 modern compilers, L1 and L2 cache are FAR FAR faster than main memory, and
41 raw CPU cycles runs circles around even cache speeds. Thus, optimizing
42 for CPU speed at the expense of size makes little sense, because all those
43 saved cycles and more are likely to be spent waiting for memory to return
44 code that /would/ have fit in the cache were it size optimized.
45
46 Thus, for example, where traditional optimizations unroll loops into
47 flat code where possible, to avoid the expense of the jump back to the top
48 of the loop, that spreads out the loop to several times its original code
49 size, thus taking far more room in fast cache and forcing the CPU to wait
50 far more often for code to be fetched from main memory. I prefer to keep
51 the loops, making the code smaller and thus allowing more of it to fit in
52 faster cache. I believe that for most code, this technique will result in
53 faster execution in the real world, despite the theoretical loss of a CPU
54 cycle here or there due to jumping back to the top of the loop.
55
56 The -freorder-blocks-and-partition, OTOH, can make code slightly larger,
57 but the effect is the same as the above, increasing execution speed. What
58 this optimization does is separate code that is used often from that which
59 is seldom used, so the "hot" code is smaller and fits better in high speed
60 cache, while the "cold" code ends up in slower main memory most of the
61 time. While a lower percentage of the code may be in cache due to the
62 larger size, cache will be used far more effectively, as more "hot" code
63 will be retained therein, with the cold code that's not used so often
64 allowed to drop out of cache into main memory. This particular
65 optimization doesn't work well with C++, however, so it's in my CFLAGS but
66 not my CXXFLAGS.
67
68 Likewise with -combine, which allows the compiler to optimize across
69 multiple source files at a time. It's only implemented for C at this time
70 (according to the gcc manpage), so it's in my CFLAGS but omitted from my
71 CXXFLAGS.
72
73 The other strategy here is to make as full a use of the extra registers
74 available to amd64 in 64-bit mode (as opposed to 32-bit x86 mode) as
75 possible. Registers operate at the speed of the CPU, no wait at all, as
76 there is for even L1 cache, so it pays to use them as efficiently as
77 possible. Several of the flags (-frename-registers of course, -fweb, etc)
78 in my CFLAGS are therefore designed to encourage gcc to do this.
79
80 All the flags I've not mentioned specifically are designed to further the
81 three common goals mentioned above, making as efficient a use as possible
82 of the speed of (1) registers and (2) cache memory, by allowing gcc to
83 optimize over as wide a scope (3, whole units with unit-at-a-time, or
84 even multiple units with -combine) as possible. Of course, see the gcc
85 manpage for additional details.
86
87 As I said, with the above, there's a /dramatic/ improvement in
88 performance between gcc-3.x and gcc-4.1.x.
89
90 --
91 Duncan - List replies preferred. No HTML msgs.
92 "Every nonfree program has a lord, a master --
93 and if you use the program, he is your master." Richard Stallman
94
95 --
96 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone? Richard Freeman <rich@××××××××××××××.net>
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone? Mark Knecht <markknecht@×××××.com>
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone? Peter Humphrey <prh@××××××××××.uk>