Gentoo Archives: gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone?
Date:	Thu, 14 Sep 2006 20:10:46
Message-Id:	`eeccsd$99s$2@sea.gmane.org`
In Reply to:	[gentoo-amd64] gcc 4.1 upgrade - bad desktop interactivity anyone? by Mark Knecht

1	"Mark Knecht" <markknecht@×××××.com> posted
2	5bdc1c8b0609140715t2c135c9q50bf7ff8cbf1d82e@××××××××××.com, excerpted
3	below, on Thu, 14 Sep 2006 07:15:42 -0700:
4
5	> I'm just curious whether anyone besides me is noticing their machine
6	> feeling somewhat sluggish since doing the gcc-4.1 upgrade? Mine seems ot
7	> be using a lot of memory. Alt-tabbing between windows seems slow.
8	> Ethernet traffic in my browser is causing pretty noticeable
9	> interruptions in things like MythTV.
10
11	> The machine is still quite usable, but it doesn't feel as snappy as it
12	> did last week.
13	>
14	> I made no changes in /etc/make.conf for the upgrade. Everything is
15	> pretty basic as far as I can tell:
16	>
17	> CFLAGS="-march=k8 -O2 -pipe"
18
19	> CXXFLAGS="${CFLAGS}"
20
21	I've noticed rather the opposite, here. gcc-4.1.1 compiled binaries are
22	/dramatically/ faster and more efficient than 3.x. However, I'm using a
23	rather more elaborate CFLAGS/CXXFLAGS, and it's my conviction that gcc-4.1
24	does better at optimizing exactly the way you've told it to. That is, if
25	you've given it inefficient optimizations, I'm convinced it makes a bad
26	thing worse, while if you've chosen your optimizations well, it makes a
27	good thing dramatically better.
28
29	Here's my CFLAGS/CXXFLAGS:
30
31	CFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
32	-freorder-blocks-and-partition -combine -funit-at-a-time -ftree-pre
33	-fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants"
34
35	CXXFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
36	-funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload
37	-fmerge-all-constants"
38
39	The general strategy here is to take advantage of size optimization -- on
40	modern compilers, L1 and L2 cache are FAR FAR faster than main memory, and
41	raw CPU cycles runs circles around even cache speeds. Thus, optimizing
42	for CPU speed at the expense of size makes little sense, because all those
43	saved cycles and more are likely to be spent waiting for memory to return
44	code that /would/ have fit in the cache were it size optimized.
45
46	Thus, for example, where traditional optimizations unroll loops into
47	flat code where possible, to avoid the expense of the jump back to the top
48	of the loop, that spreads out the loop to several times its original code
49	size, thus taking far more room in fast cache and forcing the CPU to wait
50	far more often for code to be fetched from main memory. I prefer to keep
51	the loops, making the code smaller and thus allowing more of it to fit in
52	faster cache. I believe that for most code, this technique will result in
53	faster execution in the real world, despite the theoretical loss of a CPU
54	cycle here or there due to jumping back to the top of the loop.
55
56	The -freorder-blocks-and-partition, OTOH, can make code slightly larger,
57	but the effect is the same as the above, increasing execution speed. What
58	this optimization does is separate code that is used often from that which
59	is seldom used, so the "hot" code is smaller and fits better in high speed
60	cache, while the "cold" code ends up in slower main memory most of the
61	time. While a lower percentage of the code may be in cache due to the
62	larger size, cache will be used far more effectively, as more "hot" code
63	will be retained therein, with the cold code that's not used so often
64	allowed to drop out of cache into main memory. This particular
65	optimization doesn't work well with C++, however, so it's in my CFLAGS but
66	not my CXXFLAGS.
67
68	Likewise with -combine, which allows the compiler to optimize across
69	multiple source files at a time. It's only implemented for C at this time
70	(according to the gcc manpage), so it's in my CFLAGS but omitted from my
71	CXXFLAGS.
72
73	The other strategy here is to make as full a use of the extra registers
74	available to amd64 in 64-bit mode (as opposed to 32-bit x86 mode) as
75	possible. Registers operate at the speed of the CPU, no wait at all, as
76	there is for even L1 cache, so it pays to use them as efficiently as
77	possible. Several of the flags (-frename-registers of course, -fweb, etc)
78	in my CFLAGS are therefore designed to encourage gcc to do this.
79
80	All the flags I've not mentioned specifically are designed to further the
81	three common goals mentioned above, making as efficient a use as possible
82	of the speed of (1) registers and (2) cache memory, by allowing gcc to
83	optimize over as wide a scope (3, whole units with unit-at-a-time, or
84	even multiple units with -combine) as possible. Of course, see the gcc
85	manpage for additional details.
86
87	As I said, with the above, there's a /dramatic/ improvement in
88	performance between gcc-3.x and gcc-4.1.x.
89
90	--
91	Duncan - List replies preferred. No HTML msgs.
92	"Every nonfree program has a lord, a master --
93	and if you use the program, he is your master." Richard Stallman
94
95	--
96	gentoo-amd64@g.o mailing list

Replies

Subject	Author
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone?	Richard Freeman <rich@××××××××××××××.net>
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone?	Mark Knecht <markknecht@×××××.com>
Re: [gentoo-amd64] Re: gcc 4.1 upgrade - bad desktop interactivity anyone?	Peter Humphrey <prh@××××××××××.uk>

Report Message

Find on MARC Find on Google Groups