Gentoo Archives: gentoo-amd64

From:	"Sami Näätänen" <sn.ml@××××××××××××.fi>
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] CFLAGS question from a AMD64 newbie
Date:	Tue, 09 Dec 2008 19:59:17
Message-Id:	`200812092159.13048.sn.ml@keijukammari.fi`
In Reply to:	Re: [gentoo-amd64] CFLAGS question from a AMD64 newbie by Volker Armin Hemmann

1	On Tuesday 09 December 2008 15:28:21 Volker Armin Hemmann wrote:
2	> On Dienstag 09 Dezember 2008, Sami Näätänen wrote:
3	> > So hi from a amd64 newbie. Not so newbie with Gentoo though. :)
4
5	Well sorry to not give a more details. I'm not a newbie in Gentoo just in the
6	amd64 side of things. Ie no experience of the bugs how things break in the
7	tree if using one or the other etc.
8
9	So I take this now with a litle bit of more detail.
10
11	I have been hanging with Gentoo before the 1.4 days ie long before the yearly
12	tagged releases/profiles. I have used paludis from some where around 0.2x
13	can't remember excatly which one it was. A breakage now and then in the
14	building stages is nothing new for me. Stability in my eyes is stability of
15	the binaries in my system not so of the builds itself.
16
17	> > My system is an Intel quad core core2 with a 2.4 GHz clock speed coupled
18	> > with a 4GB of memory. No overclocking etc. Want this to be stable. :)
19	> >
20	> > I'm just curious what people use as their stable CFLAGS in amd64 Gentoo?
21	> > (Sorry if this has been up lately, but I just switched to 64bit env
22	> > so...)
23	> >
24	> >
25	> > Here is mine and some explanation of why (And I use ~arch system with gcc
26	> > 4.3)
27	> >
28	> > The flags are in order they are used in my CFLAGS and CXXFLAGS.
29	> >
30	> > Gives stable base
31	> > -O2
32	>
33	> yes
34	>
35	> > Want to optimize for my system, but don't want "native"
36	> > -march=core2
37	>
38	> ok
39	>
40	> > If some ebuilds filter march this will still cache optimize etc for my
41	> > system -mtune=core2
42	>
43	> I would scrap that.
44	>
45	> > Faster floating point math and better chance of vectorization
46	> > -mfpmath=sse
47	>
48	> superfluos. March with amd64 sse is used by default.
49
50	So it's set even if arch filter drop's arch to the lowest amd64 arch. Wasn't
51	sure so stick it in as I want to be sure there are no FPU code around making
52	life harder.
53
54	> > These because of the march might get filtered
55	> > -mmmx -msse -msse2 -msse3 -mssse3
56	>
57	> if march get filtered, these might one of the reasons, I would remove them.
58
59	From my experience all the bugs that needed arch filtering had something wrong
60	in the generic optimizations enabled only when certain -Ox and -march
61	combination had been used and not the use of the instruction sets. (Couple of
62	beta gcc's excluded, but I'm not touching those anymore).
63
64	So I could scrap the older ones as march will allready cover those, except for
65	the -msse3 which allows the compiler to use more SIMD instructions in loop
66	vectorization.
67
68	> > For loop vectorization
69	> > -ftree-vectorize
70	>
71	> scrap that.
72
73	Why?
74	I read that there has been problems with it earlier, but to my experience it
75	has been in the 32bit arch and In this system none what so ever.
76	And fof isolated packages I can always easily disable that as being a paludis
77	user. By the way most of those tree-vectorizer problems come from the other
78	optimizations used before tree-vectorizer like loop peeling, loop unrolling
79	etc.
80
81	> > -pipe
82	>
83	> once upon a time I used this flags:
84	>
85	> #CFLAGS="-march=k8 -O2 -pipe -fweb -ftracer -fpeel-loops -msse3"
86	> and even
87	> #CFLAGS="-march=k8 -O2 -fweb -ftracer -fpeel-loops -ftree-vectorize
88	> -frename- registers -floop-optimize2 -msse3 -pipe"
89	>
90	> to hunt down a java bug, I recompiled the whole system with:
91	>
92	> CFLAGS="-march=k8 -O2 -msse3 -pipe"
93	>
94	> and surprise - it was as fast as before - and compiling was faster too!
95
96	Was this a 64bit system?
97	I wouldn't use tree-vectorizer in a 32bit system as the alignment issues are a
98	serious problem until gcc gets the proper stack alignment handling.
99
100	I wouldn't touch the other flags you used, but I also know what code
101	reductions regular code can get from loop vectorizer. Although to get best out
102	of vectorization one really has to write compact and loopy and maybe an odd
103	looking code. Also there are need for a lot of improvement in the vectorizer
104	as can be seen from the code generated for the joo2 function in my example.
105
106	For example:
107	float a[4];
108	float b[4];
109
110	float
111	joo() {
112	a[0] = b[0]*b[0];
113	a[1] = b[1]*b[1];
114	a[2] = b[2]*b[2];
115	a[3] = b[3]*b[3];
116	return a[0]+a[1]+a[2]+a[3];
117	}
118
119	float
120	joo2() {
121	int i;
122	for( i=0; i<4; i++)
123	a[i] = b[i]*b[i];
124	return a[0]+a[1]+a[2]+a[3];
125	}
126
127	joo() will be slower using CFLAGS="-O2 -march=core2 -ftree-vectorize" than
128	joo2(), because tree vectorizer can vectorize the constant loop out.
129	jopy the code to a c-source file like joo.c and execute:
130	gcc -O2 -march=core2 -ftree-vectorize -S joo.c && less joo.s
131
132	PS. For those who are interested: There are many issues of vectorizeable loops
133	that can't be vectorized because gcc lacks proper parameter stack alignment.
134	Which is the reason I wrote the example the way I did. :)
135
136	It can't provide nearly as many optimizations as in 64bit systems, because of
137	the alignment issue. Tree-vectorizer makes a lot of those two version
138	vectorizations when it needs to determine the memory alignment in runtime.
139	That's why I take a closer look at the vectorizations. There were really few
140	of those two version vectorizations when I compiled my "system"

Replies

Subject	Author
Re: [gentoo-amd64] CFLAGS question from a AMD64 newbie	Branko Badrljica <brankob@××××××××××.com>
Re: [gentoo-amd64] CFLAGS question from a AMD64 newbie	Volker Armin Hemmann <volker.armin.hemmann@××××××××××××.de>

Report Message

Find on MARC Find on Google Groups