1 |
On Tuesday 09 December 2008 15:28:21 Volker Armin Hemmann wrote: |
2 |
> On Dienstag 09 Dezember 2008, Sami Näätänen wrote: |
3 |
> > So hi from a amd64 newbie. Not so newbie with Gentoo though. :) |
4 |
|
5 |
Well sorry to not give a more details. I'm not a newbie in Gentoo just in the |
6 |
amd64 side of things. Ie no experience of the bugs how things break in the |
7 |
tree if using one or the other etc. |
8 |
|
9 |
So I take this now with a litle bit of more detail. |
10 |
|
11 |
I have been hanging with Gentoo before the 1.4 days ie long before the yearly |
12 |
tagged releases/profiles. I have used paludis from some where around 0.2x |
13 |
can't remember excatly which one it was. A breakage now and then in the |
14 |
building stages is nothing new for me. Stability in my eyes is stability of |
15 |
the binaries in my system not so of the builds itself. |
16 |
|
17 |
> > My system is an Intel quad core core2 with a 2.4 GHz clock speed coupled |
18 |
> > with a 4GB of memory. No overclocking etc. Want this to be stable. :) |
19 |
> > |
20 |
> > I'm just curious what people use as their stable CFLAGS in amd64 Gentoo? |
21 |
> > (Sorry if this has been up lately, but I just switched to 64bit env |
22 |
> > so...) |
23 |
> > |
24 |
> > |
25 |
> > Here is mine and some explanation of why (And I use ~arch system with gcc |
26 |
> > 4.3) |
27 |
> > |
28 |
> > The flags are in order they are used in my CFLAGS and CXXFLAGS. |
29 |
> > |
30 |
> > Gives stable base |
31 |
> > -O2 |
32 |
> |
33 |
> yes |
34 |
> |
35 |
> > Want to optimize for my system, but don't want "native" |
36 |
> > -march=core2 |
37 |
> |
38 |
> ok |
39 |
> |
40 |
> > If some ebuilds filter march this will still cache optimize etc for my |
41 |
> > system -mtune=core2 |
42 |
> |
43 |
> I would scrap that. |
44 |
> |
45 |
> > Faster floating point math and better chance of vectorization |
46 |
> > -mfpmath=sse |
47 |
> |
48 |
> superfluos. March with amd64 sse is used by default. |
49 |
|
50 |
So it's set even if arch filter drop's arch to the lowest amd64 arch. Wasn't |
51 |
sure so stick it in as I want to be sure there are no FPU code around making |
52 |
life harder. |
53 |
|
54 |
> > These because of the march might get filtered |
55 |
> > -mmmx -msse -msse2 -msse3 -mssse3 |
56 |
> |
57 |
> if march get filtered, these might one of the reasons, I would remove them. |
58 |
|
59 |
From my experience all the bugs that needed arch filtering had something wrong |
60 |
in the generic optimizations enabled only when certain -Ox and -march |
61 |
combination had been used and not the use of the instruction sets. (Couple of |
62 |
beta gcc's excluded, but I'm not touching those anymore). |
63 |
|
64 |
So I could scrap the older ones as march will allready cover those, except for |
65 |
the -msse3 which allows the compiler to use more SIMD instructions in loop |
66 |
vectorization. |
67 |
|
68 |
> > For loop vectorization |
69 |
> > -ftree-vectorize |
70 |
> |
71 |
> scrap that. |
72 |
|
73 |
Why? |
74 |
I read that there has been problems with it earlier, but to my experience it |
75 |
has been in the 32bit arch and In this system none what so ever. |
76 |
And fof isolated packages I can always easily disable that as being a paludis |
77 |
user. By the way most of those tree-vectorizer problems come from the other |
78 |
optimizations used before tree-vectorizer like loop peeling, loop unrolling |
79 |
etc. |
80 |
|
81 |
> > -pipe |
82 |
> |
83 |
> once upon a time I used this flags: |
84 |
> |
85 |
> #CFLAGS="-march=k8 -O2 -pipe -fweb -ftracer -fpeel-loops -msse3" |
86 |
> and even |
87 |
> #CFLAGS="-march=k8 -O2 -fweb -ftracer -fpeel-loops -ftree-vectorize |
88 |
> -frename- registers -floop-optimize2 -msse3 -pipe" |
89 |
> |
90 |
> to hunt down a java bug, I recompiled the whole system with: |
91 |
> |
92 |
> CFLAGS="-march=k8 -O2 -msse3 -pipe" |
93 |
> |
94 |
> and surprise - it was as fast as before - and compiling was faster too! |
95 |
|
96 |
Was this a 64bit system? |
97 |
I wouldn't use tree-vectorizer in a 32bit system as the alignment issues are a |
98 |
serious problem until gcc gets the proper stack alignment handling. |
99 |
|
100 |
I wouldn't touch the other flags you used, but I also know what code |
101 |
reductions regular code can get from loop vectorizer. Although to get best out |
102 |
of vectorization one really has to write compact and loopy and maybe an odd |
103 |
looking code. Also there are need for a lot of improvement in the vectorizer |
104 |
as can be seen from the code generated for the joo2 function in my example. |
105 |
|
106 |
For example: |
107 |
float a[4]; |
108 |
float b[4]; |
109 |
|
110 |
float |
111 |
joo() { |
112 |
a[0] = b[0]*b[0]; |
113 |
a[1] = b[1]*b[1]; |
114 |
a[2] = b[2]*b[2]; |
115 |
a[3] = b[3]*b[3]; |
116 |
return a[0]+a[1]+a[2]+a[3]; |
117 |
} |
118 |
|
119 |
float |
120 |
joo2() { |
121 |
int i; |
122 |
for( i=0; i<4; i++) |
123 |
a[i] = b[i]*b[i]; |
124 |
return a[0]+a[1]+a[2]+a[3]; |
125 |
} |
126 |
|
127 |
joo() will be slower using CFLAGS="-O2 -march=core2 -ftree-vectorize" than |
128 |
joo2(), because tree vectorizer can vectorize the constant loop out. |
129 |
jopy the code to a c-source file like joo.c and execute: |
130 |
gcc -O2 -march=core2 -ftree-vectorize -S joo.c && less joo.s |
131 |
|
132 |
PS. For those who are interested: There are many issues of vectorizeable loops |
133 |
that can't be vectorized because gcc lacks proper parameter stack alignment. |
134 |
Which is the reason I wrote the example the way I did. :) |
135 |
|
136 |
It can't provide nearly as many optimizations as in 64bit systems, because of |
137 |
the alignment issue. Tree-vectorizer makes a lot of those two version |
138 |
vectorizations when it needs to determine the memory alignment in runtime. |
139 |
That's why I take a closer look at the vectorizations. There were really few |
140 |
of those two version vectorizations when I compiled my "system" |