1 |
Duncan wrote: |
2 |
>>Nice. Now let us know your CFLAGS, and what toolchain versions you're |
3 |
>>running :D |
4 |
> |
5 |
> |
6 |
> You probably didn't notice, as I had it commented out on the main index |
7 |
> page as I don't have the page created to actually list them yet, but if |
8 |
> you viewed source, you'd have seen I have a techspecs page link commented |
9 |
> out, that'll get that sort of info, when/if I actually get it created. |
10 |
> |
11 |
> However, since you asked, your answer, and a bit more, by way of |
12 |
> explanation... |
13 |
> |
14 |
> I should really create a page listing all the little Gentoo admin scripts |
15 |
> I've come up with and how I use them. I'm sure a few folks anyway would |
16 |
> likely find them useful. |
17 |
> |
18 |
> The idea behind most of them is to create shortcuts to having to type in |
19 |
> long emerge lines, with all sorts of arbitrary command line parameters. |
20 |
> The majority of these fall into two categories, ea* and ep*, short for |
21 |
> emerge --ask <additional parameters> and emerge --pretend ... . Thus, I |
22 |
> have epworld and eaworld, the pretend and ask versions of emerge -NuDv |
23 |
> world, epsys and easys, the same for system, eplog <package>, emerge |
24 |
> --pretend --log --verbose (package name to be added to the command line so |
25 |
> eplog gcc, for instance, to see the changes between my current and the new |
26 |
> version of gcc), eptree <package>, to use the tree output, etc. |
27 |
|
28 |
Interesting. But why do you use scripts and not simple aliases? Every time you |
29 |
launch your script the HD performs a seek (which is very expensive in time), |
30 |
copies the script into memory and then forks a whole bash process to execute a |
31 |
one-liner. Using alias, which is a bash built-in, wouldn't fork a process and |
32 |
therefore be much faster. |
33 |
|
34 |
(see man alias for examples) |
35 |
|
36 |
> One thing I've found is that I'll often epworld or eptreeworld, then |
37 |
> emerge the individual packages, rather than use eaworld to do it. That |
38 |
> way, I can do them in the order I want or do several at a time if I want |
39 |
> to make use of both CPUs. Because I always use --deep, as I want to keep |
40 |
> my dependencies updated as well, I'm very often merging specific |
41 |
> dependencies. There's a small problem with that, however --oneshot, which |
42 |
> I'll always want to use with dependencies to help keep my world file |
43 |
> uncluttered, has no short form, but I use it as the default! OTOH, the |
44 |
|
45 |
man emerge: |
46 |
--oneshot (-1) |
47 |
|
48 |
IIRC --oneshot has a short form since 2.0.52 was released. |
49 |
|
50 |
> normal portage mode of adding stuff listed on the command line to the |
51 |
> world file, I don't want very often, as most of the time I'm simply |
52 |
> updating what I have, so it's all in the world file if it needs to be |
53 |
> there already anyway. Not a problem! All my regular ea* scriptlets use |
54 |
> --oneshot, so it /is/ my default. If I *AM* merging something new that I |
55 |
> want added to my world file, I have another family of ea* scriptlets that |
56 |
> do that -- all ending in "2", as in, "NOT --oneshot". Thus, I have a |
57 |
> family of ea*2 scriptlets. |
58 |
> |
59 |
> The regulars here already know one of my favorite portage features is |
60 |
> FEATURES=buildpkg, which I have set in make.conf. That of course gives me |
61 |
> a collection of binary versions of packages I've already emerged, so I |
62 |
> can quickly revert to an old version for testing something, if I want, |
63 |
> then remerge the new version once I've tested the old version to see if it |
64 |
> has the same bug I'm working on or not. To aid in this, I have a |
65 |
> collection of eppak and eapak scriptlets. Again, the portage default of |
66 |
> --usepackage (-k) doesn't fit my default needs, as if I'm using a binpkg, |
67 |
> I usually want to ONLY use a binpkg, NOT merge from source if the package |
68 |
> isn't available. That happens to be -K in short-form. However, it's my |
69 |
> default, so eapak invokes the -K version. I therefore have eapaK to |
70 |
> invoke the -k version if I don't really care whether it goes from binpkg |
71 |
> or source. |
72 |
> |
73 |
> Of course, there are various permutations of the above as well, so I have |
74 |
> eapak2 and eapaK2, as well as eapak and eapaK. For the ep* versions, of |
75 |
> course the --oneshot doesn't make a difference, so I only have eppak and |
76 |
> eppaK, no eppa?2 scriptlets. |
77 |
> |
78 |
> ... Deep breath... <g> |
79 |
> |
80 |
> All that as a preliminary explanation to this: Along with the above, I |
81 |
> have a set of efetch functions, that invoke the -f form, so just do the |
82 |
> fetch, not the actual compile and merge, and esyn (there's already an |
83 |
> esync function in something or other I have merged so I just call it |
84 |
> esyn), which does emerge sync, then updates the esearch db, then |
85 |
> automatically fetches all the packages that an eaworld would want to |
86 |
> update, so they are ready for me to merge at my leisure. |
87 |
|
88 |
I'm a bit confused now. You use *functions* to do that? Or do you mean scripts? |
89 |
By the way: with alias you could name your custom "script" esync because it |
90 |
doesn't place a file on the harddisk. |
91 |
|
92 |
> Likewise, and the real reason for this whole explanation, I /had/ an |
93 |
> "einfo" scriptlet that simply ran "emerge info". This can be very handy |
94 |
> to run, if like me, you have several slotted versions of gcc merged, and |
95 |
> you sometimes forget which one you have eselected or gcc-configed as the |
96 |
> one portage will use. Likewise, it's useful for checking on CFLAGS (or |
97 |
> CXXFLAGS OR LDFLAGS or...), if you modified them from the normal ones |
98 |
> because a particular package wasn't cooperating, and you want to see if |
99 |
> you remembered to switch them back or not. |
100 |
> |
101 |
> However, I ran into a problem. The output of einfo was too long to |
102 |
> quickly find the most useful info -- the stuff I most often change and |
103 |
> therefore most often am looking for. |
104 |
> |
105 |
> No sweat! I shortened my original "einfo" to simply "ei", and added a |
106 |
> second script, "eis" (for einfo short), that simply piped the output of |
107 |
> the usual emerge info into a grep that only returned the lines I most |
108 |
> often need -- the big title one with gcc and similar info, CFLAGS, |
109 |
> CXXFLAGS, LDFLAGS, and FEATURES. USE would also be useful, but it's too |
110 |
> long even by itself to be searched at a glance, so if I want it, I simply |
111 |
> run ei and look for what I want in the longer output. |
112 |
|
113 |
Impressive. |
114 |
|
115 |
> ... Another deep breath... <g> |
116 |
> |
117 |
> OK, with that as a preliminary, you should be able to understand the |
118 |
> following: |
119 |
> |
120 |
> $eis |
121 |
> |
122 |
> Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127, |
123 |
> glibc-2.3.6-r2, 2.6.15 x86_64) |
124 |
> |
125 |
> CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers |
126 |
> -funit-at-a-time -fweb -freorder-blocks-and-partition |
127 |
> -fmerge-all-constants" |
128 |
> |
129 |
> CXXFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers |
130 |
> -funit-at-a-time -fweb -freorder-blocks-and-partition |
131 |
> -fmerge-all-constants" |
132 |
> |
133 |
> FEATURES="autoconfig buildpkg candy ccache confcache distlocks |
134 |
> multilib-strict parallel-fetch sandbox sfperms strict userfetch" |
135 |
> |
136 |
> LDFLAGS="-Wl,-z,now" |
137 |
> |
138 |
> MAKEOPTS="-j4" |
139 |
> |
140 |
> To make sense of that... |
141 |
> |
142 |
> * The portage and glibc versions are ~amd64, as set in make.conf for the |
143 |
> system in general. |
144 |
> |
145 |
> * CFLAGS: |
146 |
> |
147 |
> I choose -Os, optimize for size, because a modern CPU and the various |
148 |
> cache levels are FAR faster than main memory. This difference is |
149 |
> frequently severe enough that it's actually more efficient to optimize for |
150 |
> size than for CPU performance, because the result is smaller code that |
151 |
> maintains cache locality (stays in fast cache) far better, and the CPU |
152 |
> saves more time that it would otherwise be spending idle, waiting for data |
153 |
> to come in from slower more distant memory, than the actual cost of the |
154 |
> loss of cycle efficiency that's often the tradeoff for small code. |
155 |
|
156 |
Given the fact that two CPUs, only differing in L2 Cache size, have nearly the |
157 |
same performance, I doubt that the performance increase is very big. Some |
158 |
interesting figures: |
159 |
|
160 |
Athlon64 something (forgot what, but shouldn't matter anyway) with 1 MB L2-cache |
161 |
is 4% faster than an Athlon64 of the same frequency but with only 512kB |
162 |
L2-cache. The bigger the cache sizes you compare get, the smaller the |
163 |
performance increase. Since you run a dual Opteron system with 1 MB L2 cache per |
164 |
CPU I tend to say that the actual performance increase you experience is about |
165 |
3%. But then I didn't take into account that -Os leaves out a few optimizations |
166 |
which would be included by -O2, the default optimization level, which actually |
167 |
makes the code a bit slower when compared to -O2. So, the performance increase |
168 |
you really experience shrinks to about 0-2%. I'd tend to proclaim that -O2 is |
169 |
even faster for most of the code, but that's only my feeling. |
170 |
|
171 |
Beside that I should mention that -Os sometimes still has problems with huge |
172 |
packages like glibc. |
173 |
|
174 |
> Back when memory operated at or near the speed of the CPU, avoiding the |
175 |
> loop, even at the expense of three-times the code, was often faster. |
176 |
> Today, where CPUs do several calculations in the time it takes to fetch |
177 |
> data from main memory, it's generally faster to go for the smaller code, |
178 |
> as it will be far more likely to still be in fast cache, avoiding that |
179 |
> long wait for main memory, even if it /does/ mean wasting a couple |
180 |
> additional cycles doing the expensive jump back to the top of the loop. |
181 |
|
182 |
Not only CPUs got faster, but also caches got bigger. Comparing my old P4 with |
183 |
1.7 GHz and 256kb L2 cache to a P4 with 3.4 GHz (frequency doubled) which has 1 |
184 |
MB L2 cache (cache quadrupled) shows that the proportions changed. Bigger cache |
185 |
of course means that you can have larger chunks of code there, so unrolling |
186 |
loops with fixed iterations actually might perform better. |
187 |
|
188 |
> Of course, this is theory, and the practical case can and will differ |
189 |
> depending on the instructions actually being compiled. In particular, |
190 |
> streaming media apps and media encoding/decoding are likely to still |
191 |
> benefit from the traditional loop elimination style optimizations, because |
192 |
> they run thru so much data already, that cache is routinely trashed |
193 |
> anyway, regardless of the size of your instructions. As well, that type |
194 |
> of application tends to have a LOT of looping instructions to optimize! |
195 |
> |
196 |
> By contrast, something like the kernel will benefit more than usual from |
197 |
> size optimization. First, it's always memory locked and as such |
198 |
> can't be swapped, and even "slow" main memory is still **MANY** **MANY** |
199 |
> times faster than swap, so a smaller kernel means more other stuff fits |
200 |
> into main memory with it, and isn't swapped as much. Second, parts of the |
201 |
|
202 |
Funny to hear this from somebody with 4 GB RAM in his system. I don't know how |
203 |
bloated your kernel is, but even if -Os would reduce the size of my kernel to |
204 |
**the half**, which is totally impossible, it wouldn't be enough to load the |
205 |
mail I am just answering into RAM. So, basically, this reasoning is just ridiculous. |
206 |
|
207 |
> kernel such as task scheduling are executed VERY often, either because |
208 |
> they are frequently executed by most processes, or because they /control/ |
209 |
> those processes. The smaller these are, the more likely they are to still |
210 |
> be in cache when next used. Likewise, the smaller they are, the less |
211 |
> potentially still useful other data gets flushed out of cache to make room |
212 |
> for the kernel code executing at the moment. Third, while there's a lot |
213 |
> of kernel code that will loop, and a lot that's essentially streaming, the |
214 |
> kernel as a whole is a pretty good mix of code and thus won't benefit as |
215 |
> much from loop optimizations and the like, as compared to special purpose |
216 |
> code like the media codec and streaming applications above. |
217 |
> |
218 |
> The differences are marked enough and now demonstrated enough that a |
219 |
> kernel config option to optimize for size was added I believe about a year |
220 |
> ago. Evidently, that lead to even MORE demonstration, as the option was |
221 |
> originally in the obscure embedded optimizations corner of the config, |
222 |
> where few would notice or use it, and they upgraded it into a main option. |
223 |
> In fact, where a year or two ago, the option didn't even exist, now I |
224 |
> believe it defaults to yes/on/do-optimize-for-size (altho it's possible |
225 |
> I'm incorrect on the last and it's not yet the default). |
226 |
|
227 |
It is not. The option you are talking about is called |
228 |
CONFIG_CC_OPTIMIZE_FOR_SIZE and is defined nowhere, so that the 'ifdef |
229 |
CONFIG_CC_OPTIMIZE_FOR_SIZE' will result in no by default and therefore set -O2 |
230 |
as default. |
231 |
|
232 |
> According to the gcc manpage, -frename-registers causes gcc to attempt to |
233 |
> make use of registers left over after normal register allocation. This is |
234 |
> particularly beneficial on archs that have many registers (keeping in |
235 |
> mind that "registers" are what amounts to L0 cache, the fastest possible |
236 |
> memory because the CPU accesses registers directly and they operate at |
237 |
> full CPU speed. Unfortunately, registers are also very limited, making |
238 |
> them an EXCEEDINGLY valuable resource! Note that while x86-32 is noted |
239 |
> for its relative /lack/ of registers, AMD basically doubled the number of |
240 |
> registers available to 64-bit code in its x86-64 aka AMD64 spec. Thus, |
241 |
> while this option wouldn't be of particular benefit on x86, on amd64, it |
242 |
> can, depending on the code of course, provide some rather serious |
243 |
> optimization! |
244 |
> |
245 |
> -fweb is a register use optimizer function as well. It tells gcc to |
246 |
> create a /web/ of dependencies and assign each individual dependency web |
247 |
> to its own pseudo-register. Thus, when it comes time for gcc to allocate |
248 |
> registers, it already has a list of the best candidates lined up and ready |
249 |
> to go. Combined with -frename register to tell gcc to efficiently make |
250 |
> use of any registers left over after the the first pass, and due to the |
251 |
> number of registers available in 64-bit mode on our arch, this can allow |
252 |
> some seriously powerful optimizations. Still, a couple of things to note |
253 |
> about it. One, -fweb (and -frename-registers as well) can cause data to |
254 |
> move out of its "home" register, which seriously complicates debugging, if |
255 |
> you are a programmer or power-user enough to worry about such things. |
256 |
> Two, the rewrite for gcc 4.0 significantly modified the functionality of |
257 |
> -fweb, and it wasn't recommended for 4.0 as it didn't yet work as well as |
258 |
> expected or as it did with gcc 3.x. For gcc 4.1, -fweb is apparently back |
259 |
> to its traditional strength. Those Gentoo users having gcc 3.4, 4.0, and |
260 |
> 4.1, all three in separate slots, will want to note this as they change |
261 |
> gcc-configuratiions, and modify it accordingly. Yes, this *IS* one of the |
262 |
> reasons my CFLAGS change so frequently! |
263 |
> |
264 |
> -funit-at-a-time tells gcc to consider a full logical unit, perhaps |
265 |
> consisting of several source files rather than just one, as a whole, when |
266 |
> it does its compiling. Of course, this allows gcc to make |
267 |
> optimizations it couldn't see if it wasn't looking at the larger picture |
268 |
> as a whole, but it requires rather more memory, to hold the entire unit |
269 |
> so it can consider it at once. This is a fairly new flag, introduced with |
270 |
> gcc 3.3 IIRC. While the idea is simple enough and shouldn't lead to any |
271 |
> bugs on its own, there WERE a number of initially never encountered bugs |
272 |
> in various code that this flag exposed, when GCC made optimizations on the |
273 |
> entire unit that it wouldn't otherwise make, thereby triggering bugs that |
274 |
> had never been triggered before. I /believe/ this was the root reason why |
275 |
> the Gentoo amd64 technotes originally discouraged use of -Os, back with |
276 |
> the first introduction of this flag in gcc 3.2 hammer (amd64) edition, as |
277 |
> -funit-at-a-time was activated by -Os at that time, and -Os was known to |
278 |
> produce bad code at the time, on amd64, with packages like portions of |
279 |
> KDE. The gcc 4.1.0 manpage now says it's enabled by default at -O2 and |
280 |
> -O3, but doesn't mention -Os. Whether that's an omission, or whether they |
281 |
> decided it shouldn't be enabled by -Os for some reason, I'm not sure, but |
282 |
> I use them both to be sure and haven't had any issues I can trace to this |
283 |
> (not even back when the technotes recommended against -Os, and said KDE |
284 |
> was supposed to have trouble with it -- maybe it was parts of KDE I never |
285 |
> merged, or maybe I was just lucky, but I've simply never had an issue with |
286 |
> it). |
287 |
> |
288 |
> -freorder-blocks-and-partition is new for gcc 4.0, I believe, alto I |
289 |
> didn't discover it until I was reading the 4.1-beta manpage. I KNOW gcc |
290 |
> 3.4.4 fails out with it, saying unrecognized flag or some such, so it's |
291 |
> another of those flags that cause my CFLAGS to be constantly changing, as |
292 |
> I switch between gcc versions. This flag won't work under all conditions, |
293 |
> according to the manpage, so is automatically disabled in the presence of |
294 |
> exception handling, and a few other situations named in the manpage. It |
295 |
> causes a lot of warnings too, to the effect that it's being disabled due |
296 |
> to X reason. There's a similar -freorder-blocks flag, which optimizes by |
297 |
> reordering blocks in a function to "reduce number of taken branches and |
298 |
> improve code locality." In English, what that means is that it breaks |
299 |
> caching less often. Again, caching is *EXTREMELY* performance critical, |
300 |
> so anything that breaks it less often is CERTAINLY welcome! The |
301 |
> -and-partition increases the effect, by separating the code into |
302 |
> frequently used and less frequently used partitions. This keeps the most |
303 |
> frequently used code all together, therefore keeping it in cache far more |
304 |
> efficiently, since the less used code won't be constantly pulled in, |
305 |
> forcing out frequently used code in the process. |
306 |
> |
307 |
> Hmm... As I'm writing and thinking about this, the probability that |
308 |
> sticking the regular -freorder-blocks option in CFLAGS as well would be a |
309 |
> wise thing, occurs to me. The non-partition version isn't as efficient as |
310 |
> the partition version, and would be redundant if the partitioned version |
311 |
> is in effect. However, the non-partitioned version doesn't have the same |
312 |
> sorts of no-exceptions-handler and similar restrictions, so having it in |
313 |
> the list, first, so the partitioned version overrides it where it can be |
314 |
> used, should be a good idea. That way, where the partitioned version can |
315 |
> be used, it will be, but where it can't, gcc will still use the |
316 |
> non-partitioned version of the option, so I'll still get /some/ of the |
317 |
> optimizations! I (re)compiled major portions of xorg (modular), qt, and |
318 |
> the new kde 3.5.1 with the partitioned option, however, and it works, and |
319 |
> I haven't tested having both options in there yet, so I'm not sure it'll |
320 |
> work as the theory suggests it should, so some caution might be advised. |
321 |
> |
322 |
> -fmerge-all-constants COULD be dangerous with SOME code, as it breaks part |
323 |
> of the C/C++ specification. However, it should be fine for most code |
324 |
> written to be compiled with gcc, and I've seen no problems /yet/ tho both |
325 |
> this and the reorder-and-partition flag above are fairly new to my CFLAGS, |
326 |
> so haven't been as extensively personally tested as the others have been. |
327 |
> If something seems to be breaking when this is in your CFLAGS, certainly |
328 |
> it's the first thing I'd try pulling out. What it actually does is merge |
329 |
> all constants with the same value into the same one. gcc has a weaker |
330 |
> -fmerge-constants version that's enabled with any -O option at all (thus |
331 |
> at -O, -O2, -O3, AND -Os), that merges all declared constants of the same |
332 |
> value, which is safe and doesn't conflict with the C/C++ spec. What the |
333 |
> /all/ specifier in there does, however, is cause gcc to merge declared |
334 |
> variables where the value actually never changes, so they are in effect |
335 |
> constants, altho they are declared as variables, with other constants of |
336 |
> the same value. This /should/ be safe, /provided/ gcc isn't failing to |
337 |
> detect a variable chance somewhere, but it conflicts with the C/C++ spec, |
338 |
> according to the gcc manpage, and thus /could/ cause issues, if the |
339 |
> developer pulls certain tricks that gcc wouldn't detect, or possibly more |
340 |
> likely, if used with code compiled by a different compiler (say |
341 |
> binary-only applications you may run, which may not have been compiled |
342 |
> with gcc). There are two reasons why I choose to use it despite the |
343 |
> possible risks. One, I want /small/ code, again, because small code fits |
344 |
> in that all-important cache better and therefore runs faster, and |
345 |
> obviously, two or more merged constants aren't going to take the space |
346 |
> they would if gcc stored them separately. Two, the risks aren't as bad if |
347 |
> you aren't running non-gcc compiled code anyway, and since I'm a strong |
348 |
> believer in Software Libre, if it's binary-only, there's very little |
349 |
> chance I'll want or risk it on my box, and everything I do run is gcc |
350 |
> compiled anyway, so should be generally safe. Still, I know there may be |
351 |
> instances where I'll have to recompile with the flag turned off, and am |
352 |
> prepared to deal with them when they happen, or I'd not have the flag in |
353 |
> my CFLAGS. |
354 |
|
355 |
You are referring a lot to the gcc manpage, but obviously you missed this part: |
356 |
|
357 |
-fomit-frame-pointer |
358 |
Don't keep the frame pointer in a register for functions that don't |
359 |
need one. This avoids the instructions to save, set up and restore |
360 |
frame pointers; it also makes an extra register available in many |
361 |
functions. It also makes debugging impossible on some machines. |
362 |
|
363 |
On some machines, such as the VAX, this flag has no effect, because |
364 |
the standard calling sequence automatically handles the frame |
365 |
pointer and nothing is saved by pretending it doesn't exist. The |
366 |
machine-description macro "FRAME_POINTER_REQUIRED" controls whether |
367 |
a target machine supports this flag. |
368 |
|
369 |
Enabled at levels -O, -O2, -O3, -Os. |
370 |
|
371 |
I have to say that I am a bit disappointed now. You seemed to be one of those |
372 |
people who actually inform themselves before sticking new flags into their CFLAGS. |
373 |
|
374 |
> And, here's some selected output from ei, interspersed with explanations, |
375 |
> since I'm editing the output anyway: |
376 |
> |
377 |
> $ei |
378 |
> !!! Failed to change nice value to '-2' |
379 |
> !!! [Errno 13] Permission denied |
380 |
> |
381 |
> This is stderr output. It's not in the eis output above because I |
382 |
> redirect stderr to /dev/null for it, as I know the reason for the error |
383 |
> and am trying to be brief. |
384 |
> |
385 |
> The warning is because I'm using PORTAGE_NICENESS=-2 in make.conf. It has |
386 |
> a negative nice set there to encourage portage to make fuller use of the |
387 |
> dual CPUs under-X/from-a-konsole-session, as X and the kernel do some |
388 |
> dynamic scheduling magic to keep X more responsive without having to up |
389 |
> /its/ priority. The practical effect of that "magic" is to lower the |
390 |
> priorities of everything besides X slightly, when X is running. This |
391 |
> /does/ have the intended effect of keeping X more responsive, but the cost |
392 |
> as observed here is that emerges take longer than they should when X is |
393 |
> running, because the scheduler is leaving a bit of extra idle CPU time to |
394 |
> keep X responsive. In many cases, I'd rather be using maximum CPU and get |
395 |
> the merges done faster, even if X drags a bit in the mean time, and the |
396 |
> slightly negative niceness for portage accomplishes exactly that. |
397 |
> |
398 |
> It's reporting a warning (to stderr) here, as I ran the command as a |
399 |
> regular non-root user, and non-root can't set negative priorities for |
400 |
> obvious system security reasons. I get the same warning with my ep* |
401 |
> commands, which I normally run as a regular user, as well. The ea* |
402 |
> commands which actually do the merging get run as root, naturally, so the |
403 |
> niceness /can/ be set negative when it counts, during a real emerge. |
404 |
> |
405 |
> So... nothing of any real matter, then. |
406 |
> |
407 |
> |
408 |
> !!! Relying on the shell to locate gcc, this may break |
409 |
> !!! DISTCC, installing gcc-config and setting your current gcc |
410 |
> !!! profile will fix this |
411 |
> |
412 |
> Another warning, likewise to stderr and thus not in the eis output. This |
413 |
> one is due to the fact that eselect, the eventual systemwide replacement |
414 |
> for gcc-config and a number of other commands, uses a different method to |
415 |
> set the compiler than gcc-config did, and portage hasn't been adjusted to |
416 |
> full compatibility just yet. Portage finds the proper gcc just fine for |
417 |
> itself, but there'd be problems if distcc was involved, thus the warning. |
418 |
|
419 |
Didn't know about this. Have you filed a bug yet on the topic? Or is there |
420 |
already one? |
421 |
|
422 |
> Again, I'm aware of the situation and the cause, but don't use distcc, so |
423 |
> it's nothing I have to worry about, and I can safely ignore the warning. |
424 |
> |
425 |
> I kept the warnings here, as I find them and the explanation behind them |
426 |
> interesting elements of my Gentoo environment, thus worth posting, for |
427 |
> others who seem interested in my Gentoo environment as well. If nothing |
428 |
> else, the explanations should help some in my audience understand that bit |
429 |
> more about how their system operates, even if they don't get these |
430 |
> warnings. |
431 |
|
432 |
Indeed. |
433 |
|
434 |
> Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127, |
435 |
> glibc-2.3.6-r2, 2.6.15 x86_64) |
436 |
> ================================================================= |
437 |
> System uname: 2.6.15 x86_64 AMD Opteron(tm) Processor 242 |
438 |
> Gentoo Base System version 1.12.0_pre15 |
439 |
> |
440 |
> Those of you running stable amd64, but wondering where baselayout is for |
441 |
> unstable, there you have it! |
442 |
> |
443 |
> ccache version 2.4 [enabled] |
444 |
> dev-lang/python: 2.4.2 |
445 |
> sys-apps/sandbox: 1.2.17 |
446 |
> sys-devel/autoconf: 2.13, 2.59-r7 |
447 |
> sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 |
448 |
> sys-devel/binutils: 2.16.91.0.1 |
449 |
> sys-devel/libtool: 1.5.22 |
450 |
> virtual/os-headers: 2.6.11-r3 |
451 |
> |
452 |
> ACCEPT_KEYWORDS="amd64 ~amd64" |
453 |
> |
454 |
> Same for the above portions of my toolchain. AFAIR, it's all ~amd64, |
455 |
> altho I was running a still-masked binutils for awhile shortly after |
456 |
> gcc-4.0 was released (still-masked on Gentoo as well), as it required the |
457 |
> newer binutils. |
458 |
> |
459 |
> LANG="en_US" |
460 |
> LDFLAGS="-Wl,-z,now" |
461 |
> |
462 |
> Some of you may have noticed the occasional Portage warning about a SETUID |
463 |
> executables using lazy bindings, and the potential security issue that |
464 |
> causes. This setting for LDFLAGS forces early bindings with all |
465 |
> dynamically linked libraries. Normally it'd only be necessary or |
466 |
> recommended for SETUID executables, and set in the ebuild where it's safe |
467 |
> to do so, but I use it by default, for several reasons. The effect is |
468 |
> that a program takes a bit longer to load initially, but won't have to |
469 |
> pause to resolve late bindings as they are needed. You're trading waiting |
470 |
> at executable initialization for waiting at some other point. With a gig |
471 |
|
472 |
Note that depending on how many functions of a library/application you really |
473 |
use when running it it might give you a bigger or smaller drawback. |
474 |
|
475 |
> of memory, I find most stuff I run more than once is at least partially |
476 |
> still in cache on the second and later launches, and with my system, I |
477 |
> don't normally find the initial wait irritating, and sometimes find a |
478 |
> pause after I'm working with a program especially so, so I prefer to have |
479 |
> everything resolved and loaded at executable launch. Additionally, with |
480 |
> lazy bindings, I've had programs start just fine, then fail later when |
481 |
> they need to resolve some function that for some reason won't resolve in |
482 |
> whatever library it's supposed to be coming from. I don't like have the |
483 |
> thing fail and interrupt me in the middle of a task, and find it far less |
484 |
> frustrating, if it's going to fail when it tries to load something, to |
485 |
> have it do so at launch. Because early bindings forces resolution of |
486 |
> functions at launch, if it's going to fail loading one, it'll fail at |
487 |
> launch, rather than after I've started working with the program. That's |
488 |
> /exactly/ how I want it, so that's why I run the above LDFLAGS setting. |
489 |
> It's nice not to have to worry about the security issue, but SETUID type |
490 |
> security isn't as critical on my single-human-user system, where that |
491 |
> single-user-is me and I already have root when I want it anyway, as it'd |
492 |
> be in a multi-user system, particularly a public server, so the other |
493 |
> reasons are more important than security, for me, on this. They just |
494 |
> happen to coincide, so I'm a happy camper. =8^) |
495 |
> |
496 |
> The caveat with these LDFLAGS, however, is the rare case where there's a |
497 |
> circular functional dependency that's normally self-resolving, Modular |
498 |
> xorg triggers one such case, where the monolithic xorg didn't. There are |
499 |
> three individual ebuilds related to modular xorg that I have to remove |
500 |
> these LDFLAGS for or they won't work. xorg-server is one. |
501 |
> xf86-vidio-ati, my video driver, is another. libdri was the third, IIRC. |
502 |
> There's a specific order they have to be compiled in, as well. If they are |
503 |
> compiled with this enabled, they, and consequently X, refuses to load (tho |
504 |
> X will load without DRI, if that's the only one, it'll just protest in the |
505 |
> log and DRI and glx aren't available). Evidently there's a non-critical |
506 |
> fourth module somewhere, that still won't load properly due to an |
507 |
> unresolved symbol, that I need to track down and remerge without these |
508 |
> LDFLAGS, and that's what's keeping GLX from loading on my current system, |
509 |
> as mentioned in an earlier post. |
510 |
> |
511 |
> LINGUAS="en" |
512 |
> MAKEOPTS="-j4" |
513 |
> |
514 |
> The four jobs is nice for a dual-CPU system -- when it works. |
515 |
> Unfortunately, the unpack and configure steps are serialized, so the jobs |
516 |
> option does little good, there. To make most efficient use of the |
517 |
> available cycles when I have a lot to merge, therefore, I'll run as many |
518 |
> as five merges in parallel. I do this quite regularly with KDE upgrades |
519 |
> like the one to 3.5.1, where I use the split KDE ebuilds and have |
520 |
> something north of 100 packages to merge before KDE is fully upgraded. |
521 |
|
522 |
I really wonder how you would paralellize unpacking and configuring a package. |
523 |
|
524 |
> I mentioned above that I often run eptree, then ea individual packages |
525 |
> from the list. This is how I accomplish the five merges in parallel. |
526 |
> I'll take a look at the tree output to check the dependencies, and merge |
527 |
> the packages first that have several dependencies, but only where those |
528 |
> dependencies aren't stepping on each other, thus keeping the parallel |
529 |
> emerges from interfering with each other, because each one is doing its |
530 |
> own dependencies, that aren't dependencies of any of the others. After I |
531 |
> get as many of those going as I can, I'll start listing 3-5 individual |
532 |
> packages without deps on the same ea command line. By the time I've |
533 |
> gotten the fifth one started, one of the other sessions has usually |
534 |
> finished or is close to it, so I can start it merging the next set of |
535 |
> packages. With five merge sessions in parallel, I'm normally running an |
536 |
> average load of 5 to 9, meaning that many applications are ready for CPU |
537 |
> scheduling time at any instant, on average. If the load drops below four, |
538 |
> there's proobably idle CPU cycles being wasted that could otherwise be |
539 |
> compiling stuff, as each CPU needs at least one load-point to stay busy, |
540 |
> plus usually can schedule a second one for some cycles as well, while the |
541 |
> first is waiting for the hard drive or whatever. |
542 |
> |
543 |
> (Note that I'm running a four-drive RAID, RAID-6, so two-way striped, for |
544 |
> my main system, Raid-0, so 4-way striped, for $PORTAGE_TMPDIR, so hard |
545 |
> drive latency isn't /nearly/ as high as it would be on a single-hard-drive |
546 |
> system. Of course, running five merges in parallel /does/ increase disk |
547 |
> latency some as well, but it /does/ seem to keep my load-average in the |
548 |
> target zone and my idle cycles to a minimum, during the merge period. |
549 |
> Also note that I've only recently added the PORTAGE_NICENESS value above, |
550 |
> and haven't gotten it fully tweaked to the best balance between |
551 |
> interactivity and emerge speed just yet, but from observations so far, |
552 |
> with the niceness value set, I'll be able to keep the system busy with |
553 |
> "only" 3-4 parallel merges, rather than the 5 I had been having to run to |
554 |
> keep the system most efficiently occupied when I had a lot to merge.) |
555 |
> |
556 |
> PKGDIR="/pkg" |
557 |
> PORTAGE_TMPDIR="/tmp" |
558 |
> PORTDIR="/p" |
559 |
> PORTDIR_OVERLAY="/l/p" |
560 |
> |
561 |
> Here you can see some of my path customization. |
562 |
> |
563 |
> USE="amd64 7zip X a52 |
564 |
> aac acpi alsa apm arts asf audiofile avi bash-completion berkdb |
565 |
> bitmap-fonts bzip2 caps cdparanoia cdr crypt css cups curl dga divx4linux |
566 |
> dlloader dri dts dv dvd dvdr dvdread eds emboss encode extrafilters fam |
567 |
> fame ffmpeg flac font-server foomaticdb gdbm gif glibc-omitfp gpm |
568 |
> gstreamer gtk2 idn imagemagick imlib ithreads jp2 jpeg jpeg2k kde |
569 |
> kdeenablefinal lcms libwww linuxthreads-tls lm_sensors logitech-mouse |
570 |
> logrotate lzo lzw lzw-tiff mad maildir mikmod mjpeg mng motif mozilla mp3 |
571 |
> mpeg ncurses network no-old-linux nolvm1 nomirrors nptl nptlonly offensive |
572 |
> ogg opengl oss pam pcre pdflib perl pic png ppds python qt quicktime |
573 |
> radeon readline scanner slang speex spell ssl tcltk theora threads tiff |
574 |
> truetype truetype-fonts type1 type1-fonts usb userlocales vcd vorbis |
575 |
> xcomposite xine xinerama xml2 xmms xosd xpm xrandr xv xvid yv12 zlib |
576 |
> elibc_glibc input_devices_keyboard input_devices_mouse kernel_linux |
577 |
> linguas_en userland_GNU video_cards_ati" |
578 |
> |
579 |
> My USE flags, FWTAR (for what they are worth). Of particular interest are |
580 |
> the input_devices_mouse and keyboard, and video_cards_ati. These come |
581 |
> from variables (INPUT_DEVICES and VIDEO_CARDS) set in make.conf, and used |
582 |
> in the new xorg-modular ebuilds. These and the others listed after zlib |
583 |
> are referred to by Gentoo devs as USE_EXPAND. Effectively, they are USE |
584 |
> flags in the form of variables, setup that way because there are rather |
585 |
> many possible values for those variables, too many to work as USE flags. |
586 |
> The LINGUAS and LANG USE_EXPAND variables are prime examples. Consider |
587 |
> how many different languages there are and that were used and documented |
588 |
> as regular USE flags, it would have to be in use.local.desc, because few |
589 |
> supporting packages would offer the same choices, so each would have to be |
590 |
> listed separately for each package. Talk about the number of USE flags |
591 |
> quickly getting out of control! |
592 |
> |
593 |
> Unset: ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL |
594 |
> |
595 |
> OK, some loose ends to wrapup, and I'm done. |
596 |
> |
597 |
> re: gcc versions: The plan is for gcc-4.0 to go ~arch fairly soon, now. |
598 |
> The devs are actively asking for bug reports involving it, now, so as many |
599 |
> as possible can be resolved before it goes ~arch. (Formerly, they were |
600 |
> recommending that bugs be filed upstream, and not with Gentoo unless there |
601 |
> was a patch attached, as it was considered entirely unsupported, just |
602 |
> there for those that wanted it anyway.) At this point, nearly everything |
603 |
> should compile just fine with 4.0. |
604 |
> |
605 |
> That said, Gentoo has slotted gcc for a reason. It's possible to have |
606 |
> multiple minor versions (3.3, 3.4, 4.0, 4.1) merged at the same time. |
607 |
> With USE=multislot, that's actually microversion (4.0.0, 4.0.1, 4.0.2...). |
608 |
> Using either gcc-config or eselect compiler, and discounting any CFLAG |
609 |
> switching you may have to do, it's a simple matter to switch between |
610 |
> merged versions. This made it easy to experiment with gcc-4.0 even tho |
611 |
> Gentoo wasn't supporting it and certain packages wouldn't compile with |
612 |
> 4.x, because it was always possible to switch to a 3.x version if |
613 |
> necessary, and compile the package there. I did this quite regularly, |
614 |
> using gcc-4.0 as my normal version, but reverting for individual packages |
615 |
> as necessary, when they wouldn't compile with 4.0. |
616 |
> |
617 |
> The same now applies to the 4.1.0-beta-snapshot series. Other than the |
618 |
> compile time necessary to compile a new gcc when the snapshot comes out |
619 |
> each week, it's easy to run the 4.1-beta as the main system compiler for |
620 |
> as wide testing as possible, while reverting to 4.0 or 3.4 (I don't have a |
621 |
> 3.3 slot merged) if needed. |
622 |
> |
623 |
> re: the performance improvements I saw that started this whole thing: |
624 |
> These trace to several things, I believe. #1, with gcc-4.0, there's now |
625 |
> support for -fvisibility -- setting certain functions as exported and |
626 |
> visible externally, others not. That can easily cut exported symbols by a |
627 |
> factor of 10. Exported symbols of course affect dynamic load-time, which |
628 |
> of course gets magnified dramatically by my LDFLAGS early binding |
629 |
> settings. When I first compiled KDE with that (there were several |
630 |
> missteps early on in terms of KDE and Gentoo's support, but that aside), |
631 |
> KDE appload times went down VERY NOTICEABLY! Again, due to my LDFLAGS, |
632 |
> the effect was multiplied dramatically, but the effect is VERY real! |
633 |
> |
634 |
> Of course, that's mainly load-time performance. The run-time performance |
635 |
> that we are actually talking here has other explanations. A big one is |
636 |
> that gcc-4 was a HUGE rewrite, with a BIG potential to DRAMATICALLY |
637 |
> improve gcc's performance. With 4.0, the theory is there, but in |
638 |
> practice, it wasn't all that optimized just yet. In some ways it reverted |
639 |
> behavior below that of the fairly mature 3.x series, altho the rewrite |
640 |
> made things much simpler and less prone to error given its maturity. 4.1, |
641 |
> however, is the first 4.x release to REALLY be hitting the potential of |
642 |
> the 4.x series, and it appears the difference is very noticeable. Of |
643 |
> course, there's a reason 4.1.0 is still in beta upstream and not supported |
644 |
> by Gentoo either, as there are still known regressions. However, where it |
645 |
> works, which it seems to do /most/ of the time, it **REALLY** works, or at |
646 |
> least that's been my observation. 3.3 was a MAJOR improvement in gcc for |
647 |
> amd64 users, because it was the first version where amd64 wasn't simply an |
648 |
> add-on hack, as it had been with 3.2. The 3.4 upgrade was minor in |
649 |
> comparison, and 4.0 while it's going ~arch shortly, and sets the stage for |
650 |
> a lot of future improvement, will be pretty minor in terms of actual |
651 |
> improved performance as well. 4.1, however, when it is finally fully |
652 |
> released, has the potential to be as big an improvement as 3.3 was -- that |
653 |
> is, a HUGE one. I'm certainly looking forward to it, and meanwhile, |
654 |
> running the snapshots, because Gentoo makes it easy to do so while |
655 |
> maintaining the ability to switch very simply between multiiple versions |
656 |
> on the system. |
657 |
> |
658 |
> Both -freorder-blocks-and-partition and -fmerge-all-constants are new to |
659 |
> me within a few days, now, and new to me with kde 3.5.1. Normally, |
660 |
> individual flags won't make /that/ much of a difference, but it's possible |
661 |
> I hit it lucky, with these. Actually, because they both match very well |
662 |
> with and reinforce my strategy of targeting size, it's possible I'm only |
663 |
> now unlocking the real potential behind size optimization. -- I **KNOW** |
664 |
> there's a **HUGE** difference in sizes between resulting file-sizes. I |
665 |
> compared 4.0.2 and 4.1.0-beta-snapshot file sizes for several modular-X |
666 |
> files in the course of researching the missing symbols problem, and the |
667 |
> difference was often a shrinkage of near 33 percent with 4.1 and my |
668 |
> current CFLAGS as opposed to 4.0.1 without the new ones. Going the other |
669 |
> way, that's a 50% larger file with 4.0.2 as compared to 4.1, 100KB vs |
670 |
> 150KB, by way of example. That's a *HUGE* difference, one big enough to |
671 |
> initially think I'd found the reason for the missing symbols right there, |
672 |
> as the new files were simply too much smaller to look workable! Still, I |
673 |
> traced the problem too LDFLAGS, so that wasn't it, and the files DO work, |
674 |
> confirming things. I'm guessing -fmerge-all-constants plays a significant |
675 |
> part in that. In any case, with that difference in size, and knowing how |
676 |
> /much/ cache hit vs. miss affects performance, it's quite possible the |
677 |
> size is the big performance factor. Of course, even if that's so, I'm not |
678 |
> sure whether it is the CFLAGS or the 4.0 vs 4.1 that should get the credit. |
679 |
> |
680 |
> In any case, I'm a happy camper right now! =8^) |
681 |
|
682 |
|
683 |
-- |
684 |
Simon Stelling |
685 |
Gentoo/AMD64 Operational Co-Lead |
686 |
blubb@g.o |
687 |
-- |
688 |
gentoo-amd64@g.o mailing list |