Gentoo Archives: gentoo-amd64

From: Simon Stelling <blubb@g.o>
To: gentoo-amd64@l.g.o
Subject: Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite
Date: Wed, 08 Feb 2006 20:39:13
Message-Id: 43EA568D.6020307@gentoo.org
In Reply to: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite by Duncan <1i5t5.duncan@cox.net>
1 Duncan wrote:
2 >>Nice. Now let us know your CFLAGS, and what toolchain versions you're
3 >>running :D
4 >
5 >
6 > You probably didn't notice, as I had it commented out on the main index
7 > page as I don't have the page created to actually list them yet, but if
8 > you viewed source, you'd have seen I have a techspecs page link commented
9 > out, that'll get that sort of info, when/if I actually get it created.
10 >
11 > However, since you asked, your answer, and a bit more, by way of
12 > explanation...
13 >
14 > I should really create a page listing all the little Gentoo admin scripts
15 > I've come up with and how I use them. I'm sure a few folks anyway would
16 > likely find them useful.
17 >
18 > The idea behind most of them is to create shortcuts to having to type in
19 > long emerge lines, with all sorts of arbitrary command line parameters.
20 > The majority of these fall into two categories, ea* and ep*, short for
21 > emerge --ask <additional parameters> and emerge --pretend ... . Thus, I
22 > have epworld and eaworld, the pretend and ask versions of emerge -NuDv
23 > world, epsys and easys, the same for system, eplog <package>, emerge
24 > --pretend --log --verbose (package name to be added to the command line so
25 > eplog gcc, for instance, to see the changes between my current and the new
26 > version of gcc), eptree <package>, to use the tree output, etc.
27
28 Interesting. But why do you use scripts and not simple aliases? Every time you
29 launch your script the HD performs a seek (which is very expensive in time),
30 copies the script into memory and then forks a whole bash process to execute a
31 one-liner. Using alias, which is a bash built-in, wouldn't fork a process and
32 therefore be much faster.
33
34 (see man alias for examples)
35
36 > One thing I've found is that I'll often epworld or eptreeworld, then
37 > emerge the individual packages, rather than use eaworld to do it. That
38 > way, I can do them in the order I want or do several at a time if I want
39 > to make use of both CPUs. Because I always use --deep, as I want to keep
40 > my dependencies updated as well, I'm very often merging specific
41 > dependencies. There's a small problem with that, however --oneshot, which
42 > I'll always want to use with dependencies to help keep my world file
43 > uncluttered, has no short form, but I use it as the default! OTOH, the
44
45 man emerge:
46 --oneshot (-1)
47
48 IIRC --oneshot has a short form since 2.0.52 was released.
49
50 > normal portage mode of adding stuff listed on the command line to the
51 > world file, I don't want very often, as most of the time I'm simply
52 > updating what I have, so it's all in the world file if it needs to be
53 > there already anyway. Not a problem! All my regular ea* scriptlets use
54 > --oneshot, so it /is/ my default. If I *AM* merging something new that I
55 > want added to my world file, I have another family of ea* scriptlets that
56 > do that -- all ending in "2", as in, "NOT --oneshot". Thus, I have a
57 > family of ea*2 scriptlets.
58 >
59 > The regulars here already know one of my favorite portage features is
60 > FEATURES=buildpkg, which I have set in make.conf. That of course gives me
61 > a collection of binary versions of packages I've already emerged, so I
62 > can quickly revert to an old version for testing something, if I want,
63 > then remerge the new version once I've tested the old version to see if it
64 > has the same bug I'm working on or not. To aid in this, I have a
65 > collection of eppak and eapak scriptlets. Again, the portage default of
66 > --usepackage (-k) doesn't fit my default needs, as if I'm using a binpkg,
67 > I usually want to ONLY use a binpkg, NOT merge from source if the package
68 > isn't available. That happens to be -K in short-form. However, it's my
69 > default, so eapak invokes the -K version. I therefore have eapaK to
70 > invoke the -k version if I don't really care whether it goes from binpkg
71 > or source.
72 >
73 > Of course, there are various permutations of the above as well, so I have
74 > eapak2 and eapaK2, as well as eapak and eapaK. For the ep* versions, of
75 > course the --oneshot doesn't make a difference, so I only have eppak and
76 > eppaK, no eppa?2 scriptlets.
77 >
78 > ... Deep breath... <g>
79 >
80 > All that as a preliminary explanation to this: Along with the above, I
81 > have a set of efetch functions, that invoke the -f form, so just do the
82 > fetch, not the actual compile and merge, and esyn (there's already an
83 > esync function in something or other I have merged so I just call it
84 > esyn), which does emerge sync, then updates the esearch db, then
85 > automatically fetches all the packages that an eaworld would want to
86 > update, so they are ready for me to merge at my leisure.
87
88 I'm a bit confused now. You use *functions* to do that? Or do you mean scripts?
89 By the way: with alias you could name your custom "script" esync because it
90 doesn't place a file on the harddisk.
91
92 > Likewise, and the real reason for this whole explanation, I /had/ an
93 > "einfo" scriptlet that simply ran "emerge info". This can be very handy
94 > to run, if like me, you have several slotted versions of gcc merged, and
95 > you sometimes forget which one you have eselected or gcc-configed as the
96 > one portage will use. Likewise, it's useful for checking on CFLAGS (or
97 > CXXFLAGS OR LDFLAGS or...), if you modified them from the normal ones
98 > because a particular package wasn't cooperating, and you want to see if
99 > you remembered to switch them back or not.
100 >
101 > However, I ran into a problem. The output of einfo was too long to
102 > quickly find the most useful info -- the stuff I most often change and
103 > therefore most often am looking for.
104 >
105 > No sweat! I shortened my original "einfo" to simply "ei", and added a
106 > second script, "eis" (for einfo short), that simply piped the output of
107 > the usual emerge info into a grep that only returned the lines I most
108 > often need -- the big title one with gcc and similar info, CFLAGS,
109 > CXXFLAGS, LDFLAGS, and FEATURES. USE would also be useful, but it's too
110 > long even by itself to be searched at a glance, so if I want it, I simply
111 > run ei and look for what I want in the longer output.
112
113 Impressive.
114
115 > ... Another deep breath... <g>
116 >
117 > OK, with that as a preliminary, you should be able to understand the
118 > following:
119 >
120 > $eis
121 >
122 > Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,
123 > glibc-2.3.6-r2, 2.6.15 x86_64)
124 >
125 > CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
126 > -funit-at-a-time -fweb -freorder-blocks-and-partition
127 > -fmerge-all-constants"
128 >
129 > CXXFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
130 > -funit-at-a-time -fweb -freorder-blocks-and-partition
131 > -fmerge-all-constants"
132 >
133 > FEATURES="autoconfig buildpkg candy ccache confcache distlocks
134 > multilib-strict parallel-fetch sandbox sfperms strict userfetch"
135 >
136 > LDFLAGS="-Wl,-z,now"
137 >
138 > MAKEOPTS="-j4"
139 >
140 > To make sense of that...
141 >
142 > * The portage and glibc versions are ~amd64, as set in make.conf for the
143 > system in general.
144 >
145 > * CFLAGS:
146 >
147 > I choose -Os, optimize for size, because a modern CPU and the various
148 > cache levels are FAR faster than main memory. This difference is
149 > frequently severe enough that it's actually more efficient to optimize for
150 > size than for CPU performance, because the result is smaller code that
151 > maintains cache locality (stays in fast cache) far better, and the CPU
152 > saves more time that it would otherwise be spending idle, waiting for data
153 > to come in from slower more distant memory, than the actual cost of the
154 > loss of cycle efficiency that's often the tradeoff for small code.
155
156 Given the fact that two CPUs, only differing in L2 Cache size, have nearly the
157 same performance, I doubt that the performance increase is very big. Some
158 interesting figures:
159
160 Athlon64 something (forgot what, but shouldn't matter anyway) with 1 MB L2-cache
161 is 4% faster than an Athlon64 of the same frequency but with only 512kB
162 L2-cache. The bigger the cache sizes you compare get, the smaller the
163 performance increase. Since you run a dual Opteron system with 1 MB L2 cache per
164 CPU I tend to say that the actual performance increase you experience is about
165 3%. But then I didn't take into account that -Os leaves out a few optimizations
166 which would be included by -O2, the default optimization level, which actually
167 makes the code a bit slower when compared to -O2. So, the performance increase
168 you really experience shrinks to about 0-2%. I'd tend to proclaim that -O2 is
169 even faster for most of the code, but that's only my feeling.
170
171 Beside that I should mention that -Os sometimes still has problems with huge
172 packages like glibc.
173
174 > Back when memory operated at or near the speed of the CPU, avoiding the
175 > loop, even at the expense of three-times the code, was often faster.
176 > Today, where CPUs do several calculations in the time it takes to fetch
177 > data from main memory, it's generally faster to go for the smaller code,
178 > as it will be far more likely to still be in fast cache, avoiding that
179 > long wait for main memory, even if it /does/ mean wasting a couple
180 > additional cycles doing the expensive jump back to the top of the loop.
181
182 Not only CPUs got faster, but also caches got bigger. Comparing my old P4 with
183 1.7 GHz and 256kb L2 cache to a P4 with 3.4 GHz (frequency doubled) which has 1
184 MB L2 cache (cache quadrupled) shows that the proportions changed. Bigger cache
185 of course means that you can have larger chunks of code there, so unrolling
186 loops with fixed iterations actually might perform better.
187
188 > Of course, this is theory, and the practical case can and will differ
189 > depending on the instructions actually being compiled. In particular,
190 > streaming media apps and media encoding/decoding are likely to still
191 > benefit from the traditional loop elimination style optimizations, because
192 > they run thru so much data already, that cache is routinely trashed
193 > anyway, regardless of the size of your instructions. As well, that type
194 > of application tends to have a LOT of looping instructions to optimize!
195 >
196 > By contrast, something like the kernel will benefit more than usual from
197 > size optimization. First, it's always memory locked and as such
198 > can't be swapped, and even "slow" main memory is still **MANY** **MANY**
199 > times faster than swap, so a smaller kernel means more other stuff fits
200 > into main memory with it, and isn't swapped as much. Second, parts of the
201
202 Funny to hear this from somebody with 4 GB RAM in his system. I don't know how
203 bloated your kernel is, but even if -Os would reduce the size of my kernel to
204 **the half**, which is totally impossible, it wouldn't be enough to load the
205 mail I am just answering into RAM. So, basically, this reasoning is just ridiculous.
206
207 > kernel such as task scheduling are executed VERY often, either because
208 > they are frequently executed by most processes, or because they /control/
209 > those processes. The smaller these are, the more likely they are to still
210 > be in cache when next used. Likewise, the smaller they are, the less
211 > potentially still useful other data gets flushed out of cache to make room
212 > for the kernel code executing at the moment. Third, while there's a lot
213 > of kernel code that will loop, and a lot that's essentially streaming, the
214 > kernel as a whole is a pretty good mix of code and thus won't benefit as
215 > much from loop optimizations and the like, as compared to special purpose
216 > code like the media codec and streaming applications above.
217 >
218 > The differences are marked enough and now demonstrated enough that a
219 > kernel config option to optimize for size was added I believe about a year
220 > ago. Evidently, that lead to even MORE demonstration, as the option was
221 > originally in the obscure embedded optimizations corner of the config,
222 > where few would notice or use it, and they upgraded it into a main option.
223 > In fact, where a year or two ago, the option didn't even exist, now I
224 > believe it defaults to yes/on/do-optimize-for-size (altho it's possible
225 > I'm incorrect on the last and it's not yet the default).
226
227 It is not. The option you are talking about is called
228 CONFIG_CC_OPTIMIZE_FOR_SIZE and is defined nowhere, so that the 'ifdef
229 CONFIG_CC_OPTIMIZE_FOR_SIZE' will result in no by default and therefore set -O2
230 as default.
231
232 > According to the gcc manpage, -frename-registers causes gcc to attempt to
233 > make use of registers left over after normal register allocation. This is
234 > particularly beneficial on archs that have many registers (keeping in
235 > mind that "registers" are what amounts to L0 cache, the fastest possible
236 > memory because the CPU accesses registers directly and they operate at
237 > full CPU speed. Unfortunately, registers are also very limited, making
238 > them an EXCEEDINGLY valuable resource! Note that while x86-32 is noted
239 > for its relative /lack/ of registers, AMD basically doubled the number of
240 > registers available to 64-bit code in its x86-64 aka AMD64 spec. Thus,
241 > while this option wouldn't be of particular benefit on x86, on amd64, it
242 > can, depending on the code of course, provide some rather serious
243 > optimization!
244 >
245 > -fweb is a register use optimizer function as well. It tells gcc to
246 > create a /web/ of dependencies and assign each individual dependency web
247 > to its own pseudo-register. Thus, when it comes time for gcc to allocate
248 > registers, it already has a list of the best candidates lined up and ready
249 > to go. Combined with -frename register to tell gcc to efficiently make
250 > use of any registers left over after the the first pass, and due to the
251 > number of registers available in 64-bit mode on our arch, this can allow
252 > some seriously powerful optimizations. Still, a couple of things to note
253 > about it. One, -fweb (and -frename-registers as well) can cause data to
254 > move out of its "home" register, which seriously complicates debugging, if
255 > you are a programmer or power-user enough to worry about such things.
256 > Two, the rewrite for gcc 4.0 significantly modified the functionality of
257 > -fweb, and it wasn't recommended for 4.0 as it didn't yet work as well as
258 > expected or as it did with gcc 3.x. For gcc 4.1, -fweb is apparently back
259 > to its traditional strength. Those Gentoo users having gcc 3.4, 4.0, and
260 > 4.1, all three in separate slots, will want to note this as they change
261 > gcc-configuratiions, and modify it accordingly. Yes, this *IS* one of the
262 > reasons my CFLAGS change so frequently!
263 >
264 > -funit-at-a-time tells gcc to consider a full logical unit, perhaps
265 > consisting of several source files rather than just one, as a whole, when
266 > it does its compiling. Of course, this allows gcc to make
267 > optimizations it couldn't see if it wasn't looking at the larger picture
268 > as a whole, but it requires rather more memory, to hold the entire unit
269 > so it can consider it at once. This is a fairly new flag, introduced with
270 > gcc 3.3 IIRC. While the idea is simple enough and shouldn't lead to any
271 > bugs on its own, there WERE a number of initially never encountered bugs
272 > in various code that this flag exposed, when GCC made optimizations on the
273 > entire unit that it wouldn't otherwise make, thereby triggering bugs that
274 > had never been triggered before. I /believe/ this was the root reason why
275 > the Gentoo amd64 technotes originally discouraged use of -Os, back with
276 > the first introduction of this flag in gcc 3.2 hammer (amd64) edition, as
277 > -funit-at-a-time was activated by -Os at that time, and -Os was known to
278 > produce bad code at the time, on amd64, with packages like portions of
279 > KDE. The gcc 4.1.0 manpage now says it's enabled by default at -O2 and
280 > -O3, but doesn't mention -Os. Whether that's an omission, or whether they
281 > decided it shouldn't be enabled by -Os for some reason, I'm not sure, but
282 > I use them both to be sure and haven't had any issues I can trace to this
283 > (not even back when the technotes recommended against -Os, and said KDE
284 > was supposed to have trouble with it -- maybe it was parts of KDE I never
285 > merged, or maybe I was just lucky, but I've simply never had an issue with
286 > it).
287 >
288 > -freorder-blocks-and-partition is new for gcc 4.0, I believe, alto I
289 > didn't discover it until I was reading the 4.1-beta manpage. I KNOW gcc
290 > 3.4.4 fails out with it, saying unrecognized flag or some such, so it's
291 > another of those flags that cause my CFLAGS to be constantly changing, as
292 > I switch between gcc versions. This flag won't work under all conditions,
293 > according to the manpage, so is automatically disabled in the presence of
294 > exception handling, and a few other situations named in the manpage. It
295 > causes a lot of warnings too, to the effect that it's being disabled due
296 > to X reason. There's a similar -freorder-blocks flag, which optimizes by
297 > reordering blocks in a function to "reduce number of taken branches and
298 > improve code locality." In English, what that means is that it breaks
299 > caching less often. Again, caching is *EXTREMELY* performance critical,
300 > so anything that breaks it less often is CERTAINLY welcome! The
301 > -and-partition increases the effect, by separating the code into
302 > frequently used and less frequently used partitions. This keeps the most
303 > frequently used code all together, therefore keeping it in cache far more
304 > efficiently, since the less used code won't be constantly pulled in,
305 > forcing out frequently used code in the process.
306 >
307 > Hmm... As I'm writing and thinking about this, the probability that
308 > sticking the regular -freorder-blocks option in CFLAGS as well would be a
309 > wise thing, occurs to me. The non-partition version isn't as efficient as
310 > the partition version, and would be redundant if the partitioned version
311 > is in effect. However, the non-partitioned version doesn't have the same
312 > sorts of no-exceptions-handler and similar restrictions, so having it in
313 > the list, first, so the partitioned version overrides it where it can be
314 > used, should be a good idea. That way, where the partitioned version can
315 > be used, it will be, but where it can't, gcc will still use the
316 > non-partitioned version of the option, so I'll still get /some/ of the
317 > optimizations! I (re)compiled major portions of xorg (modular), qt, and
318 > the new kde 3.5.1 with the partitioned option, however, and it works, and
319 > I haven't tested having both options in there yet, so I'm not sure it'll
320 > work as the theory suggests it should, so some caution might be advised.
321 >
322 > -fmerge-all-constants COULD be dangerous with SOME code, as it breaks part
323 > of the C/C++ specification. However, it should be fine for most code
324 > written to be compiled with gcc, and I've seen no problems /yet/ tho both
325 > this and the reorder-and-partition flag above are fairly new to my CFLAGS,
326 > so haven't been as extensively personally tested as the others have been.
327 > If something seems to be breaking when this is in your CFLAGS, certainly
328 > it's the first thing I'd try pulling out. What it actually does is merge
329 > all constants with the same value into the same one. gcc has a weaker
330 > -fmerge-constants version that's enabled with any -O option at all (thus
331 > at -O, -O2, -O3, AND -Os), that merges all declared constants of the same
332 > value, which is safe and doesn't conflict with the C/C++ spec. What the
333 > /all/ specifier in there does, however, is cause gcc to merge declared
334 > variables where the value actually never changes, so they are in effect
335 > constants, altho they are declared as variables, with other constants of
336 > the same value. This /should/ be safe, /provided/ gcc isn't failing to
337 > detect a variable chance somewhere, but it conflicts with the C/C++ spec,
338 > according to the gcc manpage, and thus /could/ cause issues, if the
339 > developer pulls certain tricks that gcc wouldn't detect, or possibly more
340 > likely, if used with code compiled by a different compiler (say
341 > binary-only applications you may run, which may not have been compiled
342 > with gcc). There are two reasons why I choose to use it despite the
343 > possible risks. One, I want /small/ code, again, because small code fits
344 > in that all-important cache better and therefore runs faster, and
345 > obviously, two or more merged constants aren't going to take the space
346 > they would if gcc stored them separately. Two, the risks aren't as bad if
347 > you aren't running non-gcc compiled code anyway, and since I'm a strong
348 > believer in Software Libre, if it's binary-only, there's very little
349 > chance I'll want or risk it on my box, and everything I do run is gcc
350 > compiled anyway, so should be generally safe. Still, I know there may be
351 > instances where I'll have to recompile with the flag turned off, and am
352 > prepared to deal with them when they happen, or I'd not have the flag in
353 > my CFLAGS.
354
355 You are referring a lot to the gcc manpage, but obviously you missed this part:
356
357 -fomit-frame-pointer
358 Don't keep the frame pointer in a register for functions that don't
359 need one. This avoids the instructions to save, set up and restore
360 frame pointers; it also makes an extra register available in many
361 functions. It also makes debugging impossible on some machines.
362
363 On some machines, such as the VAX, this flag has no effect, because
364 the standard calling sequence automatically handles the frame
365 pointer and nothing is saved by pretending it doesn't exist. The
366 machine-description macro "FRAME_POINTER_REQUIRED" controls whether
367 a target machine supports this flag.
368
369 Enabled at levels -O, -O2, -O3, -Os.
370
371 I have to say that I am a bit disappointed now. You seemed to be one of those
372 people who actually inform themselves before sticking new flags into their CFLAGS.
373
374 > And, here's some selected output from ei, interspersed with explanations,
375 > since I'm editing the output anyway:
376 >
377 > $ei
378 > !!! Failed to change nice value to '-2'
379 > !!! [Errno 13] Permission denied
380 >
381 > This is stderr output. It's not in the eis output above because I
382 > redirect stderr to /dev/null for it, as I know the reason for the error
383 > and am trying to be brief.
384 >
385 > The warning is because I'm using PORTAGE_NICENESS=-2 in make.conf. It has
386 > a negative nice set there to encourage portage to make fuller use of the
387 > dual CPUs under-X/from-a-konsole-session, as X and the kernel do some
388 > dynamic scheduling magic to keep X more responsive without having to up
389 > /its/ priority. The practical effect of that "magic" is to lower the
390 > priorities of everything besides X slightly, when X is running. This
391 > /does/ have the intended effect of keeping X more responsive, but the cost
392 > as observed here is that emerges take longer than they should when X is
393 > running, because the scheduler is leaving a bit of extra idle CPU time to
394 > keep X responsive. In many cases, I'd rather be using maximum CPU and get
395 > the merges done faster, even if X drags a bit in the mean time, and the
396 > slightly negative niceness for portage accomplishes exactly that.
397 >
398 > It's reporting a warning (to stderr) here, as I ran the command as a
399 > regular non-root user, and non-root can't set negative priorities for
400 > obvious system security reasons. I get the same warning with my ep*
401 > commands, which I normally run as a regular user, as well. The ea*
402 > commands which actually do the merging get run as root, naturally, so the
403 > niceness /can/ be set negative when it counts, during a real emerge.
404 >
405 > So... nothing of any real matter, then.
406 >
407 >
408 > !!! Relying on the shell to locate gcc, this may break
409 > !!! DISTCC, installing gcc-config and setting your current gcc
410 > !!! profile will fix this
411 >
412 > Another warning, likewise to stderr and thus not in the eis output. This
413 > one is due to the fact that eselect, the eventual systemwide replacement
414 > for gcc-config and a number of other commands, uses a different method to
415 > set the compiler than gcc-config did, and portage hasn't been adjusted to
416 > full compatibility just yet. Portage finds the proper gcc just fine for
417 > itself, but there'd be problems if distcc was involved, thus the warning.
418
419 Didn't know about this. Have you filed a bug yet on the topic? Or is there
420 already one?
421
422 > Again, I'm aware of the situation and the cause, but don't use distcc, so
423 > it's nothing I have to worry about, and I can safely ignore the warning.
424 >
425 > I kept the warnings here, as I find them and the explanation behind them
426 > interesting elements of my Gentoo environment, thus worth posting, for
427 > others who seem interested in my Gentoo environment as well. If nothing
428 > else, the explanations should help some in my audience understand that bit
429 > more about how their system operates, even if they don't get these
430 > warnings.
431
432 Indeed.
433
434 > Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,
435 > glibc-2.3.6-r2, 2.6.15 x86_64)
436 > =================================================================
437 > System uname: 2.6.15 x86_64 AMD Opteron(tm) Processor 242
438 > Gentoo Base System version 1.12.0_pre15
439 >
440 > Those of you running stable amd64, but wondering where baselayout is for
441 > unstable, there you have it!
442 >
443 > ccache version 2.4 [enabled]
444 > dev-lang/python: 2.4.2
445 > sys-apps/sandbox: 1.2.17
446 > sys-devel/autoconf: 2.13, 2.59-r7
447 > sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
448 > sys-devel/binutils: 2.16.91.0.1
449 > sys-devel/libtool: 1.5.22
450 > virtual/os-headers: 2.6.11-r3
451 >
452 > ACCEPT_KEYWORDS="amd64 ~amd64"
453 >
454 > Same for the above portions of my toolchain. AFAIR, it's all ~amd64,
455 > altho I was running a still-masked binutils for awhile shortly after
456 > gcc-4.0 was released (still-masked on Gentoo as well), as it required the
457 > newer binutils.
458 >
459 > LANG="en_US"
460 > LDFLAGS="-Wl,-z,now"
461 >
462 > Some of you may have noticed the occasional Portage warning about a SETUID
463 > executables using lazy bindings, and the potential security issue that
464 > causes. This setting for LDFLAGS forces early bindings with all
465 > dynamically linked libraries. Normally it'd only be necessary or
466 > recommended for SETUID executables, and set in the ebuild where it's safe
467 > to do so, but I use it by default, for several reasons. The effect is
468 > that a program takes a bit longer to load initially, but won't have to
469 > pause to resolve late bindings as they are needed. You're trading waiting
470 > at executable initialization for waiting at some other point. With a gig
471
472 Note that depending on how many functions of a library/application you really
473 use when running it it might give you a bigger or smaller drawback.
474
475 > of memory, I find most stuff I run more than once is at least partially
476 > still in cache on the second and later launches, and with my system, I
477 > don't normally find the initial wait irritating, and sometimes find a
478 > pause after I'm working with a program especially so, so I prefer to have
479 > everything resolved and loaded at executable launch. Additionally, with
480 > lazy bindings, I've had programs start just fine, then fail later when
481 > they need to resolve some function that for some reason won't resolve in
482 > whatever library it's supposed to be coming from. I don't like have the
483 > thing fail and interrupt me in the middle of a task, and find it far less
484 > frustrating, if it's going to fail when it tries to load something, to
485 > have it do so at launch. Because early bindings forces resolution of
486 > functions at launch, if it's going to fail loading one, it'll fail at
487 > launch, rather than after I've started working with the program. That's
488 > /exactly/ how I want it, so that's why I run the above LDFLAGS setting.
489 > It's nice not to have to worry about the security issue, but SETUID type
490 > security isn't as critical on my single-human-user system, where that
491 > single-user-is me and I already have root when I want it anyway, as it'd
492 > be in a multi-user system, particularly a public server, so the other
493 > reasons are more important than security, for me, on this. They just
494 > happen to coincide, so I'm a happy camper. =8^)
495 >
496 > The caveat with these LDFLAGS, however, is the rare case where there's a
497 > circular functional dependency that's normally self-resolving, Modular
498 > xorg triggers one such case, where the monolithic xorg didn't. There are
499 > three individual ebuilds related to modular xorg that I have to remove
500 > these LDFLAGS for or they won't work. xorg-server is one.
501 > xf86-vidio-ati, my video driver, is another. libdri was the third, IIRC.
502 > There's a specific order they have to be compiled in, as well. If they are
503 > compiled with this enabled, they, and consequently X, refuses to load (tho
504 > X will load without DRI, if that's the only one, it'll just protest in the
505 > log and DRI and glx aren't available). Evidently there's a non-critical
506 > fourth module somewhere, that still won't load properly due to an
507 > unresolved symbol, that I need to track down and remerge without these
508 > LDFLAGS, and that's what's keeping GLX from loading on my current system,
509 > as mentioned in an earlier post.
510 >
511 > LINGUAS="en"
512 > MAKEOPTS="-j4"
513 >
514 > The four jobs is nice for a dual-CPU system -- when it works.
515 > Unfortunately, the unpack and configure steps are serialized, so the jobs
516 > option does little good, there. To make most efficient use of the
517 > available cycles when I have a lot to merge, therefore, I'll run as many
518 > as five merges in parallel. I do this quite regularly with KDE upgrades
519 > like the one to 3.5.1, where I use the split KDE ebuilds and have
520 > something north of 100 packages to merge before KDE is fully upgraded.
521
522 I really wonder how you would paralellize unpacking and configuring a package.
523
524 > I mentioned above that I often run eptree, then ea individual packages
525 > from the list. This is how I accomplish the five merges in parallel.
526 > I'll take a look at the tree output to check the dependencies, and merge
527 > the packages first that have several dependencies, but only where those
528 > dependencies aren't stepping on each other, thus keeping the parallel
529 > emerges from interfering with each other, because each one is doing its
530 > own dependencies, that aren't dependencies of any of the others. After I
531 > get as many of those going as I can, I'll start listing 3-5 individual
532 > packages without deps on the same ea command line. By the time I've
533 > gotten the fifth one started, one of the other sessions has usually
534 > finished or is close to it, so I can start it merging the next set of
535 > packages. With five merge sessions in parallel, I'm normally running an
536 > average load of 5 to 9, meaning that many applications are ready for CPU
537 > scheduling time at any instant, on average. If the load drops below four,
538 > there's proobably idle CPU cycles being wasted that could otherwise be
539 > compiling stuff, as each CPU needs at least one load-point to stay busy,
540 > plus usually can schedule a second one for some cycles as well, while the
541 > first is waiting for the hard drive or whatever.
542 >
543 > (Note that I'm running a four-drive RAID, RAID-6, so two-way striped, for
544 > my main system, Raid-0, so 4-way striped, for $PORTAGE_TMPDIR, so hard
545 > drive latency isn't /nearly/ as high as it would be on a single-hard-drive
546 > system. Of course, running five merges in parallel /does/ increase disk
547 > latency some as well, but it /does/ seem to keep my load-average in the
548 > target zone and my idle cycles to a minimum, during the merge period.
549 > Also note that I've only recently added the PORTAGE_NICENESS value above,
550 > and haven't gotten it fully tweaked to the best balance between
551 > interactivity and emerge speed just yet, but from observations so far,
552 > with the niceness value set, I'll be able to keep the system busy with
553 > "only" 3-4 parallel merges, rather than the 5 I had been having to run to
554 > keep the system most efficiently occupied when I had a lot to merge.)
555 >
556 > PKGDIR="/pkg"
557 > PORTAGE_TMPDIR="/tmp"
558 > PORTDIR="/p"
559 > PORTDIR_OVERLAY="/l/p"
560 >
561 > Here you can see some of my path customization.
562 >
563 > USE="amd64 7zip X a52
564 > aac acpi alsa apm arts asf audiofile avi bash-completion berkdb
565 > bitmap-fonts bzip2 caps cdparanoia cdr crypt css cups curl dga divx4linux
566 > dlloader dri dts dv dvd dvdr dvdread eds emboss encode extrafilters fam
567 > fame ffmpeg flac font-server foomaticdb gdbm gif glibc-omitfp gpm
568 > gstreamer gtk2 idn imagemagick imlib ithreads jp2 jpeg jpeg2k kde
569 > kdeenablefinal lcms libwww linuxthreads-tls lm_sensors logitech-mouse
570 > logrotate lzo lzw lzw-tiff mad maildir mikmod mjpeg mng motif mozilla mp3
571 > mpeg ncurses network no-old-linux nolvm1 nomirrors nptl nptlonly offensive
572 > ogg opengl oss pam pcre pdflib perl pic png ppds python qt quicktime
573 > radeon readline scanner slang speex spell ssl tcltk theora threads tiff
574 > truetype truetype-fonts type1 type1-fonts usb userlocales vcd vorbis
575 > xcomposite xine xinerama xml2 xmms xosd xpm xrandr xv xvid yv12 zlib
576 > elibc_glibc input_devices_keyboard input_devices_mouse kernel_linux
577 > linguas_en userland_GNU video_cards_ati"
578 >
579 > My USE flags, FWTAR (for what they are worth). Of particular interest are
580 > the input_devices_mouse and keyboard, and video_cards_ati. These come
581 > from variables (INPUT_DEVICES and VIDEO_CARDS) set in make.conf, and used
582 > in the new xorg-modular ebuilds. These and the others listed after zlib
583 > are referred to by Gentoo devs as USE_EXPAND. Effectively, they are USE
584 > flags in the form of variables, setup that way because there are rather
585 > many possible values for those variables, too many to work as USE flags.
586 > The LINGUAS and LANG USE_EXPAND variables are prime examples. Consider
587 > how many different languages there are and that were used and documented
588 > as regular USE flags, it would have to be in use.local.desc, because few
589 > supporting packages would offer the same choices, so each would have to be
590 > listed separately for each package. Talk about the number of USE flags
591 > quickly getting out of control!
592 >
593 > Unset: ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL
594 >
595 > OK, some loose ends to wrapup, and I'm done.
596 >
597 > re: gcc versions: The plan is for gcc-4.0 to go ~arch fairly soon, now.
598 > The devs are actively asking for bug reports involving it, now, so as many
599 > as possible can be resolved before it goes ~arch. (Formerly, they were
600 > recommending that bugs be filed upstream, and not with Gentoo unless there
601 > was a patch attached, as it was considered entirely unsupported, just
602 > there for those that wanted it anyway.) At this point, nearly everything
603 > should compile just fine with 4.0.
604 >
605 > That said, Gentoo has slotted gcc for a reason. It's possible to have
606 > multiple minor versions (3.3, 3.4, 4.0, 4.1) merged at the same time.
607 > With USE=multislot, that's actually microversion (4.0.0, 4.0.1, 4.0.2...).
608 > Using either gcc-config or eselect compiler, and discounting any CFLAG
609 > switching you may have to do, it's a simple matter to switch between
610 > merged versions. This made it easy to experiment with gcc-4.0 even tho
611 > Gentoo wasn't supporting it and certain packages wouldn't compile with
612 > 4.x, because it was always possible to switch to a 3.x version if
613 > necessary, and compile the package there. I did this quite regularly,
614 > using gcc-4.0 as my normal version, but reverting for individual packages
615 > as necessary, when they wouldn't compile with 4.0.
616 >
617 > The same now applies to the 4.1.0-beta-snapshot series. Other than the
618 > compile time necessary to compile a new gcc when the snapshot comes out
619 > each week, it's easy to run the 4.1-beta as the main system compiler for
620 > as wide testing as possible, while reverting to 4.0 or 3.4 (I don't have a
621 > 3.3 slot merged) if needed.
622 >
623 > re: the performance improvements I saw that started this whole thing:
624 > These trace to several things, I believe. #1, with gcc-4.0, there's now
625 > support for -fvisibility -- setting certain functions as exported and
626 > visible externally, others not. That can easily cut exported symbols by a
627 > factor of 10. Exported symbols of course affect dynamic load-time, which
628 > of course gets magnified dramatically by my LDFLAGS early binding
629 > settings. When I first compiled KDE with that (there were several
630 > missteps early on in terms of KDE and Gentoo's support, but that aside),
631 > KDE appload times went down VERY NOTICEABLY! Again, due to my LDFLAGS,
632 > the effect was multiplied dramatically, but the effect is VERY real!
633 >
634 > Of course, that's mainly load-time performance. The run-time performance
635 > that we are actually talking here has other explanations. A big one is
636 > that gcc-4 was a HUGE rewrite, with a BIG potential to DRAMATICALLY
637 > improve gcc's performance. With 4.0, the theory is there, but in
638 > practice, it wasn't all that optimized just yet. In some ways it reverted
639 > behavior below that of the fairly mature 3.x series, altho the rewrite
640 > made things much simpler and less prone to error given its maturity. 4.1,
641 > however, is the first 4.x release to REALLY be hitting the potential of
642 > the 4.x series, and it appears the difference is very noticeable. Of
643 > course, there's a reason 4.1.0 is still in beta upstream and not supported
644 > by Gentoo either, as there are still known regressions. However, where it
645 > works, which it seems to do /most/ of the time, it **REALLY** works, or at
646 > least that's been my observation. 3.3 was a MAJOR improvement in gcc for
647 > amd64 users, because it was the first version where amd64 wasn't simply an
648 > add-on hack, as it had been with 3.2. The 3.4 upgrade was minor in
649 > comparison, and 4.0 while it's going ~arch shortly, and sets the stage for
650 > a lot of future improvement, will be pretty minor in terms of actual
651 > improved performance as well. 4.1, however, when it is finally fully
652 > released, has the potential to be as big an improvement as 3.3 was -- that
653 > is, a HUGE one. I'm certainly looking forward to it, and meanwhile,
654 > running the snapshots, because Gentoo makes it easy to do so while
655 > maintaining the ability to switch very simply between multiiple versions
656 > on the system.
657 >
658 > Both -freorder-blocks-and-partition and -fmerge-all-constants are new to
659 > me within a few days, now, and new to me with kde 3.5.1. Normally,
660 > individual flags won't make /that/ much of a difference, but it's possible
661 > I hit it lucky, with these. Actually, because they both match very well
662 > with and reinforce my strategy of targeting size, it's possible I'm only
663 > now unlocking the real potential behind size optimization. -- I **KNOW**
664 > there's a **HUGE** difference in sizes between resulting file-sizes. I
665 > compared 4.0.2 and 4.1.0-beta-snapshot file sizes for several modular-X
666 > files in the course of researching the missing symbols problem, and the
667 > difference was often a shrinkage of near 33 percent with 4.1 and my
668 > current CFLAGS as opposed to 4.0.1 without the new ones. Going the other
669 > way, that's a 50% larger file with 4.0.2 as compared to 4.1, 100KB vs
670 > 150KB, by way of example. That's a *HUGE* difference, one big enough to
671 > initially think I'd found the reason for the missing symbols right there,
672 > as the new files were simply too much smaller to look workable! Still, I
673 > traced the problem too LDFLAGS, so that wasn't it, and the files DO work,
674 > confirming things. I'm guessing -fmerge-all-constants plays a significant
675 > part in that. In any case, with that difference in size, and knowing how
676 > /much/ cache hit vs. miss affects performance, it's quite possible the
677 > size is the big performance factor. Of course, even if that's so, I'm not
678 > sure whether it is the CFLAGS or the 4.0 vs 4.1 that should get the credit.
679 >
680 > In any case, I'm a happy camper right now! =8^)
681
682
683 --
684 Simon Stelling
685 Gentoo/AMD64 Operational Co-Lead
686 blubb@g.o
687 --
688 gentoo-amd64@g.o mailing list

Replies

Subject Author
[gentoo-amd64] Re: Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite Duncan <1i5t5.duncan@×××.net>
Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite "Kevin F. Quinn (Gentoo)" <kevquinn@g.o>