Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite
Date: Fri, 03 Feb 2006 16:32:48
Message-Id: pan.2006.02.03.16.28.28.536378@cox.net
In Reply to: Re: [gentoo-amd64] Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite by Mike Owen
1 Mike Owen posted
2 <8f5ca2210602021712s53d33de5w6794fa384bbf93a5@××××××××××.com>, excerpted
3 below, on Thu, 02 Feb 2006 17:12:04 -0800:
4
5 > On 2/2/06, Duncan <1i5t5.duncan@×××.net> wrote:
6 >>
7 >> http://members.cox.net/pu61ic.1inux.dunc4n/
8 >
9 > Nice. Now let us know your CFLAGS, and what toolchain versions you're
10 > running :D
11
12 You probably didn't notice, as I had it commented out on the main index
13 page as I don't have the page created to actually list them yet, but if
14 you viewed source, you'd have seen I have a techspecs page link commented
15 out, that'll get that sort of info, when/if I actually get it created.
16
17 However, since you asked, your answer, and a bit more, by way of
18 explanation...
19
20 I should really create a page listing all the little Gentoo admin scripts
21 I've come up with and how I use them. I'm sure a few folks anyway would
22 likely find them useful.
23
24 The idea behind most of them is to create shortcuts to having to type in
25 long emerge lines, with all sorts of arbitrary command line parameters.
26 The majority of these fall into two categories, ea* and ep*, short for
27 emerge --ask <additional parameters> and emerge --pretend ... . Thus, I
28 have epworld and eaworld, the pretend and ask versions of emerge -NuDv
29 world, epsys and easys, the same for system, eplog <package>, emerge
30 --pretend --log --verbose (package name to be added to the command line so
31 eplog gcc, for instance, to see the changes between my current and the new
32 version of gcc), eptree <package>, to use the tree output, etc.
33
34 One thing I've found is that I'll often epworld or eptreeworld, then
35 emerge the individual packages, rather than use eaworld to do it. That
36 way, I can do them in the order I want or do several at a time if I want
37 to make use of both CPUs. Because I always use --deep, as I want to keep
38 my dependencies updated as well, I'm very often merging specific
39 dependencies. There's a small problem with that, however --oneshot, which
40 I'll always want to use with dependencies to help keep my world file
41 uncluttered, has no short form, but I use it as the default! OTOH, the
42 normal portage mode of adding stuff listed on the command line to the
43 world file, I don't want very often, as most of the time I'm simply
44 updating what I have, so it's all in the world file if it needs to be
45 there already anyway. Not a problem! All my regular ea* scriptlets use
46 --oneshot, so it /is/ my default. If I *AM* merging something new that I
47 want added to my world file, I have another family of ea* scriptlets that
48 do that -- all ending in "2", as in, "NOT --oneshot". Thus, I have a
49 family of ea*2 scriptlets.
50
51 The regulars here already know one of my favorite portage features is
52 FEATURES=buildpkg, which I have set in make.conf. That of course gives me
53 a collection of binary versions of packages I've already emerged, so I
54 can quickly revert to an old version for testing something, if I want,
55 then remerge the new version once I've tested the old version to see if it
56 has the same bug I'm working on or not. To aid in this, I have a
57 collection of eppak and eapak scriptlets. Again, the portage default of
58 --usepackage (-k) doesn't fit my default needs, as if I'm using a binpkg,
59 I usually want to ONLY use a binpkg, NOT merge from source if the package
60 isn't available. That happens to be -K in short-form. However, it's my
61 default, so eapak invokes the -K version. I therefore have eapaK to
62 invoke the -k version if I don't really care whether it goes from binpkg
63 or source.
64
65 Of course, there are various permutations of the above as well, so I have
66 eapak2 and eapaK2, as well as eapak and eapaK. For the ep* versions, of
67 course the --oneshot doesn't make a difference, so I only have eppak and
68 eppaK, no eppa?2 scriptlets.
69
70 ... Deep breath... <g>
71
72 All that as a preliminary explanation to this: Along with the above, I
73 have a set of efetch functions, that invoke the -f form, so just do the
74 fetch, not the actual compile and merge, and esyn (there's already an
75 esync function in something or other I have merged so I just call it
76 esyn), which does emerge sync, then updates the esearch db, then
77 automatically fetches all the packages that an eaworld would want to
78 update, so they are ready for me to merge at my leisure.
79
80 Likewise, and the real reason for this whole explanation, I /had/ an
81 "einfo" scriptlet that simply ran "emerge info". This can be very handy
82 to run, if like me, you have several slotted versions of gcc merged, and
83 you sometimes forget which one you have eselected or gcc-configed as the
84 one portage will use. Likewise, it's useful for checking on CFLAGS (or
85 CXXFLAGS OR LDFLAGS or...), if you modified them from the normal ones
86 because a particular package wasn't cooperating, and you want to see if
87 you remembered to switch them back or not.
88
89 However, I ran into a problem. The output of einfo was too long to
90 quickly find the most useful info -- the stuff I most often change and
91 therefore most often am looking for.
92
93 No sweat! I shortened my original "einfo" to simply "ei", and added a
94 second script, "eis" (for einfo short), that simply piped the output of
95 the usual emerge info into a grep that only returned the lines I most
96 often need -- the big title one with gcc and similar info, CFLAGS,
97 CXXFLAGS, LDFLAGS, and FEATURES. USE would also be useful, but it's too
98 long even by itself to be searched at a glance, so if I want it, I simply
99 run ei and look for what I want in the longer output.
100
101 ... Another deep breath... <g>
102
103 OK, with that as a preliminary, you should be able to understand the
104 following:
105
106 $eis
107
108 Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,
109 glibc-2.3.6-r2, 2.6.15 x86_64)
110
111 CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
112 -funit-at-a-time -fweb -freorder-blocks-and-partition
113 -fmerge-all-constants"
114
115 CXXFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers
116 -funit-at-a-time -fweb -freorder-blocks-and-partition
117 -fmerge-all-constants"
118
119 FEATURES="autoconfig buildpkg candy ccache confcache distlocks
120 multilib-strict parallel-fetch sandbox sfperms strict userfetch"
121
122 LDFLAGS="-Wl,-z,now"
123
124 MAKEOPTS="-j4"
125
126 To make sense of that...
127
128 * The portage and glibc versions are ~amd64, as set in make.conf for the
129 system in general.
130
131 * CFLAGS:
132
133 I choose -Os, optimize for size, because a modern CPU and the various
134 cache levels are FAR faster than main memory. This difference is
135 frequently severe enough that it's actually more efficient to optimize for
136 size than for CPU performance, because the result is smaller code that
137 maintains cache locality (stays in fast cache) far better, and the CPU
138 saves more time that it would otherwise be spending idle, waiting for data
139 to come in from slower more distant memory, than the actual cost of the
140 loss of cycle efficiency that's often the tradeoff for small code.
141
142 -O3, and to a lessor extent, -O2, do things like turn a loop that executes
143 a fixed number of say 3 times, into "faster" code, by avoiding the jump at
144 the end of each loop back to the top of the loop by writing it out as
145 inline code, copying the loop instructions three times. This process
146 would in our example of a 3-time fixed execution loop, save the expensive
147 jump back to the top of the loop two times -- but at the SAME time would
148 expand that section of code to three times its looped size.
149
150 Back when memory operated at or near the speed of the CPU, avoiding the
151 loop, even at the expense of three-times the code, was often faster.
152 Today, where CPUs do several calculations in the time it takes to fetch
153 data from main memory, it's generally faster to go for the smaller code,
154 as it will be far more likely to still be in fast cache, avoiding that
155 long wait for main memory, even if it /does/ mean wasting a couple
156 additional cycles doing the expensive jump back to the top of the loop.
157
158 Of course, this is theory, and the practical case can and will differ
159 depending on the instructions actually being compiled. In particular,
160 streaming media apps and media encoding/decoding are likely to still
161 benefit from the traditional loop elimination style optimizations, because
162 they run thru so much data already, that cache is routinely trashed
163 anyway, regardless of the size of your instructions. As well, that type
164 of application tends to have a LOT of looping instructions to optimize!
165
166 By contrast, something like the kernel will benefit more than usual from
167 size optimization. First, it's always memory locked and as such
168 can't be swapped, and even "slow" main memory is still **MANY** **MANY**
169 times faster than swap, so a smaller kernel means more other stuff fits
170 into main memory with it, and isn't swapped as much. Second, parts of the
171 kernel such as task scheduling are executed VERY often, either because
172 they are frequently executed by most processes, or because they /control/
173 those processes. The smaller these are, the more likely they are to still
174 be in cache when next used. Likewise, the smaller they are, the less
175 potentially still useful other data gets flushed out of cache to make room
176 for the kernel code executing at the moment. Third, while there's a lot
177 of kernel code that will loop, and a lot that's essentially streaming, the
178 kernel as a whole is a pretty good mix of code and thus won't benefit as
179 much from loop optimizations and the like, as compared to special purpose
180 code like the media codec and streaming applications above.
181
182 The differences are marked enough and now demonstrated enough that a
183 kernel config option to optimize for size was added I believe about a year
184 ago. Evidently, that lead to even MORE demonstration, as the option was
185 originally in the obscure embedded optimizations corner of the config,
186 where few would notice or use it, and they upgraded it into a main option.
187 In fact, where a year or two ago, the option didn't even exist, now I
188 believe it defaults to yes/on/do-optimize-for-size (altho it's possible
189 I'm incorrect on the last and it's not yet the default).
190
191 According to the gcc manpage, -frename-registers causes gcc to attempt to
192 make use of registers left over after normal register allocation. This is
193 particularly beneficial on archs that have many registers (keeping in
194 mind that "registers" are what amounts to L0 cache, the fastest possible
195 memory because the CPU accesses registers directly and they operate at
196 full CPU speed. Unfortunately, registers are also very limited, making
197 them an EXCEEDINGLY valuable resource! Note that while x86-32 is noted
198 for its relative /lack/ of registers, AMD basically doubled the number of
199 registers available to 64-bit code in its x86-64 aka AMD64 spec. Thus,
200 while this option wouldn't be of particular benefit on x86, on amd64, it
201 can, depending on the code of course, provide some rather serious
202 optimization!
203
204 -fweb is a register use optimizer function as well. It tells gcc to
205 create a /web/ of dependencies and assign each individual dependency web
206 to its own pseudo-register. Thus, when it comes time for gcc to allocate
207 registers, it already has a list of the best candidates lined up and ready
208 to go. Combined with -frename register to tell gcc to efficiently make
209 use of any registers left over after the the first pass, and due to the
210 number of registers available in 64-bit mode on our arch, this can allow
211 some seriously powerful optimizations. Still, a couple of things to note
212 about it. One, -fweb (and -frename-registers as well) can cause data to
213 move out of its "home" register, which seriously complicates debugging, if
214 you are a programmer or power-user enough to worry about such things.
215 Two, the rewrite for gcc 4.0 significantly modified the functionality of
216 -fweb, and it wasn't recommended for 4.0 as it didn't yet work as well as
217 expected or as it did with gcc 3.x. For gcc 4.1, -fweb is apparently back
218 to its traditional strength. Those Gentoo users having gcc 3.4, 4.0, and
219 4.1, all three in separate slots, will want to note this as they change
220 gcc-configuratiions, and modify it accordingly. Yes, this *IS* one of the
221 reasons my CFLAGS change so frequently!
222
223 -funit-at-a-time tells gcc to consider a full logical unit, perhaps
224 consisting of several source files rather than just one, as a whole, when
225 it does its compiling. Of course, this allows gcc to make
226 optimizations it couldn't see if it wasn't looking at the larger picture
227 as a whole, but it requires rather more memory, to hold the entire unit
228 so it can consider it at once. This is a fairly new flag, introduced with
229 gcc 3.3 IIRC. While the idea is simple enough and shouldn't lead to any
230 bugs on its own, there WERE a number of initially never encountered bugs
231 in various code that this flag exposed, when GCC made optimizations on the
232 entire unit that it wouldn't otherwise make, thereby triggering bugs that
233 had never been triggered before. I /believe/ this was the root reason why
234 the Gentoo amd64 technotes originally discouraged use of -Os, back with
235 the first introduction of this flag in gcc 3.2 hammer (amd64) edition, as
236 -funit-at-a-time was activated by -Os at that time, and -Os was known to
237 produce bad code at the time, on amd64, with packages like portions of
238 KDE. The gcc 4.1.0 manpage now says it's enabled by default at -O2 and
239 -O3, but doesn't mention -Os. Whether that's an omission, or whether they
240 decided it shouldn't be enabled by -Os for some reason, I'm not sure, but
241 I use them both to be sure and haven't had any issues I can trace to this
242 (not even back when the technotes recommended against -Os, and said KDE
243 was supposed to have trouble with it -- maybe it was parts of KDE I never
244 merged, or maybe I was just lucky, but I've simply never had an issue with
245 it).
246
247 -freorder-blocks-and-partition is new for gcc 4.0, I believe, alto I
248 didn't discover it until I was reading the 4.1-beta manpage. I KNOW gcc
249 3.4.4 fails out with it, saying unrecognized flag or some such, so it's
250 another of those flags that cause my CFLAGS to be constantly changing, as
251 I switch between gcc versions. This flag won't work under all conditions,
252 according to the manpage, so is automatically disabled in the presence of
253 exception handling, and a few other situations named in the manpage. It
254 causes a lot of warnings too, to the effect that it's being disabled due
255 to X reason. There's a similar -freorder-blocks flag, which optimizes by
256 reordering blocks in a function to "reduce number of taken branches and
257 improve code locality." In English, what that means is that it breaks
258 caching less often. Again, caching is *EXTREMELY* performance critical,
259 so anything that breaks it less often is CERTAINLY welcome! The
260 -and-partition increases the effect, by separating the code into
261 frequently used and less frequently used partitions. This keeps the most
262 frequently used code all together, therefore keeping it in cache far more
263 efficiently, since the less used code won't be constantly pulled in,
264 forcing out frequently used code in the process.
265
266 Hmm... As I'm writing and thinking about this, the probability that
267 sticking the regular -freorder-blocks option in CFLAGS as well would be a
268 wise thing, occurs to me. The non-partition version isn't as efficient as
269 the partition version, and would be redundant if the partitioned version
270 is in effect. However, the non-partitioned version doesn't have the same
271 sorts of no-exceptions-handler and similar restrictions, so having it in
272 the list, first, so the partitioned version overrides it where it can be
273 used, should be a good idea. That way, where the partitioned version can
274 be used, it will be, but where it can't, gcc will still use the
275 non-partitioned version of the option, so I'll still get /some/ of the
276 optimizations! I (re)compiled major portions of xorg (modular), qt, and
277 the new kde 3.5.1 with the partitioned option, however, and it works, and
278 I haven't tested having both options in there yet, so I'm not sure it'll
279 work as the theory suggests it should, so some caution might be advised.
280
281 -fmerge-all-constants COULD be dangerous with SOME code, as it breaks part
282 of the C/C++ specification. However, it should be fine for most code
283 written to be compiled with gcc, and I've seen no problems /yet/ tho both
284 this and the reorder-and-partition flag above are fairly new to my CFLAGS,
285 so haven't been as extensively personally tested as the others have been.
286 If something seems to be breaking when this is in your CFLAGS, certainly
287 it's the first thing I'd try pulling out. What it actually does is merge
288 all constants with the same value into the same one. gcc has a weaker
289 -fmerge-constants version that's enabled with any -O option at all (thus
290 at -O, -O2, -O3, AND -Os), that merges all declared constants of the same
291 value, which is safe and doesn't conflict with the C/C++ spec. What the
292 /all/ specifier in there does, however, is cause gcc to merge declared
293 variables where the value actually never changes, so they are in effect
294 constants, altho they are declared as variables, with other constants of
295 the same value. This /should/ be safe, /provided/ gcc isn't failing to
296 detect a variable chance somewhere, but it conflicts with the C/C++ spec,
297 according to the gcc manpage, and thus /could/ cause issues, if the
298 developer pulls certain tricks that gcc wouldn't detect, or possibly more
299 likely, if used with code compiled by a different compiler (say
300 binary-only applications you may run, which may not have been compiled
301 with gcc). There are two reasons why I choose to use it despite the
302 possible risks. One, I want /small/ code, again, because small code fits
303 in that all-important cache better and therefore runs faster, and
304 obviously, two or more merged constants aren't going to take the space
305 they would if gcc stored them separately. Two, the risks aren't as bad if
306 you aren't running non-gcc compiled code anyway, and since I'm a strong
307 believer in Software Libre, if it's binary-only, there's very little
308 chance I'll want or risk it on my box, and everything I do run is gcc
309 compiled anyway, so should be generally safe. Still, I know there may be
310 instances where I'll have to recompile with the flag turned off, and am
311 prepared to deal with them when they happen, or I'd not have the flag in
312 my CFLAGS.
313
314
315 And, here's some selected output from ei, interspersed with explanations,
316 since I'm editing the output anyway:
317
318 $ei
319 !!! Failed to change nice value to '-2'
320 !!! [Errno 13] Permission denied
321
322 This is stderr output. It's not in the eis output above because I
323 redirect stderr to /dev/null for it, as I know the reason for the error
324 and am trying to be brief.
325
326 The warning is because I'm using PORTAGE_NICENESS=-2 in make.conf. It has
327 a negative nice set there to encourage portage to make fuller use of the
328 dual CPUs under-X/from-a-konsole-session, as X and the kernel do some
329 dynamic scheduling magic to keep X more responsive without having to up
330 /its/ priority. The practical effect of that "magic" is to lower the
331 priorities of everything besides X slightly, when X is running. This
332 /does/ have the intended effect of keeping X more responsive, but the cost
333 as observed here is that emerges take longer than they should when X is
334 running, because the scheduler is leaving a bit of extra idle CPU time to
335 keep X responsive. In many cases, I'd rather be using maximum CPU and get
336 the merges done faster, even if X drags a bit in the mean time, and the
337 slightly negative niceness for portage accomplishes exactly that.
338
339 It's reporting a warning (to stderr) here, as I ran the command as a
340 regular non-root user, and non-root can't set negative priorities for
341 obvious system security reasons. I get the same warning with my ep*
342 commands, which I normally run as a regular user, as well. The ea*
343 commands which actually do the merging get run as root, naturally, so the
344 niceness /can/ be set negative when it counts, during a real emerge.
345
346 So... nothing of any real matter, then.
347
348
349 !!! Relying on the shell to locate gcc, this may break
350 !!! DISTCC, installing gcc-config and setting your current gcc
351 !!! profile will fix this
352
353 Another warning, likewise to stderr and thus not in the eis output. This
354 one is due to the fact that eselect, the eventual systemwide replacement
355 for gcc-config and a number of other commands, uses a different method to
356 set the compiler than gcc-config did, and portage hasn't been adjusted to
357 full compatibility just yet. Portage finds the proper gcc just fine for
358 itself, but there'd be problems if distcc was involved, thus the warning.
359
360 Again, I'm aware of the situation and the cause, but don't use distcc, so
361 it's nothing I have to worry about, and I can safely ignore the warning.
362
363 I kept the warnings here, as I find them and the explanation behind them
364 interesting elements of my Gentoo environment, thus worth posting, for
365 others who seem interested in my Gentoo environment as well. If nothing
366 else, the explanations should help some in my audience understand that bit
367 more about how their system operates, even if they don't get these
368 warnings.
369
370
371 Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,
372 glibc-2.3.6-r2, 2.6.15 x86_64)
373 =================================================================
374 System uname: 2.6.15 x86_64 AMD Opteron(tm) Processor 242
375 Gentoo Base System version 1.12.0_pre15
376
377 Those of you running stable amd64, but wondering where baselayout is for
378 unstable, there you have it!
379
380 ccache version 2.4 [enabled]
381 dev-lang/python: 2.4.2
382 sys-apps/sandbox: 1.2.17
383 sys-devel/autoconf: 2.13, 2.59-r7
384 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
385 sys-devel/binutils: 2.16.91.0.1
386 sys-devel/libtool: 1.5.22
387 virtual/os-headers: 2.6.11-r3
388
389 ACCEPT_KEYWORDS="amd64 ~amd64"
390
391 Same for the above portions of my toolchain. AFAIR, it's all ~amd64,
392 altho I was running a still-masked binutils for awhile shortly after
393 gcc-4.0 was released (still-masked on Gentoo as well), as it required the
394 newer binutils.
395
396 LANG="en_US"
397 LDFLAGS="-Wl,-z,now"
398
399 Some of you may have noticed the occasional Portage warning about a SETUID
400 executables using lazy bindings, and the potential security issue that
401 causes. This setting for LDFLAGS forces early bindings with all
402 dynamically linked libraries. Normally it'd only be necessary or
403 recommended for SETUID executables, and set in the ebuild where it's safe
404 to do so, but I use it by default, for several reasons. The effect is
405 that a program takes a bit longer to load initially, but won't have to
406 pause to resolve late bindings as they are needed. You're trading waiting
407 at executable initialization for waiting at some other point. With a gig
408 of memory, I find most stuff I run more than once is at least partially
409 still in cache on the second and later launches, and with my system, I
410 don't normally find the initial wait irritating, and sometimes find a
411 pause after I'm working with a program especially so, so I prefer to have
412 everything resolved and loaded at executable launch. Additionally, with
413 lazy bindings, I've had programs start just fine, then fail later when
414 they need to resolve some function that for some reason won't resolve in
415 whatever library it's supposed to be coming from. I don't like have the
416 thing fail and interrupt me in the middle of a task, and find it far less
417 frustrating, if it's going to fail when it tries to load something, to
418 have it do so at launch. Because early bindings forces resolution of
419 functions at launch, if it's going to fail loading one, it'll fail at
420 launch, rather than after I've started working with the program. That's
421 /exactly/ how I want it, so that's why I run the above LDFLAGS setting.
422 It's nice not to have to worry about the security issue, but SETUID type
423 security isn't as critical on my single-human-user system, where that
424 single-user-is me and I already have root when I want it anyway, as it'd
425 be in a multi-user system, particularly a public server, so the other
426 reasons are more important than security, for me, on this. They just
427 happen to coincide, so I'm a happy camper. =8^)
428
429 The caveat with these LDFLAGS, however, is the rare case where there's a
430 circular functional dependency that's normally self-resolving, Modular
431 xorg triggers one such case, where the monolithic xorg didn't. There are
432 three individual ebuilds related to modular xorg that I have to remove
433 these LDFLAGS for or they won't work. xorg-server is one.
434 xf86-vidio-ati, my video driver, is another. libdri was the third, IIRC.
435 There's a specific order they have to be compiled in, as well. If they are
436 compiled with this enabled, they, and consequently X, refuses to load (tho
437 X will load without DRI, if that's the only one, it'll just protest in the
438 log and DRI and glx aren't available). Evidently there's a non-critical
439 fourth module somewhere, that still won't load properly due to an
440 unresolved symbol, that I need to track down and remerge without these
441 LDFLAGS, and that's what's keeping GLX from loading on my current system,
442 as mentioned in an earlier post.
443
444 LINGUAS="en"
445 MAKEOPTS="-j4"
446
447 The four jobs is nice for a dual-CPU system -- when it works.
448 Unfortunately, the unpack and configure steps are serialized, so the jobs
449 option does little good, there. To make most efficient use of the
450 available cycles when I have a lot to merge, therefore, I'll run as many
451 as five merges in parallel. I do this quite regularly with KDE upgrades
452 like the one to 3.5.1, where I use the split KDE ebuilds and have
453 something north of 100 packages to merge before KDE is fully upgraded.
454
455 I mentioned above that I often run eptree, then ea individual packages
456 from the list. This is how I accomplish the five merges in parallel.
457 I'll take a look at the tree output to check the dependencies, and merge
458 the packages first that have several dependencies, but only where those
459 dependencies aren't stepping on each other, thus keeping the parallel
460 emerges from interfering with each other, because each one is doing its
461 own dependencies, that aren't dependencies of any of the others. After I
462 get as many of those going as I can, I'll start listing 3-5 individual
463 packages without deps on the same ea command line. By the time I've
464 gotten the fifth one started, one of the other sessions has usually
465 finished or is close to it, so I can start it merging the next set of
466 packages. With five merge sessions in parallel, I'm normally running an
467 average load of 5 to 9, meaning that many applications are ready for CPU
468 scheduling time at any instant, on average. If the load drops below four,
469 there's proobably idle CPU cycles being wasted that could otherwise be
470 compiling stuff, as each CPU needs at least one load-point to stay busy,
471 plus usually can schedule a second one for some cycles as well, while the
472 first is waiting for the hard drive or whatever.
473
474 (Note that I'm running a four-drive RAID, RAID-6, so two-way striped, for
475 my main system, Raid-0, so 4-way striped, for $PORTAGE_TMPDIR, so hard
476 drive latency isn't /nearly/ as high as it would be on a single-hard-drive
477 system. Of course, running five merges in parallel /does/ increase disk
478 latency some as well, but it /does/ seem to keep my load-average in the
479 target zone and my idle cycles to a minimum, during the merge period.
480 Also note that I've only recently added the PORTAGE_NICENESS value above,
481 and haven't gotten it fully tweaked to the best balance between
482 interactivity and emerge speed just yet, but from observations so far,
483 with the niceness value set, I'll be able to keep the system busy with
484 "only" 3-4 parallel merges, rather than the 5 I had been having to run to
485 keep the system most efficiently occupied when I had a lot to merge.)
486
487 PKGDIR="/pkg"
488 PORTAGE_TMPDIR="/tmp"
489 PORTDIR="/p"
490 PORTDIR_OVERLAY="/l/p"
491
492 Here you can see some of my path customization.
493
494 USE="amd64 7zip X a52
495 aac acpi alsa apm arts asf audiofile avi bash-completion berkdb
496 bitmap-fonts bzip2 caps cdparanoia cdr crypt css cups curl dga divx4linux
497 dlloader dri dts dv dvd dvdr dvdread eds emboss encode extrafilters fam
498 fame ffmpeg flac font-server foomaticdb gdbm gif glibc-omitfp gpm
499 gstreamer gtk2 idn imagemagick imlib ithreads jp2 jpeg jpeg2k kde
500 kdeenablefinal lcms libwww linuxthreads-tls lm_sensors logitech-mouse
501 logrotate lzo lzw lzw-tiff mad maildir mikmod mjpeg mng motif mozilla mp3
502 mpeg ncurses network no-old-linux nolvm1 nomirrors nptl nptlonly offensive
503 ogg opengl oss pam pcre pdflib perl pic png ppds python qt quicktime
504 radeon readline scanner slang speex spell ssl tcltk theora threads tiff
505 truetype truetype-fonts type1 type1-fonts usb userlocales vcd vorbis
506 xcomposite xine xinerama xml2 xmms xosd xpm xrandr xv xvid yv12 zlib
507 elibc_glibc input_devices_keyboard input_devices_mouse kernel_linux
508 linguas_en userland_GNU video_cards_ati"
509
510 My USE flags, FWTAR (for what they are worth). Of particular interest are
511 the input_devices_mouse and keyboard, and video_cards_ati. These come
512 from variables (INPUT_DEVICES and VIDEO_CARDS) set in make.conf, and used
513 in the new xorg-modular ebuilds. These and the others listed after zlib
514 are referred to by Gentoo devs as USE_EXPAND. Effectively, they are USE
515 flags in the form of variables, setup that way because there are rather
516 many possible values for those variables, too many to work as USE flags.
517 The LINGUAS and LANG USE_EXPAND variables are prime examples. Consider
518 how many different languages there are and that were used and documented
519 as regular USE flags, it would have to be in use.local.desc, because few
520 supporting packages would offer the same choices, so each would have to be
521 listed separately for each package. Talk about the number of USE flags
522 quickly getting out of control!
523
524 Unset: ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL
525
526 OK, some loose ends to wrapup, and I'm done.
527
528 re: gcc versions: The plan is for gcc-4.0 to go ~arch fairly soon, now.
529 The devs are actively asking for bug reports involving it, now, so as many
530 as possible can be resolved before it goes ~arch. (Formerly, they were
531 recommending that bugs be filed upstream, and not with Gentoo unless there
532 was a patch attached, as it was considered entirely unsupported, just
533 there for those that wanted it anyway.) At this point, nearly everything
534 should compile just fine with 4.0.
535
536 That said, Gentoo has slotted gcc for a reason. It's possible to have
537 multiple minor versions (3.3, 3.4, 4.0, 4.1) merged at the same time.
538 With USE=multislot, that's actually microversion (4.0.0, 4.0.1, 4.0.2...).
539 Using either gcc-config or eselect compiler, and discounting any CFLAG
540 switching you may have to do, it's a simple matter to switch between
541 merged versions. This made it easy to experiment with gcc-4.0 even tho
542 Gentoo wasn't supporting it and certain packages wouldn't compile with
543 4.x, because it was always possible to switch to a 3.x version if
544 necessary, and compile the package there. I did this quite regularly,
545 using gcc-4.0 as my normal version, but reverting for individual packages
546 as necessary, when they wouldn't compile with 4.0.
547
548 The same now applies to the 4.1.0-beta-snapshot series. Other than the
549 compile time necessary to compile a new gcc when the snapshot comes out
550 each week, it's easy to run the 4.1-beta as the main system compiler for
551 as wide testing as possible, while reverting to 4.0 or 3.4 (I don't have a
552 3.3 slot merged) if needed.
553
554 re: the performance improvements I saw that started this whole thing:
555 These trace to several things, I believe. #1, with gcc-4.0, there's now
556 support for -fvisibility -- setting certain functions as exported and
557 visible externally, others not. That can easily cut exported symbols by a
558 factor of 10. Exported symbols of course affect dynamic load-time, which
559 of course gets magnified dramatically by my LDFLAGS early binding
560 settings. When I first compiled KDE with that (there were several
561 missteps early on in terms of KDE and Gentoo's support, but that aside),
562 KDE appload times went down VERY NOTICEABLY! Again, due to my LDFLAGS,
563 the effect was multiplied dramatically, but the effect is VERY real!
564
565 Of course, that's mainly load-time performance. The run-time performance
566 that we are actually talking here has other explanations. A big one is
567 that gcc-4 was a HUGE rewrite, with a BIG potential to DRAMATICALLY
568 improve gcc's performance. With 4.0, the theory is there, but in
569 practice, it wasn't all that optimized just yet. In some ways it reverted
570 behavior below that of the fairly mature 3.x series, altho the rewrite
571 made things much simpler and less prone to error given its maturity. 4.1,
572 however, is the first 4.x release to REALLY be hitting the potential of
573 the 4.x series, and it appears the difference is very noticeable. Of
574 course, there's a reason 4.1.0 is still in beta upstream and not supported
575 by Gentoo either, as there are still known regressions. However, where it
576 works, which it seems to do /most/ of the time, it **REALLY** works, or at
577 least that's been my observation. 3.3 was a MAJOR improvement in gcc for
578 amd64 users, because it was the first version where amd64 wasn't simply an
579 add-on hack, as it had been with 3.2. The 3.4 upgrade was minor in
580 comparison, and 4.0 while it's going ~arch shortly, and sets the stage for
581 a lot of future improvement, will be pretty minor in terms of actual
582 improved performance as well. 4.1, however, when it is finally fully
583 released, has the potential to be as big an improvement as 3.3 was -- that
584 is, a HUGE one. I'm certainly looking forward to it, and meanwhile,
585 running the snapshots, because Gentoo makes it easy to do so while
586 maintaining the ability to switch very simply between multiiple versions
587 on the system.
588
589 Both -freorder-blocks-and-partition and -fmerge-all-constants are new to
590 me within a few days, now, and new to me with kde 3.5.1. Normally,
591 individual flags won't make /that/ much of a difference, but it's possible
592 I hit it lucky, with these. Actually, because they both match very well
593 with and reinforce my strategy of targeting size, it's possible I'm only
594 now unlocking the real potential behind size optimization. -- I **KNOW**
595 there's a **HUGE** difference in sizes between resulting file-sizes. I
596 compared 4.0.2 and 4.1.0-beta-snapshot file sizes for several modular-X
597 files in the course of researching the missing symbols problem, and the
598 difference was often a shrinkage of near 33 percent with 4.1 and my
599 current CFLAGS as opposed to 4.0.1 without the new ones. Going the other
600 way, that's a 50% larger file with 4.0.2 as compared to 4.1, 100KB vs
601 150KB, by way of example. That's a *HUGE* difference, one big enough to
602 initially think I'd found the reason for the missing symbols right there,
603 as the new files were simply too much smaller to look workable! Still, I
604 traced the problem too LDFLAGS, so that wasn't it, and the files DO work,
605 confirming things. I'm guessing -fmerge-all-constants plays a significant
606 part in that. In any case, with that difference in size, and knowing how
607 /much/ cache hit vs. miss affects performance, it's quite possible the
608 size is the big performance factor. Of course, even if that's so, I'm not
609 sure whether it is the CFLAGS or the 4.0 vs 4.1 that should get the credit.
610
611 In any case, I'm a happy camper right now! =8^)
612
613
614 --
615 Duncan - List replies preferred. No HTML msgs.
616 "Every nonfree program has a lord, a master --
617 and if you use the program, he is your master." Richard Stallman in
618 http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
619
620
621 --
622 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite Taka John Brunkhorst <antiwmac@×××××.com>
Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite David Guerizec <david@××××××××.net>
Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite Simon Stelling <blubb@g.o>