1 |
Mike Owen posted |
2 |
<8f5ca2210602021712s53d33de5w6794fa384bbf93a5@××××××××××.com>, excerpted |
3 |
below, on Thu, 02 Feb 2006 17:12:04 -0800: |
4 |
|
5 |
> On 2/2/06, Duncan <1i5t5.duncan@×××.net> wrote: |
6 |
>> |
7 |
>> http://members.cox.net/pu61ic.1inux.dunc4n/ |
8 |
> |
9 |
> Nice. Now let us know your CFLAGS, and what toolchain versions you're |
10 |
> running :D |
11 |
|
12 |
You probably didn't notice, as I had it commented out on the main index |
13 |
page as I don't have the page created to actually list them yet, but if |
14 |
you viewed source, you'd have seen I have a techspecs page link commented |
15 |
out, that'll get that sort of info, when/if I actually get it created. |
16 |
|
17 |
However, since you asked, your answer, and a bit more, by way of |
18 |
explanation... |
19 |
|
20 |
I should really create a page listing all the little Gentoo admin scripts |
21 |
I've come up with and how I use them. I'm sure a few folks anyway would |
22 |
likely find them useful. |
23 |
|
24 |
The idea behind most of them is to create shortcuts to having to type in |
25 |
long emerge lines, with all sorts of arbitrary command line parameters. |
26 |
The majority of these fall into two categories, ea* and ep*, short for |
27 |
emerge --ask <additional parameters> and emerge --pretend ... . Thus, I |
28 |
have epworld and eaworld, the pretend and ask versions of emerge -NuDv |
29 |
world, epsys and easys, the same for system, eplog <package>, emerge |
30 |
--pretend --log --verbose (package name to be added to the command line so |
31 |
eplog gcc, for instance, to see the changes between my current and the new |
32 |
version of gcc), eptree <package>, to use the tree output, etc. |
33 |
|
34 |
One thing I've found is that I'll often epworld or eptreeworld, then |
35 |
emerge the individual packages, rather than use eaworld to do it. That |
36 |
way, I can do them in the order I want or do several at a time if I want |
37 |
to make use of both CPUs. Because I always use --deep, as I want to keep |
38 |
my dependencies updated as well, I'm very often merging specific |
39 |
dependencies. There's a small problem with that, however --oneshot, which |
40 |
I'll always want to use with dependencies to help keep my world file |
41 |
uncluttered, has no short form, but I use it as the default! OTOH, the |
42 |
normal portage mode of adding stuff listed on the command line to the |
43 |
world file, I don't want very often, as most of the time I'm simply |
44 |
updating what I have, so it's all in the world file if it needs to be |
45 |
there already anyway. Not a problem! All my regular ea* scriptlets use |
46 |
--oneshot, so it /is/ my default. If I *AM* merging something new that I |
47 |
want added to my world file, I have another family of ea* scriptlets that |
48 |
do that -- all ending in "2", as in, "NOT --oneshot". Thus, I have a |
49 |
family of ea*2 scriptlets. |
50 |
|
51 |
The regulars here already know one of my favorite portage features is |
52 |
FEATURES=buildpkg, which I have set in make.conf. That of course gives me |
53 |
a collection of binary versions of packages I've already emerged, so I |
54 |
can quickly revert to an old version for testing something, if I want, |
55 |
then remerge the new version once I've tested the old version to see if it |
56 |
has the same bug I'm working on or not. To aid in this, I have a |
57 |
collection of eppak and eapak scriptlets. Again, the portage default of |
58 |
--usepackage (-k) doesn't fit my default needs, as if I'm using a binpkg, |
59 |
I usually want to ONLY use a binpkg, NOT merge from source if the package |
60 |
isn't available. That happens to be -K in short-form. However, it's my |
61 |
default, so eapak invokes the -K version. I therefore have eapaK to |
62 |
invoke the -k version if I don't really care whether it goes from binpkg |
63 |
or source. |
64 |
|
65 |
Of course, there are various permutations of the above as well, so I have |
66 |
eapak2 and eapaK2, as well as eapak and eapaK. For the ep* versions, of |
67 |
course the --oneshot doesn't make a difference, so I only have eppak and |
68 |
eppaK, no eppa?2 scriptlets. |
69 |
|
70 |
... Deep breath... <g> |
71 |
|
72 |
All that as a preliminary explanation to this: Along with the above, I |
73 |
have a set of efetch functions, that invoke the -f form, so just do the |
74 |
fetch, not the actual compile and merge, and esyn (there's already an |
75 |
esync function in something or other I have merged so I just call it |
76 |
esyn), which does emerge sync, then updates the esearch db, then |
77 |
automatically fetches all the packages that an eaworld would want to |
78 |
update, so they are ready for me to merge at my leisure. |
79 |
|
80 |
Likewise, and the real reason for this whole explanation, I /had/ an |
81 |
"einfo" scriptlet that simply ran "emerge info". This can be very handy |
82 |
to run, if like me, you have several slotted versions of gcc merged, and |
83 |
you sometimes forget which one you have eselected or gcc-configed as the |
84 |
one portage will use. Likewise, it's useful for checking on CFLAGS (or |
85 |
CXXFLAGS OR LDFLAGS or...), if you modified them from the normal ones |
86 |
because a particular package wasn't cooperating, and you want to see if |
87 |
you remembered to switch them back or not. |
88 |
|
89 |
However, I ran into a problem. The output of einfo was too long to |
90 |
quickly find the most useful info -- the stuff I most often change and |
91 |
therefore most often am looking for. |
92 |
|
93 |
No sweat! I shortened my original "einfo" to simply "ei", and added a |
94 |
second script, "eis" (for einfo short), that simply piped the output of |
95 |
the usual emerge info into a grep that only returned the lines I most |
96 |
often need -- the big title one with gcc and similar info, CFLAGS, |
97 |
CXXFLAGS, LDFLAGS, and FEATURES. USE would also be useful, but it's too |
98 |
long even by itself to be searched at a glance, so if I want it, I simply |
99 |
run ei and look for what I want in the longer output. |
100 |
|
101 |
... Another deep breath... <g> |
102 |
|
103 |
OK, with that as a preliminary, you should be able to understand the |
104 |
following: |
105 |
|
106 |
$eis |
107 |
|
108 |
Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127, |
109 |
glibc-2.3.6-r2, 2.6.15 x86_64) |
110 |
|
111 |
CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers |
112 |
-funit-at-a-time -fweb -freorder-blocks-and-partition |
113 |
-fmerge-all-constants" |
114 |
|
115 |
CXXFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers |
116 |
-funit-at-a-time -fweb -freorder-blocks-and-partition |
117 |
-fmerge-all-constants" |
118 |
|
119 |
FEATURES="autoconfig buildpkg candy ccache confcache distlocks |
120 |
multilib-strict parallel-fetch sandbox sfperms strict userfetch" |
121 |
|
122 |
LDFLAGS="-Wl,-z,now" |
123 |
|
124 |
MAKEOPTS="-j4" |
125 |
|
126 |
To make sense of that... |
127 |
|
128 |
* The portage and glibc versions are ~amd64, as set in make.conf for the |
129 |
system in general. |
130 |
|
131 |
* CFLAGS: |
132 |
|
133 |
I choose -Os, optimize for size, because a modern CPU and the various |
134 |
cache levels are FAR faster than main memory. This difference is |
135 |
frequently severe enough that it's actually more efficient to optimize for |
136 |
size than for CPU performance, because the result is smaller code that |
137 |
maintains cache locality (stays in fast cache) far better, and the CPU |
138 |
saves more time that it would otherwise be spending idle, waiting for data |
139 |
to come in from slower more distant memory, than the actual cost of the |
140 |
loss of cycle efficiency that's often the tradeoff for small code. |
141 |
|
142 |
-O3, and to a lessor extent, -O2, do things like turn a loop that executes |
143 |
a fixed number of say 3 times, into "faster" code, by avoiding the jump at |
144 |
the end of each loop back to the top of the loop by writing it out as |
145 |
inline code, copying the loop instructions three times. This process |
146 |
would in our example of a 3-time fixed execution loop, save the expensive |
147 |
jump back to the top of the loop two times -- but at the SAME time would |
148 |
expand that section of code to three times its looped size. |
149 |
|
150 |
Back when memory operated at or near the speed of the CPU, avoiding the |
151 |
loop, even at the expense of three-times the code, was often faster. |
152 |
Today, where CPUs do several calculations in the time it takes to fetch |
153 |
data from main memory, it's generally faster to go for the smaller code, |
154 |
as it will be far more likely to still be in fast cache, avoiding that |
155 |
long wait for main memory, even if it /does/ mean wasting a couple |
156 |
additional cycles doing the expensive jump back to the top of the loop. |
157 |
|
158 |
Of course, this is theory, and the practical case can and will differ |
159 |
depending on the instructions actually being compiled. In particular, |
160 |
streaming media apps and media encoding/decoding are likely to still |
161 |
benefit from the traditional loop elimination style optimizations, because |
162 |
they run thru so much data already, that cache is routinely trashed |
163 |
anyway, regardless of the size of your instructions. As well, that type |
164 |
of application tends to have a LOT of looping instructions to optimize! |
165 |
|
166 |
By contrast, something like the kernel will benefit more than usual from |
167 |
size optimization. First, it's always memory locked and as such |
168 |
can't be swapped, and even "slow" main memory is still **MANY** **MANY** |
169 |
times faster than swap, so a smaller kernel means more other stuff fits |
170 |
into main memory with it, and isn't swapped as much. Second, parts of the |
171 |
kernel such as task scheduling are executed VERY often, either because |
172 |
they are frequently executed by most processes, or because they /control/ |
173 |
those processes. The smaller these are, the more likely they are to still |
174 |
be in cache when next used. Likewise, the smaller they are, the less |
175 |
potentially still useful other data gets flushed out of cache to make room |
176 |
for the kernel code executing at the moment. Third, while there's a lot |
177 |
of kernel code that will loop, and a lot that's essentially streaming, the |
178 |
kernel as a whole is a pretty good mix of code and thus won't benefit as |
179 |
much from loop optimizations and the like, as compared to special purpose |
180 |
code like the media codec and streaming applications above. |
181 |
|
182 |
The differences are marked enough and now demonstrated enough that a |
183 |
kernel config option to optimize for size was added I believe about a year |
184 |
ago. Evidently, that lead to even MORE demonstration, as the option was |
185 |
originally in the obscure embedded optimizations corner of the config, |
186 |
where few would notice or use it, and they upgraded it into a main option. |
187 |
In fact, where a year or two ago, the option didn't even exist, now I |
188 |
believe it defaults to yes/on/do-optimize-for-size (altho it's possible |
189 |
I'm incorrect on the last and it's not yet the default). |
190 |
|
191 |
According to the gcc manpage, -frename-registers causes gcc to attempt to |
192 |
make use of registers left over after normal register allocation. This is |
193 |
particularly beneficial on archs that have many registers (keeping in |
194 |
mind that "registers" are what amounts to L0 cache, the fastest possible |
195 |
memory because the CPU accesses registers directly and they operate at |
196 |
full CPU speed. Unfortunately, registers are also very limited, making |
197 |
them an EXCEEDINGLY valuable resource! Note that while x86-32 is noted |
198 |
for its relative /lack/ of registers, AMD basically doubled the number of |
199 |
registers available to 64-bit code in its x86-64 aka AMD64 spec. Thus, |
200 |
while this option wouldn't be of particular benefit on x86, on amd64, it |
201 |
can, depending on the code of course, provide some rather serious |
202 |
optimization! |
203 |
|
204 |
-fweb is a register use optimizer function as well. It tells gcc to |
205 |
create a /web/ of dependencies and assign each individual dependency web |
206 |
to its own pseudo-register. Thus, when it comes time for gcc to allocate |
207 |
registers, it already has a list of the best candidates lined up and ready |
208 |
to go. Combined with -frename register to tell gcc to efficiently make |
209 |
use of any registers left over after the the first pass, and due to the |
210 |
number of registers available in 64-bit mode on our arch, this can allow |
211 |
some seriously powerful optimizations. Still, a couple of things to note |
212 |
about it. One, -fweb (and -frename-registers as well) can cause data to |
213 |
move out of its "home" register, which seriously complicates debugging, if |
214 |
you are a programmer or power-user enough to worry about such things. |
215 |
Two, the rewrite for gcc 4.0 significantly modified the functionality of |
216 |
-fweb, and it wasn't recommended for 4.0 as it didn't yet work as well as |
217 |
expected or as it did with gcc 3.x. For gcc 4.1, -fweb is apparently back |
218 |
to its traditional strength. Those Gentoo users having gcc 3.4, 4.0, and |
219 |
4.1, all three in separate slots, will want to note this as they change |
220 |
gcc-configuratiions, and modify it accordingly. Yes, this *IS* one of the |
221 |
reasons my CFLAGS change so frequently! |
222 |
|
223 |
-funit-at-a-time tells gcc to consider a full logical unit, perhaps |
224 |
consisting of several source files rather than just one, as a whole, when |
225 |
it does its compiling. Of course, this allows gcc to make |
226 |
optimizations it couldn't see if it wasn't looking at the larger picture |
227 |
as a whole, but it requires rather more memory, to hold the entire unit |
228 |
so it can consider it at once. This is a fairly new flag, introduced with |
229 |
gcc 3.3 IIRC. While the idea is simple enough and shouldn't lead to any |
230 |
bugs on its own, there WERE a number of initially never encountered bugs |
231 |
in various code that this flag exposed, when GCC made optimizations on the |
232 |
entire unit that it wouldn't otherwise make, thereby triggering bugs that |
233 |
had never been triggered before. I /believe/ this was the root reason why |
234 |
the Gentoo amd64 technotes originally discouraged use of -Os, back with |
235 |
the first introduction of this flag in gcc 3.2 hammer (amd64) edition, as |
236 |
-funit-at-a-time was activated by -Os at that time, and -Os was known to |
237 |
produce bad code at the time, on amd64, with packages like portions of |
238 |
KDE. The gcc 4.1.0 manpage now says it's enabled by default at -O2 and |
239 |
-O3, but doesn't mention -Os. Whether that's an omission, or whether they |
240 |
decided it shouldn't be enabled by -Os for some reason, I'm not sure, but |
241 |
I use them both to be sure and haven't had any issues I can trace to this |
242 |
(not even back when the technotes recommended against -Os, and said KDE |
243 |
was supposed to have trouble with it -- maybe it was parts of KDE I never |
244 |
merged, or maybe I was just lucky, but I've simply never had an issue with |
245 |
it). |
246 |
|
247 |
-freorder-blocks-and-partition is new for gcc 4.0, I believe, alto I |
248 |
didn't discover it until I was reading the 4.1-beta manpage. I KNOW gcc |
249 |
3.4.4 fails out with it, saying unrecognized flag or some such, so it's |
250 |
another of those flags that cause my CFLAGS to be constantly changing, as |
251 |
I switch between gcc versions. This flag won't work under all conditions, |
252 |
according to the manpage, so is automatically disabled in the presence of |
253 |
exception handling, and a few other situations named in the manpage. It |
254 |
causes a lot of warnings too, to the effect that it's being disabled due |
255 |
to X reason. There's a similar -freorder-blocks flag, which optimizes by |
256 |
reordering blocks in a function to "reduce number of taken branches and |
257 |
improve code locality." In English, what that means is that it breaks |
258 |
caching less often. Again, caching is *EXTREMELY* performance critical, |
259 |
so anything that breaks it less often is CERTAINLY welcome! The |
260 |
-and-partition increases the effect, by separating the code into |
261 |
frequently used and less frequently used partitions. This keeps the most |
262 |
frequently used code all together, therefore keeping it in cache far more |
263 |
efficiently, since the less used code won't be constantly pulled in, |
264 |
forcing out frequently used code in the process. |
265 |
|
266 |
Hmm... As I'm writing and thinking about this, the probability that |
267 |
sticking the regular -freorder-blocks option in CFLAGS as well would be a |
268 |
wise thing, occurs to me. The non-partition version isn't as efficient as |
269 |
the partition version, and would be redundant if the partitioned version |
270 |
is in effect. However, the non-partitioned version doesn't have the same |
271 |
sorts of no-exceptions-handler and similar restrictions, so having it in |
272 |
the list, first, so the partitioned version overrides it where it can be |
273 |
used, should be a good idea. That way, where the partitioned version can |
274 |
be used, it will be, but where it can't, gcc will still use the |
275 |
non-partitioned version of the option, so I'll still get /some/ of the |
276 |
optimizations! I (re)compiled major portions of xorg (modular), qt, and |
277 |
the new kde 3.5.1 with the partitioned option, however, and it works, and |
278 |
I haven't tested having both options in there yet, so I'm not sure it'll |
279 |
work as the theory suggests it should, so some caution might be advised. |
280 |
|
281 |
-fmerge-all-constants COULD be dangerous with SOME code, as it breaks part |
282 |
of the C/C++ specification. However, it should be fine for most code |
283 |
written to be compiled with gcc, and I've seen no problems /yet/ tho both |
284 |
this and the reorder-and-partition flag above are fairly new to my CFLAGS, |
285 |
so haven't been as extensively personally tested as the others have been. |
286 |
If something seems to be breaking when this is in your CFLAGS, certainly |
287 |
it's the first thing I'd try pulling out. What it actually does is merge |
288 |
all constants with the same value into the same one. gcc has a weaker |
289 |
-fmerge-constants version that's enabled with any -O option at all (thus |
290 |
at -O, -O2, -O3, AND -Os), that merges all declared constants of the same |
291 |
value, which is safe and doesn't conflict with the C/C++ spec. What the |
292 |
/all/ specifier in there does, however, is cause gcc to merge declared |
293 |
variables where the value actually never changes, so they are in effect |
294 |
constants, altho they are declared as variables, with other constants of |
295 |
the same value. This /should/ be safe, /provided/ gcc isn't failing to |
296 |
detect a variable chance somewhere, but it conflicts with the C/C++ spec, |
297 |
according to the gcc manpage, and thus /could/ cause issues, if the |
298 |
developer pulls certain tricks that gcc wouldn't detect, or possibly more |
299 |
likely, if used with code compiled by a different compiler (say |
300 |
binary-only applications you may run, which may not have been compiled |
301 |
with gcc). There are two reasons why I choose to use it despite the |
302 |
possible risks. One, I want /small/ code, again, because small code fits |
303 |
in that all-important cache better and therefore runs faster, and |
304 |
obviously, two or more merged constants aren't going to take the space |
305 |
they would if gcc stored them separately. Two, the risks aren't as bad if |
306 |
you aren't running non-gcc compiled code anyway, and since I'm a strong |
307 |
believer in Software Libre, if it's binary-only, there's very little |
308 |
chance I'll want or risk it on my box, and everything I do run is gcc |
309 |
compiled anyway, so should be generally safe. Still, I know there may be |
310 |
instances where I'll have to recompile with the flag turned off, and am |
311 |
prepared to deal with them when they happen, or I'd not have the flag in |
312 |
my CFLAGS. |
313 |
|
314 |
|
315 |
And, here's some selected output from ei, interspersed with explanations, |
316 |
since I'm editing the output anyway: |
317 |
|
318 |
$ei |
319 |
!!! Failed to change nice value to '-2' |
320 |
!!! [Errno 13] Permission denied |
321 |
|
322 |
This is stderr output. It's not in the eis output above because I |
323 |
redirect stderr to /dev/null for it, as I know the reason for the error |
324 |
and am trying to be brief. |
325 |
|
326 |
The warning is because I'm using PORTAGE_NICENESS=-2 in make.conf. It has |
327 |
a negative nice set there to encourage portage to make fuller use of the |
328 |
dual CPUs under-X/from-a-konsole-session, as X and the kernel do some |
329 |
dynamic scheduling magic to keep X more responsive without having to up |
330 |
/its/ priority. The practical effect of that "magic" is to lower the |
331 |
priorities of everything besides X slightly, when X is running. This |
332 |
/does/ have the intended effect of keeping X more responsive, but the cost |
333 |
as observed here is that emerges take longer than they should when X is |
334 |
running, because the scheduler is leaving a bit of extra idle CPU time to |
335 |
keep X responsive. In many cases, I'd rather be using maximum CPU and get |
336 |
the merges done faster, even if X drags a bit in the mean time, and the |
337 |
slightly negative niceness for portage accomplishes exactly that. |
338 |
|
339 |
It's reporting a warning (to stderr) here, as I ran the command as a |
340 |
regular non-root user, and non-root can't set negative priorities for |
341 |
obvious system security reasons. I get the same warning with my ep* |
342 |
commands, which I normally run as a regular user, as well. The ea* |
343 |
commands which actually do the merging get run as root, naturally, so the |
344 |
niceness /can/ be set negative when it counts, during a real emerge. |
345 |
|
346 |
So... nothing of any real matter, then. |
347 |
|
348 |
|
349 |
!!! Relying on the shell to locate gcc, this may break |
350 |
!!! DISTCC, installing gcc-config and setting your current gcc |
351 |
!!! profile will fix this |
352 |
|
353 |
Another warning, likewise to stderr and thus not in the eis output. This |
354 |
one is due to the fact that eselect, the eventual systemwide replacement |
355 |
for gcc-config and a number of other commands, uses a different method to |
356 |
set the compiler than gcc-config did, and portage hasn't been adjusted to |
357 |
full compatibility just yet. Portage finds the proper gcc just fine for |
358 |
itself, but there'd be problems if distcc was involved, thus the warning. |
359 |
|
360 |
Again, I'm aware of the situation and the cause, but don't use distcc, so |
361 |
it's nothing I have to worry about, and I can safely ignore the warning. |
362 |
|
363 |
I kept the warnings here, as I find them and the explanation behind them |
364 |
interesting elements of my Gentoo environment, thus worth posting, for |
365 |
others who seem interested in my Gentoo environment as well. If nothing |
366 |
else, the explanations should help some in my audience understand that bit |
367 |
more about how their system operates, even if they don't get these |
368 |
warnings. |
369 |
|
370 |
|
371 |
Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127, |
372 |
glibc-2.3.6-r2, 2.6.15 x86_64) |
373 |
================================================================= |
374 |
System uname: 2.6.15 x86_64 AMD Opteron(tm) Processor 242 |
375 |
Gentoo Base System version 1.12.0_pre15 |
376 |
|
377 |
Those of you running stable amd64, but wondering where baselayout is for |
378 |
unstable, there you have it! |
379 |
|
380 |
ccache version 2.4 [enabled] |
381 |
dev-lang/python: 2.4.2 |
382 |
sys-apps/sandbox: 1.2.17 |
383 |
sys-devel/autoconf: 2.13, 2.59-r7 |
384 |
sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 |
385 |
sys-devel/binutils: 2.16.91.0.1 |
386 |
sys-devel/libtool: 1.5.22 |
387 |
virtual/os-headers: 2.6.11-r3 |
388 |
|
389 |
ACCEPT_KEYWORDS="amd64 ~amd64" |
390 |
|
391 |
Same for the above portions of my toolchain. AFAIR, it's all ~amd64, |
392 |
altho I was running a still-masked binutils for awhile shortly after |
393 |
gcc-4.0 was released (still-masked on Gentoo as well), as it required the |
394 |
newer binutils. |
395 |
|
396 |
LANG="en_US" |
397 |
LDFLAGS="-Wl,-z,now" |
398 |
|
399 |
Some of you may have noticed the occasional Portage warning about a SETUID |
400 |
executables using lazy bindings, and the potential security issue that |
401 |
causes. This setting for LDFLAGS forces early bindings with all |
402 |
dynamically linked libraries. Normally it'd only be necessary or |
403 |
recommended for SETUID executables, and set in the ebuild where it's safe |
404 |
to do so, but I use it by default, for several reasons. The effect is |
405 |
that a program takes a bit longer to load initially, but won't have to |
406 |
pause to resolve late bindings as they are needed. You're trading waiting |
407 |
at executable initialization for waiting at some other point. With a gig |
408 |
of memory, I find most stuff I run more than once is at least partially |
409 |
still in cache on the second and later launches, and with my system, I |
410 |
don't normally find the initial wait irritating, and sometimes find a |
411 |
pause after I'm working with a program especially so, so I prefer to have |
412 |
everything resolved and loaded at executable launch. Additionally, with |
413 |
lazy bindings, I've had programs start just fine, then fail later when |
414 |
they need to resolve some function that for some reason won't resolve in |
415 |
whatever library it's supposed to be coming from. I don't like have the |
416 |
thing fail and interrupt me in the middle of a task, and find it far less |
417 |
frustrating, if it's going to fail when it tries to load something, to |
418 |
have it do so at launch. Because early bindings forces resolution of |
419 |
functions at launch, if it's going to fail loading one, it'll fail at |
420 |
launch, rather than after I've started working with the program. That's |
421 |
/exactly/ how I want it, so that's why I run the above LDFLAGS setting. |
422 |
It's nice not to have to worry about the security issue, but SETUID type |
423 |
security isn't as critical on my single-human-user system, where that |
424 |
single-user-is me and I already have root when I want it anyway, as it'd |
425 |
be in a multi-user system, particularly a public server, so the other |
426 |
reasons are more important than security, for me, on this. They just |
427 |
happen to coincide, so I'm a happy camper. =8^) |
428 |
|
429 |
The caveat with these LDFLAGS, however, is the rare case where there's a |
430 |
circular functional dependency that's normally self-resolving, Modular |
431 |
xorg triggers one such case, where the monolithic xorg didn't. There are |
432 |
three individual ebuilds related to modular xorg that I have to remove |
433 |
these LDFLAGS for or they won't work. xorg-server is one. |
434 |
xf86-vidio-ati, my video driver, is another. libdri was the third, IIRC. |
435 |
There's a specific order they have to be compiled in, as well. If they are |
436 |
compiled with this enabled, they, and consequently X, refuses to load (tho |
437 |
X will load without DRI, if that's the only one, it'll just protest in the |
438 |
log and DRI and glx aren't available). Evidently there's a non-critical |
439 |
fourth module somewhere, that still won't load properly due to an |
440 |
unresolved symbol, that I need to track down and remerge without these |
441 |
LDFLAGS, and that's what's keeping GLX from loading on my current system, |
442 |
as mentioned in an earlier post. |
443 |
|
444 |
LINGUAS="en" |
445 |
MAKEOPTS="-j4" |
446 |
|
447 |
The four jobs is nice for a dual-CPU system -- when it works. |
448 |
Unfortunately, the unpack and configure steps are serialized, so the jobs |
449 |
option does little good, there. To make most efficient use of the |
450 |
available cycles when I have a lot to merge, therefore, I'll run as many |
451 |
as five merges in parallel. I do this quite regularly with KDE upgrades |
452 |
like the one to 3.5.1, where I use the split KDE ebuilds and have |
453 |
something north of 100 packages to merge before KDE is fully upgraded. |
454 |
|
455 |
I mentioned above that I often run eptree, then ea individual packages |
456 |
from the list. This is how I accomplish the five merges in parallel. |
457 |
I'll take a look at the tree output to check the dependencies, and merge |
458 |
the packages first that have several dependencies, but only where those |
459 |
dependencies aren't stepping on each other, thus keeping the parallel |
460 |
emerges from interfering with each other, because each one is doing its |
461 |
own dependencies, that aren't dependencies of any of the others. After I |
462 |
get as many of those going as I can, I'll start listing 3-5 individual |
463 |
packages without deps on the same ea command line. By the time I've |
464 |
gotten the fifth one started, one of the other sessions has usually |
465 |
finished or is close to it, so I can start it merging the next set of |
466 |
packages. With five merge sessions in parallel, I'm normally running an |
467 |
average load of 5 to 9, meaning that many applications are ready for CPU |
468 |
scheduling time at any instant, on average. If the load drops below four, |
469 |
there's proobably idle CPU cycles being wasted that could otherwise be |
470 |
compiling stuff, as each CPU needs at least one load-point to stay busy, |
471 |
plus usually can schedule a second one for some cycles as well, while the |
472 |
first is waiting for the hard drive or whatever. |
473 |
|
474 |
(Note that I'm running a four-drive RAID, RAID-6, so two-way striped, for |
475 |
my main system, Raid-0, so 4-way striped, for $PORTAGE_TMPDIR, so hard |
476 |
drive latency isn't /nearly/ as high as it would be on a single-hard-drive |
477 |
system. Of course, running five merges in parallel /does/ increase disk |
478 |
latency some as well, but it /does/ seem to keep my load-average in the |
479 |
target zone and my idle cycles to a minimum, during the merge period. |
480 |
Also note that I've only recently added the PORTAGE_NICENESS value above, |
481 |
and haven't gotten it fully tweaked to the best balance between |
482 |
interactivity and emerge speed just yet, but from observations so far, |
483 |
with the niceness value set, I'll be able to keep the system busy with |
484 |
"only" 3-4 parallel merges, rather than the 5 I had been having to run to |
485 |
keep the system most efficiently occupied when I had a lot to merge.) |
486 |
|
487 |
PKGDIR="/pkg" |
488 |
PORTAGE_TMPDIR="/tmp" |
489 |
PORTDIR="/p" |
490 |
PORTDIR_OVERLAY="/l/p" |
491 |
|
492 |
Here you can see some of my path customization. |
493 |
|
494 |
USE="amd64 7zip X a52 |
495 |
aac acpi alsa apm arts asf audiofile avi bash-completion berkdb |
496 |
bitmap-fonts bzip2 caps cdparanoia cdr crypt css cups curl dga divx4linux |
497 |
dlloader dri dts dv dvd dvdr dvdread eds emboss encode extrafilters fam |
498 |
fame ffmpeg flac font-server foomaticdb gdbm gif glibc-omitfp gpm |
499 |
gstreamer gtk2 idn imagemagick imlib ithreads jp2 jpeg jpeg2k kde |
500 |
kdeenablefinal lcms libwww linuxthreads-tls lm_sensors logitech-mouse |
501 |
logrotate lzo lzw lzw-tiff mad maildir mikmod mjpeg mng motif mozilla mp3 |
502 |
mpeg ncurses network no-old-linux nolvm1 nomirrors nptl nptlonly offensive |
503 |
ogg opengl oss pam pcre pdflib perl pic png ppds python qt quicktime |
504 |
radeon readline scanner slang speex spell ssl tcltk theora threads tiff |
505 |
truetype truetype-fonts type1 type1-fonts usb userlocales vcd vorbis |
506 |
xcomposite xine xinerama xml2 xmms xosd xpm xrandr xv xvid yv12 zlib |
507 |
elibc_glibc input_devices_keyboard input_devices_mouse kernel_linux |
508 |
linguas_en userland_GNU video_cards_ati" |
509 |
|
510 |
My USE flags, FWTAR (for what they are worth). Of particular interest are |
511 |
the input_devices_mouse and keyboard, and video_cards_ati. These come |
512 |
from variables (INPUT_DEVICES and VIDEO_CARDS) set in make.conf, and used |
513 |
in the new xorg-modular ebuilds. These and the others listed after zlib |
514 |
are referred to by Gentoo devs as USE_EXPAND. Effectively, they are USE |
515 |
flags in the form of variables, setup that way because there are rather |
516 |
many possible values for those variables, too many to work as USE flags. |
517 |
The LINGUAS and LANG USE_EXPAND variables are prime examples. Consider |
518 |
how many different languages there are and that were used and documented |
519 |
as regular USE flags, it would have to be in use.local.desc, because few |
520 |
supporting packages would offer the same choices, so each would have to be |
521 |
listed separately for each package. Talk about the number of USE flags |
522 |
quickly getting out of control! |
523 |
|
524 |
Unset: ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL |
525 |
|
526 |
OK, some loose ends to wrapup, and I'm done. |
527 |
|
528 |
re: gcc versions: The plan is for gcc-4.0 to go ~arch fairly soon, now. |
529 |
The devs are actively asking for bug reports involving it, now, so as many |
530 |
as possible can be resolved before it goes ~arch. (Formerly, they were |
531 |
recommending that bugs be filed upstream, and not with Gentoo unless there |
532 |
was a patch attached, as it was considered entirely unsupported, just |
533 |
there for those that wanted it anyway.) At this point, nearly everything |
534 |
should compile just fine with 4.0. |
535 |
|
536 |
That said, Gentoo has slotted gcc for a reason. It's possible to have |
537 |
multiple minor versions (3.3, 3.4, 4.0, 4.1) merged at the same time. |
538 |
With USE=multislot, that's actually microversion (4.0.0, 4.0.1, 4.0.2...). |
539 |
Using either gcc-config or eselect compiler, and discounting any CFLAG |
540 |
switching you may have to do, it's a simple matter to switch between |
541 |
merged versions. This made it easy to experiment with gcc-4.0 even tho |
542 |
Gentoo wasn't supporting it and certain packages wouldn't compile with |
543 |
4.x, because it was always possible to switch to a 3.x version if |
544 |
necessary, and compile the package there. I did this quite regularly, |
545 |
using gcc-4.0 as my normal version, but reverting for individual packages |
546 |
as necessary, when they wouldn't compile with 4.0. |
547 |
|
548 |
The same now applies to the 4.1.0-beta-snapshot series. Other than the |
549 |
compile time necessary to compile a new gcc when the snapshot comes out |
550 |
each week, it's easy to run the 4.1-beta as the main system compiler for |
551 |
as wide testing as possible, while reverting to 4.0 or 3.4 (I don't have a |
552 |
3.3 slot merged) if needed. |
553 |
|
554 |
re: the performance improvements I saw that started this whole thing: |
555 |
These trace to several things, I believe. #1, with gcc-4.0, there's now |
556 |
support for -fvisibility -- setting certain functions as exported and |
557 |
visible externally, others not. That can easily cut exported symbols by a |
558 |
factor of 10. Exported symbols of course affect dynamic load-time, which |
559 |
of course gets magnified dramatically by my LDFLAGS early binding |
560 |
settings. When I first compiled KDE with that (there were several |
561 |
missteps early on in terms of KDE and Gentoo's support, but that aside), |
562 |
KDE appload times went down VERY NOTICEABLY! Again, due to my LDFLAGS, |
563 |
the effect was multiplied dramatically, but the effect is VERY real! |
564 |
|
565 |
Of course, that's mainly load-time performance. The run-time performance |
566 |
that we are actually talking here has other explanations. A big one is |
567 |
that gcc-4 was a HUGE rewrite, with a BIG potential to DRAMATICALLY |
568 |
improve gcc's performance. With 4.0, the theory is there, but in |
569 |
practice, it wasn't all that optimized just yet. In some ways it reverted |
570 |
behavior below that of the fairly mature 3.x series, altho the rewrite |
571 |
made things much simpler and less prone to error given its maturity. 4.1, |
572 |
however, is the first 4.x release to REALLY be hitting the potential of |
573 |
the 4.x series, and it appears the difference is very noticeable. Of |
574 |
course, there's a reason 4.1.0 is still in beta upstream and not supported |
575 |
by Gentoo either, as there are still known regressions. However, where it |
576 |
works, which it seems to do /most/ of the time, it **REALLY** works, or at |
577 |
least that's been my observation. 3.3 was a MAJOR improvement in gcc for |
578 |
amd64 users, because it was the first version where amd64 wasn't simply an |
579 |
add-on hack, as it had been with 3.2. The 3.4 upgrade was minor in |
580 |
comparison, and 4.0 while it's going ~arch shortly, and sets the stage for |
581 |
a lot of future improvement, will be pretty minor in terms of actual |
582 |
improved performance as well. 4.1, however, when it is finally fully |
583 |
released, has the potential to be as big an improvement as 3.3 was -- that |
584 |
is, a HUGE one. I'm certainly looking forward to it, and meanwhile, |
585 |
running the snapshots, because Gentoo makes it easy to do so while |
586 |
maintaining the ability to switch very simply between multiiple versions |
587 |
on the system. |
588 |
|
589 |
Both -freorder-blocks-and-partition and -fmerge-all-constants are new to |
590 |
me within a few days, now, and new to me with kde 3.5.1. Normally, |
591 |
individual flags won't make /that/ much of a difference, but it's possible |
592 |
I hit it lucky, with these. Actually, because they both match very well |
593 |
with and reinforce my strategy of targeting size, it's possible I'm only |
594 |
now unlocking the real potential behind size optimization. -- I **KNOW** |
595 |
there's a **HUGE** difference in sizes between resulting file-sizes. I |
596 |
compared 4.0.2 and 4.1.0-beta-snapshot file sizes for several modular-X |
597 |
files in the course of researching the missing symbols problem, and the |
598 |
difference was often a shrinkage of near 33 percent with 4.1 and my |
599 |
current CFLAGS as opposed to 4.0.1 without the new ones. Going the other |
600 |
way, that's a 50% larger file with 4.0.2 as compared to 4.1, 100KB vs |
601 |
150KB, by way of example. That's a *HUGE* difference, one big enough to |
602 |
initially think I'd found the reason for the missing symbols right there, |
603 |
as the new files were simply too much smaller to look workable! Still, I |
604 |
traced the problem too LDFLAGS, so that wasn't it, and the files DO work, |
605 |
confirming things. I'm guessing -fmerge-all-constants plays a significant |
606 |
part in that. In any case, with that difference in size, and knowing how |
607 |
/much/ cache hit vs. miss affects performance, it's quite possible the |
608 |
size is the big performance factor. Of course, even if that's so, I'm not |
609 |
sure whether it is the CFLAGS or the 4.0 vs 4.1 that should get the credit. |
610 |
|
611 |
In any case, I'm a happy camper right now! =8^) |
612 |
|
613 |
|
614 |
-- |
615 |
Duncan - List replies preferred. No HTML msgs. |
616 |
"Every nonfree program has a lord, a master -- |
617 |
and if you use the program, he is your master." Richard Stallman in |
618 |
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html |
619 |
|
620 |
|
621 |
-- |
622 |
gentoo-amd64@g.o mailing list |