1 |
May I put my oar into your optimisation dicussion. |
2 |
|
3 |
It's funny, Duncan. On the one side you are saving every byte of cpu-cache. On |
4 |
the other side, you are happy by having forked bashes in your main memory. |
5 |
But how do you take control about that? I mean, how do you get the code of |
6 |
your forked bashes away from your cpu cache to have it free for kernel code? |
7 |
|
8 |
A long time ago . . ., I was testing some CFLAGS on my own programs. I wrote a |
9 |
fast-fourier algorithm myself, only to see the "impressive" difference |
10 |
between Os, O3 and some other optimisation flags. I fed my fast-fourier |
11 |
algorithm with a large amount of input. But no matter how hard I tried to get |
12 |
it faster by changing the flags, it didn't work. The difference is marginal |
13 |
and not every flag brings improvement for every program. The only thing that |
14 |
changed a lot was the time gcc needs to perform those optimisations. |
15 |
|
16 |
Bernhard |
17 |
Am Donnerstag 09 Februar 2006 01:17 schrieb Duncan: |
18 |
> Simon Stelling posted <43EA568D.6020307@g.o>, excerpted below, on |
19 |
> |
20 |
> Wed, 08 Feb 2006 21:37:33 +0100: |
21 |
> > Duncan wrote: |
22 |
> >> I should really create a page listing all the little Gentoo admin |
23 |
> >> scripts I've come up with and how I use them. I'm sure a few folks |
24 |
> >> anyway would likely find them useful. |
25 |
> >> |
26 |
> >> The idea behind most of them is to create shortcuts to having to type in |
27 |
> >> long emerge lines, with all sorts of arbitrary command line parameters. |
28 |
> >> The majority of these fall into two categories, ea* and ep*, short for |
29 |
> >> emerge --ask <additional parameters> and emerge --pretend ... . Thus, I |
30 |
> >> have epworld and eaworld, the pretend and ask versions of emerge -NuDv |
31 |
> >> world, epsys and easys, the same for system, eplog <package>, emerge |
32 |
> >> --pretend --log --verbose (package name to be added to the command line |
33 |
> >> so eplog gcc, for instance, to see the changes between my current and |
34 |
> >> the new version of gcc), eptree <package>, to use the tree output, etc. |
35 |
> > |
36 |
> > Interesting. But why do you use scripts and not simple aliases? Every |
37 |
> > time you launch your script the HD performs a seek (which is very |
38 |
> > expensive in time), copies the script into memory and then forks a whole |
39 |
> > bash process to execute a one-liner. Using alias, which is a bash |
40 |
> > built-in, wouldn't fork a process and therefore be much faster. |
41 |
> |
42 |
> My thinking, which is possibly incorrect (your input appreciated), is that |
43 |
> file-based scripts get pulled into cache the first time they are executed, |
44 |
> and will remain there (with a gig of memory) pretty much until I'm done |
45 |
> doing my upgrades. At the same time, they are simply in cache, not |
46 |
> something in bash's memory, so if the memory is needed, it will be |
47 |
> reclaimed. As well, after I'm done and on to other tasks, the cached |
48 |
> commands will eventually be replaced by other data, if need be. |
49 |
> |
50 |
> Aliases (and bash-functions) are held in memory. That's not as flexible |
51 |
> as cache in terms of being knocked out of memory if the memory is needed |
52 |
> by other things. Sure, that memory may be flushed to disk-based swap, but |
53 |
> that's disk based the same as the actual script files I'm using, so |
54 |
> reading it back into main memory if it's faulted out will take something |
55 |
> comparable to the time it'd take to read in the script file again anyway. |
56 |
> That's little gain, with the additional overhead and therefore loss of |
57 |
> having to manage the temp-copy in swapped memory, if it comes to that. |
58 |
> |
59 |
> Actually, there are some details here that may affect things. I don't |
60 |
> know enough about the following factors to be able to evaluate how they |
61 |
> balance out, but the real reason I chose individual scripts is below. |
62 |
> |
63 |
> One, here anyway, tho not on most systems, I'm running four SATA disks in |
64 |
> RAID. The swap is actually not on the RAID, as the kernel manages it like |
65 |
> RAID on its own, provided all four swap areas are set to the same priority |
66 |
> (they are), which means swap is running on the equivalent of |
67 |
> four-way-striped RAID-0. Meanwhile, the scripts, as part of my main |
68 |
> system, are on RAID-6 for redundancy, so with the same four disks backing |
69 |
> the RAID-6 as the swap, I've only effectively two-way-striped storage |
70 |
> there, the other two disk stripes being parity. Thus, retrieval from the |
71 |
> 4-way-striped swap should in theory be more efficient than from the |
72 |
> 2-way-striped regular storage. OTOH, the granularity of the stripe |
73 |
> in either case, against the size of the one or two-line script, likely |
74 |
> means that it'll be pulled from a single stripe (at the speed of |
75 |
> reading from a single disk, tho there are parallelizing opportunities |
76 |
> not available on a single disk). It's also likely that the swap will be |
77 |
> more optimally managed for fast retrieval than the location on the regular |
78 |
> filesystem is. Balanced against that we have the overhead of maintaining |
79 |
> the swap tracking. |
80 |
> |
81 |
> That's assuming it would swap that out to the dedicated swap in the first |
82 |
> place. I'm not familiar with Linux's VM, but given that the aliases and |
83 |
> functions would be file-based in either case, it's possible it would |
84 |
> simply drop the data from main memory, relying on the fact that that the |
85 |
> data is clean file-backed data and could be read-in directly from the |
86 |
> files again, if necessary, rather than bothering with actually creating a |
87 |
> temporary copy of the /same/ data in swap, taking time to do so when it |
88 |
> could just read it back in from the file. |
89 |
> |
90 |
> Another aspect is the effect of data vs metadata caching. Again, I'm not |
91 |
> familiar with how Linux manages this, and indeed, it may differ between |
92 |
> filesystems, but the idea is that if the file metadata is still cached, |
93 |
> even if the file itself isn't, it's a single disk seek and read to read |
94 |
> the data back in, as opposed to multiple seeks and reads, following the |
95 |
> logical directory structure to fetch each directory table in the |
96 |
> hierarchy until it reaches the entry that actually has the file location, |
97 |
> before it can read the file itself, to read the file initially, or if the |
98 |
> location metadata has been flushed as well. (Back several years ago on |
99 |
> MSWormOS, one of the first things I always did after a reinstall was set |
100 |
> the system to server profile, which kept a far larger metadata cache, on |
101 |
> the theory that the metadata was usually smaller than the data, and for |
102 |
> dirs, sharable among many data files, so I'd rather spend cache memory on |
103 |
> metadata than data. The other choices were the default desktop profile, |
104 |
> and laptop, a much smaller metadata cache. I originally learned about |
105 |
> these as a result of reading about a bug in the original 95 as shipped, |
106 |
> that swapped some entries in the registry, and therefore cached FAR less |
107 |
> metadata than it should have. I don't know where these tweaks are located |
108 |
> on Linux, or how to go about adjusting them safely.) |
109 |
> |
110 |
> Basically, therefore, I don't believe aliases to be a big positive, and |
111 |
> possibly somewhat of a negative, as opposed to scripts, because the |
112 |
> scripts will be cached in most cases after initial use anyway, yet they |
113 |
> have the advantage of not having to be maintained or tracked in memory |
114 |
> when I'm doing other tasks and the system needs that cache. |
115 |
> |
116 |
> Given that I don't believe it's a big positive, I prefer the |
117 |
> administrative convenience and maintainability of separate scripts. |
118 |
> |
119 |
> There /is/ a third alternative, that I came across recently, that I think |
120 |
> is a good idea. If you'd coomment, perhaps it would help me sort out the |
121 |
> implications. |
122 |
> |
123 |
> The idea, simply put, is "bash command theming", single scripts that can |
124 |
> be invoked that will "theme" a command prompt for the tasks at hand. I |
125 |
> didn't read the entire article I saw covering this, but skimmed it enough |
126 |
> to get the gist. A single invokable script for each set of tasks, say |
127 |
> perl programming, bash programming, working with portage, etc, that would |
128 |
> set up a specific set of aliases and functions for that task. Invoking |
129 |
> the script with the "off" parameter would erase that set of aliases and |
130 |
> bash functions, thereby recovering the memory, and do any related cleanup |
131 |
> like resetting the path if necessary to exclude any task specific |
132 |
> commands. Taking this a step further, a variable could be setup that |
133 |
> would list the theme or themes that were active, that the theme-setup |
134 |
> script could read and automatically deactivate the previous theme while |
135 |
> switching to the new one. One could even share functionality between |
136 |
> themes, sourcing common files, which would check the active theme and |
137 |
> adjust their behavior based on the active theme. |
138 |
> |
139 |
> This alias and function theming wouldn't be quite as modular (tho with |
140 |
> sourcing it could be) as the individual scripts, but would maintain the |
141 |
> performance advantages (if any) of the alias/function idea, while at the |
142 |
> same time allowing the memory reclamation of the cached-script option. It |
143 |
> sounds really good, but I'm not yet convinced the benefits would be worth |
144 |
> the additional effort of setting up those themes, since the solution I |
145 |
> have works. |
146 |
> |
147 |
> One VERY NICE benefit of the themes idea is that it would directly |
148 |
> address any namespace pollution concerns. It has a direct appeal to |
149 |
> programmers and anyone else that's ever had to deal with such issues, for |
150 |
> that reason alone. One single command on the path to invoke the theme, |
151 |
> possibly even an eselect-like command shared among themes, with |
152 |
> everything else off-path and out of the namespace unless that theme is |
153 |
> invoked! /VERY/ appealing indeed. OTOH, there are those who'll never |
154 |
> remember the theme they have active at the moment, and be constantly |
155 |
> confused. For these folks, it'd be a nightmare! |
156 |
> |
157 |
> > man emerge: |
158 |
> > --oneshot (-1) |
159 |
> > |
160 |
> > IIRC --oneshot has a short form since 2.0.52 was released. |
161 |
> |
162 |
> Learn new things everyday. Thanks! I remember how pleased I was to have |
163 |
> --newuse, and even more so when I discovered -N, so very nice! |
164 |
> |
165 |
> >> ... Deep breath... <g> |
166 |
> >> |
167 |
> >> All that as a preliminary explanation to this: Along with the above, I |
168 |
> >> have a set of efetch functions, that invoke the -f form, so just do the |
169 |
> >> fetch, not the actual compile and merge, and esyn (there's already an |
170 |
> >> esync function in something or other I have merged so I just call it |
171 |
> >> esyn), which does emerge sync, then updates the esearch db, then |
172 |
> >> automatically fetches all the packages that an eaworld would want to |
173 |
> >> update, so they are ready for me to merge at my leisure. |
174 |
> > |
175 |
> > I'm a bit confused now. You use *functions* to do that? Or do you mean |
176 |
> > scripts? By the way: with alias you could name your custom "script" |
177 |
> > esync because it doesn't place a file on the harddisk. |
178 |
> |
179 |
> Scripts. I was using "functions" in the generic sense here. I did |
180 |
> realize before I sent that it had a dual meaning, but figured it wasn't |
181 |
> important enough a distinction to go back and correct, or explain. |
182 |
> Unfortunately, every time I decide to skip something like that, I get |
183 |
> called on it, which doesn't help my posts get any shorter! =8^) |
184 |
> |
185 |
> >> I choose -Os, optimize for size, because a modern CPU and the various |
186 |
> >> cache levels are FAR faster than main memory. |
187 |
> > |
188 |
> > Given the fact that two CPUs, only differing in L2 Cache size, have |
189 |
> > nearly the same performance, I doubt that the performance increase is |
190 |
> > very big. Some interesting figures: |
191 |
> > |
192 |
> > Athlon64 something (forgot what, but shouldn't matter anyway) with 1 MB |
193 |
> > L2-cache is 4% faster than an Athlon64 of the same frequency but with |
194 |
> > only 512kB L2-cache. The bigger the cache sizes you compare get, the |
195 |
> > smaller the performance increase. Since you run a dual Opteron system |
196 |
> > with 1 MB L2 cache per CPU I tend to say that the actual performance |
197 |
> > increase you experience is about 3%. But then I didn't take into account |
198 |
> > that -Os leaves out a few optimizations which would be included by -O2, |
199 |
> > the default optimization level, which actually makes the code a bit |
200 |
> > slower when compared to -O2. So, the performance increase you really |
201 |
> > experience shrinks to about 0-2%. I'd tend to proclaim that -O2 is even |
202 |
> > faster for most of the code, but that's only my feeling. |
203 |
> |
204 |
> Interesting, indeed. I'd counter that it likely has to do with how many |
205 |
> tasks are being juggled as well, plus the number of kernel/user context |
206 |
> switches, of course. I wonder under what load, and with what task-type, |
207 |
> the above 4% difference was measured. |
208 |
> |
209 |
> Of course, the definitive way to end the argument would be to do some |
210 |
> profiling and get some hard numbers, but I don't think either you or I |
211 |
> consider it an important enough factor in our lives to go to /that/ sort |
212 |
> of trouble. <g> |
213 |
> |
214 |
> > Beside that I should mention that -Os sometimes still has problems with |
215 |
> > huge packages like glibc. |
216 |
> |
217 |
> Interestingly enough, while Gentoo's glibc ebuilds stripflags to -O2, I |
218 |
> did try it with all that stripflags logic disabled. For glibc, it /does/ |
219 |
> seem to slow things down, or did back with gcc-3.3 (IIRC) anyway. I tried |
220 |
> the same glibc both ways. I would have tried tinkering further, but |
221 |
> decided it wasn't worth complicating debugging and the like, since glibc |
222 |
> is loaded by virtually everything, and I'd never be able to tell if it was |
223 |
> my funny tweaks to glibc, or some actual issue with whatever package. |
224 |
> Besides, that's an aweful costly package, in terms of recompile time, not |
225 |
> to mention system stability, to be experimenting with. I /can/ say, |
226 |
> however, that it didn't crash or cause any other issues I could see or |
227 |
> attribute to it. |
228 |
> |
229 |
> OTOH, I haven't tried it with xorg-modular yet, but the monolithic xorg |
230 |
> builds seemed to perform better with -Os. I tried one of them (6.8??) |
231 |
> both ways too. I ended up routinely killing the stripflags logic, but I |
232 |
> was modifying other portions of the ebuild as well (so it compiled only |
233 |
> the ATI video driver, and only installed the 100-dpi fonts, not 75-dpi, |
234 |
> among other things), so that was just one of several modifications I was |
235 |
> making, tho the only real performance affecting one. Performance in X was |
236 |
> better, but it DID take longer to switch to a VT, when I tried that. In |
237 |
> fact, at one point, the switch to VT functionality broke, but someone |
238 |
> mentioned it was broken in general at that point for certain drivers, |
239 |
> anyway, so I'm not sure my optimizations had anything to do with it. |
240 |
> |
241 |
> >> Of course, this is theory, and the practical case can and will differ |
242 |
> >> depending on the instructions actually being compiled. In particular, |
243 |
> >> streaming media apps and media encoding/decoding are likely to still |
244 |
> >> benefit from the traditional loop elimination style optimizations, |
245 |
> >> because they run thru so much data already, that cache is routinely |
246 |
> >> trashed anyway, regardless of the size of your instructions. As well, |
247 |
> >> that type of application tends to have a LOT of looping instructions to |
248 |
> >> optimize! |
249 |
> >> |
250 |
> >> By contrast, something like the kernel will benefit more than usual |
251 |
> >> from size optimization. First, it's always memory locked and as such |
252 |
> >> can't be swapped, and even "slow" main memory is still **MANY** |
253 |
> >> **MANY** times faster than swap, so a smaller kernel means more other |
254 |
> >> stuff fits into main memory with it, and isn't swapped as much. Second, |
255 |
> >> parts of the |
256 |
> > |
257 |
> > Funny to hear this from somebody with 4 GB RAM in his system. I don't |
258 |
> > know how bloated your kernel is, but even if -Os would reduce the size |
259 |
> > of my kernel to **the half**, which is totally impossible, it wouldn't |
260 |
> > be enough to load the mail I am just answering into RAM. So, basically, |
261 |
> > this reasoning is just ridiculous. |
262 |
> |
263 |
> I won't argue with that. BTW, still at a gig, much to my frustration! I |
264 |
> put off upgrading memory when I decided my disk was in danger of going bad |
265 |
> and I ended up deciding to go 4-disk SATA based RAID. Then I upgraded my |
266 |
> stereo near Christmas... Now the CC is almost paid off again, so I'm |
267 |
> looking at that memory upgrade again. |
268 |
> |
269 |
> Much to my frustration, memory prices don't seem to be dropping much |
270 |
> lately! |
271 |
> |
272 |
> > You are referring a lot to the gcc manpage, but obviously you missed |
273 |
> > this part: |
274 |
> > |
275 |
> > -fomit-frame-pointer |
276 |
> > Don't keep the frame pointer in a register for functions that |
277 |
> > don't need one. This avoids the instructions to save, set up |
278 |
> > and restore frame pointers; it also makes an extra register |
279 |
> > available in many functions. It also makes debugging |
280 |
> > impossible on some machines. |
281 |
> > |
282 |
> > On some machines, such as the VAX, this flag has no effect, |
283 |
> > because the standard calling sequence automatically handles |
284 |
> > the frame pointer and nothing is saved by pretending it |
285 |
> > doesn't exist. The machine-description macro |
286 |
> > "FRAME_POINTER_REQUIRED" controls whether a target machine |
287 |
> > supports this flag. |
288 |
> > |
289 |
> > Enabled at levels -O, -O2, -O3, -Os. |
290 |
> > |
291 |
> > I have to say that I am a bit disappointed now. You seemed to be one of |
292 |
> > those people who actually inform themselves before sticking new flags |
293 |
> > into their CFLAGS. |
294 |
> |
295 |
> ?? |
296 |
> |
297 |
> I'm not sure which way you mean this. It was in my CFLAGS list, but I |
298 |
> didn't discuss it as it's fairly common (from my observation, nearly as |
299 |
> common as -pipe) and seems fairly non-controversial on Gentoo. Did you |
300 |
> miss it in my CFLAGS and are saying I should be using it, or did you see |
301 |
> it and are saying its unnecessary and redundant because it's enabled by |
302 |
> the -Os? |
303 |
> |
304 |
> If the latter, yes, but as mentioned above in the context of glibc, -Os is |
305 |
> sometimes stripped. In that case, the redundancy of having the basic |
306 |
> -fomit-frame-pointer is useful, unless it's also stripped, but as I said, |
307 |
> it seems much less controversial than some flags and is often |
308 |
> specifically allowed where most are stripped. |
309 |
> |
310 |
> Or, are you saying I should avoid it due to the debugging implications? I |
311 |
> don't quite get it. |
312 |
> |
313 |
> >> !!! Relying on the shell to locate gcc, this may break !!! DISTCC, |
314 |
> >> installing gcc-config and setting your current gcc !!! profile will fix |
315 |
> >> this |
316 |
> >> |
317 |
> >> Another warning, likewise to stderr and thus not in the eis output. |
318 |
> >> This one is due to the fact that eselect, the eventual systemwide |
319 |
> >> replacement for gcc-config and a number of other commands, uses a |
320 |
> >> different method to set the compiler than gcc-config did, and portage |
321 |
> >> hasn't been adjusted to full compatibility just yet. Portage finds the |
322 |
> >> proper gcc just fine for itself, but there'd be problems if distcc was |
323 |
> >> involved, thus the warning. |
324 |
> > |
325 |
> > Didn't know about this. Have you filed a bug yet on the topic? Or is |
326 |
> > there already one? |
327 |
> |
328 |
> There is one. I don't recall if I filed it or if it was already there, |
329 |
> but both JH and the portage folks know about the issue. IIRC, the portage |
330 |
> folks decided it was their side that needed changed, but that required |
331 |
> changes to the distcc package, and I don't know how that has gone since I |
332 |
> don't use distcc, except that I was slightly surprised to see the warning |
333 |
> in portage 2.1 still. |
334 |
> |
335 |
> >> MAKEOPTS="-j4" |
336 |
> >> |
337 |
> >> The four jobs is nice for a dual-CPU system -- when it works. |
338 |
> >> Unfortunately, the unpack and configure steps are serialized, so the |
339 |
> >> jobs option does little good, there. To make most efficient use of the |
340 |
> >> available cycles when I have a lot to merge, therefore, I'll run as |
341 |
> >> many as five merges in parallel. I do this quite regularly with KDE |
342 |
> >> upgrades like the one to 3.5.1, where I use the split KDE ebuilds and |
343 |
> >> have something north of 100 packages to merge before KDE is fully |
344 |
> >> upgraded. |
345 |
> > |
346 |
> > I really wonder how you would paralellize unpacking and configuring a |
347 |
> > package. |
348 |
> |
349 |
> That's what was nice about configcache, which was supposed to be in the |
350 |
> next portage, but I haven't seen or heard anything about it for awhile, |
351 |
> and the next portage, 2.1, is what I'm using. configcache seriously |
352 |
> shortened that stage of the build, leaving more of it parallelized, but... |
353 |
> |
354 |
> I was using it for awhile, patching successive versions of portage, but it |
355 |
> broke about the time sandbox split, the dev said he wasn't maintaining the |
356 |
> old version since it was going in the new portage, and I tried updating |
357 |
> the patch but eventually ran into what I think were unrelated issues but |
358 |
> decided to drop that in one of my troubleshooting steps and never picked |
359 |
> it up again. |
360 |
> |
361 |
> I'd certainly like to have it back again, tho. If it's working in 2.1, |
362 |
> I've not seen it documented or seen any hints in the emerge output, as |
363 |
> were there before. You seen or heard anything? |
364 |
> |
365 |
> BTW, what is your opinion on -ftracer? Several devs I've noticed use it, |
366 |
> but the manpage says it's not that useful without active profiling, which |
367 |
> means compiling, profiling, and recompiling, AFAIK. It's possible the |
368 |
> devs running it do that, but I doubt it, and otherwise, I don't see that |
369 |
> it should be that useful? I don't know if you run it, but since I've got |
370 |
> your attention, I thought I'd ask what you think about it. Is there |
371 |
> something of significance I'm missing, or are they, or are they actually |
372 |
> doing that compile/profile/recompile thing? It just doesn't make sense to |
373 |
> me. I've seen it in several user posted CFLAGS as well, but I'll bet a |
374 |
> good portion of them are simply because they saw it in a dev's CFLAGS and |
375 |
> decided it looked useful, not because they understand any implications |
376 |
> stated in the manpage. (Not that I always do either, but... <g>) |
377 |
> |
378 |
> -- |
379 |
> Duncan - List replies preferred. No HTML msgs. |
380 |
> "Every nonfree program has a lord, a master -- |
381 |
> and if you use the program, he is your master." Richard Stallman in |
382 |
> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html |
383 |
-- |
384 |
gentoo-amd64@g.o mailing list |