Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite - gentoo-amd64

From:	Simon Stelling <blubb@g.o>
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite
Date:	Wed, 08 Feb 2006 20:39:13
Message-Id:	`43EA568D.6020307@gentoo.org`
In Reply to:	[gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite by Duncan <1i5t5.duncan@cox.net>

1

Duncan wrote:

2

>>Nice. Now let us know your CFLAGS, and what toolchain versions you're

3

>>running :D

4

>

5

>

6

> You probably didn't notice, as I had it commented out on the main index

7

> page as I don't have the page created to actually list them yet, but if

8

> you viewed source, you'd have seen I have a techspecs page link commented

9

> out, that'll get that sort of info, when/if I actually get it created.

10

>

11

> However, since you asked, your answer, and a bit more, by way of

12

> explanation...

13

>

14

> I should really create a page listing all the little Gentoo admin scripts

15

> I've come up with and how I use them.  I'm sure a few folks anyway would

16

> likely find them useful.

17

>

18

> The idea behind most of them is to create shortcuts to having to type in

19

> long emerge lines, with all sorts of arbitrary command line parameters.

20

> The majority of these fall into two categories, ea* and ep*, short for

21

> emerge --ask <additional parameters> and emerge --pretend ... .  Thus, I

22

> have epworld and eaworld, the pretend and ask versions of emerge -NuDv

23

> world, epsys and easys, the same for system, eplog <package>, emerge

24

> --pretend --log --verbose (package name to be added to the command line so

25

> eplog gcc, for instance, to see the changes between my current and the new

26

> version of gcc), eptree <package>, to use the tree output, etc.

27

28

Interesting. But why do you use scripts and not simple aliases? Every time you

29

launch your script the HD performs a seek (which is very expensive in time),

30

copies the script into memory and then forks a whole bash process to execute a

31

one-liner. Using alias, which is a bash built-in, wouldn't fork a process and

32

therefore be much faster.

33

34

(see man alias for examples)

35

36

> One thing I've found is that I'll often epworld or eptreeworld, then

37

> emerge the individual packages, rather than use eaworld to do it.  That

38

> way, I can do them in the order I want or do several at a time if I want

39

> to make use of both CPUs.  Because I always use --deep, as I want to keep

40

> my dependencies updated as well, I'm very often merging specific

41

> dependencies.  There's a small problem with that, however --oneshot, which

42

> I'll always want to use with dependencies to help keep my world file

43

> uncluttered, has no short form, but I use it as the default!  OTOH, the

44

45

man emerge:

46

       --oneshot (-1)

47

48

IIRC --oneshot has a short form since 2.0.52 was released.

49

50

> normal portage mode of adding stuff listed on the command line to the

51

> world file, I don't want very often, as most of the time I'm simply

52

> updating what I have, so it's all in the world file if it needs to be

53

> there already anyway.  Not a problem! All my regular ea* scriptlets use

54

> --oneshot, so it /is/ my default.  If I *AM* merging something new that I

55

> want added to my world file, I have another family of ea* scriptlets that

56

> do that -- all ending in "2", as in, "NOT --oneshot".  Thus, I have a

57

> family of ea*2 scriptlets.

58

>

59

> The regulars here already know one of my favorite portage features is

60

> FEATURES=buildpkg, which I have set in make.conf.  That of course gives me

61

> a collection of binary versions  of packages I've already emerged, so I

62

> can quickly revert to an old version for testing something, if I want,

63

> then remerge the new version once I've tested the old version to see if it

64

> has the same bug I'm working on or not.  To aid in this, I have a

65

> collection of eppak and eapak scriptlets.  Again, the portage default of

66

> --usepackage (-k) doesn't fit my default needs, as  if I'm using a binpkg,

67

> I usually want to ONLY use a binpkg, NOT merge from source if the package

68

> isn't available.  That happens to be -K in short-form. However, it's my

69

> default, so eapak invokes the -K version.  I therefore have eapaK to

70

> invoke the -k version if I don't really care whether it goes from binpkg

71

> or source.

72

>

73

> Of course, there are various permutations of the above as well, so I have

74

> eapak2 and eapaK2, as well as eapak and eapaK.  For the ep* versions, of

75

> course the --oneshot doesn't make a difference, so I only have eppak and

76

> eppaK, no eppa?2 scriptlets.

77

>

78

> ...  Deep breath... <g>

79

>

80

> All that as a preliminary explanation to this:  Along with the above, I

81

> have a set of efetch functions, that invoke the -f form, so just do the

82

> fetch, not the actual compile and merge, and esyn (there's already an

83

> esync function in something or other I have merged so I just call it

84

> esyn), which does emerge sync, then updates the esearch db, then

85

> automatically fetches all the packages that an eaworld would want to

86

> update, so they are ready for me to merge at my leisure.

87

88

I'm a bit confused now. You use *functions* to do that? Or do you mean scripts?

89

By the way: with alias you could name your custom "script" esync because it

90

doesn't place a file on the harddisk.

91

92

> Likewise, and the real reason for this whole explanation, I /had/ an

93

> "einfo" scriptlet that simply ran "emerge info".  This can be very handy

94

> to run, if like me, you have several slotted versions of gcc merged, and

95

> you sometimes forget which one you have eselected or gcc-configed as the

96

> one portage will use.  Likewise, it's useful for checking on CFLAGS (or

97

> CXXFLAGS OR LDFLAGS or...), if you modified them from the normal ones

98

> because a particular package wasn't cooperating, and you want to see if

99

> you remembered to switch them back or not.

100

>

101

> However, I ran into a problem.  The output of einfo was too long to

102

> quickly find the most useful info -- the stuff I most often change and

103

> therefore most often am looking for.

104

>

105

> No sweat!  I shortened my original "einfo" to simply "ei", and added a

106

> second script, "eis" (for einfo short), that simply piped the output of

107

> the usual emerge info into a grep that only returned the lines I most

108

> often need -- the big title one with gcc and similar info, CFLAGS,

109

> CXXFLAGS, LDFLAGS, and FEATURES.  USE would also be useful, but it's too

110

> long even by itself to be searched at a glance, so if I want it, I simply

111

> run ei and look for what I want in the longer output.

112

113

Impressive.

114

115

> ...  Another deep breath... <g>

116

>

117

> OK, with that as a preliminary, you should be able to understand the

118

> following:

119

>

120

> $eis

121

>

122

> Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,

123

> glibc-2.3.6-r2, 2.6.15 x86_64)

124

>

125

> CFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers

126

> -funit-at-a-time -fweb -freorder-blocks-and-partition

127

> -fmerge-all-constants"

128

>

129

> CXXFLAGS="-march=k8 -Os -pipe -fomit-frame-pointer -frename-registers

130

> -funit-at-a-time -fweb -freorder-blocks-and-partition

131

> -fmerge-all-constants"

132

>

133

> FEATURES="autoconfig buildpkg candy ccache confcache distlocks

134

> multilib-strict parallel-fetch sandbox sfperms strict userfetch"

135

>

136

> LDFLAGS="-Wl,-z,now"

137

>

138

> MAKEOPTS="-j4"

139

>

140

> To make sense of that...

141

>

142

> * The portage and glibc versions are ~amd64, as set in make.conf for the

143

> system in general.

144

>

145

> * CFLAGS:

146

>

147

> I choose -Os, optimize for size, because a modern CPU and the various

148

> cache levels are FAR faster than main memory.  This difference is

149

> frequently severe enough that it's actually more efficient to optimize for

150

> size than for CPU performance, because the result is smaller code that

151

> maintains cache locality (stays in fast cache) far better, and the CPU

152

> saves more time that it would otherwise be spending idle, waiting for data

153

> to come in from slower more distant memory, than the actual cost of the

154

> loss of cycle efficiency that's often the tradeoff for small code.

155

156

Given the fact that two CPUs, only differing in L2 Cache size, have nearly the

157

same performance, I doubt that the performance increase is very big. Some

158

interesting figures:

159

160

Athlon64 something (forgot what, but shouldn't matter anyway) with 1 MB L2-cache

161

 is 4% faster than an Athlon64 of the same frequency but with only 512kB

162

L2-cache. The bigger the cache sizes you compare get, the smaller the

163

performance increase. Since you run a dual Opteron system with 1 MB L2 cache per

164

CPU I tend to say that the actual performance increase you experience is about

165

3%. But then I didn't take into account that -Os leaves out a few optimizations

166

which would be included by -O2, the default optimization level, which actually

167

makes the code a bit slower when compared to -O2. So, the performance increase

168

you really experience shrinks to about 0-2%. I'd tend to proclaim that -O2 is

169

even faster for most of the code, but that's only my feeling.

170

171

Beside that I should mention that -Os sometimes still has problems with huge

172

packages like glibc.

173

174

> Back when memory operated at or near the speed of the CPU, avoiding the

175

> loop, even at the expense of three-times the code, was  often faster.

176

> Today, where CPUs do several calculations in the time it takes to fetch

177

> data from main memory, it's generally faster to go for the smaller code,

178

> as it will be far more likely to still be in fast cache, avoiding that

179

> long wait for main memory, even if it /does/ mean wasting a couple

180

> additional cycles doing the expensive jump back to the top of the loop.

181

182

Not only CPUs got faster, but also caches got bigger. Comparing my old P4 with

183

1.7 GHz and 256kb L2 cache to a P4 with 3.4 GHz (frequency doubled) which has 1

184

MB L2 cache (cache quadrupled) shows that the proportions changed. Bigger cache

185

of course means that you can have larger chunks of code there, so unrolling

186

loops with fixed iterations actually might perform better.

187

188

> Of course, this is theory, and the practical case can and will differ

189

> depending on the instructions actually being compiled.  In particular,

190

> streaming media apps and media encoding/decoding are likely to still

191

> benefit from the traditional loop elimination style optimizations, because

192

> they run thru so much data already, that cache is routinely trashed

193

> anyway, regardless of the size of your instructions.  As well, that type

194

> of application tends to have a LOT of looping instructions to optimize!

195

>

196

> By contrast, something like the kernel will benefit more than usual from

197

> size optimization.  First, it's always memory locked and as such

198

> can't be swapped, and even "slow" main memory is still **MANY** **MANY**

199

> times faster than swap, so a smaller kernel means more other stuff fits

200

> into main memory with it, and isn't swapped as much.  Second, parts of the

201

202

Funny to hear this from somebody with 4 GB RAM in his system. I don't know how

203

bloated your kernel is, but even if -Os would reduce the size of my kernel to

204

**the half**, which is totally impossible, it wouldn't be enough to load the

205

mail I am just answering into RAM. So, basically, this reasoning is just ridiculous.

206

207

> kernel such as task scheduling are executed VERY often, either because

208

> they are frequently executed by most processes, or because they /control/

209

> those processes.  The smaller these are, the more likely they are to still

210

> be in cache when next used.  Likewise, the smaller they are, the less

211

> potentially still useful other data gets flushed out of cache to make room

212

> for the kernel code executing at the moment.  Third, while there's a lot

213

> of kernel code that will loop, and a lot that's essentially streaming, the

214

> kernel as a whole is a pretty good mix of code and thus won't benefit as

215

> much from loop optimizations and the like, as compared to special purpose

216

> code like the media codec and streaming applications above.

217

>

218

> The differences are marked enough and now demonstrated enough that a

219

> kernel config option to optimize for size was added I believe about a year

220

> ago.  Evidently, that lead to even MORE demonstration, as the option was

221

> originally in the obscure embedded optimizations corner of the config,

222

> where few would notice or use it, and they upgraded it into a main option.

223

> In fact, where a year or two ago, the option didn't even exist, now I

224

> believe it defaults to yes/on/do-optimize-for-size (altho it's possible

225

> I'm incorrect on the last and it's not yet the default).

226

227

It is not. The option you are talking about is called

228

CONFIG_CC_OPTIMIZE_FOR_SIZE and is defined nowhere, so that the 'ifdef

229

CONFIG_CC_OPTIMIZE_FOR_SIZE' will result in no by default and therefore set -O2

230

as default.

231

232

> According to the gcc manpage, -frename-registers causes gcc to attempt to

233

> make use of registers left over after normal register allocation.  This is

234

> particularly beneficial on archs that have many registers (keeping in

235

> mind that "registers" are what amounts to L0 cache, the fastest possible

236

> memory because the CPU accesses registers directly and they operate at

237

> full CPU speed.  Unfortunately, registers are also very limited, making

238

> them an EXCEEDINGLY valuable resource! Note that while x86-32 is noted

239

> for its relative /lack/ of registers, AMD basically doubled the number of

240

> registers available to 64-bit code in its x86-64 aka AMD64 spec.  Thus,

241

> while this option wouldn't be of particular benefit on x86, on amd64, it

242

> can, depending on the code of course, provide some rather serious

243

> optimization!

244

>

245

> -fweb is a register use optimizer function as well.  It tells gcc to

246

> create a /web/ of dependencies and assign each individual dependency web

247

> to its own pseudo-register.  Thus, when it comes time for gcc to allocate

248

> registers, it already has a list of the best candidates lined up and ready

249

> to go.  Combined with -frename register to tell gcc to efficiently make

250

> use of any registers left over after the the first pass, and due to the

251

> number of registers available in 64-bit mode on our arch, this can allow

252

> some seriously powerful optimizations.  Still, a couple of things to note

253

> about it.  One, -fweb (and -frename-registers as well) can cause data to

254

> move out of its "home" register, which seriously complicates debugging, if

255

> you are a programmer or power-user enough to worry about such things.

256

> Two, the rewrite for gcc 4.0 significantly modified the functionality of

257

> -fweb, and it wasn't recommended for 4.0 as it didn't yet work as well as

258

> expected or as it did with gcc 3.x.  For gcc 4.1, -fweb is apparently back

259

> to its traditional strength.  Those Gentoo users having gcc 3.4, 4.0, and

260

> 4.1, all three in separate slots, will want to note this as they change

261

> gcc-configuratiions, and modify it accordingly.  Yes, this *IS* one of the

262

> reasons my CFLAGS change so frequently!

263

>

264

> -funit-at-a-time tells gcc to consider a full logical unit, perhaps

265

> consisting of several source files rather than just one, as a whole, when

266

> it does its compiling.  Of course, this allows gcc to make

267

> optimizations it couldn't see if it wasn't looking at the larger picture

268

> as a whole, but it requires rather more memory, to hold the entire unit

269

> so it can consider it at once. This is a fairly new flag, introduced with

270

> gcc 3.3 IIRC.  While the idea is simple enough and shouldn't lead to any

271

> bugs on its own, there WERE a number of initially never encountered bugs

272

> in various code that this flag exposed, when GCC made optimizations on the

273

> entire unit that it wouldn't otherwise make, thereby triggering bugs that

274

> had never been triggered before.  I /believe/ this was the root reason why

275

> the Gentoo amd64 technotes originally discouraged use of -Os, back with

276

> the first introduction of this flag in gcc 3.2 hammer (amd64) edition, as

277

> -funit-at-a-time was activated by -Os at that time, and -Os was known to

278

> produce bad code at the time, on amd64, with packages like portions of

279

> KDE.  The gcc 4.1.0 manpage now says it's enabled by default at -O2 and

280

> -O3, but doesn't mention -Os.  Whether that's an omission, or whether they

281

> decided it shouldn't be enabled by -Os for some reason, I'm not sure, but

282

> I use them both to be sure and haven't had any issues I can trace to this

283

> (not even back when the technotes recommended against -Os, and said KDE

284

> was supposed to have trouble with it -- maybe it was parts of KDE I never

285

> merged, or maybe I was just lucky, but I've simply never had an issue with

286

> it).

287

>

288

> -freorder-blocks-and-partition is new for gcc 4.0, I believe, alto I

289

> didn't discover it until I was reading the 4.1-beta manpage.  I KNOW gcc

290

> 3.4.4 fails out with it, saying unrecognized flag or some such, so it's

291

> another of those flags that cause my CFLAGS to be constantly changing, as

292

> I switch between gcc versions.  This flag won't work under all conditions,

293

> according to the manpage, so is automatically disabled in the presence of

294

> exception handling, and a few other situations named in the manpage.  It

295

> causes a lot of warnings too, to the effect that it's being disabled due

296

> to X reason.  There's a similar -freorder-blocks flag, which optimizes by

297

> reordering blocks in a function to "reduce number of taken branches and

298

> improve code locality."  In English, what that means is that it breaks

299

> caching less often.  Again, caching is *EXTREMELY* performance critical,

300

> so anything that breaks it less often is CERTAINLY welcome!  The

301

> -and-partition increases the effect, by separating the code into

302

> frequently used and less frequently used partitions.  This keeps the most

303

> frequently used code all together, therefore keeping it in cache far more

304

> efficiently, since the less used code won't be constantly pulled in,

305

> forcing out frequently used code in the process.

306

>

307

> Hmm... As I'm writing and thinking about this, the probability that

308

> sticking the regular -freorder-blocks option in CFLAGS as well would be a

309

> wise thing, occurs to me.  The non-partition version isn't as efficient as

310

> the partition version, and would be redundant if the partitioned version

311

> is in effect.  However, the non-partitioned version doesn't have the same

312

> sorts of no-exceptions-handler and similar restrictions, so having it in

313

> the list, first, so the partitioned version overrides it where it can be

314

> used, should be a good idea.  That way, where the partitioned version can

315

> be used, it will be, but where it can't, gcc will still use the

316

> non-partitioned version of the option, so I'll still get /some/ of the

317

> optimizations!  I (re)compiled major portions of xorg (modular), qt, and

318

> the new kde 3.5.1 with the partitioned option, however, and it works, and

319

> I haven't tested having both options in there yet, so I'm not sure it'll

320

> work as the theory suggests it should, so some caution might be advised.

321

>

322

> -fmerge-all-constants COULD be dangerous with SOME code, as it breaks part

323

> of the C/C++ specification.  However, it should be fine for most code

324

> written to be compiled with gcc, and I've seen no problems /yet/ tho both

325

> this and the reorder-and-partition flag above are fairly new to my CFLAGS,

326

> so haven't been as extensively personally tested as the others have been.

327

> If something seems to be breaking when this is in your CFLAGS, certainly

328

> it's the first thing I'd try pulling out.  What it actually does is merge

329

> all constants with the same value into the same one.  gcc has a weaker

330

> -fmerge-constants version that's enabled with any -O option at all (thus

331

> at -O, -O2, -O3, AND -Os), that merges all declared constants of the same

332

> value, which is safe and doesn't conflict with the C/C++ spec.  What the

333

> /all/ specifier in there does, however, is cause gcc to merge declared

334

> variables where the value actually never changes, so they are in effect

335

> constants, altho they are declared as variables, with other constants of

336

> the same value.  This /should/ be safe, /provided/ gcc isn't failing to

337

> detect a variable chance somewhere, but it conflicts with the C/C++ spec,

338

> according to the gcc manpage, and thus /could/ cause issues, if the

339

> developer pulls certain tricks that gcc wouldn't detect, or possibly more

340

> likely, if used with code compiled by a different compiler (say

341

> binary-only applications you may run, which may not have been compiled

342

> with gcc).  There are two reasons why I choose to use it despite the

343

> possible risks.  One, I want /small/ code, again, because small code fits

344

> in that all-important cache better and therefore runs faster, and

345

> obviously, two or more merged constants aren't going to take the space

346

> they would if gcc stored them separately.  Two, the risks aren't as bad if

347

> you aren't running non-gcc compiled code anyway, and since I'm a strong

348

> believer in Software Libre, if it's binary-only, there's very little

349

> chance I'll want or risk it on my box, and everything I do run is gcc

350

> compiled anyway, so should be generally safe.  Still, I know there may be

351

> instances where I'll have to recompile with the flag turned off, and am

352

> prepared to deal with them when they happen, or I'd not have the flag in

353

> my CFLAGS.

354

355

You are referring a lot to the gcc manpage, but obviously you missed this part:

356

357

       -fomit-frame-pointer

358

           Don't keep the frame pointer in a register for functions that don't

359

           need one.  This avoids the instructions to save, set up and restore

360

           frame pointers; it also makes an extra register available in many

361

           functions.  It also makes debugging impossible on some machines.

362

363

           On some machines, such as the VAX, this flag has no effect, because

364

           the standard calling sequence automatically handles the frame

365

           pointer and nothing is saved by pretending it doesn't exist.  The

366

           machine-description macro "FRAME_POINTER_REQUIRED" controls whether

367

           a target machine supports this flag.

368

369

           Enabled at levels -O, -O2, -O3, -Os.

370

371

I have to say that I am a bit disappointed now. You seemed to be one of those

372

people who actually inform themselves before sticking new flags into their CFLAGS.

373

374

> And, here's some selected output from ei, interspersed with explanations,

375

> since I'm editing the output anyway:

376

>

377

> $ei

378

> !!! Failed to change nice value to '-2'

379

> !!! [Errno 13] Permission denied

380

>

381

> This is stderr output.  It's not in the eis output above because I

382

> redirect stderr to /dev/null for it, as I know the reason for the error

383

> and am trying to be brief.

384

>

385

> The warning is because I'm using PORTAGE_NICENESS=-2 in make.conf.  It has

386

> a negative nice set there to encourage portage to make fuller use of the

387

> dual CPUs under-X/from-a-konsole-session, as X and the kernel do some

388

> dynamic scheduling magic to keep X more responsive without having to up

389

> /its/ priority.  The practical effect of that "magic" is to lower the

390

> priorities of everything besides X slightly, when X is running.  This

391

> /does/ have the intended effect of keeping X more responsive, but the cost

392

> as observed here is that emerges take longer than they should when X is

393

> running, because the scheduler is leaving a bit of extra idle CPU time to

394

> keep X responsive.  In many cases, I'd rather be using maximum CPU and get

395

> the merges done faster, even if X drags a bit in the mean time, and the

396

> slightly negative niceness for portage accomplishes exactly that.

397

>

398

> It's reporting a warning (to stderr) here, as I ran the command as a

399

> regular non-root user, and non-root can't set negative priorities for

400

> obvious system security reasons.  I get the same warning with my ep*

401

> commands, which I normally run as a regular user, as well.  The ea*

402

> commands which actually do the merging get run as root, naturally, so the

403

> niceness /can/ be set negative when it counts, during a real emerge.

404

>

405

> So... nothing of any real matter, then.

406

>

407

>

408

> !!! Relying on the shell to locate gcc, this may break

409

> !!! DISTCC, installing gcc-config and setting your current gcc

410

> !!! profile will fix this

411

>

412

> Another warning, likewise to stderr and thus not in the eis output.  This

413

> one is due to the fact that eselect, the eventual systemwide replacement

414

> for gcc-config and a number of other commands, uses a different method to

415

> set the compiler than gcc-config did, and portage hasn't been adjusted to

416

> full compatibility just yet.  Portage finds the proper gcc just fine for

417

> itself, but there'd be problems if distcc was involved, thus the warning.

418

419

Didn't know about this. Have you filed a bug yet on the topic? Or is there

420

already one?

421

422

> Again, I'm aware of the situation and the cause, but don't use distcc, so

423

> it's nothing I have to worry about, and I can safely ignore the warning.

424

>

425

> I kept the warnings here, as I find them and the explanation behind them

426

> interesting elements of my Gentoo environment, thus worth posting, for

427

> others who seem interested in my Gentoo environment as well.  If nothing

428

> else, the explanations should help some in my audience understand that bit

429

> more about how their system operates, even if they don't get these

430

> warnings.

431

432

Indeed.

433

434

> Portage 2.1_pre4-r1 (default-linux/amd64/2006.0, gcc-4.1.0-beta20060127,

435

> glibc-2.3.6-r2, 2.6.15 x86_64)

436

> =================================================================

437

> System uname: 2.6.15 x86_64 AMD Opteron(tm) Processor 242

438

> Gentoo Base System version 1.12.0_pre15

439

>

440

> Those of you running stable amd64, but wondering where baselayout is for

441

> unstable, there you have it!

442

>

443

> ccache version 2.4 [enabled]

444

> dev-lang/python:   2.4.2

445

> sys-apps/sandbox:    1.2.17

446

> sys-devel/autoconf:  2.13, 2.59-r7

447

> sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1

448

> sys-devel/binutils:  2.16.91.0.1

449

> sys-devel/libtool:   1.5.22

450

> virtual/os-headers:  2.6.11-r3

451

>

452

> ACCEPT_KEYWORDS="amd64 ~amd64"

453

>

454

> Same for the above portions of my toolchain.  AFAIR, it's all ~amd64,

455

> altho I was running a still-masked binutils for awhile shortly after

456

> gcc-4.0 was released (still-masked on Gentoo as well), as it required the

457

> newer binutils.

458

>

459

> LANG="en_US"

460

> LDFLAGS="-Wl,-z,now"

461

>

462

> Some of you may have noticed the occasional Portage warning about a SETUID

463

> executables using lazy bindings, and the potential security issue that

464

> causes. This setting for LDFLAGS forces early bindings with all

465

> dynamically linked libraries.  Normally it'd only be necessary or

466

> recommended for SETUID executables, and set in the ebuild where it's safe

467

> to do so, but I use it by default, for several reasons.  The effect is

468

> that a program takes a bit longer to load initially, but won't have to

469

> pause to resolve late bindings as they are needed.  You're trading waiting

470

> at executable initialization for waiting at some other point.  With a gig

471

472

Note that depending on how many functions of a library/application you really

473

use when running it it might give you a bigger or smaller drawback.

474

475

> of memory, I find most stuff I run more than once is at least partially

476

> still in cache on the second and later launches, and with my system, I

477

> don't normally find the initial wait irritating, and sometimes find a

478

> pause after I'm working with a program especially so, so I prefer to have

479

> everything resolved and loaded at executable launch.  Additionally, with

480

> lazy bindings, I've had programs start just fine, then fail later when

481

> they need to resolve some function that for some reason won't resolve in

482

> whatever library it's supposed to be coming from.  I don't like have the

483

> thing fail and interrupt me in the middle of a task, and find it far less

484

> frustrating, if it's going to fail when it tries to load something, to

485

> have it do so at launch.  Because early bindings forces resolution of

486

> functions at launch, if it's going to fail loading one, it'll fail at

487

> launch, rather than after I've started working with the program.  That's

488

> /exactly/ how I want it, so that's why I run the above LDFLAGS setting.

489

> It's nice not to have to worry about the security issue, but SETUID type

490

> security isn't as critical on my single-human-user system, where that

491

> single-user-is me and  I already have root when I want it anyway, as it'd

492

> be in a multi-user system, particularly a public server, so the other

493

> reasons are more important than security, for me, on this.  They just

494

> happen to coincide, so I'm a happy camper. =8^)

495

>

496

> The caveat with these LDFLAGS, however, is the rare case where there's a

497

> circular functional dependency that's normally self-resolving,   Modular

498

> xorg triggers one such case, where the monolithic xorg didn't.  There are

499

> three individual ebuilds related to modular xorg that I have to remove

500

> these LDFLAGS for or they won't work.  xorg-server is one.

501

> xf86-vidio-ati, my video driver, is another.  libdri was the third, IIRC.

502

> There's a specific order they have to be compiled in, as well. If they are

503

> compiled with this enabled, they, and consequently X, refuses to load (tho

504

> X will load without DRI, if that's the only one, it'll just protest in the

505

> log and DRI and glx aren't available).  Evidently there's a non-critical

506

> fourth module somewhere, that still won't load properly due to an

507

> unresolved symbol, that I need to track down and remerge without these

508

> LDFLAGS, and that's what's keeping GLX from loading on my current system,

509

> as mentioned in an earlier post.

510

>

511

> LINGUAS="en"

512

> MAKEOPTS="-j4"

513

>

514

> The four jobs is nice for a dual-CPU system -- when it works.

515

> Unfortunately, the unpack and configure steps are serialized, so the jobs

516

> option does little good, there.  To make most efficient use of the

517

> available cycles when I have a lot to merge, therefore, I'll run as many

518

> as five merges in parallel.  I do this quite regularly with KDE upgrades

519

> like the one to 3.5.1, where I use the split KDE ebuilds and have

520

> something north of 100 packages to merge before KDE is fully upgraded.

521

522

I really wonder how you would paralellize unpacking and configuring a package.

523

524

> I mentioned above that I often run eptree, then ea individual packages

525

> from the list.  This is how I accomplish the five merges in parallel.

526

> I'll take a look at the tree output to check the dependencies, and merge

527

> the packages first that have several dependencies, but only where those

528

> dependencies aren't stepping on each other, thus keeping the parallel

529

> emerges from interfering with each other, because each one is doing its

530

> own dependencies, that aren't dependencies of any of the others.  After I

531

> get as many of those going as I can, I'll start listing 3-5 individual

532

> packages without deps on the same ea command line.  By the time I've

533

> gotten the fifth one started, one of the other sessions has usually

534

> finished or is close to it, so I can start it merging the next set of

535

> packages.  With five merge sessions in parallel, I'm normally running an

536

> average load of 5 to 9, meaning that many applications are ready for CPU

537

> scheduling time at any instant, on average.  If the load drops below four,

538

> there's proobably idle CPU cycles being wasted that could otherwise be

539

> compiling stuff, as each CPU needs at least one load-point to stay busy,

540

> plus usually can schedule a second one for some cycles as well, while the

541

> first is waiting for the hard drive or whatever.

542

>

543

> (Note that I'm running a four-drive RAID, RAID-6, so two-way striped, for

544

> my main system, Raid-0, so 4-way striped, for $PORTAGE_TMPDIR, so hard

545

> drive latency isn't /nearly/ as high as it would be on a single-hard-drive

546

> system.  Of course, running five merges in parallel /does/ increase disk

547

> latency some as well, but it /does/ seem to keep my load-average in the

548

> target zone and my idle cycles to a minimum, during the merge period.

549

> Also note that I've only recently added the PORTAGE_NICENESS value above,

550

> and haven't gotten it fully tweaked to the best balance between

551

> interactivity and emerge speed just yet, but from observations so far,

552

> with the niceness value set, I'll be able to keep the system busy with

553

> "only" 3-4 parallel merges, rather than the 5 I had been having to run to

554

> keep the system most efficiently occupied when I had a lot to merge.)

555

>

556

> PKGDIR="/pkg"

557

> PORTAGE_TMPDIR="/tmp"

558

> PORTDIR="/p"

559

> PORTDIR_OVERLAY="/l/p"

560

>

561

> Here you can see some of my path customization.

562

>

563

> USE="amd64 7zip X a52

564

> aac acpi alsa apm arts asf audiofile avi bash-completion berkdb

565

> bitmap-fonts bzip2 caps cdparanoia cdr crypt css cups curl dga divx4linux

566

> dlloader dri dts dv dvd dvdr dvdread eds emboss encode extrafilters fam

567

> fame ffmpeg flac font-server foomaticdb gdbm gif glibc-omitfp gpm

568

> gstreamer gtk2 idn imagemagick imlib ithreads jp2 jpeg jpeg2k kde

569

> kdeenablefinal lcms libwww linuxthreads-tls lm_sensors logitech-mouse

570

> logrotate lzo lzw lzw-tiff mad maildir mikmod mjpeg mng motif mozilla mp3

571

> mpeg ncurses network no-old-linux nolvm1 nomirrors nptl nptlonly offensive

572

> ogg opengl oss pam pcre pdflib perl pic png ppds python qt quicktime

573

> radeon readline scanner slang speex spell ssl tcltk theora threads tiff

574

> truetype truetype-fonts type1 type1-fonts usb userlocales vcd vorbis

575

> xcomposite xine xinerama xml2 xmms xosd xpm xrandr xv xvid yv12 zlib

576

> elibc_glibc input_devices_keyboard input_devices_mouse kernel_linux

577

> linguas_en userland_GNU video_cards_ati"

578

>

579

> My USE flags, FWTAR (for what they are worth).  Of particular interest are

580

> the input_devices_mouse and keyboard, and video_cards_ati.  These come

581

> from variables (INPUT_DEVICES and VIDEO_CARDS) set in make.conf, and used

582

> in the new xorg-modular ebuilds.  These and the others listed after zlib

583

> are referred to by Gentoo devs as USE_EXPAND.  Effectively, they are USE

584

> flags in the form of variables, setup that way because there are rather

585

> many possible values for those variables, too many to work as USE flags.

586

> The LINGUAS and LANG USE_EXPAND variables are prime examples.  Consider

587

> how many different languages there are and that were used and documented

588

> as regular USE flags, it would have to be in use.local.desc, because few

589

> supporting packages would offer the same choices, so each would have to be

590

> listed separately for each package.  Talk about the number of USE flags

591

> quickly getting out of control!

592

>

593

> Unset:  ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL

594

>

595

> OK, some loose ends to wrapup, and I'm done.

596

>

597

> re: gcc versions:  The plan is for gcc-4.0 to go ~arch fairly soon, now.

598

> The devs are actively asking for bug reports involving it, now, so as many

599

> as possible can be resolved before it goes ~arch.  (Formerly, they were

600

> recommending that bugs be filed upstream, and not with Gentoo unless there

601

> was a patch attached, as it was considered entirely unsupported, just

602

> there for those that wanted it anyway.)  At this point, nearly everything

603

> should compile just fine with 4.0.

604

>

605

> That said, Gentoo has slotted gcc for a reason.  It's possible to have

606

> multiple minor versions (3.3, 3.4, 4.0, 4.1) merged at the same time.

607

> With USE=multislot, that's actually microversion (4.0.0, 4.0.1, 4.0.2...).

608

> Using either gcc-config or eselect compiler, and discounting any CFLAG

609

> switching you may have to do, it's a simple matter to switch between

610

> merged versions.  This made it easy to experiment with gcc-4.0 even tho

611

> Gentoo wasn't supporting it and certain packages wouldn't compile with

612

> 4.x, because it was always possible to switch to a 3.x version if

613

> necessary, and compile the package there.  I did this quite regularly,

614

> using gcc-4.0 as my normal version, but reverting for individual packages

615

> as necessary, when they wouldn't compile with 4.0.

616

>

617

> The same now applies to the 4.1.0-beta-snapshot series.  Other than the

618

> compile time necessary to compile a new gcc when the snapshot comes out

619

> each week, it's easy to run the 4.1-beta as the main system compiler for

620

> as wide testing as possible, while reverting to 4.0 or 3.4 (I don't have a

621

> 3.3 slot merged) if needed.

622

>

623

> re: the performance improvements I saw that started this whole thing:

624

> These trace to several things, I believe.  #1, with gcc-4.0, there's now

625

> support for -fvisibility -- setting certain functions as exported and

626

> visible externally, others not.  That can easily cut exported symbols by a

627

> factor of 10.  Exported symbols of course affect dynamic load-time, which

628

> of course gets magnified dramatically by my LDFLAGS early binding

629

> settings.  When I first compiled KDE with that (there were several

630

> missteps early on in terms of KDE and Gentoo's support, but that aside),

631

> KDE appload times went down VERY NOTICEABLY!  Again, due to my LDFLAGS,

632

> the effect was multiplied dramatically, but the effect is VERY real!

633

>

634

> Of course, that's mainly load-time performance.  The run-time performance

635

> that we are actually talking here has other explanations.  A big one is

636

> that gcc-4 was a HUGE rewrite, with a BIG potential to DRAMATICALLY

637

> improve gcc's performance.  With 4.0, the theory is there, but in

638

> practice, it wasn't all that optimized just yet.  In some ways it reverted

639

> behavior below that of the fairly mature 3.x series, altho the rewrite

640

> made things much simpler and less prone to error given its maturity.  4.1,

641

> however, is the first 4.x release to REALLY be hitting the potential of

642

> the 4.x series, and it appears the difference is very noticeable.  Of

643

> course, there's a reason 4.1.0 is still in beta upstream and not supported

644

> by Gentoo either, as there are still known regressions.  However, where it

645

> works, which it seems to do /most/ of the time, it **REALLY** works, or at

646

> least that's been my observation.  3.3 was a MAJOR improvement in gcc for

647

> amd64 users, because it was the first version where amd64 wasn't simply an

648

> add-on hack, as it had been with 3.2.  The 3.4 upgrade was minor in

649

> comparison, and 4.0 while it's going ~arch shortly, and sets the stage for

650

> a lot of future improvement, will be pretty minor in terms of actual

651

> improved performance as well.  4.1, however, when it is finally fully

652

> released, has the potential to be as big an improvement as 3.3 was -- that

653

> is, a HUGE one.  I'm certainly looking forward to it, and meanwhile,

654

> running the snapshots, because Gentoo makes it easy to do so while

655

> maintaining the ability to switch very simply between multiiple versions

656

> on the system.

657

>

658

> Both -freorder-blocks-and-partition and -fmerge-all-constants are new to

659

> me within a few days, now, and new to me with kde 3.5.1.  Normally,

660

> individual flags won't make /that/ much of a difference, but it's possible

661

> I hit it lucky, with these.  Actually, because they both match very well

662

> with and reinforce my strategy of targeting size, it's possible I'm only

663

> now unlocking the real potential behind size optimization.  -- I **KNOW**

664

> there's a **HUGE** difference in sizes between resulting file-sizes.  I

665

> compared 4.0.2 and 4.1.0-beta-snapshot file sizes for several modular-X

666

> files in the course of researching the missing symbols problem, and the

667

> difference was often a shrinkage of near 33 percent with 4.1 and my

668

> current CFLAGS as opposed to 4.0.1 without the new ones.  Going the other

669

> way, that's a 50% larger file with 4.0.2 as compared to 4.1, 100KB vs

670

> 150KB, by way of example.  That's a *HUGE* difference, one big enough to

671

> initially think I'd found the reason for the missing symbols right there,

672

> as the new files were simply too much smaller to look workable!  Still, I

673

> traced the problem too LDFLAGS, so that wasn't it, and the files DO work,

674

> confirming things.  I'm guessing -fmerge-all-constants plays a significant

675

> part in that.  In any case, with that difference in size, and knowing how

676

> /much/ cache hit vs. miss affects performance, it's quite possible the

677

> size is the big performance factor.  Of course, even if that's so, I'm not

678

> sure whether it is the CFLAGS or the 4.0 vs 4.1 that should get the credit.

679

>

680

> In any case, I'm a happy camper right now! =8^)

681

682

683

--

684

Simon Stelling

685

Gentoo/AMD64 Operational Co-Lead

686

blubb@g.o

687

--

688

gentoo-amd64@g.o mailing list

Subject	Author
[gentoo-amd64] Re: Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite	Duncan <1i5t5.duncan@×××.net>
Re: [gentoo-amd64] Re: Re: Wow! KDE 3.5.1 & Xorg 7.0 w/ Composite	"Kevin F. Quinn (Gentoo)" <kevquinn@g.o>

Gentoo Archives: gentoo-amd64

Replies