[gentoo-desktop] Re: @system and parallel merge speedup - gentoo-desktop

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-desktop@l.g.o
Subject:	[gentoo-desktop] Re: @system and parallel merge speedup
Date:	Mon, 22 Oct 2012 09:02:46
Message-Id:	`pan.2012.10.22.06.26.06@cox.net`
In Reply to:	[gentoo-desktop] @system and parallel merge speedup by Alex Efros

1

Alex Efros posted on Sun, 21 Oct 2012 16:24:32 +0300 as excerpted:

2

3

> Hi!

4

>

5

> On Sun, Oct 21, 2012 at 08:02:47AM +0000, Duncan wrote:

6

>> Bottom line, an empty @system set really does make a noticeable

7

>> difference in parallel merge handling, speeding up especially

8

>> --emptytree @world rebuilds but also any general update that has a

9

>> significant number of otherwise @system packages and deps,

10

>> dramatically.  I'm happy. =:^)

11

>

12

> I think "@system first" and "@system not merge in parallel" rules are

13

> safe to break when you just doing "--emptytree @world" on already

14

> updated OS because it's only rebuild existing packages, and all packages

15

> while compiling will see same set of other packages (including same

16

> versions). But when upgrading multiple packages (including some from

17

> original @system and some from @world) this probably may result in bugs.

18

19

In theory, you're right.  In practice, I've not seen it yet, tho being 

20

cautious I'd say it needs at least six months of testing (I've only been 

21

testing it about a month, maybe six weeks) before I can say for sure.  

22

It /was/ something I was a bit concerned about, however.

23

24

That was in fact one of the reasons I decided to try it on the netbook's 

25

chroot as well, which hadn't been upgraded in a year and a half.  I 

26

figured if it could work reasonably well there, the chances of an 

27

undiscovered real problem were much lower.

28

29

However, it /is/ worth noting that as a matter of course, I already often 

30

choose to do some system-critical upgrades (portage, gcc, glibc, openrc, 

31

udev) on their own, before doing the general upgrades, in part so I can 

32

deal with their config file changes and note any problems right away, 

33

with a relatively small changeset to deal with, as opposed to having a 

34

whole slew of updates including critical system package updates happen 

35

all at once, thus making it far more difficult to trace which update 

36

actually broke things.

37

38

That's where the years of gentoo experience I originally mentioned comes 

39

in.  This isn't going to be as easy for a gentoo newbie for at least two 

40

reasons.  First, they're less likely to know what packages really /are/ 

41

system critical, and thus are more likely to unmerge them without the 

42

extra unmerge warning a package in the system set gets.  (I mentioned 

43

that one in the first post.) Second, spotting critical updates in the 

44

initial --pretend run, knowing which packages it's a good idea to upgrade 

45

first, by themselves, dealing with config file updates, etc, for just 

46

that critical package (and any dependency updates it might pull in), 

47

before going on to the general @world upgrade, probably makes a good bit 

48

of difference in practice, and gentoo newbies are rather less likely to 

49

be able to make that differentiation.  (I didn't specifically mention 

50

that one until now.)

51

52

> As for "--emptytree @world" speedup, can you provide benchmarked values?

53

> I mean, only few packages forced to use only one CPU Core while

54

> compiling.

55

> So, merging packages in parallel may save some time mostly for doing

56

> unpack/prepare/configure/install/merge. All of them except configure

57

> actually do a lot of I/O, which most likely lose a lot in speed instead

58

> of gain when done in parallel (especially keeping in mind kernel bug

59

> 12309). So, at a glance time you may win on configure you'll mostly lose

60

> on I/O, and most of time all your CPU Cores will be loaded anyway while

61

> compiling, and doing configure in parallel to compiling unlikely save

62

> some time. This is why I think without actual benchmarking we can't be

63

> sure how faster it became (if it became faster at all, which is

64

> questionable).

65

66

Good points, and no, I can't easily provide benchmarks, both because of 

67

the recent hardware upgrade here, and because portage itself has been 

68

gradually improving its parallel merging abilities -- a recent update 

69

changed the scheduling algorithm so it starts additional merges much 

70

sooner than it did previously. (See gentoo bug 438650 fixed in portage 

71

2.1.11.29 and 2.2.0_alpha140, both released on Oct 17.  That I know about 

72

that hints at another thing I do routinely as an experienced gentooer: I 

73

always read portage's changelog and check out any referenced bugs that 

74

look interesting, before I upgrade portage.  To the extent practical 

75

without actually reading the individual git commits, I want to know about 

76

package manager changes that might affect me BEFORE I do that upgrade!)

77

78

But, I believe as core-counts rise, you're underestimating the effects of 

79

portage's parallel merging abilities.  In particular, a lot of packages 

80

normally in @system (or deps thereof) are relatively small packages such 

81

as grep, patch, sed... where the single-threaded configure step takes a 

82

MUCH larger share of the total package merge time than it does with 

83

larger packages.  Similarly, the unpack and prepare phases, plus the 

84

package phase for folks using FEATURES=binpkg, tend to be

85

single-threaded.[1]

86

87

Thus, instead of serializing several dozen small mostly single-threaded 

88

package merges for packages like grep/sed/patch/util-linux/etc, depending 

89

on the --jobs and --load-average numbers you feed to portage, several of 

90

these end up getting done in parallel, with the portage multi-job output 

91

bumping a line every few seconds because it's doing them in parallel, 

92

instead of every minute or so, because it's doing one at a time.

93

94

Meanwhile, it should be obvious, but it's worth stating anyway.  The 

95

effect gets *MUCH* bigger as the number of cores increases.  For a dual-

96

core, bah, not worth the trouble, as it could cause more problems then it 

97

solves, especially if people are trying to work on other things while 

98

portage is doing its thing in the background.  I suspect the break-over 

99

point is either triple-core or quad-core.  One of the reasons portage is 

100

getting better lately is because someone's taken an interest that has a 

101

32-core, with a corresponding amount of memory (64 or 128 gig IIRC).

102

103

It's worth noting, as I mentioned, that I now have a 6-core, recently 

104

upgraded from a dual-dual-core (4 cores), with a corresponding memory 

105

upgrade, to 16 gigs.

106

107

One of the first things I noticed doing emerges was how much more 

108

difficult it was to keep the 6-core actually peaked out to 100% CPU, than 

109

it had been the 4-core.  While I suspect there would have been a 

110

difference on the quad-core (as I said I believe the break-over's 

111

probably 3-4 cores), it wasn't a big deal there.  Staring at that 6-core 

112

running at 100% on 1-2 cores CPU-freq-maxed at 3.6 GHz, while the other 

113

4-5 cores remained near idle at <20% utilization at CPU-freq-minimum 1.4 

114

GHz... was VERY frustrating.  So began my drive to empty @system and get 

115

portage properly scheduling parallel merges for former @system packages 

116

and their deps as well!

117

118

For the quad-core plus hyperthreading (thus 8 threads I take it?) you 

119

mention below (4.6 GHz OC, nice! I see stock is 3.4 GHz), the boost from 

120

killing @system forced serialization should definitely make a difference 

121

(unless the hyperthreading doesn't do much for that work load, making it 

122

effectively no better than a non-hyperthreaded quad-core.  For my 6-core, 

123

it made a rather big difference, and I guarantee if you had the 32-core 

124

that one of the devs working on improving portage's parallelization has, 

125

you'd be hot on the trail to improve it as well!

126

127

> As for me, I found very effective way to speedup emerge is upgrading

128

> from Core2Duo E6600 to i7-2600K overclocked to 4.6GHz. This speedup

129

> compilation on my system in 6 times (kernel now compiles in just 1

130

> minute). And to speedup most other (non-compilation) portage operations

131

> I use 4GB tmpfs mount on /var/tmp/portage/.

132

133

I remember reading about the 1-minute kernel compiles on i7s.  Very 

134

impressive.

135

136

FWIW, there's a lot of variables to fill in the blank on, before we can 

137

be sure kernel build time comparisons are apples to apples (I had several 

138

more paragraphs written on that, but decided it was a digression too far 

139

for this post so deleted 'em), but AFAIK when I read about it (on phoronix 

140

I believe), he was doing an all-yes config, so building rather more than 

141

a typical customized-config gentooer, but was using a rather fast SSD, 

142

which probably improved his times quite a bit compared to "spinning rust".

143

144

But I don't know if his timings included the actual compress (and if so 

145

with what CONFIG_KERNEL_XXX compression option) and I don't believe they 

146

included the actual install, only the build.

147

148

That said, a 1-minute all-yes-config kernel build time is impressive 

149

indeed, the envy of many, including me.  (OTOH, my fx6100 was on sale for 

150

$100, $109 post-tax.  That's lower than pricewatch's $118 lowest quote 

151

(shipped, no tax), and only about 40% of the $273 low quote for an 

152

i7-2600k.)

153

154

My build, compress (CONFIG_KERNEL_XZ) and install, runs ~2 minutes 

155

(1:58-2:07, 10+ runs, warm-cache), so yes, even if your build time 

156

doesn't include compress and install, which it might, 1-minute is still 

157

VERY impressive.  Tho as I said, my CPU cost ~40% of the going price on 

158

yours, so...

159

160

Meanwhile...

161

162

I too use and DEFINITELY recommend a tmpfs $PORTAGE_TMPDIR.  I'm running 

163

16 gig RAM here, and didn't want to run out of room with parallel builds, 

164

so set a nice roomy 12G tmpfs size.

165

166

A $PORTAGE_TMPDIR on tmpfs also reduces the I/O.  At least here, the only 

167

time I've had problems, both on the old hardware and on the new, is when 

168

I go into swap.  (And on the old hardware I had swap priority= striped 

169

across four disks and 4-way md/raid0, so the kernel could schedule swap-

170

out vs read-in much better and I didn't see a problem until I hit nearly 

171

half-gig of swap loading at once; the new hardware is only single-disk 

172

ATM, and I see issues starting @ 80 meg or so of swap loading, at once.)  

173

But with 16 gig RAM on the new system, the only time I see it go into 

174

swap is when I run a kernel build with uncapped -j, thus hitting 500+ 

175

jobs and close enough to 16 gigs that whether I hit swap or not depends 

176

on what else I've been doing with the system.

177

178

Basically, I/O is thus not a problem at all with portage, here, up to the

179

--jobs=12 --load-average=12 along with MAKEOPTS="-j20 -l15" I normally 

180

run, anyway.  On the old system with only six gigs of RAM, if I tried 

181

hard enough I could get portage to hit swap there, but I limited --jobs 

182

and MAKEOPTS until that wasn't an issue, and had no additional problems.

183

184

Tho I should mention I also run PORTAGE_NICENESS=19 (and my kernel-build/

185

install script similarly renices itself to 19 before starting the kernel 

186

build), which puts it in batch-scheduling mode (idle-only scheduling, but 

187

longer timeslices).

188

189

If it matters, filesystem is reiserfs, iosched is cfq, drive is sata2/ahci 

190

(amd 990fx/sb950 chipset) 2.5" seagate "spinning rust".

191

192

But I definitely agree with $PORTAGE_TMPDIR on tmpfs.  It makes a HUGE 

193

difference!

194

195

---

196

[1] Compression parallelism:  There are parallel-threaded alternatives to 

197

bzip2, for instance, but they have certain down-sides like decompress 

198

only being parallel where the tarball was compressed with the same 

199

parallel tool, and certain compression buffer nul-fill handling 

200

differences that make them not functionally perfect drop-in replacements. 

201

See the recent discussion on the topic on the gentoo-dev list for 

202

instance.

203

204

--

205

Duncan - List replies preferred.   No HTML msgs.

206

"Every nonfree program has a lord, a master --

207

and if you use the program, he is your master."  Richard Stallman

1	Alex Efros posted on Sun, 21 Oct 2012 16:24:32 +0300 as excerpted:
2
3	> Hi!
4	>
5	> On Sun, Oct 21, 2012 at 08:02:47AM +0000, Duncan wrote:
6	>> Bottom line, an empty @system set really does make a noticeable
7	>> difference in parallel merge handling, speeding up especially
8	>> --emptytree @world rebuilds but also any general update that has a
9	>> significant number of otherwise @system packages and deps,
10	>> dramatically. I'm happy. =:^)
11	>
12	> I think "@system first" and "@system not merge in parallel" rules are
13	> safe to break when you just doing "--emptytree @world" on already
14	> updated OS because it's only rebuild existing packages, and all packages
15	> while compiling will see same set of other packages (including same
16	> versions). But when upgrading multiple packages (including some from
17	> original @system and some from @world) this probably may result in bugs.
18
19	In theory, you're right. In practice, I've not seen it yet, tho being
20	cautious I'd say it needs at least six months of testing (I've only been
21	testing it about a month, maybe six weeks) before I can say for sure.
22	It /was/ something I was a bit concerned about, however.
23
24	That was in fact one of the reasons I decided to try it on the netbook's
25	chroot as well, which hadn't been upgraded in a year and a half. I
26	figured if it could work reasonably well there, the chances of an
27	undiscovered real problem were much lower.
28
29	However, it /is/ worth noting that as a matter of course, I already often
30	choose to do some system-critical upgrades (portage, gcc, glibc, openrc,
31	udev) on their own, before doing the general upgrades, in part so I can
32	deal with their config file changes and note any problems right away,
33	with a relatively small changeset to deal with, as opposed to having a
34	whole slew of updates including critical system package updates happen
35	all at once, thus making it far more difficult to trace which update
36	actually broke things.
37
38	That's where the years of gentoo experience I originally mentioned comes
39	in. This isn't going to be as easy for a gentoo newbie for at least two
40	reasons. First, they're less likely to know what packages really /are/
41	system critical, and thus are more likely to unmerge them without the
42	extra unmerge warning a package in the system set gets. (I mentioned
43	that one in the first post.) Second, spotting critical updates in the
44	initial --pretend run, knowing which packages it's a good idea to upgrade
45	first, by themselves, dealing with config file updates, etc, for just
46	that critical package (and any dependency updates it might pull in),
47	before going on to the general @world upgrade, probably makes a good bit
48	of difference in practice, and gentoo newbies are rather less likely to
49	be able to make that differentiation. (I didn't specifically mention
50	that one until now.)
51
52	> As for "--emptytree @world" speedup, can you provide benchmarked values?
53	> I mean, only few packages forced to use only one CPU Core while
54	> compiling.
55	> So, merging packages in parallel may save some time mostly for doing
56	> unpack/prepare/configure/install/merge. All of them except configure
57	> actually do a lot of I/O, which most likely lose a lot in speed instead
58	> of gain when done in parallel (especially keeping in mind kernel bug
59	> 12309). So, at a glance time you may win on configure you'll mostly lose
60	> on I/O, and most of time all your CPU Cores will be loaded anyway while
61	> compiling, and doing configure in parallel to compiling unlikely save
62	> some time. This is why I think without actual benchmarking we can't be
63	> sure how faster it became (if it became faster at all, which is
64	> questionable).
65
66	Good points, and no, I can't easily provide benchmarks, both because of
67	the recent hardware upgrade here, and because portage itself has been
68	gradually improving its parallel merging abilities -- a recent update
69	changed the scheduling algorithm so it starts additional merges much
70	sooner than it did previously. (See gentoo bug 438650 fixed in portage
71	2.1.11.29 and 2.2.0_alpha140, both released on Oct 17. That I know about
72	that hints at another thing I do routinely as an experienced gentooer: I
73	always read portage's changelog and check out any referenced bugs that
74	look interesting, before I upgrade portage. To the extent practical
75	without actually reading the individual git commits, I want to know about
76	package manager changes that might affect me BEFORE I do that upgrade!)
77
78	But, I believe as core-counts rise, you're underestimating the effects of
79	portage's parallel merging abilities. In particular, a lot of packages
80	normally in @system (or deps thereof) are relatively small packages such
81	as grep, patch, sed... where the single-threaded configure step takes a
82	MUCH larger share of the total package merge time than it does with
83	larger packages. Similarly, the unpack and prepare phases, plus the
84	package phase for folks using FEATURES=binpkg, tend to be
85	single-threaded.[1]
86
87	Thus, instead of serializing several dozen small mostly single-threaded
88	package merges for packages like grep/sed/patch/util-linux/etc, depending
89	on the --jobs and --load-average numbers you feed to portage, several of
90	these end up getting done in parallel, with the portage multi-job output
91	bumping a line every few seconds because it's doing them in parallel,
92	instead of every minute or so, because it's doing one at a time.
93
94	Meanwhile, it should be obvious, but it's worth stating anyway. The
95	effect gets MUCH bigger as the number of cores increases. For a dual-
96	core, bah, not worth the trouble, as it could cause more problems then it
97	solves, especially if people are trying to work on other things while
98	portage is doing its thing in the background. I suspect the break-over
99	point is either triple-core or quad-core. One of the reasons portage is
100	getting better lately is because someone's taken an interest that has a
101	32-core, with a corresponding amount of memory (64 or 128 gig IIRC).
102
103	It's worth noting, as I mentioned, that I now have a 6-core, recently
104	upgraded from a dual-dual-core (4 cores), with a corresponding memory
105	upgrade, to 16 gigs.
106
107	One of the first things I noticed doing emerges was how much more
108	difficult it was to keep the 6-core actually peaked out to 100% CPU, than
109	it had been the 4-core. While I suspect there would have been a
110	difference on the quad-core (as I said I believe the break-over's
111	probably 3-4 cores), it wasn't a big deal there. Staring at that 6-core
112	running at 100% on 1-2 cores CPU-freq-maxed at 3.6 GHz, while the other
113	4-5 cores remained near idle at <20% utilization at CPU-freq-minimum 1.4
114	GHz... was VERY frustrating. So began my drive to empty @system and get
115	portage properly scheduling parallel merges for former @system packages
116	and their deps as well!
117
118	For the quad-core plus hyperthreading (thus 8 threads I take it?) you
119	mention below (4.6 GHz OC, nice! I see stock is 3.4 GHz), the boost from
120	killing @system forced serialization should definitely make a difference
121	(unless the hyperthreading doesn't do much for that work load, making it
122	effectively no better than a non-hyperthreaded quad-core. For my 6-core,
123	it made a rather big difference, and I guarantee if you had the 32-core
124	that one of the devs working on improving portage's parallelization has,
125	you'd be hot on the trail to improve it as well!
126
127	> As for me, I found very effective way to speedup emerge is upgrading
128	> from Core2Duo E6600 to i7-2600K overclocked to 4.6GHz. This speedup
129	> compilation on my system in 6 times (kernel now compiles in just 1
130	> minute). And to speedup most other (non-compilation) portage operations
131	> I use 4GB tmpfs mount on /var/tmp/portage/.
132
133	I remember reading about the 1-minute kernel compiles on i7s. Very
134	impressive.
135
136	FWIW, there's a lot of variables to fill in the blank on, before we can
137	be sure kernel build time comparisons are apples to apples (I had several
138	more paragraphs written on that, but decided it was a digression too far
139	for this post so deleted 'em), but AFAIK when I read about it (on phoronix
140	I believe), he was doing an all-yes config, so building rather more than
141	a typical customized-config gentooer, but was using a rather fast SSD,
142	which probably improved his times quite a bit compared to "spinning rust".
143
144	But I don't know if his timings included the actual compress (and if so
145	with what CONFIG_KERNEL_XXX compression option) and I don't believe they
146	included the actual install, only the build.
147
148	That said, a 1-minute all-yes-config kernel build time is impressive
149	indeed, the envy of many, including me. (OTOH, my fx6100 was on sale for
150	$100, $109 post-tax. That's lower than pricewatch's $118 lowest quote
151	(shipped, no tax), and only about 40% of the $273 low quote for an
152	i7-2600k.)
153
154	My build, compress (CONFIG_KERNEL_XZ) and install, runs ~2 minutes
155	(1:58-2:07, 10+ runs, warm-cache), so yes, even if your build time
156	doesn't include compress and install, which it might, 1-minute is still
157	VERY impressive. Tho as I said, my CPU cost ~40% of the going price on
158	yours, so...
159
160	Meanwhile...
161
162	I too use and DEFINITELY recommend a tmpfs $PORTAGE_TMPDIR. I'm running
163	16 gig RAM here, and didn't want to run out of room with parallel builds,
164	so set a nice roomy 12G tmpfs size.
165
166	A $PORTAGE_TMPDIR on tmpfs also reduces the I/O. At least here, the only
167	time I've had problems, both on the old hardware and on the new, is when
168	I go into swap. (And on the old hardware I had swap priority= striped
169	across four disks and 4-way md/raid0, so the kernel could schedule swap-
170	out vs read-in much better and I didn't see a problem until I hit nearly
171	half-gig of swap loading at once; the new hardware is only single-disk
172	ATM, and I see issues starting @ 80 meg or so of swap loading, at once.)
173	But with 16 gig RAM on the new system, the only time I see it go into
174	swap is when I run a kernel build with uncapped -j, thus hitting 500+
175	jobs and close enough to 16 gigs that whether I hit swap or not depends
176	on what else I've been doing with the system.
177
178	Basically, I/O is thus not a problem at all with portage, here, up to the
179	--jobs=12 --load-average=12 along with MAKEOPTS="-j20 -l15" I normally
180	run, anyway. On the old system with only six gigs of RAM, if I tried
181	hard enough I could get portage to hit swap there, but I limited --jobs
182	and MAKEOPTS until that wasn't an issue, and had no additional problems.
183
184	Tho I should mention I also run PORTAGE_NICENESS=19 (and my kernel-build/
185	install script similarly renices itself to 19 before starting the kernel
186	build), which puts it in batch-scheduling mode (idle-only scheduling, but
187	longer timeslices).
188
189	If it matters, filesystem is reiserfs, iosched is cfq, drive is sata2/ahci
190	(amd 990fx/sb950 chipset) 2.5" seagate "spinning rust".
191
192	But I definitely agree with $PORTAGE_TMPDIR on tmpfs. It makes a HUGE
193	difference!
194
195	---
196	[1] Compression parallelism: There are parallel-threaded alternatives to
197	bzip2, for instance, but they have certain down-sides like decompress
198	only being parallel where the tarball was compressed with the same
199	parallel tool, and certain compression buffer nul-fill handling
200	differences that make them not functionally perfect drop-in replacements.
201	See the recent discussion on the topic on the gentoo-dev list for
202	instance.
203
204	--
205	Duncan - List replies preferred. No HTML msgs.
206	"Every nonfree program has a lord, a master --
207	and if you use the program, he is your master." Richard Stallman

Gentoo Archives: gentoo-desktop