[gentoo-dev] Re: Optimizing performance - gentoo-dev

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-dev@l.g.o
Subject:	[gentoo-dev] Re: Optimizing performance
Date:	Thu, 15 Dec 2005 14:53:47
Message-Id:	`pan.2005.12.15.14.43.40.649371@cox.net`
In Reply to:	[gentoo-dev] Optimizing performance by Patrick Lauer

1

Patrick Lauer posted <1134650885.4634.57.camel@localhost>, excerpted

2

below,  on Thu, 15 Dec 2005 13:48:05 +0100:

3

4

> I was wondering if there are any sane ways to optimize the performance

5

> of a Gentoo system.

6

7

This really belongs on user, or perhaps on the appropriate purposed list,

8

desktop or hardened or whatever, not on devel.  That said, some

9

comments...  (I can't resist. <g>)

10

11

> Overoptimization (the well known "-O9 -fomgomg" CFLAGS etc.) tends to

12

> make things unstable, which is of course not what we want. The "easy"

13

> way out would be buying faster hardware, but that is usually not an

14

> option ;-)

15

>

16

> So ... what can be done to get the stable maximum out of your hardware?

17

>

18

> In my experience (x86 centric - do other arches have different

19

> "problems"?) the following is stable, but not necessarily the optimum:

20

21

The general rules are the same, but there are architectural differences

22

that often change the details.  I /think/ it was MIPS that has extremely

23

slow i/o (I saw that mentioned in the split-kde-ebuilds debate, they said

24

it could cause compile times to double -- a big thing for something as big

25

as  KDE).  x86 (32-bit) has a relatively small number of CPU registers,

26

compared to most other archs (amd64 in 64-bit mode increased the number

27

dramatically, tho it's the same for 32-bit mode for compatibility

28

reasons), and this has a big effect on register use strategy.

29

30

That said, in the general case, the -march switch normally chooses pretty

31

good defaults for the target arch.  Modifying them a whole lot from that,

32

other than to cover special cases, or with the general -Ox optimization

33

switches, is therefore often counterproductive and/or problematic.

34

35

> - don't overtweak CFLAGS. "-O2 -march=$your_cpu_family" seems to be on

36

> average the best, -O3 is often slower and can cause bugs

37

38

A lot of folks don't realize the effect of cache memory on optimizations. 

39

I'll be brief here, but particularly for things like the kernel that stay

40

in memory, -Os can at times work wonders, because it means more of the

41

working set stays in a cache closer to the CPU, and the additional speed

42

in retrieving that code far outweighs the compromises made to

43

optimizations to shrink it to size.  Conversely, media streaming or

44

encoding apps are constantly throwing out old data and fetching new data,

45

and the optimizations are often more effective for them, so they work

46

better with -O2 or even -O3.

47

48

There have been occasional problems with -Os, generally because it isn't

49

used as much and gets less testing, so earlier in a gcc cycle series. 

50

However, I run -Os here (amd64) by default, and haven't seen any issues

51

that went away if I reverted to -O2, over the couple years I've been

52

running Gentoo.  (Actually, that has been the case, even when I've edited

53

ebuilds to remove their stripflags calls and the like.  Glibc and xorg

54

both stripflags including -Os.  xorg seemed to benefit here from -Os after

55

I removed the stripflags call, while glibc worked but seemed slower. Note

56

that editing ebuilds means if it breaks, you get to keep the pieces!)

57

58

For gcc, -pipe doesn't improve program optimization, but will make

59

compiling faster.  -fomit-frame-pointers makes smaller applications if

60

you aren't debugging.  Those are both common enough to be fairly safe. 

61

-frename-registers and -fweb may also be useful. (-fweb ceases to be so on

62

gcc4, however, because it is implemented differently.)  -funit-at-a-time

63

(new to gcc-3.4, so don't try it with gcc-3.3) may also be worth looking

64

into, altho it's already enabled by -Os. These latter flags are less

65

commonly used, however, thus less well tested, and may therefore cause

66

very occasional problems. (-funit-at-a-time was known to do so early in

67

the 3.4 cycle, but those issues should have been long ago dealt with by

68

now.)  I consider those /reasonably/ conservative, and it's what I run. 

69

If I were running a server, however, I'd probably only run -O2 and the

70

first two (-pipe and -fomit-frame-pointers).

71

72

Do some research on -Os, in any case.  It could be well worth your time.

73

74

> - check that all IDE disks use DMA mode, otherwise they are limited to

75

> ~16M/s with a huge CPU usage penalty. Sometimes (application-specific)

76

> increasing the readahead with hdparm gives a huge throughput boost.

77

78

This suggestion does involve hardware, but not a real heavy cost, and the

79

performance boost may be worth it. Consider running a RAID system.  I

80

recently switched to RAID, a four-disk setup, raid1/mirrored for /boot,

81

raid6 (for redundancy) for most of the system, raid0/striped (for speed)

82

for /tmp, the portage dir, etc, stuff that was either temporary anyway, or

83

could easily be redownloaded. (Swap can also be striped, set equal

84

partitions on each disk and set equal priority for them in fstab.) I was

85

very pleasantly surprised at how much of a difference it made!

86

87

Cost, as I said, is reasonable, particularly if you have disks laying

88

around or can buy them used.  Even buying say three 80-gig drives and

89

doing what I did only with a raid5 is reasonable, at the price of hard

90

drives these days.  Unfortunately, if your board is still PATA, you can

91

only run a single disk per IDE channel or it bogs down, so you may need to

92

buy a PCI IDE expansion board which will add to the cost.  If you have

93

onboard SATA and are buying new disks so can buy SATA anyway (my case),

94

that should do just fine, as SATA runs a dedicated channel to each

95

drive anyway.  SCSI is a higher cost option, ruled out here, but SATA

96

works very nicely, certainly so for me.

97

98

> - kernel tweaks like setting swappiness or using a different I/O

99

> scheduler (CFQ, deadline) should help, but I'm not aware of any "real"

100

> benchmarks

101

102

Again, a reasonable new-hardware suggestion.  When purchasing a new system

103

or considering an upgrade, more memory is often the most effective

104

optimization you can make (with the raid suggestion above very close to

105

it). Slower CPU and more memory, up to a gig or so, is almost always

106

better than the reverse, because hard drive access is WAYYY slower than

107

even cheap/slow memory.  At a gig of memory, running with swap disabled is

108

actually a practical option, altho it might not be faster and there are a

109

certain memory zone management considerations. Usual X/KDE desktop usage

110

will run perhaps a third of a gig.  That means half to 2/3 gig for cache,

111

which is "comfortable". Naturally, if you take the RAID suggestion above,

112

this one isn't quite as critical, because drive latency will be lower so

113

reliance on swap isn't as painful, and a big cache not nearly as critical

114

to good performance.  A gig to two gig can still be useful, but the

115

cost/performance tradeoff isn't as good, and the money will likely be

116

better spent elsewhere.

117

118

Note that with a gig of memory and a striped swap, I have swappiness upped

119

to 100 to force the most unused app memory to swap, and I literally can't

120

tell when it starts swapping at all, except by watching the used swap

121

graph on ksysguard.  None at all of the slowdowns I had previously

122

associated with swapping, back when I had a single drive and a half-gig of

123

memory.

124

125

> - using a "smarter" filesystem can dramatically improve performance at

126

> the potential cost of reliability. As data on FS reliability is hard to

127

> find from unbiased sources this becomes a religious issue ... migrating

128

> from ext3 to reiserfs makes "emerge sync" extremely much faster, but is

129

> reiserfs sustainable?

130

131

I run reiserfs here on everything.  However, some don't consider it

132

extremely stable.  I keep second-copy partitions as backups of stuff I

133

want to ensure is safe, for that reason and others (fat-finger deleting,

134

anyone?). Bottom line, reiserfs is certainly safe "enough", if you have a

135

decent backup system in place, and you follow it regularly, as you should.

136

I can't see how anyone can reasonably disagree with that, filesystem

137

religious zealousy or not.

138

139

In any case, note that you can simply redownload your portage tree anyway,

140

and with the speed and size benefits of reiserfs (size only if you don't

141

have notail in your config), even the ones least likely to trust the

142

integrity of reiserfs should see the benefit of putting your portage tree

143

on it.  /tmp and/or /var/tmp may equally benefit, for the same reasons. An

144

exception might be if you regularly put huge files (700 meg CD and

145

multi-gig DVD images to burn, would be one example) on the partition.  In

146

that case, jfs or xfs (don't remember which, but one's optimized for large

147

files) might be preferable.

148

149

As I said, I run reiserfs for everything here, but I also have backup

150

images of stuff I know I want to keep.

151

152

> Are there any application-specific tweaks

153

154

As I mentioned, -O3 is often best for multimedia stuff,

155

encoders/decoders/streamers and the like, while -O2, or often, -Os, is

156

better for most things.

157

158

159

--

160

Duncan - List replies preferred.   No HTML msgs.

161

"Every nonfree program has a lord, a master --

162

and if you use the program, he is your master."  Richard Stallman in

163

http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html

164

165

166

--

167

gentoo-dev@g.o mailing list

Gentoo Archives: gentoo-dev

Replies

1	Patrick Lauer posted <1134650885.4634.57.camel@localhost>, excerpted
2	below, on Thu, 15 Dec 2005 13:48:05 +0100:
3
4	> I was wondering if there are any sane ways to optimize the performance
5	> of a Gentoo system.
6
7	This really belongs on user, or perhaps on the appropriate purposed list,
8	desktop or hardened or whatever, not on devel. That said, some
9	comments... (I can't resist. <g>)
10
11	> Overoptimization (the well known "-O9 -fomgomg" CFLAGS etc.) tends to
12	> make things unstable, which is of course not what we want. The "easy"
13	> way out would be buying faster hardware, but that is usually not an
14	> option ;-)
15	>
16	> So ... what can be done to get the stable maximum out of your hardware?
17	>
18	> In my experience (x86 centric - do other arches have different
19	> "problems"?) the following is stable, but not necessarily the optimum:
20
21	The general rules are the same, but there are architectural differences
22	that often change the details. I /think/ it was MIPS that has extremely
23	slow i/o (I saw that mentioned in the split-kde-ebuilds debate, they said
24	it could cause compile times to double -- a big thing for something as big
25	as KDE). x86 (32-bit) has a relatively small number of CPU registers,
26	compared to most other archs (amd64 in 64-bit mode increased the number
27	dramatically, tho it's the same for 32-bit mode for compatibility
28	reasons), and this has a big effect on register use strategy.
29
30	That said, in the general case, the -march switch normally chooses pretty
31	good defaults for the target arch. Modifying them a whole lot from that,
32	other than to cover special cases, or with the general -Ox optimization
33	switches, is therefore often counterproductive and/or problematic.
34
35	> - don't overtweak CFLAGS. "-O2 -march=$your_cpu_family" seems to be on
36	> average the best, -O3 is often slower and can cause bugs
37
38	A lot of folks don't realize the effect of cache memory on optimizations.
39	I'll be brief here, but particularly for things like the kernel that stay
40	in memory, -Os can at times work wonders, because it means more of the
41	working set stays in a cache closer to the CPU, and the additional speed
42	in retrieving that code far outweighs the compromises made to
43	optimizations to shrink it to size. Conversely, media streaming or
44	encoding apps are constantly throwing out old data and fetching new data,
45	and the optimizations are often more effective for them, so they work
46	better with -O2 or even -O3.
47
48	There have been occasional problems with -Os, generally because it isn't
49	used as much and gets less testing, so earlier in a gcc cycle series.
50	However, I run -Os here (amd64) by default, and haven't seen any issues
51	that went away if I reverted to -O2, over the couple years I've been
52	running Gentoo. (Actually, that has been the case, even when I've edited
53	ebuilds to remove their stripflags calls and the like. Glibc and xorg
54	both stripflags including -Os. xorg seemed to benefit here from -Os after
55	I removed the stripflags call, while glibc worked but seemed slower. Note
56	that editing ebuilds means if it breaks, you get to keep the pieces!)
57
58	For gcc, -pipe doesn't improve program optimization, but will make
59	compiling faster. -fomit-frame-pointers makes smaller applications if
60	you aren't debugging. Those are both common enough to be fairly safe.
61	-frename-registers and -fweb may also be useful. (-fweb ceases to be so on
62	gcc4, however, because it is implemented differently.) -funit-at-a-time
63	(new to gcc-3.4, so don't try it with gcc-3.3) may also be worth looking
64	into, altho it's already enabled by -Os. These latter flags are less
65	commonly used, however, thus less well tested, and may therefore cause
66	very occasional problems. (-funit-at-a-time was known to do so early in
67	the 3.4 cycle, but those issues should have been long ago dealt with by
68	now.) I consider those /reasonably/ conservative, and it's what I run.
69	If I were running a server, however, I'd probably only run -O2 and the
70	first two (-pipe and -fomit-frame-pointers).
71
72	Do some research on -Os, in any case. It could be well worth your time.
73
74	> - check that all IDE disks use DMA mode, otherwise they are limited to
75	> ~16M/s with a huge CPU usage penalty. Sometimes (application-specific)
76	> increasing the readahead with hdparm gives a huge throughput boost.
77
78	This suggestion does involve hardware, but not a real heavy cost, and the
79	performance boost may be worth it. Consider running a RAID system. I
80	recently switched to RAID, a four-disk setup, raid1/mirrored for /boot,
81	raid6 (for redundancy) for most of the system, raid0/striped (for speed)
82	for /tmp, the portage dir, etc, stuff that was either temporary anyway, or
83	could easily be redownloaded. (Swap can also be striped, set equal
84	partitions on each disk and set equal priority for them in fstab.) I was
85	very pleasantly surprised at how much of a difference it made!
86
87	Cost, as I said, is reasonable, particularly if you have disks laying
88	around or can buy them used. Even buying say three 80-gig drives and
89	doing what I did only with a raid5 is reasonable, at the price of hard
90	drives these days. Unfortunately, if your board is still PATA, you can
91	only run a single disk per IDE channel or it bogs down, so you may need to
92	buy a PCI IDE expansion board which will add to the cost. If you have
93	onboard SATA and are buying new disks so can buy SATA anyway (my case),
94	that should do just fine, as SATA runs a dedicated channel to each
95	drive anyway. SCSI is a higher cost option, ruled out here, but SATA
96	works very nicely, certainly so for me.
97
98	> - kernel tweaks like setting swappiness or using a different I/O
99	> scheduler (CFQ, deadline) should help, but I'm not aware of any "real"
100	> benchmarks
101
102	Again, a reasonable new-hardware suggestion. When purchasing a new system
103	or considering an upgrade, more memory is often the most effective
104	optimization you can make (with the raid suggestion above very close to
105	it). Slower CPU and more memory, up to a gig or so, is almost always
106	better than the reverse, because hard drive access is WAYYY slower than
107	even cheap/slow memory. At a gig of memory, running with swap disabled is
108	actually a practical option, altho it might not be faster and there are a
109	certain memory zone management considerations. Usual X/KDE desktop usage
110	will run perhaps a third of a gig. That means half to 2/3 gig for cache,
111	which is "comfortable". Naturally, if you take the RAID suggestion above,
112	this one isn't quite as critical, because drive latency will be lower so
113	reliance on swap isn't as painful, and a big cache not nearly as critical
114	to good performance. A gig to two gig can still be useful, but the
115	cost/performance tradeoff isn't as good, and the money will likely be
116	better spent elsewhere.
117
118	Note that with a gig of memory and a striped swap, I have swappiness upped
119	to 100 to force the most unused app memory to swap, and I literally can't
120	tell when it starts swapping at all, except by watching the used swap
121	graph on ksysguard. None at all of the slowdowns I had previously
122	associated with swapping, back when I had a single drive and a half-gig of
123	memory.
124
125	> - using a "smarter" filesystem can dramatically improve performance at
126	> the potential cost of reliability. As data on FS reliability is hard to
127	> find from unbiased sources this becomes a religious issue ... migrating
128	> from ext3 to reiserfs makes "emerge sync" extremely much faster, but is
129	> reiserfs sustainable?
130
131	I run reiserfs here on everything. However, some don't consider it
132	extremely stable. I keep second-copy partitions as backups of stuff I
133	want to ensure is safe, for that reason and others (fat-finger deleting,
134	anyone?). Bottom line, reiserfs is certainly safe "enough", if you have a
135	decent backup system in place, and you follow it regularly, as you should.
136	I can't see how anyone can reasonably disagree with that, filesystem
137	religious zealousy or not.
138
139	In any case, note that you can simply redownload your portage tree anyway,
140	and with the speed and size benefits of reiserfs (size only if you don't
141	have notail in your config), even the ones least likely to trust the
142	integrity of reiserfs should see the benefit of putting your portage tree
143	on it. /tmp and/or /var/tmp may equally benefit, for the same reasons. An
144	exception might be if you regularly put huge files (700 meg CD and
145	multi-gig DVD images to burn, would be one example) on the partition. In
146	that case, jfs or xfs (don't remember which, but one's optimized for large
147	files) might be preferable.
148
149	As I said, I run reiserfs for everything here, but I also have backup
150	images of stuff I know I want to keep.
151
152	> Are there any application-specific tweaks
153
154	As I mentioned, -O3 is often best for multimedia stuff,
155	encoders/decoders/streamers and the like, while -O2, or often, -Os, is
156	better for most things.
157
158
159	--
160	Duncan - List replies preferred. No HTML msgs.
161	"Every nonfree program has a lord, a master --
162	and if you use the program, he is your master." Richard Stallman in
163	http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
164
165
166	--
167	gentoo-dev@g.o mailing list