Re: [gentoo-amd64] Re: Re: amd64 and kernel configuration - gentoo-amd64

From:	Jean.Borsenberger@×××××.fr
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] Re: Re: amd64 and kernel configuration
Date:	Wed, 27 Jul 2005 17:09:28
Message-Id:	`Pine.LNX.4.58.0507271900500.14425@siolinh.obspm.fr`
In Reply to:	[gentoo-amd64] Re: Re: amd64 and kernel configuration by Duncan <1i5t5.duncan@cox.net>

1

Well, may be it's SUMO, but when we swicth on the NUMA option for

2

the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP

3

benchs between 20% to 30% (depending on the program considered).

4

5

Note: OPenMP is a FORTRAN variation in which you put paralelisation

6

directives, without boring of the implementation details, using a single

7

address space, for all instances of the user program.

8

9

Jean Borsenberger

10

tel: +33 (0)1 45 07 76 29

11

Observatoire de Paris Meudon

12

5 place Jules Janssen

13

92195 Meudon France

14

15

On Wed, 27 Jul 2005, Duncan wrote:

16

17

> Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,

18

> excerpted below,  on Wed, 27 Jul 2005 05:46:28 -0500:

19

>

20

> > Just to point out, amd was calling the opterons and such more of a SUMO

21

> > configuration (Sufficiently Uniform Memory Organization, not joking here),

22

> > instead of NUMA. Whereas technically, it clearly is a NUMA system, the

23

> > differences in latency when accessing memory from a bank attached to

24

> > another processors memory controller is very small. Small enough to be

25

> > largely ignored, and treated like uniform memory access latencies in a SMP

26

> > system. Sorta in between SMP unified style memory access and NUMA. This

27

> > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You

28

> > add hypertransport switches to scale over 8 chips/sockets, it'll most

29

> > likely be a different story...

30

>

31

> I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given

32

> the design of the hardware.  They have a very good point, that while it's

33

> physically NUMA, the latencies variances are so close to unified that in

34

> many ways it's indistinguishable -- except for the fact that keeping it

35

> NUMA means allowing independent access of two different apps running on

36

> two different CPUs, to their own memory in parallel, rather than one

37

> having to wait for the other, if the memory were interleaved and unified

38

> (as it would be for quad channel access, if that were enabled).

39

>

40

> > What I've always wondered is, the NUMA code in the linux kernel, is this

41

> > for handling traditional NUMA, like in a large computer system (big iron)

42

> > where NUMA memory access latencies will vary greatly, or is it simply for

43

> > optimizing the memory usage across the memory banks. Keeping data in the

44

> > memory of the processor using it, etc, etc. Of course none of this matters

45

> > for single chip/socket amd systems, as dual cores as well as single cores

46

> > share a memory controller. Hmm, maybe I should drink some coffee and

47

> > shutup until I'm awake...

48

>

49

> Well, yeah, for single-socket/dual-core, but what about dual socket

50

> (either single core or dual core)?  Your questions make sense there, and

51

> that's what I'm running (single core, tho upgrading to dual core for a

52

> quad-core total board sometime next year, would be very nice, and just

53

> might be within the limits of my budget), so yes, I'm rather interested!

54

>

55

> The answer to your question on how the kernel deals with it, by my

56

> understanding, is this:  The Linux kernel SMP/NUMA architecture allows for

57

> "CPU affinity grouping".  In earlier kernels, it was all automated, but

58

> they are actually getting advanced enough now to allow deliberate manual

59

> splitting of various groups, and combined with userspace control

60

> applications, will ultimately be able to dynamically assign processes to

61

> one or more CPU groups of various sizes, controlling the CPU and memory

62

> resources available to individual processes.  So, yes, I guess that means

63

> it's developing some pretty "big iron" qualities, altho many of them are

64

> still in flux and won't be stable at least in mainline for another six

65

> months or a year, at minimum.

66

>

67

> Let's refocus now back on the implementation and the smaller picture once

68

> again, to examine these "CPU affinity zones" in a bit more detail.  The

69

> following is according to the writeups I've seen, mostly on LWN's weekly

70

> kernel pages.   (Jon Corbet, LWN editor, does a very good job of balancing

71

> the technical kernel hacker level stuff with the middle-ground

72

> not-too-technical kernel follower stuff, good enough that I find the site

73

> useful enough to subscribe, even tho I could get even the premium content

74

> a week later for free.  Yes, that's an endorsement of the site, because

75

> it's where a lot of my info comes from, and I'm certainly not one to try

76

> to keep my knowledge exclusive!)

77

>

78

> Anyway... from mainly that source...  CPU affinity zones work with sets

79

> and supersets of processors.  An Intel hyperthreading pair of virtual

80

> processors on the same physical processor will be at the highest affinity

81

> level, the lowest level aka strongest grouping in the hierarchy, because

82

> they share the same cache memory all the way up to L1 itself, and the

83

> Linux kernel can switch processes between the two virtual CPUs of a

84

> hyperthreaded CPU with zero cost or loss in performance, therefore only

85

> taking into account the relative balance of processes on each of the

86

> hyperthread virtual CPUs.

87

>

88

> At the next lowest level affinity, we'd have the dual-core AMDs, same

89

> chip, same memory controller, same local memory, same hypertransport

90

> interfaces to the chipset, other CPUs and the rest of the world, and very

91

> tightly cooperative, but with separate L2 and of course separate L1 cache.

92

> There's a slight performance penalty between switching processes between

93

> these CPUs, due to the cache flushing it would entail, but it's only very

94

> slight and quite speedy, so thread imbalance between the two processors

95

> doesn't have to get bad at all, before it's worth it to switch the CPUs to

96

> maintain balance, even at the cost of that cache flush.

97

>

98

> At a slightly lower level of affinity would be the Intel dual cores, since

99

> they aren't quite so tightly coupled, and don't share all the same

100

> interfaces to the outside world.  In practice, since only one of these,

101

> the Intel dual core or the AMD dual core, will normally be encountered in

102

> real life, they can be treated at the same level, with possibly a small

103

> internal tweak to the relative weighting of thread imbalance vs

104

> performance loss for switching CPUs, based on which one is actually in

105

> place.

106

>

107

> Here things get interesting, because of the different implementations

108

> available.  AMD's 2-way thru 8-way Opterons configured for unified memory

109

> access would be first, because again, their dedicated inter-CPU

110

> hypertransport links let them cooperate closer than conventional

111

> multi-socket CPUs would.  Beyond that, it's a tossup between Intel's

112

> unified memory multi-processors and AMD's NUMA/SUMO memory Opterons.  I'd

113

> still say the Opterons cooperate closer, even in NUMA/SUMO mode, than

114

> Intel chips will with unified memory, due to that SUMO aspect.  At the

115

> same time, they have the parallel memory access advantages of NUMA.

116

>

117

> Beyond that, there's several levels of clustering, local/board, off-board

118

> but short-fat-pipe accessible (using technologies such as PCI

119

> interconnect, fibre-channel, and that SGI interconnect tech IDR the name

120

> of ATM), conventional (and Beowulf?) type clustering, and remote

121

> clustering. At each of these levels, as with the above, the cost to switch

122

> processes between peers at the same affinity level gets higher and higher,

123

> so the corresponding process imbalance necessary to trigger a switch

124

> likewise gets higher and higher, until at the extreme of remote

125

> clustering, it's almost done manually only. or anyway at the level of a

126

> user level application managing the transfers, rather than the kernel,

127

> directly (since, after all, with remote clustering, each remote group is

128

> probably running its own kernel, if not individual machines within that

129

> group).

130

>

131

> So, the point of all that is that the kernel sees a hierarchical grouping

132

> of CPUs, and is designed with more flexibility to balance processes and

133

> memory use at the extreme affinity end, and more hesitation to balance it

134

> due to the higher cost involved, at the extremely low affinity end.  The

135

> main writeup I read on the subject dealt with thread/process CPU

136

> switching, not memory switching, but within the context of NUMA, the

137

> principles become so intertwined it's impossible to separate them, and the

138

> writeup very clearly made the point that the memory issues involved in

139

> making the transfer were included in the cost accounting as well.

140

>

141

> I'm not sure whether this addressed the point you were trying to make, or

142

> hit beside it, but anyway, it was fun trying to put into text for the

143

> first time since I read about it, the principles in that writeup, along

144

> with other facts I've merged along the way.  My dad's a teacher, and I

145

> remember him many times making the point that the best way to learn

146

> something is to attempt to teach it.  He used that principle in his own

147

> classes, having the students help each other, and I remember him making

148

> the point about himself as well, at one point, as he struggled to teach

149

> basic accounting principles based only on a textbook and the single

150

> college intro level class he had himself taken years before, when he found

151

> himself teaching a high school class on the subject.  The principle is

152

> certainly true, as by explaining the affinity clustering principles here,

153

> it has forced me to ensure they form a reasonable and self consistent

154

> infrastructure in my own head, in ordered to be able to explain it in the

155

> post.  So, anyway, thanks for the intellectual stimulation!  <g>

156

>

157

> --

158

> Duncan - List replies preferred.   No HTML msgs.

159

> "Every nonfree program has a lord, a master --

160

> and if you use the program, he is your master."  Richard Stallman in

161

> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html

162

>

163

>

164

> --

165

> gentoo-amd64@g.o mailing list

166

>

167

>

168

--

169

gentoo-amd64@g.o mailing list

1	Well, may be it's SUMO, but when we swicth on the NUMA option for
2	the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP
3	benchs between 20% to 30% (depending on the program considered).
4
5	Note: OPenMP is a FORTRAN variation in which you put paralelisation
6	directives, without boring of the implementation details, using a single
7	address space, for all instances of the user program.
8
9	Jean Borsenberger
10	tel: +33 (0)1 45 07 76 29
11	Observatoire de Paris Meudon
12	5 place Jules Janssen
13	92195 Meudon France
14
15	On Wed, 27 Jul 2005, Duncan wrote:
16
17	> Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,
18	> excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500:
19	>
20	> > Just to point out, amd was calling the opterons and such more of a SUMO
21	> > configuration (Sufficiently Uniform Memory Organization, not joking here),
22	> > instead of NUMA. Whereas technically, it clearly is a NUMA system, the
23	> > differences in latency when accessing memory from a bank attached to
24	> > another processors memory controller is very small. Small enough to be
25	> > largely ignored, and treated like uniform memory access latencies in a SMP
26	> > system. Sorta in between SMP unified style memory access and NUMA. This
27	> > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
28	> > add hypertransport switches to scale over 8 chips/sockets, it'll most
29	> > likely be a different story...
30	>
31	> I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
32	> the design of the hardware. They have a very good point, that while it's
33	> physically NUMA, the latencies variances are so close to unified that in
34	> many ways it's indistinguishable -- except for the fact that keeping it
35	> NUMA means allowing independent access of two different apps running on
36	> two different CPUs, to their own memory in parallel, rather than one
37	> having to wait for the other, if the memory were interleaved and unified
38	> (as it would be for quad channel access, if that were enabled).
39	>
40	> > What I've always wondered is, the NUMA code in the linux kernel, is this
41	> > for handling traditional NUMA, like in a large computer system (big iron)
42	> > where NUMA memory access latencies will vary greatly, or is it simply for
43	> > optimizing the memory usage across the memory banks. Keeping data in the
44	> > memory of the processor using it, etc, etc. Of course none of this matters
45	> > for single chip/socket amd systems, as dual cores as well as single cores
46	> > share a memory controller. Hmm, maybe I should drink some coffee and
47	> > shutup until I'm awake...
48	>
49	> Well, yeah, for single-socket/dual-core, but what about dual socket
50	> (either single core or dual core)? Your questions make sense there, and
51	> that's what I'm running (single core, tho upgrading to dual core for a
52	> quad-core total board sometime next year, would be very nice, and just
53	> might be within the limits of my budget), so yes, I'm rather interested!
54	>
55	> The answer to your question on how the kernel deals with it, by my
56	> understanding, is this: The Linux kernel SMP/NUMA architecture allows for
57	> "CPU affinity grouping". In earlier kernels, it was all automated, but
58	> they are actually getting advanced enough now to allow deliberate manual
59	> splitting of various groups, and combined with userspace control
60	> applications, will ultimately be able to dynamically assign processes to
61	> one or more CPU groups of various sizes, controlling the CPU and memory
62	> resources available to individual processes. So, yes, I guess that means
63	> it's developing some pretty "big iron" qualities, altho many of them are
64	> still in flux and won't be stable at least in mainline for another six
65	> months or a year, at minimum.
66	>
67	> Let's refocus now back on the implementation and the smaller picture once
68	> again, to examine these "CPU affinity zones" in a bit more detail. The
69	> following is according to the writeups I've seen, mostly on LWN's weekly
70	> kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing
71	> the technical kernel hacker level stuff with the middle-ground
72	> not-too-technical kernel follower stuff, good enough that I find the site
73	> useful enough to subscribe, even tho I could get even the premium content
74	> a week later for free. Yes, that's an endorsement of the site, because
75	> it's where a lot of my info comes from, and I'm certainly not one to try
76	> to keep my knowledge exclusive!)
77	>
78	> Anyway... from mainly that source... CPU affinity zones work with sets
79	> and supersets of processors. An Intel hyperthreading pair of virtual
80	> processors on the same physical processor will be at the highest affinity
81	> level, the lowest level aka strongest grouping in the hierarchy, because
82	> they share the same cache memory all the way up to L1 itself, and the
83	> Linux kernel can switch processes between the two virtual CPUs of a
84	> hyperthreaded CPU with zero cost or loss in performance, therefore only
85	> taking into account the relative balance of processes on each of the
86	> hyperthread virtual CPUs.
87	>
88	> At the next lowest level affinity, we'd have the dual-core AMDs, same
89	> chip, same memory controller, same local memory, same hypertransport
90	> interfaces to the chipset, other CPUs and the rest of the world, and very
91	> tightly cooperative, but with separate L2 and of course separate L1 cache.
92	> There's a slight performance penalty between switching processes between
93	> these CPUs, due to the cache flushing it would entail, but it's only very
94	> slight and quite speedy, so thread imbalance between the two processors
95	> doesn't have to get bad at all, before it's worth it to switch the CPUs to
96	> maintain balance, even at the cost of that cache flush.
97	>
98	> At a slightly lower level of affinity would be the Intel dual cores, since
99	> they aren't quite so tightly coupled, and don't share all the same
100	> interfaces to the outside world. In practice, since only one of these,
101	> the Intel dual core or the AMD dual core, will normally be encountered in
102	> real life, they can be treated at the same level, with possibly a small
103	> internal tweak to the relative weighting of thread imbalance vs
104	> performance loss for switching CPUs, based on which one is actually in
105	> place.
106	>
107	> Here things get interesting, because of the different implementations
108	> available. AMD's 2-way thru 8-way Opterons configured for unified memory
109	> access would be first, because again, their dedicated inter-CPU
110	> hypertransport links let them cooperate closer than conventional
111	> multi-socket CPUs would. Beyond that, it's a tossup between Intel's
112	> unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd
113	> still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
114	> Intel chips will with unified memory, due to that SUMO aspect. At the
115	> same time, they have the parallel memory access advantages of NUMA.
116	>
117	> Beyond that, there's several levels of clustering, local/board, off-board
118	> but short-fat-pipe accessible (using technologies such as PCI
119	> interconnect, fibre-channel, and that SGI interconnect tech IDR the name
120	> of ATM), conventional (and Beowulf?) type clustering, and remote
121	> clustering. At each of these levels, as with the above, the cost to switch
122	> processes between peers at the same affinity level gets higher and higher,
123	> so the corresponding process imbalance necessary to trigger a switch
124	> likewise gets higher and higher, until at the extreme of remote
125	> clustering, it's almost done manually only. or anyway at the level of a
126	> user level application managing the transfers, rather than the kernel,
127	> directly (since, after all, with remote clustering, each remote group is
128	> probably running its own kernel, if not individual machines within that
129	> group).
130	>
131	> So, the point of all that is that the kernel sees a hierarchical grouping
132	> of CPUs, and is designed with more flexibility to balance processes and
133	> memory use at the extreme affinity end, and more hesitation to balance it
134	> due to the higher cost involved, at the extremely low affinity end. The
135	> main writeup I read on the subject dealt with thread/process CPU
136	> switching, not memory switching, but within the context of NUMA, the
137	> principles become so intertwined it's impossible to separate them, and the
138	> writeup very clearly made the point that the memory issues involved in
139	> making the transfer were included in the cost accounting as well.
140	>
141	> I'm not sure whether this addressed the point you were trying to make, or
142	> hit beside it, but anyway, it was fun trying to put into text for the
143	> first time since I read about it, the principles in that writeup, along
144	> with other facts I've merged along the way. My dad's a teacher, and I
145	> remember him many times making the point that the best way to learn
146	> something is to attempt to teach it. He used that principle in his own
147	> classes, having the students help each other, and I remember him making
148	> the point about himself as well, at one point, as he struggled to teach
149	> basic accounting principles based only on a textbook and the single
150	> college intro level class he had himself taken years before, when he found
151	> himself teaching a high school class on the subject. The principle is
152	> certainly true, as by explaining the affinity clustering principles here,
153	> it has forced me to ensure they form a reasonable and self consistent
154	> infrastructure in my own head, in ordered to be able to explain it in the
155	> post. So, anyway, thanks for the intellectual stimulation! <g>
156	>
157	> --
158	> Duncan - List replies preferred. No HTML msgs.
159	> "Every nonfree program has a lord, a master --
160	> and if you use the program, he is your master." Richard Stallman in
161	> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
162	>
163	>
164	> --
165	> gentoo-amd64@g.o mailing list
166	>
167	>
168	--
169	gentoo-amd64@g.o mailing list

Gentoo Archives: gentoo-amd64