[gentoo-amd64] Re: Re: amd64 and kernel configuration - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: Re: amd64 and kernel configuration
Date:	Wed, 27 Jul 2005 15:45:29
Message-Id:	`pan.2005.07.27.15.42.13.414249@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: amd64 and kernel configuration by Drew Kirkpatrick

1

Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,

2

excerpted below,  on Wed, 27 Jul 2005 05:46:28 -0500:

3

4

> Just to point out, amd was calling the opterons and such more of a SUMO

5

> configuration (Sufficiently Uniform Memory Organization, not joking here),

6

> instead of NUMA. Whereas technically, it clearly is a NUMA system, the

7

> differences in latency when accessing memory from a bank attached to

8

> another processors memory controller is very small. Small enough to be

9

> largely ignored, and treated like uniform memory access latencies in a SMP

10

> system. Sorta in between SMP unified style memory access and NUMA. This

11

> holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You

12

> add hypertransport switches to scale over 8 chips/sockets, it'll most

13

> likely be a different story...

14

15

I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given

16

the design of the hardware.  They have a very good point, that while it's

17

physically NUMA, the latencies variances are so close to unified that in

18

many ways it's indistinguishable -- except for the fact that keeping it

19

NUMA means allowing independent access of two different apps running on

20

two different CPUs, to their own memory in parallel, rather than one

21

having to wait for the other, if the memory were interleaved and unified

22

(as it would be for quad channel access, if that were enabled).

23

24

> What I've always wondered is, the NUMA code in the linux kernel, is this

25

> for handling traditional NUMA, like in a large computer system (big iron)

26

> where NUMA memory access latencies will vary greatly, or is it simply for

27

> optimizing the memory usage across the memory banks. Keeping data in the

28

> memory of the processor using it, etc, etc. Of course none of this matters

29

> for single chip/socket amd systems, as dual cores as well as single cores

30

> share a memory controller. Hmm, maybe I should drink some coffee and

31

> shutup until I'm awake...

32

33

Well, yeah, for single-socket/dual-core, but what about dual socket

34

(either single core or dual core)?  Your questions make sense there, and

35

that's what I'm running (single core, tho upgrading to dual core for a

36

quad-core total board sometime next year, would be very nice, and just

37

might be within the limits of my budget), so yes, I'm rather interested!

38

39

The answer to your question on how the kernel deals with it, by my

40

understanding, is this:  The Linux kernel SMP/NUMA architecture allows for

41

"CPU affinity grouping".  In earlier kernels, it was all automated, but

42

they are actually getting advanced enough now to allow deliberate manual

43

splitting of various groups, and combined with userspace control

44

applications, will ultimately be able to dynamically assign processes to

45

one or more CPU groups of various sizes, controlling the CPU and memory

46

resources available to individual processes.  So, yes, I guess that means

47

it's developing some pretty "big iron" qualities, altho many of them are

48

still in flux and won't be stable at least in mainline for another six

49

months or a year, at minimum.

50

51

Let's refocus now back on the implementation and the smaller picture once

52

again, to examine these "CPU affinity zones" in a bit more detail.  The

53

following is according to the writeups I've seen, mostly on LWN's weekly

54

kernel pages.   (Jon Corbet, LWN editor, does a very good job of balancing

55

the technical kernel hacker level stuff with the middle-ground

56

not-too-technical kernel follower stuff, good enough that I find the site

57

useful enough to subscribe, even tho I could get even the premium content

58

a week later for free.  Yes, that's an endorsement of the site, because

59

it's where a lot of my info comes from, and I'm certainly not one to try

60

to keep my knowledge exclusive!)

61

62

Anyway... from mainly that source...  CPU affinity zones work with sets

63

and supersets of processors.  An Intel hyperthreading pair of virtual

64

processors on the same physical processor will be at the highest affinity

65

level, the lowest level aka strongest grouping in the hierarchy, because

66

they share the same cache memory all the way up to L1 itself, and the

67

Linux kernel can switch processes between the two virtual CPUs of a

68

hyperthreaded CPU with zero cost or loss in performance, therefore only

69

taking into account the relative balance of processes on each of the

70

hyperthread virtual CPUs.

71

72

At the next lowest level affinity, we'd have the dual-core AMDs, same

73

chip, same memory controller, same local memory, same hypertransport

74

interfaces to the chipset, other CPUs and the rest of the world, and very

75

tightly cooperative, but with separate L2 and of course separate L1 cache.

76

There's a slight performance penalty between switching processes between

77

these CPUs, due to the cache flushing it would entail, but it's only very

78

slight and quite speedy, so thread imbalance between the two processors

79

doesn't have to get bad at all, before it's worth it to switch the CPUs to

80

maintain balance, even at the cost of that cache flush.

81

82

At a slightly lower level of affinity would be the Intel dual cores, since

83

they aren't quite so tightly coupled, and don't share all the same

84

interfaces to the outside world.  In practice, since only one of these,

85

the Intel dual core or the AMD dual core, will normally be encountered in

86

real life, they can be treated at the same level, with possibly a small

87

internal tweak to the relative weighting of thread imbalance vs

88

performance loss for switching CPUs, based on which one is actually in

89

place.

90

91

Here things get interesting, because of the different implementations

92

available.  AMD's 2-way thru 8-way Opterons configured for unified memory

93

access would be first, because again, their dedicated inter-CPU

94

hypertransport links let them cooperate closer than conventional

95

multi-socket CPUs would.  Beyond that, it's a tossup between Intel's

96

unified memory multi-processors and AMD's NUMA/SUMO memory Opterons.  I'd

97

still say the Opterons cooperate closer, even in NUMA/SUMO mode, than

98

Intel chips will with unified memory, due to that SUMO aspect.  At the

99

same time, they have the parallel memory access advantages of NUMA.

100

101

Beyond that, there's several levels of clustering, local/board, off-board

102

but short-fat-pipe accessible (using technologies such as PCI

103

interconnect, fibre-channel, and that SGI interconnect tech IDR the name

104

of ATM), conventional (and Beowulf?) type clustering, and remote

105

clustering. At each of these levels, as with the above, the cost to switch

106

processes between peers at the same affinity level gets higher and higher,

107

so the corresponding process imbalance necessary to trigger a switch

108

likewise gets higher and higher, until at the extreme of remote

109

clustering, it's almost done manually only. or anyway at the level of a

110

user level application managing the transfers, rather than the kernel,

111

directly (since, after all, with remote clustering, each remote group is

112

probably running its own kernel, if not individual machines within that

113

group).

114

115

So, the point of all that is that the kernel sees a hierarchical grouping

116

of CPUs, and is designed with more flexibility to balance processes and

117

memory use at the extreme affinity end, and more hesitation to balance it

118

due to the higher cost involved, at the extremely low affinity end.  The

119

main writeup I read on the subject dealt with thread/process CPU

120

switching, not memory switching, but within the context of NUMA, the

121

principles become so intertwined it's impossible to separate them, and the

122

writeup very clearly made the point that the memory issues involved in

123

making the transfer were included in the cost accounting as well.

124

125

I'm not sure whether this addressed the point you were trying to make, or

126

hit beside it, but anyway, it was fun trying to put into text for the

127

first time since I read about it, the principles in that writeup, along

128

with other facts I've merged along the way.  My dad's a teacher, and I

129

remember him many times making the point that the best way to learn

130

something is to attempt to teach it.  He used that principle in his own

131

classes, having the students help each other, and I remember him making

132

the point about himself as well, at one point, as he struggled to teach

133

basic accounting principles based only on a textbook and the single

134

college intro level class he had himself taken years before, when he found

135

himself teaching a high school class on the subject.  The principle is

136

certainly true, as by explaining the affinity clustering principles here,

137

it has forced me to ensure they form a reasonable and self consistent

138

infrastructure in my own head, in ordered to be able to explain it in the

139

post.  So, anyway, thanks for the intellectual stimulation!  <g>

140

141

--

142

Duncan - List replies preferred.   No HTML msgs.

143

"Every nonfree program has a lord, a master --

144

and if you use the program, he is your master."  Richard Stallman in

145

http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html

146

147

148

--

149

gentoo-amd64@g.o mailing list

Gentoo Archives: gentoo-amd64

Replies

1	Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,
2	excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500:
3
4	> Just to point out, amd was calling the opterons and such more of a SUMO
5	> configuration (Sufficiently Uniform Memory Organization, not joking here),
6	> instead of NUMA. Whereas technically, it clearly is a NUMA system, the
7	> differences in latency when accessing memory from a bank attached to
8	> another processors memory controller is very small. Small enough to be
9	> largely ignored, and treated like uniform memory access latencies in a SMP
10	> system. Sorta in between SMP unified style memory access and NUMA. This
11	> holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
12	> add hypertransport switches to scale over 8 chips/sockets, it'll most
13	> likely be a different story...
14
15	I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
16	the design of the hardware. They have a very good point, that while it's
17	physically NUMA, the latencies variances are so close to unified that in
18	many ways it's indistinguishable -- except for the fact that keeping it
19	NUMA means allowing independent access of two different apps running on
20	two different CPUs, to their own memory in parallel, rather than one
21	having to wait for the other, if the memory were interleaved and unified
22	(as it would be for quad channel access, if that were enabled).
23
24	> What I've always wondered is, the NUMA code in the linux kernel, is this
25	> for handling traditional NUMA, like in a large computer system (big iron)
26	> where NUMA memory access latencies will vary greatly, or is it simply for
27	> optimizing the memory usage across the memory banks. Keeping data in the
28	> memory of the processor using it, etc, etc. Of course none of this matters
29	> for single chip/socket amd systems, as dual cores as well as single cores
30	> share a memory controller. Hmm, maybe I should drink some coffee and
31	> shutup until I'm awake...
32
33	Well, yeah, for single-socket/dual-core, but what about dual socket
34	(either single core or dual core)? Your questions make sense there, and
35	that's what I'm running (single core, tho upgrading to dual core for a
36	quad-core total board sometime next year, would be very nice, and just
37	might be within the limits of my budget), so yes, I'm rather interested!
38
39	The answer to your question on how the kernel deals with it, by my
40	understanding, is this: The Linux kernel SMP/NUMA architecture allows for
41	"CPU affinity grouping". In earlier kernels, it was all automated, but
42	they are actually getting advanced enough now to allow deliberate manual
43	splitting of various groups, and combined with userspace control
44	applications, will ultimately be able to dynamically assign processes to
45	one or more CPU groups of various sizes, controlling the CPU and memory
46	resources available to individual processes. So, yes, I guess that means
47	it's developing some pretty "big iron" qualities, altho many of them are
48	still in flux and won't be stable at least in mainline for another six
49	months or a year, at minimum.
50
51	Let's refocus now back on the implementation and the smaller picture once
52	again, to examine these "CPU affinity zones" in a bit more detail. The
53	following is according to the writeups I've seen, mostly on LWN's weekly
54	kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing
55	the technical kernel hacker level stuff with the middle-ground
56	not-too-technical kernel follower stuff, good enough that I find the site
57	useful enough to subscribe, even tho I could get even the premium content
58	a week later for free. Yes, that's an endorsement of the site, because
59	it's where a lot of my info comes from, and I'm certainly not one to try
60	to keep my knowledge exclusive!)
61
62	Anyway... from mainly that source... CPU affinity zones work with sets
63	and supersets of processors. An Intel hyperthreading pair of virtual
64	processors on the same physical processor will be at the highest affinity
65	level, the lowest level aka strongest grouping in the hierarchy, because
66	they share the same cache memory all the way up to L1 itself, and the
67	Linux kernel can switch processes between the two virtual CPUs of a
68	hyperthreaded CPU with zero cost or loss in performance, therefore only
69	taking into account the relative balance of processes on each of the
70	hyperthread virtual CPUs.
71
72	At the next lowest level affinity, we'd have the dual-core AMDs, same
73	chip, same memory controller, same local memory, same hypertransport
74	interfaces to the chipset, other CPUs and the rest of the world, and very
75	tightly cooperative, but with separate L2 and of course separate L1 cache.
76	There's a slight performance penalty between switching processes between
77	these CPUs, due to the cache flushing it would entail, but it's only very
78	slight and quite speedy, so thread imbalance between the two processors
79	doesn't have to get bad at all, before it's worth it to switch the CPUs to
80	maintain balance, even at the cost of that cache flush.
81
82	At a slightly lower level of affinity would be the Intel dual cores, since
83	they aren't quite so tightly coupled, and don't share all the same
84	interfaces to the outside world. In practice, since only one of these,
85	the Intel dual core or the AMD dual core, will normally be encountered in
86	real life, they can be treated at the same level, with possibly a small
87	internal tweak to the relative weighting of thread imbalance vs
88	performance loss for switching CPUs, based on which one is actually in
89	place.
90
91	Here things get interesting, because of the different implementations
92	available. AMD's 2-way thru 8-way Opterons configured for unified memory
93	access would be first, because again, their dedicated inter-CPU
94	hypertransport links let them cooperate closer than conventional
95	multi-socket CPUs would. Beyond that, it's a tossup between Intel's
96	unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd
97	still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
98	Intel chips will with unified memory, due to that SUMO aspect. At the
99	same time, they have the parallel memory access advantages of NUMA.
100
101	Beyond that, there's several levels of clustering, local/board, off-board
102	but short-fat-pipe accessible (using technologies such as PCI
103	interconnect, fibre-channel, and that SGI interconnect tech IDR the name
104	of ATM), conventional (and Beowulf?) type clustering, and remote
105	clustering. At each of these levels, as with the above, the cost to switch
106	processes between peers at the same affinity level gets higher and higher,
107	so the corresponding process imbalance necessary to trigger a switch
108	likewise gets higher and higher, until at the extreme of remote
109	clustering, it's almost done manually only. or anyway at the level of a
110	user level application managing the transfers, rather than the kernel,
111	directly (since, after all, with remote clustering, each remote group is
112	probably running its own kernel, if not individual machines within that
113	group).
114
115	So, the point of all that is that the kernel sees a hierarchical grouping
116	of CPUs, and is designed with more flexibility to balance processes and
117	memory use at the extreme affinity end, and more hesitation to balance it
118	due to the higher cost involved, at the extremely low affinity end. The
119	main writeup I read on the subject dealt with thread/process CPU
120	switching, not memory switching, but within the context of NUMA, the
121	principles become so intertwined it's impossible to separate them, and the
122	writeup very clearly made the point that the memory issues involved in
123	making the transfer were included in the cost accounting as well.
124
125	I'm not sure whether this addressed the point you were trying to make, or
126	hit beside it, but anyway, it was fun trying to put into text for the
127	first time since I read about it, the principles in that writeup, along
128	with other facts I've merged along the way. My dad's a teacher, and I
129	remember him many times making the point that the best way to learn
130	something is to attempt to teach it. He used that principle in his own
131	classes, having the students help each other, and I remember him making
132	the point about himself as well, at one point, as he struggled to teach
133	basic accounting principles based only on a textbook and the single
134	college intro level class he had himself taken years before, when he found
135	himself teaching a high school class on the subject. The principle is
136	certainly true, as by explaining the affinity clustering principles here,
137	it has forced me to ensure they form a reasonable and self consistent
138	infrastructure in my own head, in ordered to be able to explain it in the
139	post. So, anyway, thanks for the intellectual stimulation! <g>
140
141	--
142	Duncan - List replies preferred. No HTML msgs.
143	"Every nonfree program has a lord, a master --
144	and if you use the program, he is your master." Richard Stallman in
145	http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
146
147
148	--
149	gentoo-amd64@g.o mailing list