1 |
Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>, |
2 |
excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500: |
3 |
|
4 |
> Just to point out, amd was calling the opterons and such more of a SUMO |
5 |
> configuration (Sufficiently Uniform Memory Organization, not joking here), |
6 |
> instead of NUMA. Whereas technically, it clearly is a NUMA system, the |
7 |
> differences in latency when accessing memory from a bank attached to |
8 |
> another processors memory controller is very small. Small enough to be |
9 |
> largely ignored, and treated like uniform memory access latencies in a SMP |
10 |
> system. Sorta in between SMP unified style memory access and NUMA. This |
11 |
> holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You |
12 |
> add hypertransport switches to scale over 8 chips/sockets, it'll most |
13 |
> likely be a different story... |
14 |
|
15 |
I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given |
16 |
the design of the hardware. They have a very good point, that while it's |
17 |
physically NUMA, the latencies variances are so close to unified that in |
18 |
many ways it's indistinguishable -- except for the fact that keeping it |
19 |
NUMA means allowing independent access of two different apps running on |
20 |
two different CPUs, to their own memory in parallel, rather than one |
21 |
having to wait for the other, if the memory were interleaved and unified |
22 |
(as it would be for quad channel access, if that were enabled). |
23 |
|
24 |
> What I've always wondered is, the NUMA code in the linux kernel, is this |
25 |
> for handling traditional NUMA, like in a large computer system (big iron) |
26 |
> where NUMA memory access latencies will vary greatly, or is it simply for |
27 |
> optimizing the memory usage across the memory banks. Keeping data in the |
28 |
> memory of the processor using it, etc, etc. Of course none of this matters |
29 |
> for single chip/socket amd systems, as dual cores as well as single cores |
30 |
> share a memory controller. Hmm, maybe I should drink some coffee and |
31 |
> shutup until I'm awake... |
32 |
|
33 |
Well, yeah, for single-socket/dual-core, but what about dual socket |
34 |
(either single core or dual core)? Your questions make sense there, and |
35 |
that's what I'm running (single core, tho upgrading to dual core for a |
36 |
quad-core total board sometime next year, would be very nice, and just |
37 |
might be within the limits of my budget), so yes, I'm rather interested! |
38 |
|
39 |
The answer to your question on how the kernel deals with it, by my |
40 |
understanding, is this: The Linux kernel SMP/NUMA architecture allows for |
41 |
"CPU affinity grouping". In earlier kernels, it was all automated, but |
42 |
they are actually getting advanced enough now to allow deliberate manual |
43 |
splitting of various groups, and combined with userspace control |
44 |
applications, will ultimately be able to dynamically assign processes to |
45 |
one or more CPU groups of various sizes, controlling the CPU and memory |
46 |
resources available to individual processes. So, yes, I guess that means |
47 |
it's developing some pretty "big iron" qualities, altho many of them are |
48 |
still in flux and won't be stable at least in mainline for another six |
49 |
months or a year, at minimum. |
50 |
|
51 |
Let's refocus now back on the implementation and the smaller picture once |
52 |
again, to examine these "CPU affinity zones" in a bit more detail. The |
53 |
following is according to the writeups I've seen, mostly on LWN's weekly |
54 |
kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing |
55 |
the technical kernel hacker level stuff with the middle-ground |
56 |
not-too-technical kernel follower stuff, good enough that I find the site |
57 |
useful enough to subscribe, even tho I could get even the premium content |
58 |
a week later for free. Yes, that's an endorsement of the site, because |
59 |
it's where a lot of my info comes from, and I'm certainly not one to try |
60 |
to keep my knowledge exclusive!) |
61 |
|
62 |
Anyway... from mainly that source... CPU affinity zones work with sets |
63 |
and supersets of processors. An Intel hyperthreading pair of virtual |
64 |
processors on the same physical processor will be at the highest affinity |
65 |
level, the lowest level aka strongest grouping in the hierarchy, because |
66 |
they share the same cache memory all the way up to L1 itself, and the |
67 |
Linux kernel can switch processes between the two virtual CPUs of a |
68 |
hyperthreaded CPU with zero cost or loss in performance, therefore only |
69 |
taking into account the relative balance of processes on each of the |
70 |
hyperthread virtual CPUs. |
71 |
|
72 |
At the next lowest level affinity, we'd have the dual-core AMDs, same |
73 |
chip, same memory controller, same local memory, same hypertransport |
74 |
interfaces to the chipset, other CPUs and the rest of the world, and very |
75 |
tightly cooperative, but with separate L2 and of course separate L1 cache. |
76 |
There's a slight performance penalty between switching processes between |
77 |
these CPUs, due to the cache flushing it would entail, but it's only very |
78 |
slight and quite speedy, so thread imbalance between the two processors |
79 |
doesn't have to get bad at all, before it's worth it to switch the CPUs to |
80 |
maintain balance, even at the cost of that cache flush. |
81 |
|
82 |
At a slightly lower level of affinity would be the Intel dual cores, since |
83 |
they aren't quite so tightly coupled, and don't share all the same |
84 |
interfaces to the outside world. In practice, since only one of these, |
85 |
the Intel dual core or the AMD dual core, will normally be encountered in |
86 |
real life, they can be treated at the same level, with possibly a small |
87 |
internal tweak to the relative weighting of thread imbalance vs |
88 |
performance loss for switching CPUs, based on which one is actually in |
89 |
place. |
90 |
|
91 |
Here things get interesting, because of the different implementations |
92 |
available. AMD's 2-way thru 8-way Opterons configured for unified memory |
93 |
access would be first, because again, their dedicated inter-CPU |
94 |
hypertransport links let them cooperate closer than conventional |
95 |
multi-socket CPUs would. Beyond that, it's a tossup between Intel's |
96 |
unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd |
97 |
still say the Opterons cooperate closer, even in NUMA/SUMO mode, than |
98 |
Intel chips will with unified memory, due to that SUMO aspect. At the |
99 |
same time, they have the parallel memory access advantages of NUMA. |
100 |
|
101 |
Beyond that, there's several levels of clustering, local/board, off-board |
102 |
but short-fat-pipe accessible (using technologies such as PCI |
103 |
interconnect, fibre-channel, and that SGI interconnect tech IDR the name |
104 |
of ATM), conventional (and Beowulf?) type clustering, and remote |
105 |
clustering. At each of these levels, as with the above, the cost to switch |
106 |
processes between peers at the same affinity level gets higher and higher, |
107 |
so the corresponding process imbalance necessary to trigger a switch |
108 |
likewise gets higher and higher, until at the extreme of remote |
109 |
clustering, it's almost done manually only. or anyway at the level of a |
110 |
user level application managing the transfers, rather than the kernel, |
111 |
directly (since, after all, with remote clustering, each remote group is |
112 |
probably running its own kernel, if not individual machines within that |
113 |
group). |
114 |
|
115 |
So, the point of all that is that the kernel sees a hierarchical grouping |
116 |
of CPUs, and is designed with more flexibility to balance processes and |
117 |
memory use at the extreme affinity end, and more hesitation to balance it |
118 |
due to the higher cost involved, at the extremely low affinity end. The |
119 |
main writeup I read on the subject dealt with thread/process CPU |
120 |
switching, not memory switching, but within the context of NUMA, the |
121 |
principles become so intertwined it's impossible to separate them, and the |
122 |
writeup very clearly made the point that the memory issues involved in |
123 |
making the transfer were included in the cost accounting as well. |
124 |
|
125 |
I'm not sure whether this addressed the point you were trying to make, or |
126 |
hit beside it, but anyway, it was fun trying to put into text for the |
127 |
first time since I read about it, the principles in that writeup, along |
128 |
with other facts I've merged along the way. My dad's a teacher, and I |
129 |
remember him many times making the point that the best way to learn |
130 |
something is to attempt to teach it. He used that principle in his own |
131 |
classes, having the students help each other, and I remember him making |
132 |
the point about himself as well, at one point, as he struggled to teach |
133 |
basic accounting principles based only on a textbook and the single |
134 |
college intro level class he had himself taken years before, when he found |
135 |
himself teaching a high school class on the subject. The principle is |
136 |
certainly true, as by explaining the affinity clustering principles here, |
137 |
it has forced me to ensure they form a reasonable and self consistent |
138 |
infrastructure in my own head, in ordered to be able to explain it in the |
139 |
post. So, anyway, thanks for the intellectual stimulation! <g> |
140 |
|
141 |
-- |
142 |
Duncan - List replies preferred. No HTML msgs. |
143 |
"Every nonfree program has a lord, a master -- |
144 |
and if you use the program, he is your master." Richard Stallman in |
145 |
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html |
146 |
|
147 |
|
148 |
-- |
149 |
gentoo-amd64@g.o mailing list |