Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Re: amd64 and kernel configuration
Date: Wed, 27 Jul 2005 15:45:29
Message-Id: pan.2005.07.27.15.42.13.414249@cox.net
In Reply to: Re: [gentoo-amd64] Re: amd64 and kernel configuration by Drew Kirkpatrick
1 Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,
2 excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500:
3
4 > Just to point out, amd was calling the opterons and such more of a SUMO
5 > configuration (Sufficiently Uniform Memory Organization, not joking here),
6 > instead of NUMA. Whereas technically, it clearly is a NUMA system, the
7 > differences in latency when accessing memory from a bank attached to
8 > another processors memory controller is very small. Small enough to be
9 > largely ignored, and treated like uniform memory access latencies in a SMP
10 > system. Sorta in between SMP unified style memory access and NUMA. This
11 > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
12 > add hypertransport switches to scale over 8 chips/sockets, it'll most
13 > likely be a different story...
14
15 I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
16 the design of the hardware. They have a very good point, that while it's
17 physically NUMA, the latencies variances are so close to unified that in
18 many ways it's indistinguishable -- except for the fact that keeping it
19 NUMA means allowing independent access of two different apps running on
20 two different CPUs, to their own memory in parallel, rather than one
21 having to wait for the other, if the memory were interleaved and unified
22 (as it would be for quad channel access, if that were enabled).
23
24 > What I've always wondered is, the NUMA code in the linux kernel, is this
25 > for handling traditional NUMA, like in a large computer system (big iron)
26 > where NUMA memory access latencies will vary greatly, or is it simply for
27 > optimizing the memory usage across the memory banks. Keeping data in the
28 > memory of the processor using it, etc, etc. Of course none of this matters
29 > for single chip/socket amd systems, as dual cores as well as single cores
30 > share a memory controller. Hmm, maybe I should drink some coffee and
31 > shutup until I'm awake...
32
33 Well, yeah, for single-socket/dual-core, but what about dual socket
34 (either single core or dual core)? Your questions make sense there, and
35 that's what I'm running (single core, tho upgrading to dual core for a
36 quad-core total board sometime next year, would be very nice, and just
37 might be within the limits of my budget), so yes, I'm rather interested!
38
39 The answer to your question on how the kernel deals with it, by my
40 understanding, is this: The Linux kernel SMP/NUMA architecture allows for
41 "CPU affinity grouping". In earlier kernels, it was all automated, but
42 they are actually getting advanced enough now to allow deliberate manual
43 splitting of various groups, and combined with userspace control
44 applications, will ultimately be able to dynamically assign processes to
45 one or more CPU groups of various sizes, controlling the CPU and memory
46 resources available to individual processes. So, yes, I guess that means
47 it's developing some pretty "big iron" qualities, altho many of them are
48 still in flux and won't be stable at least in mainline for another six
49 months or a year, at minimum.
50
51 Let's refocus now back on the implementation and the smaller picture once
52 again, to examine these "CPU affinity zones" in a bit more detail. The
53 following is according to the writeups I've seen, mostly on LWN's weekly
54 kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing
55 the technical kernel hacker level stuff with the middle-ground
56 not-too-technical kernel follower stuff, good enough that I find the site
57 useful enough to subscribe, even tho I could get even the premium content
58 a week later for free. Yes, that's an endorsement of the site, because
59 it's where a lot of my info comes from, and I'm certainly not one to try
60 to keep my knowledge exclusive!)
61
62 Anyway... from mainly that source... CPU affinity zones work with sets
63 and supersets of processors. An Intel hyperthreading pair of virtual
64 processors on the same physical processor will be at the highest affinity
65 level, the lowest level aka strongest grouping in the hierarchy, because
66 they share the same cache memory all the way up to L1 itself, and the
67 Linux kernel can switch processes between the two virtual CPUs of a
68 hyperthreaded CPU with zero cost or loss in performance, therefore only
69 taking into account the relative balance of processes on each of the
70 hyperthread virtual CPUs.
71
72 At the next lowest level affinity, we'd have the dual-core AMDs, same
73 chip, same memory controller, same local memory, same hypertransport
74 interfaces to the chipset, other CPUs and the rest of the world, and very
75 tightly cooperative, but with separate L2 and of course separate L1 cache.
76 There's a slight performance penalty between switching processes between
77 these CPUs, due to the cache flushing it would entail, but it's only very
78 slight and quite speedy, so thread imbalance between the two processors
79 doesn't have to get bad at all, before it's worth it to switch the CPUs to
80 maintain balance, even at the cost of that cache flush.
81
82 At a slightly lower level of affinity would be the Intel dual cores, since
83 they aren't quite so tightly coupled, and don't share all the same
84 interfaces to the outside world. In practice, since only one of these,
85 the Intel dual core or the AMD dual core, will normally be encountered in
86 real life, they can be treated at the same level, with possibly a small
87 internal tweak to the relative weighting of thread imbalance vs
88 performance loss for switching CPUs, based on which one is actually in
89 place.
90
91 Here things get interesting, because of the different implementations
92 available. AMD's 2-way thru 8-way Opterons configured for unified memory
93 access would be first, because again, their dedicated inter-CPU
94 hypertransport links let them cooperate closer than conventional
95 multi-socket CPUs would. Beyond that, it's a tossup between Intel's
96 unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd
97 still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
98 Intel chips will with unified memory, due to that SUMO aspect. At the
99 same time, they have the parallel memory access advantages of NUMA.
100
101 Beyond that, there's several levels of clustering, local/board, off-board
102 but short-fat-pipe accessible (using technologies such as PCI
103 interconnect, fibre-channel, and that SGI interconnect tech IDR the name
104 of ATM), conventional (and Beowulf?) type clustering, and remote
105 clustering. At each of these levels, as with the above, the cost to switch
106 processes between peers at the same affinity level gets higher and higher,
107 so the corresponding process imbalance necessary to trigger a switch
108 likewise gets higher and higher, until at the extreme of remote
109 clustering, it's almost done manually only. or anyway at the level of a
110 user level application managing the transfers, rather than the kernel,
111 directly (since, after all, with remote clustering, each remote group is
112 probably running its own kernel, if not individual machines within that
113 group).
114
115 So, the point of all that is that the kernel sees a hierarchical grouping
116 of CPUs, and is designed with more flexibility to balance processes and
117 memory use at the extreme affinity end, and more hesitation to balance it
118 due to the higher cost involved, at the extremely low affinity end. The
119 main writeup I read on the subject dealt with thread/process CPU
120 switching, not memory switching, but within the context of NUMA, the
121 principles become so intertwined it's impossible to separate them, and the
122 writeup very clearly made the point that the memory issues involved in
123 making the transfer were included in the cost accounting as well.
124
125 I'm not sure whether this addressed the point you were trying to make, or
126 hit beside it, but anyway, it was fun trying to put into text for the
127 first time since I read about it, the principles in that writeup, along
128 with other facts I've merged along the way. My dad's a teacher, and I
129 remember him many times making the point that the best way to learn
130 something is to attempt to teach it. He used that principle in his own
131 classes, having the students help each other, and I remember him making
132 the point about himself as well, at one point, as he struggled to teach
133 basic accounting principles based only on a textbook and the single
134 college intro level class he had himself taken years before, when he found
135 himself teaching a high school class on the subject. The principle is
136 certainly true, as by explaining the affinity clustering principles here,
137 it has forced me to ensure they form a reasonable and self consistent
138 infrastructure in my own head, in ordered to be able to explain it in the
139 post. So, anyway, thanks for the intellectual stimulation! <g>
140
141 --
142 Duncan - List replies preferred. No HTML msgs.
143 "Every nonfree program has a lord, a master --
144 and if you use the program, he is your master." Richard Stallman in
145 http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
146
147
148 --
149 gentoo-amd64@g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: Re: amd64 and kernel configuration Jean.Borsenberger@×××××.fr