Gentoo Archives: gentoo-amd64

From: Jean.Borsenberger@×××××.fr
To: gentoo-amd64@l.g.o
Subject: Re: [gentoo-amd64] Re: Re: amd64 and kernel configuration
Date: Wed, 27 Jul 2005 17:09:28
Message-Id: Pine.LNX.4.58.0507271900500.14425@siolinh.obspm.fr
In Reply to: [gentoo-amd64] Re: Re: amd64 and kernel configuration by Duncan <1i5t5.duncan@cox.net>
1 Well, may be it's SUMO, but when we swicth on the NUMA option for
2 the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP
3 benchs between 20% to 30% (depending on the program considered).
4
5 Note: OPenMP is a FORTRAN variation in which you put paralelisation
6 directives, without boring of the implementation details, using a single
7 address space, for all instances of the user program.
8
9 Jean Borsenberger
10 tel: +33 (0)1 45 07 76 29
11 Observatoire de Paris Meudon
12 5 place Jules Janssen
13 92195 Meudon France
14
15 On Wed, 27 Jul 2005, Duncan wrote:
16
17 > Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>,
18 > excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500:
19 >
20 > > Just to point out, amd was calling the opterons and such more of a SUMO
21 > > configuration (Sufficiently Uniform Memory Organization, not joking here),
22 > > instead of NUMA. Whereas technically, it clearly is a NUMA system, the
23 > > differences in latency when accessing memory from a bank attached to
24 > > another processors memory controller is very small. Small enough to be
25 > > largely ignored, and treated like uniform memory access latencies in a SMP
26 > > system. Sorta in between SMP unified style memory access and NUMA. This
27 > > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You
28 > > add hypertransport switches to scale over 8 chips/sockets, it'll most
29 > > likely be a different story...
30 >
31 > I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given
32 > the design of the hardware. They have a very good point, that while it's
33 > physically NUMA, the latencies variances are so close to unified that in
34 > many ways it's indistinguishable -- except for the fact that keeping it
35 > NUMA means allowing independent access of two different apps running on
36 > two different CPUs, to their own memory in parallel, rather than one
37 > having to wait for the other, if the memory were interleaved and unified
38 > (as it would be for quad channel access, if that were enabled).
39 >
40 > > What I've always wondered is, the NUMA code in the linux kernel, is this
41 > > for handling traditional NUMA, like in a large computer system (big iron)
42 > > where NUMA memory access latencies will vary greatly, or is it simply for
43 > > optimizing the memory usage across the memory banks. Keeping data in the
44 > > memory of the processor using it, etc, etc. Of course none of this matters
45 > > for single chip/socket amd systems, as dual cores as well as single cores
46 > > share a memory controller. Hmm, maybe I should drink some coffee and
47 > > shutup until I'm awake...
48 >
49 > Well, yeah, for single-socket/dual-core, but what about dual socket
50 > (either single core or dual core)? Your questions make sense there, and
51 > that's what I'm running (single core, tho upgrading to dual core for a
52 > quad-core total board sometime next year, would be very nice, and just
53 > might be within the limits of my budget), so yes, I'm rather interested!
54 >
55 > The answer to your question on how the kernel deals with it, by my
56 > understanding, is this: The Linux kernel SMP/NUMA architecture allows for
57 > "CPU affinity grouping". In earlier kernels, it was all automated, but
58 > they are actually getting advanced enough now to allow deliberate manual
59 > splitting of various groups, and combined with userspace control
60 > applications, will ultimately be able to dynamically assign processes to
61 > one or more CPU groups of various sizes, controlling the CPU and memory
62 > resources available to individual processes. So, yes, I guess that means
63 > it's developing some pretty "big iron" qualities, altho many of them are
64 > still in flux and won't be stable at least in mainline for another six
65 > months or a year, at minimum.
66 >
67 > Let's refocus now back on the implementation and the smaller picture once
68 > again, to examine these "CPU affinity zones" in a bit more detail. The
69 > following is according to the writeups I've seen, mostly on LWN's weekly
70 > kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing
71 > the technical kernel hacker level stuff with the middle-ground
72 > not-too-technical kernel follower stuff, good enough that I find the site
73 > useful enough to subscribe, even tho I could get even the premium content
74 > a week later for free. Yes, that's an endorsement of the site, because
75 > it's where a lot of my info comes from, and I'm certainly not one to try
76 > to keep my knowledge exclusive!)
77 >
78 > Anyway... from mainly that source... CPU affinity zones work with sets
79 > and supersets of processors. An Intel hyperthreading pair of virtual
80 > processors on the same physical processor will be at the highest affinity
81 > level, the lowest level aka strongest grouping in the hierarchy, because
82 > they share the same cache memory all the way up to L1 itself, and the
83 > Linux kernel can switch processes between the two virtual CPUs of a
84 > hyperthreaded CPU with zero cost or loss in performance, therefore only
85 > taking into account the relative balance of processes on each of the
86 > hyperthread virtual CPUs.
87 >
88 > At the next lowest level affinity, we'd have the dual-core AMDs, same
89 > chip, same memory controller, same local memory, same hypertransport
90 > interfaces to the chipset, other CPUs and the rest of the world, and very
91 > tightly cooperative, but with separate L2 and of course separate L1 cache.
92 > There's a slight performance penalty between switching processes between
93 > these CPUs, due to the cache flushing it would entail, but it's only very
94 > slight and quite speedy, so thread imbalance between the two processors
95 > doesn't have to get bad at all, before it's worth it to switch the CPUs to
96 > maintain balance, even at the cost of that cache flush.
97 >
98 > At a slightly lower level of affinity would be the Intel dual cores, since
99 > they aren't quite so tightly coupled, and don't share all the same
100 > interfaces to the outside world. In practice, since only one of these,
101 > the Intel dual core or the AMD dual core, will normally be encountered in
102 > real life, they can be treated at the same level, with possibly a small
103 > internal tweak to the relative weighting of thread imbalance vs
104 > performance loss for switching CPUs, based on which one is actually in
105 > place.
106 >
107 > Here things get interesting, because of the different implementations
108 > available. AMD's 2-way thru 8-way Opterons configured for unified memory
109 > access would be first, because again, their dedicated inter-CPU
110 > hypertransport links let them cooperate closer than conventional
111 > multi-socket CPUs would. Beyond that, it's a tossup between Intel's
112 > unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd
113 > still say the Opterons cooperate closer, even in NUMA/SUMO mode, than
114 > Intel chips will with unified memory, due to that SUMO aspect. At the
115 > same time, they have the parallel memory access advantages of NUMA.
116 >
117 > Beyond that, there's several levels of clustering, local/board, off-board
118 > but short-fat-pipe accessible (using technologies such as PCI
119 > interconnect, fibre-channel, and that SGI interconnect tech IDR the name
120 > of ATM), conventional (and Beowulf?) type clustering, and remote
121 > clustering. At each of these levels, as with the above, the cost to switch
122 > processes between peers at the same affinity level gets higher and higher,
123 > so the corresponding process imbalance necessary to trigger a switch
124 > likewise gets higher and higher, until at the extreme of remote
125 > clustering, it's almost done manually only. or anyway at the level of a
126 > user level application managing the transfers, rather than the kernel,
127 > directly (since, after all, with remote clustering, each remote group is
128 > probably running its own kernel, if not individual machines within that
129 > group).
130 >
131 > So, the point of all that is that the kernel sees a hierarchical grouping
132 > of CPUs, and is designed with more flexibility to balance processes and
133 > memory use at the extreme affinity end, and more hesitation to balance it
134 > due to the higher cost involved, at the extremely low affinity end. The
135 > main writeup I read on the subject dealt with thread/process CPU
136 > switching, not memory switching, but within the context of NUMA, the
137 > principles become so intertwined it's impossible to separate them, and the
138 > writeup very clearly made the point that the memory issues involved in
139 > making the transfer were included in the cost accounting as well.
140 >
141 > I'm not sure whether this addressed the point you were trying to make, or
142 > hit beside it, but anyway, it was fun trying to put into text for the
143 > first time since I read about it, the principles in that writeup, along
144 > with other facts I've merged along the way. My dad's a teacher, and I
145 > remember him many times making the point that the best way to learn
146 > something is to attempt to teach it. He used that principle in his own
147 > classes, having the students help each other, and I remember him making
148 > the point about himself as well, at one point, as he struggled to teach
149 > basic accounting principles based only on a textbook and the single
150 > college intro level class he had himself taken years before, when he found
151 > himself teaching a high school class on the subject. The principle is
152 > certainly true, as by explaining the affinity clustering principles here,
153 > it has forced me to ensure they form a reasonable and self consistent
154 > infrastructure in my own head, in ordered to be able to explain it in the
155 > post. So, anyway, thanks for the intellectual stimulation! <g>
156 >
157 > --
158 > Duncan - List replies preferred. No HTML msgs.
159 > "Every nonfree program has a lord, a master --
160 > and if you use the program, he is your master." Richard Stallman in
161 > http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
162 >
163 >
164 > --
165 > gentoo-amd64@g.o mailing list
166 >
167 >
168 --
169 gentoo-amd64@g.o mailing list