1 |
Well, may be it's SUMO, but when we swicth on the NUMA option for |
2 |
the kernel of our quadri-pro - 16Gb opteron it did speed up the OPENMP |
3 |
benchs between 20% to 30% (depending on the program considered). |
4 |
|
5 |
Note: OPenMP is a FORTRAN variation in which you put paralelisation |
6 |
directives, without boring of the implementation details, using a single |
7 |
address space, for all instances of the user program. |
8 |
|
9 |
Jean Borsenberger |
10 |
tel: +33 (0)1 45 07 76 29 |
11 |
Observatoire de Paris Meudon |
12 |
5 place Jules Janssen |
13 |
92195 Meudon France |
14 |
|
15 |
On Wed, 27 Jul 2005, Duncan wrote: |
16 |
|
17 |
> Drew Kirkpatrick posted <81469e8e0507270346445f4363@××××××××××.com>, |
18 |
> excerpted below, on Wed, 27 Jul 2005 05:46:28 -0500: |
19 |
> |
20 |
> > Just to point out, amd was calling the opterons and such more of a SUMO |
21 |
> > configuration (Sufficiently Uniform Memory Organization, not joking here), |
22 |
> > instead of NUMA. Whereas technically, it clearly is a NUMA system, the |
23 |
> > differences in latency when accessing memory from a bank attached to |
24 |
> > another processors memory controller is very small. Small enough to be |
25 |
> > largely ignored, and treated like uniform memory access latencies in a SMP |
26 |
> > system. Sorta in between SMP unified style memory access and NUMA. This |
27 |
> > holds for up to 3 hypertranport link hops, or up to 8 chips/sockets. You |
28 |
> > add hypertransport switches to scale over 8 chips/sockets, it'll most |
29 |
> > likely be a different story... |
30 |
> |
31 |
> I wasn't aware of the AMD "SUMO" moniker, but it /does/ make sense, given |
32 |
> the design of the hardware. They have a very good point, that while it's |
33 |
> physically NUMA, the latencies variances are so close to unified that in |
34 |
> many ways it's indistinguishable -- except for the fact that keeping it |
35 |
> NUMA means allowing independent access of two different apps running on |
36 |
> two different CPUs, to their own memory in parallel, rather than one |
37 |
> having to wait for the other, if the memory were interleaved and unified |
38 |
> (as it would be for quad channel access, if that were enabled). |
39 |
> |
40 |
> > What I've always wondered is, the NUMA code in the linux kernel, is this |
41 |
> > for handling traditional NUMA, like in a large computer system (big iron) |
42 |
> > where NUMA memory access latencies will vary greatly, or is it simply for |
43 |
> > optimizing the memory usage across the memory banks. Keeping data in the |
44 |
> > memory of the processor using it, etc, etc. Of course none of this matters |
45 |
> > for single chip/socket amd systems, as dual cores as well as single cores |
46 |
> > share a memory controller. Hmm, maybe I should drink some coffee and |
47 |
> > shutup until I'm awake... |
48 |
> |
49 |
> Well, yeah, for single-socket/dual-core, but what about dual socket |
50 |
> (either single core or dual core)? Your questions make sense there, and |
51 |
> that's what I'm running (single core, tho upgrading to dual core for a |
52 |
> quad-core total board sometime next year, would be very nice, and just |
53 |
> might be within the limits of my budget), so yes, I'm rather interested! |
54 |
> |
55 |
> The answer to your question on how the kernel deals with it, by my |
56 |
> understanding, is this: The Linux kernel SMP/NUMA architecture allows for |
57 |
> "CPU affinity grouping". In earlier kernels, it was all automated, but |
58 |
> they are actually getting advanced enough now to allow deliberate manual |
59 |
> splitting of various groups, and combined with userspace control |
60 |
> applications, will ultimately be able to dynamically assign processes to |
61 |
> one or more CPU groups of various sizes, controlling the CPU and memory |
62 |
> resources available to individual processes. So, yes, I guess that means |
63 |
> it's developing some pretty "big iron" qualities, altho many of them are |
64 |
> still in flux and won't be stable at least in mainline for another six |
65 |
> months or a year, at minimum. |
66 |
> |
67 |
> Let's refocus now back on the implementation and the smaller picture once |
68 |
> again, to examine these "CPU affinity zones" in a bit more detail. The |
69 |
> following is according to the writeups I've seen, mostly on LWN's weekly |
70 |
> kernel pages. (Jon Corbet, LWN editor, does a very good job of balancing |
71 |
> the technical kernel hacker level stuff with the middle-ground |
72 |
> not-too-technical kernel follower stuff, good enough that I find the site |
73 |
> useful enough to subscribe, even tho I could get even the premium content |
74 |
> a week later for free. Yes, that's an endorsement of the site, because |
75 |
> it's where a lot of my info comes from, and I'm certainly not one to try |
76 |
> to keep my knowledge exclusive!) |
77 |
> |
78 |
> Anyway... from mainly that source... CPU affinity zones work with sets |
79 |
> and supersets of processors. An Intel hyperthreading pair of virtual |
80 |
> processors on the same physical processor will be at the highest affinity |
81 |
> level, the lowest level aka strongest grouping in the hierarchy, because |
82 |
> they share the same cache memory all the way up to L1 itself, and the |
83 |
> Linux kernel can switch processes between the two virtual CPUs of a |
84 |
> hyperthreaded CPU with zero cost or loss in performance, therefore only |
85 |
> taking into account the relative balance of processes on each of the |
86 |
> hyperthread virtual CPUs. |
87 |
> |
88 |
> At the next lowest level affinity, we'd have the dual-core AMDs, same |
89 |
> chip, same memory controller, same local memory, same hypertransport |
90 |
> interfaces to the chipset, other CPUs and the rest of the world, and very |
91 |
> tightly cooperative, but with separate L2 and of course separate L1 cache. |
92 |
> There's a slight performance penalty between switching processes between |
93 |
> these CPUs, due to the cache flushing it would entail, but it's only very |
94 |
> slight and quite speedy, so thread imbalance between the two processors |
95 |
> doesn't have to get bad at all, before it's worth it to switch the CPUs to |
96 |
> maintain balance, even at the cost of that cache flush. |
97 |
> |
98 |
> At a slightly lower level of affinity would be the Intel dual cores, since |
99 |
> they aren't quite so tightly coupled, and don't share all the same |
100 |
> interfaces to the outside world. In practice, since only one of these, |
101 |
> the Intel dual core or the AMD dual core, will normally be encountered in |
102 |
> real life, they can be treated at the same level, with possibly a small |
103 |
> internal tweak to the relative weighting of thread imbalance vs |
104 |
> performance loss for switching CPUs, based on which one is actually in |
105 |
> place. |
106 |
> |
107 |
> Here things get interesting, because of the different implementations |
108 |
> available. AMD's 2-way thru 8-way Opterons configured for unified memory |
109 |
> access would be first, because again, their dedicated inter-CPU |
110 |
> hypertransport links let them cooperate closer than conventional |
111 |
> multi-socket CPUs would. Beyond that, it's a tossup between Intel's |
112 |
> unified memory multi-processors and AMD's NUMA/SUMO memory Opterons. I'd |
113 |
> still say the Opterons cooperate closer, even in NUMA/SUMO mode, than |
114 |
> Intel chips will with unified memory, due to that SUMO aspect. At the |
115 |
> same time, they have the parallel memory access advantages of NUMA. |
116 |
> |
117 |
> Beyond that, there's several levels of clustering, local/board, off-board |
118 |
> but short-fat-pipe accessible (using technologies such as PCI |
119 |
> interconnect, fibre-channel, and that SGI interconnect tech IDR the name |
120 |
> of ATM), conventional (and Beowulf?) type clustering, and remote |
121 |
> clustering. At each of these levels, as with the above, the cost to switch |
122 |
> processes between peers at the same affinity level gets higher and higher, |
123 |
> so the corresponding process imbalance necessary to trigger a switch |
124 |
> likewise gets higher and higher, until at the extreme of remote |
125 |
> clustering, it's almost done manually only. or anyway at the level of a |
126 |
> user level application managing the transfers, rather than the kernel, |
127 |
> directly (since, after all, with remote clustering, each remote group is |
128 |
> probably running its own kernel, if not individual machines within that |
129 |
> group). |
130 |
> |
131 |
> So, the point of all that is that the kernel sees a hierarchical grouping |
132 |
> of CPUs, and is designed with more flexibility to balance processes and |
133 |
> memory use at the extreme affinity end, and more hesitation to balance it |
134 |
> due to the higher cost involved, at the extremely low affinity end. The |
135 |
> main writeup I read on the subject dealt with thread/process CPU |
136 |
> switching, not memory switching, but within the context of NUMA, the |
137 |
> principles become so intertwined it's impossible to separate them, and the |
138 |
> writeup very clearly made the point that the memory issues involved in |
139 |
> making the transfer were included in the cost accounting as well. |
140 |
> |
141 |
> I'm not sure whether this addressed the point you were trying to make, or |
142 |
> hit beside it, but anyway, it was fun trying to put into text for the |
143 |
> first time since I read about it, the principles in that writeup, along |
144 |
> with other facts I've merged along the way. My dad's a teacher, and I |
145 |
> remember him many times making the point that the best way to learn |
146 |
> something is to attempt to teach it. He used that principle in his own |
147 |
> classes, having the students help each other, and I remember him making |
148 |
> the point about himself as well, at one point, as he struggled to teach |
149 |
> basic accounting principles based only on a textbook and the single |
150 |
> college intro level class he had himself taken years before, when he found |
151 |
> himself teaching a high school class on the subject. The principle is |
152 |
> certainly true, as by explaining the affinity clustering principles here, |
153 |
> it has forced me to ensure they form a reasonable and self consistent |
154 |
> infrastructure in my own head, in ordered to be able to explain it in the |
155 |
> post. So, anyway, thanks for the intellectual stimulation! <g> |
156 |
> |
157 |
> -- |
158 |
> Duncan - List replies preferred. No HTML msgs. |
159 |
> "Every nonfree program has a lord, a master -- |
160 |
> and if you use the program, he is your master." Richard Stallman in |
161 |
> http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html |
162 |
> |
163 |
> |
164 |
> -- |
165 |
> gentoo-amd64@g.o mailing list |
166 |
> |
167 |
> |
168 |
-- |
169 |
gentoo-amd64@g.o mailing list |