Gentoo Archives: gentoo-dev

From: Kevin <gentoo-dev@××××××.biz>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] SOLVED: Major MCE problem with SMP on Gentoo kernels
Date: Wed, 26 May 2004 05:18:13
Message-Id: 200405200819.09310.gentoo-dev@gnosys.biz
In Reply to: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels by Kevin
1 Hi All-
2
3 A final note to this thread.
4
5 After trying for many hours of high-intensity cpu activity (like emerging
6 many packages---which is what used to cause the MCE), since replacing my
7 stepping level 7 Xeon with a stepping level 9 Xeon (so that I now have
8 two identical cpus, even in stepping levels (whereas this was not true
9 before), I have been unable to reproduce my MCE 0004 error. I even did
10 this with the kernel compiled with -march=pentium4 CFLAGS (that caused an
11 MCE after 5 or 10 minutes of emerging mysql with stepping level 7 and
12 stepping level 9 cpus installed).
13
14 Naturally, I'm delighted by this, however, the whole experience has been
15 somewhat confusing (although enlightening in many respects).
16
17 Memtest86 still behaves exactly as it did before the hardware replacement.
18 Any thoughts on why it behaves this way with this hardware (unable to set
19 address range limits, unable to force ECC testing on, program locks after
20 2 minutes of operation). I suppose that's a question for another thread
21 in another forum.
22
23 I seem to have suffered from no hardware failures on the M/B, the CPUs
24 (one of the old CPUs is still present---I replaced the other), or the RAM
25 (although I suppose the stepping level 7 Xeon might have had some
26 incredibly subtle flaw that only showed up with another CPU present).
27
28 The replacement hardware seems to suffer no problems at all, in spite of
29 what Memtest86 does (fails at 1023.8MB 30 or 40 times and then freezes).
30
31 I really appreciate all of the suggestions here. You guys convinced me
32 that it was hardware which is why I replaced everything and that
33 ultimately solved the problem, although it's not clear that there was
34 really a hardware problem. The lesson I've learned (though I'm not sure
35 this is really the root issue) is that when doing multi-processor
36 computing, make sure that both processors are identical in every way.
37 Any thoughts on the accuracy of this rule?
38
39 But the bizarre thing is that I couldn't reproduce this MCE at all using
40 another distribution on the same (pre-replacement) hardware. Does Gentoo
41 push the hardware much harder than other distros? Perhaps because I'm
42 compiling the code for my particular hardware vice running code that was
43 built to run on many different sets of hardware (less aggressive CFLAGS
44 et. al.)? I'm at a loss to explain this.
45
46 Again, many thanks for all the help here.
47
48 -Kevin
49
50 --
51 gentoo-dev@g.o mailing list