Gentoo Archives: gentoo-dev

From: Kevin <gentoo-dev@××××××.biz>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
Date: Tue, 18 May 2004 08:29:18
Message-Id: 200405171951.32745.gentoo-dev@gnosys.biz
In Reply to: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels by Greg KH
1 Again, thanks to all who have commented on this thread. I've now done
2 some more testing and have some other interesting (though also confusing)
3 results to report.
4
5 On Thursday 13 May 2004 11:54, Greg KH wrote:
6 > On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote:
7 > > Greg KH thinks it's bad memory,
8 >
9 > It's not only me, it's memtest86 saying it :)
10
11 True. Although it is locking up after only 1-2 minutes of operation.
12 What conclusion should I draw from that?
13
14 >
15 > > but I'm skeptical of that because the main address that fails (some
16 > > 30 times in a row) is at 1023.8MB and the Dell Utilities only test up
17 > > to 1022MB, and because I haven't seen the problem with the liveCD
18 > > kernel.
19 >
20 > Maybe that's the fault of the Dell utilities. Seriously, I trust
21 > memtest86 over any other vendor specific test. If you don't want to
22 > believe it, that's fine, but I would really consider fixing that issue
23 > before trying to point the finger at the kernel or the Gentoo install.
24
25 You're right, Greg. I finally took your advice and did some serious
26 testing with the DIMM sticks. This box has 4 slots, DIMMA-DIMMD, and
27 here's what I've done:
28 1) swapped one 512MB stick for the other in DIMMA/DIMMB (reversed their
29 positions)
30 2) removed one 512MB stick from DIMMB (configs require filling from DIMMA
31 up)
32 3) removed the other 512MB stick (so that now I've tried each stick in
33 DIMMA all by itself, and no sticks in any of the other slots)
34 4) completely replaced each 512MB stick with new ones from Dell and did
35 all of 1-3 above with the new sticks.
36
37 In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all behave
38 very similarly. That is, they show 1023.8MB (or 511.8MB if only one
39 stick installed) as repeatedly failing (some 30 or 40 times), then they
40 do either (a) show 304.5MB failing and three more failed tests of
41 1023.8MB (or 511.8MB) and then the program locks up; or (b) show three
42 failed tests at 64.0MB, then three more at 1023.8MB (or 511.8MB), then
43 one more failed test at 64.0MB, then one more at 0.6MB, then one more at
44 1023.8MB (or 511.8MB), and then the program locks up.
45
46 Since I had the extra sticks, I also tried testing with all 4 slots filled
47 and got very similar results to those described above, except the
48 repeatedly failing address was 2047.8MB (in all cases, 512MB, 1024MB, and
49 2048MB, the repeately failing address is 0.2MB below the max).
50
51 There are no intermittent failing addresses---there are two very specific
52 patterns to the failures, and the program always locks up after following
53 one pattern or the other.
54
55 In all of the memory configurations I tried, the Dell utilities reported
56 no memory errors (or any other hardware errors).
57
58 Although I'm sure there are others here with more experience
59 troubleshooting such problems, I'm thinking that the above is enough to
60 base a pretty sound conclusion upon, and the conclusion I would draw is
61 that hardware and memory are not the cause of these MCE problems. I
62 welcome anyone contradicting that conclusion because I've never seen
63 anything like this before and I'm at a loss on how to resolve it. I'm
64 tempted to try replacing one of the CPUs to see if identical stepping
65 levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are
66 otherwise identical) will resolve the problem.
67
68 I also tried getting memtest86 (and variants) to let me turn on the ECC
69 portion of the tests to no avail, and when I tried sizing the memory, the
70 probe returned 1024MB, the use bios std setting returned 1024MB, and the
71 use bios all setting locked the program up.
72
73 I also tried something else that had an enormous positive effect on the
74 situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS and
75 built another kernel with identical .config settings. With that kernel
76 running, I did some 2-4 hours of solid compiling work, emerging and
77 re-emerging packages like mysql, cyrus-sasl, cyrus-imapd, mit-krb5,
78 openafs, etc. But unfortunately, this kernel also ended up freezing
79 after doing more of the same, and it did so with the same error message
80 MCE 0000000000000004.
81
82 I tried using parsemce.c from
83 http://www.codemonkey.org.uk/cruft/parsemce.c/. I built it and ran it,
84 but it wasn't very helpful and I'm not quite sure what I'm supposed to do
85 with it.
86
87 Chris, I'm going to try your kernel. Thanks for offering that. I'll
88 relate whatever I learn from that test.
89
90 Again, I really appreciate all the thoughtful replies on what to try next
91 to resolve this problem. If there are any others, or if anyone has
92 suggestions on what to try next, I'd love to hear them. Perhaps I could
93 send my .config file to someone and they could try cross-compiling a
94 kernel for me to try running?
95
96 Thanks again.
97
98 --
99 -Kevin
100
101 --
102 gentoo-dev@g.o mailing list

Replies