1 |
Again, thanks to all who have commented on this thread. I've now done |
2 |
some more testing and have some other interesting (though also confusing) |
3 |
results to report. |
4 |
|
5 |
On Thursday 13 May 2004 11:54, Greg KH wrote: |
6 |
> On Thu, May 13, 2004 at 07:06:12AM -0400, Kevin wrote: |
7 |
> > Greg KH thinks it's bad memory, |
8 |
> |
9 |
> It's not only me, it's memtest86 saying it :) |
10 |
|
11 |
True. Although it is locking up after only 1-2 minutes of operation. |
12 |
What conclusion should I draw from that? |
13 |
|
14 |
> |
15 |
> > but I'm skeptical of that because the main address that fails (some |
16 |
> > 30 times in a row) is at 1023.8MB and the Dell Utilities only test up |
17 |
> > to 1022MB, and because I haven't seen the problem with the liveCD |
18 |
> > kernel. |
19 |
> |
20 |
> Maybe that's the fault of the Dell utilities. Seriously, I trust |
21 |
> memtest86 over any other vendor specific test. If you don't want to |
22 |
> believe it, that's fine, but I would really consider fixing that issue |
23 |
> before trying to point the finger at the kernel or the Gentoo install. |
24 |
|
25 |
You're right, Greg. I finally took your advice and did some serious |
26 |
testing with the DIMM sticks. This box has 4 slots, DIMMA-DIMMD, and |
27 |
here's what I've done: |
28 |
1) swapped one 512MB stick for the other in DIMMA/DIMMB (reversed their |
29 |
positions) |
30 |
2) removed one 512MB stick from DIMMB (configs require filling from DIMMA |
31 |
up) |
32 |
3) removed the other 512MB stick (so that now I've tried each stick in |
33 |
DIMMA all by itself, and no sticks in any of the other slots) |
34 |
4) completely replaced each 512MB stick with new ones from Dell and did |
35 |
all of 1-3 above with the new sticks. |
36 |
|
37 |
In every case, memtest86 v3.0, memtest86 v3.1a, memtest86+ v1.0 all behave |
38 |
very similarly. That is, they show 1023.8MB (or 511.8MB if only one |
39 |
stick installed) as repeatedly failing (some 30 or 40 times), then they |
40 |
do either (a) show 304.5MB failing and three more failed tests of |
41 |
1023.8MB (or 511.8MB) and then the program locks up; or (b) show three |
42 |
failed tests at 64.0MB, then three more at 1023.8MB (or 511.8MB), then |
43 |
one more failed test at 64.0MB, then one more at 0.6MB, then one more at |
44 |
1023.8MB (or 511.8MB), and then the program locks up. |
45 |
|
46 |
Since I had the extra sticks, I also tried testing with all 4 slots filled |
47 |
and got very similar results to those described above, except the |
48 |
repeatedly failing address was 2047.8MB (in all cases, 512MB, 1024MB, and |
49 |
2048MB, the repeately failing address is 0.2MB below the max). |
50 |
|
51 |
There are no intermittent failing addresses---there are two very specific |
52 |
patterns to the failures, and the program always locks up after following |
53 |
one pattern or the other. |
54 |
|
55 |
In all of the memory configurations I tried, the Dell utilities reported |
56 |
no memory errors (or any other hardware errors). |
57 |
|
58 |
Although I'm sure there are others here with more experience |
59 |
troubleshooting such problems, I'm thinking that the above is enough to |
60 |
base a pretty sound conclusion upon, and the conclusion I would draw is |
61 |
that hardware and memory are not the cause of these MCE problems. I |
62 |
welcome anyone contradicting that conclusion because I've never seen |
63 |
anything like this before and I'm at a loss on how to resolve it. I'm |
64 |
tempted to try replacing one of the CPUs to see if identical stepping |
65 |
levels (my CPU0 is stepping level 7 and CPU1 is level 9, but they are |
66 |
otherwise identical) will resolve the problem. |
67 |
|
68 |
I also tried getting memtest86 (and variants) to let me turn on the ECC |
69 |
portion of the tests to no avail, and when I tried sizing the memory, the |
70 |
probe returned 1024MB, the use bios std setting returned 1024MB, and the |
71 |
use bios all setting locked the program up. |
72 |
|
73 |
I also tried something else that had an enormous positive effect on the |
74 |
situation---I changed -march=pentium4 to -march=pentium3 in my CFLAGS and |
75 |
built another kernel with identical .config settings. With that kernel |
76 |
running, I did some 2-4 hours of solid compiling work, emerging and |
77 |
re-emerging packages like mysql, cyrus-sasl, cyrus-imapd, mit-krb5, |
78 |
openafs, etc. But unfortunately, this kernel also ended up freezing |
79 |
after doing more of the same, and it did so with the same error message |
80 |
MCE 0000000000000004. |
81 |
|
82 |
I tried using parsemce.c from |
83 |
http://www.codemonkey.org.uk/cruft/parsemce.c/. I built it and ran it, |
84 |
but it wasn't very helpful and I'm not quite sure what I'm supposed to do |
85 |
with it. |
86 |
|
87 |
Chris, I'm going to try your kernel. Thanks for offering that. I'll |
88 |
relate whatever I learn from that test. |
89 |
|
90 |
Again, I really appreciate all the thoughtful replies on what to try next |
91 |
to resolve this problem. If there are any others, or if anyone has |
92 |
suggestions on what to try next, I'd love to hear them. Perhaps I could |
93 |
send my .config file to someone and they could try cross-compiling a |
94 |
kernel for me to try running? |
95 |
|
96 |
Thanks again. |
97 |
|
98 |
-- |
99 |
-Kevin |
100 |
|
101 |
-- |
102 |
gentoo-dev@g.o mailing list |