Gentoo Archives: gentoo-dev

From: Kevin <gentoo-dev@××××××.biz>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels
Date: Wed, 19 May 2004 17:50:48
Message-Id: 200405191348.12015.gentoo-dev@gnosys.biz
In Reply to: Re: [gentoo-dev] Major MCE problem with SMP on Gentoo kernels by Josh Glover
1 Thanks again for the replies, folks.
2
3 Well, I've now replaced the system motherboard, the CPU (first tried
4 removing one CPU and memtest86 behaved the exact same way, then replaced
5 the CPU with the new one), and the RAM. Results: memtest86 and friends
6 all behave the exact same way. Could this still be a hardware problem?
7 I'm hard-pressed to believe that I have two different motherboards that
8 just happen to suffer from the same flaw (they are not even the same
9 exact version: one is version B2 and the other is version C4). The only
10 things that are common between the system now and the system before are:
11 (1) the SCSI controller card (RAID card) (another SCSI controller was
12 replaced with the m/b), (2) 2 SCSI hard drives connected to the RAID
13 card, (3) a PCI hardware controller based modem, and (4) the SCSI
14 hot-plug backplane. Could one of these be causing the problem? I
15 haven't tried reproducing my MCE 0004 error again, but memtest86 shows no
16 difference. Can anyone buy into the notion now that memtest86 is doing
17 something that it shouldn't be doing when testing this system? Again,
18 the Dell Utilities are all turning up flawless. I've set the
19 configuration in memtest86 to limit the address range it tests to those
20 addresses below 1022MB or RAM (this is what the Dell utilities test with
21 1024MB RAM installed), but it ignores those limits and tests up to 1024
22 anyway and that's where it's still finding its errors (1023.8MB). I've
23 configured memtest86 to turn on ECC testing and it refuses to do so (when
24 I touch (8) for restart tests, the setting returns to off). What's going
25 on here?
26
27 Any thoughts are most welcome. I'll be trying to reproduce my MCE error
28 with this new hardware, and I'll post results when I have them.
29
30 Thanks again for all the replies.
31
32 On Tuesday 18 May 2004 08:02, Josh Glover wrote:
33 > Quoth Kevin (Tue 2004-05-18 04:29:58AM -0400):
34 [...]
35 > > True. Although it is locking up after only 1-2 minutes of operation.
36 > > What conclusion should I draw from that?
37 >
38 > Bad system board. :(
39
40 I just replaced it. Still does the same thing.
41
42 >
43 > > Although I'm sure there are others here with more experience
44 > > troubleshooting such problems, I'm thinking that the above is enough
45 > > to base a pretty sound conclusion upon, and the conclusion I would
46 > > draw is that hardware and memory are not the cause of these MCE
47 > > problems.
48 >
49 > Wrong. memtest86 giving you errors almost always indicates a hardware
50 > problem. You have changed the memory, but what remained consistent? The
51 > memory bus! Try a new system board.
52
53 New system board includes a new memory bus. Still get the same results.
54
55 >
56 > > I also tried something else that had an enormous positive effect on
57 > > the situation---I changed -march=pentium4 to -march=pentium3 in my
58 > > CFLAGS
59 >
60 > All you have done is turn off SSE2 instructions and possibly a few
61 > others that the P4s have and the P3s do not. If something is wrong with
62 > your system board or CPU, less stress on the CPU is likely not to show
63 > problems as often.
64
65 That's a good point. I'll try reproducing the MCE now with the new
66 hardware.
67
68 >
69 > You have bad hardware, Kevin. Try the compile test with one CPU at a
70 > time (i.e. take one out), and if that is not illuminating, replace the
71 > system board.
72
73 Thanks again gents!
74
75 -Kevin
76
77 --
78 gentoo-dev@g.o mailing list

Replies