Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: mce exception -- howto determine real problem?
Date: Sun, 27 Apr 2008 10:52:35
Message-Id: pan.2008.04.27.10.52.16@cox.net
In Reply to: [gentoo-amd64] mce exception -- howto determine real problem? by Alexander Puchmayr
1 Alexander Puchmayr <alexander.puchmayr@×××××××.at> posted
2 200804271523.25677.alexander.puchmayr@×××××××.at, excerpted below, on
3 Sun, 27 Apr 2008 15:23:25 +0200:
4
5 > After being abroad for a month I tried to boot my amd64 maschine and it
6 > crashes all the time, doing spontaneous reboots, hangs in the bios, and
7 > occasionally it tells me something about mce exception when loading
8 > hal-daemon, and immediately after that kernel panic, nothing works.
9 >
10 > The MCE-exception also says i should run mcelog --ascii, but how should
11 > I do that if the maschine is virtually dead in the kernel panic state?
12
13 If you can, get the MCE number, write it down or whatever. Then get the
14 program parsemce (written by Dave Jones, the kernel guy). The sources
15 are here:
16
17 http://www.kernel.org/pub/linux/kernel/people/davej/tools/parsemce.c
18
19 The google hits are here:
20
21 http://www.google.com/linux?lr=lang_en&hl=en&q=parsemce
22
23 The personal experience:
24
25 The (generic) memory I originally had in this machine was apparently not
26 quite up to its on-module and sold-as rating (PC3200). Unfortunately,
27 the original BIOS and early updates didn't have memory timing control
28 either; the board simply ran what the on-module rating said it was good
29 for (assuming the board and CPUs, Opterons with onboard memory
30 controllers, could run it as well), and that proved somewhat optimistic.
31
32 Memtest86 checked out clean, because it tests memory retention in various
33 patterns, not memory timing, which it doesn't stress. However, I'd get
34 crashes often enough I knew something was wrong, and they'd happen more
35 frequently under high activity, tho it didn't seem to be directly CPU
36 related. I did notice occasional bunzip2 errors claiming a corrupt
37 archive, but trying again would usually work. That seemed the only
38 common user-space symptom I had -- the only other one was the MCEs and
39 often (but not always) kernel panics as a result. (I had the option to
40 panic automatically on detected error turned off, setting it to try to
41 continue if it could -- the only way I was able to compile things,
42 sometimes. I also learned how to restart an in-progress merge after a
43 reboot...)
44
45 I suspected it might be memory but as I said memtest came up clean, and
46 it could have been the memory controller on the CPU, or the on-chip
47 cache, or...
48
49 Finally I found out that MCE stood for machine-check-exception, and that
50 the number it reports could be checked against a chart AMD and Intel both
51 provide to see what the machine check said was going wrong. Parsemce was
52 the way to do this at the time, tho it now seems there may be other
53 methods, based on your comments.
54
55 Looking it up I found it was indeed memory bus errors, but it still could
56 have been the board or something, not the memory itself. Still, knowing
57 it was generic memory, I figured it was. The problem was that the
58 socket-940 Opterons (which I have) take registered memory, which is
59 nowhere near the commodity that unregistered non-ecc memory is, and I
60 really couldn't afford it at the time.
61
62 Well, Tyan finally came out with a BIOS update that exposed the necessary
63 memory knobs for me to tweak, and I found that limiting speed just one
64 notch, from 400 MHz to 383 MHz (basically PC3000 speed instead of PC3200)
65 solved the problem ENTIRELY. At that speed I could even significantly
66 tighten down the individual memory timings and the system was STILL
67 stable as a rock -- it was the overall clockspeed that the memory just
68 couldn't quite handle, at least on my board. I suspect the reason it was
69 generic memory was that it was just at tolerance at one end, and the
70 board (which had originally only been rated to PC2700 speeds, so I was
71 surprised it actually clocked the memory at PC3200 in the first place --
72 I figure generic memory, but PC3200, should be plenty good to run at
73 PC2700 and it would have been, only by then the board BIOS had been
74 updated to take up to PC3200 memory even tho it hadn't be updated to
75 clock limit it if necessary) is probably near tolerance at the other end,
76 so the memory simply wasn't stable at those speeds in this board, but
77 would have worked just fine at lower speeds (as it did for me when I
78 could finally do it with a BIOS update) or at rated speeds on many other
79 boards.
80
81 So I ran the memory at PC3000 speeds but with many of the individual
82 timings tightened for a year or so... at which point I could finally
83 afford to upgrade memory to what I /really/ wanted all along -- the 8
84 gigs memory I'm now running, tho it cost me ~US$1100 to do it. It's
85 Super Talent brand, perhaps not top of the line, but at least it's stable
86 at rated speeds.
87
88 Anyway, first moral of the story is that the MCE reports were accurate,
89 altho they didn't pin it down all the way, they certainly pointed the
90 area, confirming my earlier suspicions, and that parsemce can help with
91 the number to English description mapping. Second is that just because
92 memtest86 says it's fine doesn't necessarily mean the /timings/ are fine,
93 as it tests memory cell reliability and doesn't stress timings.
94
95 Other morals would be to avoid expensive registered memory requirements
96 if possible, avoid generic memory, avoid making assumptions about what a
97 board will try to run the memory at despite what it's rated to run it at,
98 and when shopping, confirm if possible that the BIOS contains memory
99 speed tweaking knobs.
100
101 ... The reason I got Opterons in the first place was because I wanted a
102 dual CPU system, and I wanted AMD64, and that's the way you got it at the
103 time. I've been very happy with the system in general, Tyan is quite
104 good with Linux support on many of their boards including this one -- it
105 was certified with several Linux distributions and they even had a
106 preconfigured lm_sensors.conf available for download! =8^) Additionally,
107 when the dual-cores came out, all I had to do was upgrade the BIOS and I
108 was ready to upgrade the CPUs (which I've now done, to dual dual-core
109 Opteron 290s, at 2.8 GHz, top of the socket 940 line). It sure would
110 have been nice to have the memory tweaking stuff in the BIOS rather
111 earlier, however.
112
113 Meanwhile, I plan on keeping this one for awhile (thus the money sunk in
114 upgrades), but when I do upgrade again, I'd like to make it a Tyan once
115 again, but 4 cores is really plenty and likely will still be plenty at
116 that time, and they have that available in single socket desktops now so
117 I won't need to spend the extra $$ on server class gear.
118
119 As for your problems, keep in mind that a going bad or overloaded
120 powersupply can produce similar symptoms as well, or it can be unstable
121 wall power. Or overclocking altho it doesn't sound like you're into that.
122 But since you are getting MCEs, I definitely check what they say is wrong
123 before anything else. The machine is providing the information, why not
124 use it?
125
126 --
127 Duncan - List replies preferred. No HTML msgs.
128 "Every nonfree program has a lord, a master --
129 and if you use the program, he is your master." Richard Stallman
130
131 --
132 gentoo-amd64@l.g.o mailing list

Replies

Subject Author
Re: [gentoo-amd64] Re: mce exception -- howto determine real problem? Alexander Puchmayr <alexander.puchmayr@×××××××.at>