1 |
Alexander Puchmayr <alexander.puchmayr@×××××××.at> posted |
2 |
200804271523.25677.alexander.puchmayr@×××××××.at, excerpted below, on |
3 |
Sun, 27 Apr 2008 15:23:25 +0200: |
4 |
|
5 |
> After being abroad for a month I tried to boot my amd64 maschine and it |
6 |
> crashes all the time, doing spontaneous reboots, hangs in the bios, and |
7 |
> occasionally it tells me something about mce exception when loading |
8 |
> hal-daemon, and immediately after that kernel panic, nothing works. |
9 |
> |
10 |
> The MCE-exception also says i should run mcelog --ascii, but how should |
11 |
> I do that if the maschine is virtually dead in the kernel panic state? |
12 |
|
13 |
If you can, get the MCE number, write it down or whatever. Then get the |
14 |
program parsemce (written by Dave Jones, the kernel guy). The sources |
15 |
are here: |
16 |
|
17 |
http://www.kernel.org/pub/linux/kernel/people/davej/tools/parsemce.c |
18 |
|
19 |
The google hits are here: |
20 |
|
21 |
http://www.google.com/linux?lr=lang_en&hl=en&q=parsemce |
22 |
|
23 |
The personal experience: |
24 |
|
25 |
The (generic) memory I originally had in this machine was apparently not |
26 |
quite up to its on-module and sold-as rating (PC3200). Unfortunately, |
27 |
the original BIOS and early updates didn't have memory timing control |
28 |
either; the board simply ran what the on-module rating said it was good |
29 |
for (assuming the board and CPUs, Opterons with onboard memory |
30 |
controllers, could run it as well), and that proved somewhat optimistic. |
31 |
|
32 |
Memtest86 checked out clean, because it tests memory retention in various |
33 |
patterns, not memory timing, which it doesn't stress. However, I'd get |
34 |
crashes often enough I knew something was wrong, and they'd happen more |
35 |
frequently under high activity, tho it didn't seem to be directly CPU |
36 |
related. I did notice occasional bunzip2 errors claiming a corrupt |
37 |
archive, but trying again would usually work. That seemed the only |
38 |
common user-space symptom I had -- the only other one was the MCEs and |
39 |
often (but not always) kernel panics as a result. (I had the option to |
40 |
panic automatically on detected error turned off, setting it to try to |
41 |
continue if it could -- the only way I was able to compile things, |
42 |
sometimes. I also learned how to restart an in-progress merge after a |
43 |
reboot...) |
44 |
|
45 |
I suspected it might be memory but as I said memtest came up clean, and |
46 |
it could have been the memory controller on the CPU, or the on-chip |
47 |
cache, or... |
48 |
|
49 |
Finally I found out that MCE stood for machine-check-exception, and that |
50 |
the number it reports could be checked against a chart AMD and Intel both |
51 |
provide to see what the machine check said was going wrong. Parsemce was |
52 |
the way to do this at the time, tho it now seems there may be other |
53 |
methods, based on your comments. |
54 |
|
55 |
Looking it up I found it was indeed memory bus errors, but it still could |
56 |
have been the board or something, not the memory itself. Still, knowing |
57 |
it was generic memory, I figured it was. The problem was that the |
58 |
socket-940 Opterons (which I have) take registered memory, which is |
59 |
nowhere near the commodity that unregistered non-ecc memory is, and I |
60 |
really couldn't afford it at the time. |
61 |
|
62 |
Well, Tyan finally came out with a BIOS update that exposed the necessary |
63 |
memory knobs for me to tweak, and I found that limiting speed just one |
64 |
notch, from 400 MHz to 383 MHz (basically PC3000 speed instead of PC3200) |
65 |
solved the problem ENTIRELY. At that speed I could even significantly |
66 |
tighten down the individual memory timings and the system was STILL |
67 |
stable as a rock -- it was the overall clockspeed that the memory just |
68 |
couldn't quite handle, at least on my board. I suspect the reason it was |
69 |
generic memory was that it was just at tolerance at one end, and the |
70 |
board (which had originally only been rated to PC2700 speeds, so I was |
71 |
surprised it actually clocked the memory at PC3200 in the first place -- |
72 |
I figure generic memory, but PC3200, should be plenty good to run at |
73 |
PC2700 and it would have been, only by then the board BIOS had been |
74 |
updated to take up to PC3200 memory even tho it hadn't be updated to |
75 |
clock limit it if necessary) is probably near tolerance at the other end, |
76 |
so the memory simply wasn't stable at those speeds in this board, but |
77 |
would have worked just fine at lower speeds (as it did for me when I |
78 |
could finally do it with a BIOS update) or at rated speeds on many other |
79 |
boards. |
80 |
|
81 |
So I ran the memory at PC3000 speeds but with many of the individual |
82 |
timings tightened for a year or so... at which point I could finally |
83 |
afford to upgrade memory to what I /really/ wanted all along -- the 8 |
84 |
gigs memory I'm now running, tho it cost me ~US$1100 to do it. It's |
85 |
Super Talent brand, perhaps not top of the line, but at least it's stable |
86 |
at rated speeds. |
87 |
|
88 |
Anyway, first moral of the story is that the MCE reports were accurate, |
89 |
altho they didn't pin it down all the way, they certainly pointed the |
90 |
area, confirming my earlier suspicions, and that parsemce can help with |
91 |
the number to English description mapping. Second is that just because |
92 |
memtest86 says it's fine doesn't necessarily mean the /timings/ are fine, |
93 |
as it tests memory cell reliability and doesn't stress timings. |
94 |
|
95 |
Other morals would be to avoid expensive registered memory requirements |
96 |
if possible, avoid generic memory, avoid making assumptions about what a |
97 |
board will try to run the memory at despite what it's rated to run it at, |
98 |
and when shopping, confirm if possible that the BIOS contains memory |
99 |
speed tweaking knobs. |
100 |
|
101 |
... The reason I got Opterons in the first place was because I wanted a |
102 |
dual CPU system, and I wanted AMD64, and that's the way you got it at the |
103 |
time. I've been very happy with the system in general, Tyan is quite |
104 |
good with Linux support on many of their boards including this one -- it |
105 |
was certified with several Linux distributions and they even had a |
106 |
preconfigured lm_sensors.conf available for download! =8^) Additionally, |
107 |
when the dual-cores came out, all I had to do was upgrade the BIOS and I |
108 |
was ready to upgrade the CPUs (which I've now done, to dual dual-core |
109 |
Opteron 290s, at 2.8 GHz, top of the socket 940 line). It sure would |
110 |
have been nice to have the memory tweaking stuff in the BIOS rather |
111 |
earlier, however. |
112 |
|
113 |
Meanwhile, I plan on keeping this one for awhile (thus the money sunk in |
114 |
upgrades), but when I do upgrade again, I'd like to make it a Tyan once |
115 |
again, but 4 cores is really plenty and likely will still be plenty at |
116 |
that time, and they have that available in single socket desktops now so |
117 |
I won't need to spend the extra $$ on server class gear. |
118 |
|
119 |
As for your problems, keep in mind that a going bad or overloaded |
120 |
powersupply can produce similar symptoms as well, or it can be unstable |
121 |
wall power. Or overclocking altho it doesn't sound like you're into that. |
122 |
But since you are getting MCEs, I definitely check what they say is wrong |
123 |
before anything else. The machine is providing the information, why not |
124 |
use it? |
125 |
|
126 |
-- |
127 |
Duncan - List replies preferred. No HTML msgs. |
128 |
"Every nonfree program has a lord, a master -- |
129 |
and if you use the program, he is your master." Richard Stallman |
130 |
|
131 |
-- |
132 |
gentoo-amd64@l.g.o mailing list |