Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: mce log errors
Date: Wed, 07 Dec 2005 05:54:05
Message-Id: pan.2005.12.07.05.50.45.257174@cox.net
In Reply to: Re: [gentoo-amd64] mce log errors by Deedra Waters
1 Deedra Waters posted <Pine.LNX.4.64.0512062202210.6176@monster>, excerpted
2 below, on Tue, 06 Dec 2005 22:04:50 -0600:
3
4 > Is there a way to test that fact? I've tried to work with lm_sensors,
5 > but the readings for that are way way off. So, considering lm_sensors
6 > isuseless is there another way to tell if overheating is the problem?
7 >
8 > The case itself has a lot of fans, but it's also got 5 harddrives in it.
9
10 Don't know about ASUS, but Tyan has lm_sensors config files for many of
11 their boards on their site.
12
13 I had similar but not as severe (main memory only, no L2 cache errors)
14 here. For quite some time they drove me nearly up a wall, so I can
15 definitely identify with your situation!
16
17 In my case, it turned out to be over-rated generic memory. After a BIOS
18 update added memory timing control, I limited my so-called PC3200 memory
19 to PC3000 (downclocking from 200/400 MHz to 183/366 MHz), and now get to
20 actually enjoy that fabled Linux stability, with the only reboots being
21 when I do so purposefully! =8^)
22
23 One thing you can try, somewhat counter-intuitive, but it definitely
24 helped here until BIOS got timing limit functionality (it didn't seem to
25 cause any compile problems or the like, either, a good thing on Gentoo),
26 is to turn OFF ECC. The best I can figure, the additional ECC data put a
27 higher strain on already touchy timings, so turning it OFF increased
28 stability while not noticably increasing undetected errors.
29
30 In any case, try declocking a bit. I only declocked memory, but if it's
31 really L2 cache issues for you, you'll likely have to declock the CPUs as
32 well. If it's overheating or general timing touchiness, that should
33 definitely improve stability.
34
35 It could also be slightly low voltage. Again, a properly configured
36 lm_sensors config would be a /great/ help here, but if it isn't
37 available... Turning the clocking down should help there as well, but
38 turning the voltage up at the regular clock rate, provided your cooling is
39 fine, may also help. If it's the cooling and NOT the voltage, that will
40 make things WORSE, of course, thereby giving you a way to tell the
41 difference, PROVIDED you want to risk 0v3r(10(kin9 methods even if not
42 actually overclocking, of course. =8^)
43
44 That leads to another possible solution, one some will certainly consider
45 more sane than resorting to upping voltages. Particularly with that many
46 drives and the number of fans you indicate you may have, plus everything
47 else in a normal computer, it could be your power supply isn't quite large
48 enough to handle the demand. This could easily be rehash for you, but
49 just in case... many power supplies are hopelessly overrated,
50 particularly if you don't see any UL or CE (or other
51 appropriate nationality testing organization) certifications on them. If
52 they are certified, you can safely assume the rated overall output, but
53 particular voltages may still be inadequate for your needs, certainly so
54 when running five drives and possibly a fully loaded RAM rack, plus
55 multiiple fans. Put it this way, if you are using the power supply that
56 came with the case, and you bought a low-end case, it's a fair bet that
57 the rating on the power supply isn't worth the sticker it's printed on!
58
59 Personally, I tend to prefer a pretty good power margin -- enough so it's
60 not even close to stressing, and I've never had the trouble with power
61 supplies I've seen others have. If I'm spending an extra $50 to $100 to
62 have the peace of mind that stable power means, so be it! Tt's well worth
63 it to me, considering the alternative of potential headaches and even
64 damage to a system worth conservatively a couple grand.
65
66 (I just need to learn to spend more on memory, as obviously, the generic
67 stuff I was buying didn't cut it. <g> Lesson learned!)
68
69 --
70 Duncan - List replies preferred. No HTML msgs.
71 "Every nonfree program has a lord, a master --
72 and if you use the program, he is your master." Richard Stallman in
73 http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html
74
75
76 --
77 gentoo-amd64@g.o mailing list