1 |
Mark Knecht wrote: |
2 |
> On Sun, Feb 15, 2009 at 3:42 PM, Harry Putnam <reader@×××××××.com> wrote: |
3 |
> |
4 |
>> I've been experiencing spontaneous reboots on one gentoo machine |
5 |
>> lately. Looking thru /var/log/messages... I see the restarts but |
6 |
>> looking above that... I'm not seeing anything I recognize as being a |
7 |
>> culprit. |
8 |
>> |
9 |
>> Its been happening for a few weeks... but I've been busy and only now |
10 |
>> digging into it ( The machine is no kind of server ). |
11 |
>> |
12 |
>> It appears to only happen in X (I'm using xfce4) and I've only noticed |
13 |
>> it since I started running 2.6.28 kernels. Although I couldn't say |
14 |
>> that it seemed to be directly related. |
15 |
>> |
16 |
>> I mean I didn't boot into 2.6.28 and suddenly notice spontaneous |
17 |
>> rebooting. |
18 |
>> |
19 |
>> It does not appear to be heat realated... but I am only now using |
20 |
>> lm_sensors to keep an accurate record and see if there appears to be a |
21 |
>> relationship. |
22 |
>> |
23 |
>> I've had two today so either its happening more often or I'm just |
24 |
>> spending more time on that machine. |
25 |
>> |
26 |
>> It may also be on the first or second time its happened while I as |
27 |
>> actually right at the keyboard. |
28 |
>> |
29 |
>> I'm sorry to be so vague about it, but in truth, I've been pretty lazy |
30 |
>> about it... since no real harm comes of an unexpected reboot on that |
31 |
>> machine (so far anyway). But clearly something that has to be figured |
32 |
>> out. |
33 |
>> |
34 |
>> The only things I've checked so far... |
35 |
>> 1) browsing thru /var/log/messages (Having trouble recognizing any |
36 |
>> thing that looks suspicious. |
37 |
>> |
38 |
>> I have noticed what appears to be a time/date anomaly where the |
39 |
>> progression of time is suddenly irregular. That is, an earlier |
40 |
>> time shows up amongst some later times. |
41 |
>> |
42 |
>> It appears to have been me sudoing to visudo. And apparently |
43 |
>> having /etc/sudoers open long enough for the closing of it to be |
44 |
>> earlier than other events taking place. |
45 |
>> |
46 |
>> Again ... I'm not real sure exactly what happened there but it |
47 |
>> does not appear to coincide with a reboot anyway. |
48 |
>> |
49 |
>> 2) checking how hot the cpu is getting (Doesn't appear to be a |
50 |
>> problem) But now running a cron job recording temperatures every 10 |
51 |
>> minutes. So that may turn up something. |
52 |
>> |
53 |
>> 3) checking for overfilled disks. (none show in df -h) |
54 |
>> |
55 |
>> |
56 |
> |
57 |
> Reseat memory and PCI cards, etc. Consider removing for a period of |
58 |
> time any hardware not absolutely necessary to debug the problem. (I.e. |
59 |
> - second video card, extra disk drives, extra network adapters, etc.) |
60 |
> Run memtest86 for a few days if you can spare the machine. Run |
61 |
> spinrite, etc., to look for drive problems. Open the box up and place |
62 |
> a fan blowing extra air for additional cooling. |
63 |
> |
64 |
> good luck, |
65 |
> Mark |
66 |
> |
67 |
> |
68 |
> |
69 |
|
70 |
To add another test. I had this issue once before and it was a faulty |
71 |
driver for my hard drives. I ran a command like this to test mine: |
72 |
|
73 |
hdparm -Tt /dev/hda && hdparm -Tt /dev/hda && hdparm -Tt /dev/hda && |
74 |
hdparm -Tt /dev/hda && hdparm -Tt /dev/hda |
75 |
|
76 |
If it can pass that then it should be all right and you can look |
77 |
elsewhere. Mine would only fail when the drives were very busy and that |
78 |
test should do that pretty good. |
79 |
|
80 |
Hope that helps. |
81 |
|
82 |
Dale |
83 |
|
84 |
:-) :-) |