1 |
Hi there! |
2 |
|
3 |
I have had our gentoo server go down twice in under two days. I am |
4 |
currently trying to figure out what is happening. |
5 |
|
6 |
Facts: |
7 |
- Dual PIII 933 MHz system (ServerWorks OSB4) |
8 |
- 3.5GB RAM |
9 |
- 2.6.11.2-grsec-20050614 kernel (self rolled) |
10 |
- SCSI: Adaptec AIC-7892P, 32MB cache |
11 |
+ Disks |
12 |
+ For Operating System |
13 |
- 2x IBM DDYS-T09170N SCSI U160 10KRPM 9.1GB in a RAID1, 1x of the |
14 |
same for hotspare |
15 |
+ For storage etc |
16 |
- 3x IBM IC35L036UWD210-0 SCSSI U160 10KRPM |
17 |
- 1x IBM DDYS-T36950N SCSI U160 10KRPM |
18 |
- In a RAID5 |
19 |
|
20 |
Tuesday afternoon, I was informed that there might be problems with this |
21 |
server. I had just been working on it via shell. I went back, and found |
22 |
it unresponsive. |
23 |
|
24 |
I went into the server room, only to catch it ending a reboot and being |
25 |
almost totally back up. It behaved the rest of the day. I was not able |
26 |
to find any indications of problems in the logs. |
27 |
|
28 |
Wednesday evening, I was again working on the system via ssh, and it |
29 |
stopped responding. I got into the server room fast enough this time. I |
30 |
tried to log in as root, and could not. I could type the username, but |
31 |
upon hitting enter, nothing happened. That was true for any console. |
32 |
|
33 |
I have syslogd output *.* to console 10, so flipping over there, I saw |
34 |
nothing out of the ordinary. The last long, at the time I noticed it |
35 |
stop responding, was a simple run-of-the-mill firewall log. |
36 |
|
37 |
After a few more minutes, the system was completely unresponsive, save |
38 |
for SysReq. I Synced, tErmed, Synced again, remounted everything |
39 |
read-only and forced it to reboot. |
40 |
|
41 |
Again I was not able to find any logs indicating any errors at all. |
42 |
|
43 |
The only two possibilities I see is that I was goofing with samba at |
44 |
various points, both days. However, samba was not running at either time |
45 |
the system went down. |
46 |
|
47 |
The other, more interesting one, is that at both times when the system |
48 |
went down, I was creating a tar.bz2 out of a kernel source. The problems |
49 |
happened well after I had started them. |
50 |
|
51 |
Wondering about disks, I threw smartctl -a at both of the arrays (sda , |
52 |
sdb), which didn't give anything out of the ordinary. |
53 |
|
54 |
However when I run smartctl -t offline or -t short or -t long on sda or |
55 |
sdb, it immediately fails on STDOUT. This I find odd, because I have |
56 |
done these tests in the past. Granted it was on a different kernel, |
57 |
which I no longer have around. |
58 |
|
59 |
Here is an example: |
60 |
|
61 |
# smartctl -t short /dev/sda |
62 |
smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen |
63 |
Home page is http://smartmontools.sourceforge.net/ |
64 |
|
65 |
Short Background Self Test Failed |
66 |
|
67 |
Looking at logs, I don't see anything strange. Including dmesg. |
68 |
|
69 |
I am worried by the smartctl results, however I realize there is a small |
70 |
possibility that it's due to kernel changes. |
71 |
|
72 |
Any ideas out there? Thank you for reading this! I *LOVE* Gentoo in |
73 |
production. |
74 |
-- |
75 |
gentoo-server@g.o mailing list |