1 |
Mark Knecht posted on Tue, 30 Mar 2010 06:56:14 -0700 as excerpted: |
2 |
|
3 |
> 3) I LOVE your idea of managing 3 /boot partitions by hand instead of |
4 |
> using RAID. Easy to do, completely testable ahead of time. If I ensure |
5 |
> that every disk can boot then no matter what disk goes down the machine |
6 |
> still works, at least a little. Not that much work and even if I don't |
7 |
> do it for awhile it doesn't matter as I can do repairs without a CD. |
8 |
> (well....) |
9 |
|
10 |
That's one of those things you only tend to realize after running a RAID |
11 |
for awhile... and possibly after having grub die, for some reason I don't |
12 |
quite understand, on just a kernel update... and realizing that had I |
13 |
setup multiple independent /boot and boot-backup partitions instead of a |
14 |
single RAID-1 /boot, I'd have had the backups to boot to if I'd have |
15 |
needed it. |
16 |
|
17 |
So call it the voice of experience! =:^) |
18 |
|
19 |
Meanwhile, glad you figured the problem out. A boot-flag-requiring- |
20 |
BIOS... that'd explain the problem for both the RAID and no-RAID version! |
21 |
|
22 |
100% waits for long periods... I've seen a number of reasons for this. |
23 |
One key to remember is that I/O backups have a way of stopping many other |
24 |
things at times. Among the reasons I've seen: |
25 |
|
26 |
1a) Dying disk. I've had old disks that would sometimes take some time to |
27 |
respond, especially if they had gone to sleep. If you hear several clicks |
28 |
(aka "the click of death") as it resets the disk and tries again... it's |
29 |
time to think about either replacing the old disk, or sending in the new |
30 |
one for a replacement. |
31 |
|
32 |
1b) Another form of that is hard to read data sectors. It'll typically |
33 |
try to read a bad sector quite a number of times, often for several |
34 |
minutes at a time, before either giving up or reading it correctly. |
35 |
Again, if you're seeing this, get a new disk and get your data transferred |
36 |
before it's too late! |
37 |
|
38 |
2) I think this one was fixed and I only read of it, I didn't experience |
39 |
it myself. Back some time ago, if a network interface were active using |
40 |
DHCP, but couldn't get a response from a DHCP server, it could cause |
41 |
pretty much the entire system to hang for some time, every time the fake/ |
42 |
random address normally assigned from the zero-conf reserved netblock |
43 |
expired. The system would try to find a DHCP server again, and again, if |
44 |
one didn't answer, would eventually assign the zero-conf block fake/random |
45 |
address again, but would cause a system hang of upto a minute (the default |
46 |
timeout, AFAIK), before it would do so. Again, this /should/ have been |
47 |
fixed quite some time ago, but one can never be sure what similar symptom |
48 |
bug may be lurking in some hardware or other. |
49 |
|
50 |
3) Back to disks, but not the harbinger of doom that #1 is, perhaps your |
51 |
system is simply set to suspend the disks after a period of inactivity, |
52 |
and it takes them some time to spin back up. I've had this happen to me, |
53 |
but it was years ago and back on MS. But because of the issues I've had |
54 |
more recently with (1), I'm sure it'd still be an issue in some |
55 |
configurations. (Fortunately, laptop mode on my netbook with 120 gig SATA |
56 |
hard drive seems to work very well and almost invisibly, to the point I |
57 |
don't worry about disk sleep there at all, as the resume is smooth enough |
58 |
I basically don't even notice -- save for the extra hour and a half of |
59 |
runtime I normally get with laptop mode active! FWIW, the thing "just |
60 |
works" in terms of both suspend2ram and hibernate/suspend2disk, as well. |
61 |
=:^) |
62 |
|
63 |
4) Kernels before... 2.6.30 I believe... could occasionally exhibit a read/ |
64 |
write I/O priority inversion on ext3. The problem had existed for |
65 |
sometime, but was attributed to the normal effects of the then default |
66 |
data=ordered as opposed to data=writeback journaling, until some massive |
67 |
stability issues with ext4 (which ubuntu had just deployed as a non- |
68 |
default option for their new installs, the problem came in combining that |
69 |
with stuff like the unstable black-box nVidia drivers, which crashed |
70 |
systems in the middle of writes on occasion!) prompted a reexamination of |
71 |
a number of related previous assumptions. 2.6.30 had a quick-fix. 2.6.31 |
72 |
had better fixes, and additionally and quite controversially, switched |
73 |
ext3 defaults to data=writeback, which with the new fixes, was judged |
74 |
sufficiently stable to be the default. (As a reiserfs user who lived thru |
75 |
the period before it got proper data=ordered, I'll never trust |
76 |
data=writeback again, so I disagree with Linus decision to make it the |
77 |
ext3 default, but at least I can change that on the systems I run.) So if |
78 |
you're running a kernel older than 2.6.30 or .31, this could potentially |
79 |
be an issue, tho it's unlikely to be /too/ bad under normal conditions. |
80 |
|
81 |
Those are the possibilities I know of. |
82 |
|
83 |
-- |
84 |
Duncan - List replies preferred. No HTML msgs. |
85 |
"Every nonfree program has a lord, a master -- |
86 |
and if you use the program, he is your master." Richard Stallman |