Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found - gentoo-amd64

From:	Mark Knecht <markknecht@×××××.com>
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found
Date:	Fri, 02 Apr 2010 18:02:35
Message-Id:	`t2u5bdc1c8b1004021018s2337369ck9094b2a358fe768c@mail.gmail.com`
In Reply to:	[gentoo-amd64] Re: RAID1 boot - no bootable media found by Duncan <1i5t5.duncan@cox.net>

1

Good stuff. I'll snip out the less important to keep the response

2

shorter but don't think for a second that I didn't appreciate it. I

3

do!

4

5

On Fri, Apr 2, 2010 at 2:43 AM, Duncan <1i5t5.duncan@×××.net> wrote:

6

> Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted:

7

<SNIP>

8

>

9

> Making the titles different is a very good idea.  It's what I ended up

10

> doing too, as otherwise, it can get confusing pretty fast.

11

>

12

> Something else you might want to do, for purposes of identifying the

13

> drives at the grub boot prompt if something goes wrong or you are

14

> otherwise trying to boot something on another drive, is create a (probably

15

> empty) differently named file on each one, say grub.sda, grub.sdb, etc.

16

17

I'll consider that, once I get the hard problems solved.

18

<SNIP>

19

>> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7

20

>> hours. I'm at 18% after 28 minutes so that seems about right. (With no

21

>> errors so far assuming I'm using the right command)

22

>

23

> I used the -w switch here, which actually goes over the disk a total of 8

24

> times, alternating writing and then reading back to verify the written

25

> pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which

26

> is alternating 10101010, alternating 01010101, all ones, all zeroes).

27

28

OK, makes sense then.

29

30

I ran one pass of badblocks on each of the drives. No problem found.

31

32

I know some Linux folks don't like Spinrite but I've had good luck

33

with it so that's running now. Problem is it cannot run the drives at

34

the same time and it looks like it wants at least 24 hours to do the

35

whole drive so using it would take 3 days. I will likely let it run

36

through the first drive (I'm busy today) and then tomorrow drop back

37

into Linux and possibly try your badblocks on all 3 drives. I'm not

38

overly concerned about losing the install.

39

40

<SNIP>

41

>

42

>

43

> [8 second spin-down timeouts]

44

>

45

>> Very true. Here is the same drive model I put in a new machine for my

46

>> dad. It's been powered up and running Gentoo as a typical desktop

47

>> machine for about 50 days. He doesn't use it more than about an hour a

48

>> day on average. It's already hit 31K load/unload cycles. At 10% of 300K

49

>> that about 1.5 years of life before I hit that spec. I've watched his

50

>> system a bit and his system seems to add 1 to the count almost exactly

51

>> every 2 minutes on average. Is that a common cron job maybe?

52

>

53

> It's unlikely to be a cron job.  But check your logging, and check what

54

> sort of atime you're using on your mounts (relatime is the new kernel

55

> default, but it was atime until relatively recently, say 2.6.30 or .31 or

56

> some such, and noatime is recommended unless you have something that

57

> actually depends on atime, alpine is known to need it for mail, and some

58

> backup software uses it, tho little else on a modern system will, I always

59

> use noatime on my real disk mounts, as opposed to say tmpfs, here).  If

60

> there's something writing to the log every two minutes or less, and the

61

> buffers are set to timeout dirty data and flush to disk every two

62

> minutes...  And simply accessing a file will change the atime on it if you

63

> have that turned on, thus necessitating a write to disk to update the

64

> atime, with those dirty buffers flushed every X minutes or seconds as well.

65

66

Here is fstab from my dad's machine which racks up 30

67

Load_Cycle_Counts and hour:

68

69

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.

70

LABEL="myboot"          /boot           ext2            noauto,noatime  1 2

71

LABEL="myroot"          /               ext3            noatime         0 1

72

LABEL="myswap"          none            swap            sw              0 0

73

LABEL="homeherb"        /home/herb      ext3            noatime         0 1

74

/dev/cdrom              /mnt/cdrom      auto            noauto,ro       0 0

75

#/dev/fd0               /mnt/floppy     auto            noauto          0 0

76

77

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for

78

# POSIX shared memory (shm_open, shm_unlink).

79

# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will

80

#  use almost no memory if not populated with files)

81

shm                     /dev/shm        tmpfs

82

nodev,nosuid,noexec     0 0

83

84

On the other hand there is some cron stuff going on every 10 minutes

85

or so. Possibly it's not 1 event ever 2 minutes but maybe 5 events

86

every 10 minutes?

87

88

Apr  2 07:10:01 gandalf cron[6310]: (root) CMD (test -x

89

/usr/sbin/run-crons && /usr/sbin/run-crons )

90

Apr  2 07:20:01 gandalf cron[6322]: (root) CMD (test -x

91

/usr/sbin/run-crons && /usr/sbin/run-crons )

92

Apr  2 07:30:01 gandalf cron[6335]: (root) CMD (test -x

93

/usr/sbin/run-crons && /usr/sbin/run-crons )

94

Apr  2 07:40:01 gandalf cron[6348]: (root) CMD (test -x

95

/usr/sbin/run-crons && /usr/sbin/run-crons )

96

Apr  2 07:50:01 gandalf cron[6361]: (root) CMD (test -x

97

/usr/sbin/run-crons && /usr/sbin/run-crons )

98

Apr  2 07:59:01 gandalf cron[6374]: (root) CMD (rm -f

99

/var/spool/cron/lastrun/cron.hourly)

100

Apr  2 08:00:01 gandalf cron[6376]: (root) CMD (test -x

101

/usr/sbin/run-crons && /usr/sbin/run-crons )

102

Apr  2 08:10:01 gandalf cron[6388]: (root) CMD (test -x

103

/usr/sbin/run-crons && /usr/sbin/run-crons )

104

Apr  2 08:20:01 gandalf cron[6401]: (root) CMD (test -x

105

/usr/sbin/run-crons && /usr/sbin/run-crons )

106

Apr  2 08:30:01 gandalf cron[6414]: (root) CMD (test -x

107

/usr/sbin/run-crons && /usr/sbin/run-crons )

108

Apr  2 08:40:01 gandalf cron[6427]: (root) CMD (test -x

109

/usr/sbin/run-crons && /usr/sbin/run-crons )

110

Apr  2 08:50:01 gandalf cron[6440]: (root) CMD (test -x

111

/usr/sbin/run-crons && /usr/sbin/run-crons )

112

Apr  2 08:59:01 gandalf cron[6453]: (root) CMD (rm -f

113

/var/spool/cron/lastrun/cron.hourly)

114

Apr  2 09:00:01 gandalf cron[6455]: (root) CMD (test -x

115

/usr/sbin/run-crons && /usr/sbin/run-crons )

116

Apr  2 09:10:01 gandalf cron[6467]: (root) CMD (test -x

117

/usr/sbin/run-crons && /usr/sbin/run-crons )

118

Apr  2 09:18:01 gandalf sshd[6479]: Accepted keyboard-interactive/pam

119

for root from 67.188.27.80 port 51981 ssh2

120

Apr  2 09:18:01 gandalf sshd[6479]: pam_unix(sshd:session): session

121

opened for user root by (uid=0)

122

123

124

<SNIP>

125

>

126

> Note that I don't have #193, the load-cycle counts.  There's a couple

127

> different technologies here.  The ramp-type load/unload yours uses is

128

> typical of the smaller 2.5" laptop drives.  These are designed for far

129

> shorter idle/standby timeouts and thus a far higher cycle count, load

130

> cycles, typical rating 300,000 to 600,000.  Standard desktop/server drives

131

> use a contact park method and a lower power cycle count, typically 50,000

132

> or so.  That's the difference.

133

134

I also purchased two Enterprise Edition drives - the 500GB size. They

135

are also spec'ed at 300K

136

137

http://www.wdc.com/en/products/products.asp?DriveID=489

138

139

My intention was to use them in a RAID0 and then back them up daily to

140

RAID1 for more safety. However I'm starting to think this TLER feature

141

may well be part of this problem. I don't want to start using them

142

however until I understand this 30/minute issue. No reason to wear

143

everything out!

144

145

<SNIP>

146

>

147

> One thing they recommend with RAID, which I did NOT do, BTW, and which I'm

148

> beginning to worry about since I'm approaching the end of my 5 year

149

> warranties, is buying either different brands or models, or at least

150

> ensuring you're getting different lot numbers of the same model.  The idea

151

> being, if they're all the same model and lot number, and they're all part

152

> of the same RAID so in similar operating conditions, they're all likely to

153

> go out pretty close to each other.  That's one reason to be glad I'm

154

> running 4-way RAID-1, I suppose, as one hopes that when they start going,

155

> even if they are the same model and lot number, at least one of the four

156

> can hang on long enough for me to buy replacements and transfer the

157

> critical data.

158

159

Exactly! My plan for this box is a 3 disk RAID1 as 3 disks is all it will hold.

160

161

Most folks don't understand that if 1 drive has a 1% chance of failing

162

then 3 drives is more like a 3% chance of failing assuming they are

163

are truly independent. If they all come from the same lot and 1 fails

164

then it's logically more likely that the other 2 will fail in the next

165

few days or weeks. Certainly much faster then getting them from

166

different companies.

167

168

169

<SNIP>

170

>>

171

>> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 >

172

>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.

173

>

174

> [snipped the trace]

175

>

176

> Ouch!  Blocked for 2 minutes...

177

>

178

> Yes, between the logs and the 2-minute hung-task, that does look like some

179

> serious issues, chipset or other...

180

>

181

> Talking about which...

182

>

183

> Can you try different SATA cables?  I'm assuming you and your dad aren't

184

> using the same cables.  Maybe it's the cables, not the chipset.

185

186

Now that's an interesting thought. n my other machines I used the

187

cables Intel shipped with the MB. However in this case I couldn't

188

because the SATA connectors don't point upward but come out

189

horizontally. Due to proximity to the drive container I had to get 90

190

degree cables and all 3 drives are using those right now. I can switch

191

two of the drives to the Intel cables.

192

193

That said Spinrite has been running for hours without and problem at

194

all and it will tell me if there are delays, sectors not found, etc.,

195

so if it was as blatant a problem as it appears to be when running

196

Linux then I really think I would have seen it by now. I would have

197

guessed I would have seen it running badblocks also, but possibly not.

198

199

>

200

> Also, consider slowing the data down.  Disable UDMA or reduce it to a

201

> lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1

202

> speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the

203

> wrap!):

204

>

205

> http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?

206

> p_faqid=1337

207

>

208

> If that solves the issue, then you know it's related to signal timing.

209

210

Will try it.

211

212

>

213

> Unfortunately, this can be mobo related.  I had very similar issues with

214

> memory at one point, and had to slow it down from the rated PC3200, to

215

> PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS.

216

> Unfortunately, initially the BIOS didn't have a setting for that; it

217

> wasn't until a BIOS update that I got it.  Until I got the update and

218

> declocked it, it would work most of the time, but was borderline.  The

219

> thing was, the memory was solid and tested so in memtest86+, but that

220

> tests memory cells, not speed, and at the rated speed, that memory and

221

> that board just didn't like each other, and there'd be occasional issues

222

> (bunzip2 erroring out due to checksum mismatch was a common one, and

223

> occasional crashes). Ultimately, I fixed the problem when I upgraded

224

> memory.

225

226

OK, so I have 6 of these drives and multiple PCs. While not a perfect

227

test I can try putting a couple into another machine and building a 2

228

drive RAID1 just to see what happens.

229

>

230

> So having experienced the issue with memory, I know exactly how

231

> frustrating it can be.  But if you slow it down with the jumper and it

232

> works, then you can try different cables, or take off the jumper and try

233

> lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm

234

> or something.  Or exchange either the drives or the mobo, if you can, or

235

> buy an add-on SATA card and disable the onboard one.

236

>

237

> Oh, and double-check the kernel driver you are using for it as well.

238

> Maybe there's another that'll work better, or driver options you can feed

239

> to it, or something.

240

241

The kernel driver is ahci. Don't know that I have any alternatives

242

when booting AHCI from BIOS, but I can look at the other modes with

243

other drivers and see if the problems still occurs. That's a bit of

244

work but probably worth it.

245

246

This is all a big table of experiments that eventually limit the

247

problem to a single location. (Hopefully!)

248

249

>

250

> Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, and

251

> reloaded the system from backup, after you switched to AHCI, try that.

252

> AHCI and the kernel driver for it is almost certainly what you want, not

253

> compatibility mode, but that could potentially screw things up too, if you

254

> switched it and didn't redo the disk afterward.

255

>

256

> I do wish you luck!  Seeing those errors brought back BAD memories of the

257

> memory problems I had, so while yours is disk not memory, I can definitely

258

> sympathize!

259

260

As always, thanks for the help. I'm very interested, and yes, even a

261

little frustrated! ;-)

262

263

Cheers,

264

Mark

Gentoo Archives: gentoo-amd64

Replies

1	Good stuff. I'll snip out the less important to keep the response
2	shorter but don't think for a second that I didn't appreciate it. I
3	do!
4
5	On Fri, Apr 2, 2010 at 2:43 AM, Duncan <1i5t5.duncan@×××.net> wrote:
6	> Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted:
7	<SNIP>
8	>
9	> Making the titles different is a very good idea. It's what I ended up
10	> doing too, as otherwise, it can get confusing pretty fast.
11	>
12	> Something else you might want to do, for purposes of identifying the
13	> drives at the grub boot prompt if something goes wrong or you are
14	> otherwise trying to boot something on another drive, is create a (probably
15	> empty) differently named file on each one, say grub.sda, grub.sdb, etc.
16
17	I'll consider that, once I get the hard problems solved.
18	<SNIP>
19	>> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7
20	>> hours. I'm at 18% after 28 minutes so that seems about right. (With no
21	>> errors so far assuming I'm using the right command)
22	>
23	> I used the -w switch here, which actually goes over the disk a total of 8
24	> times, alternating writing and then reading back to verify the written
25	> pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which
26	> is alternating 10101010, alternating 01010101, all ones, all zeroes).
27
28	OK, makes sense then.
29
30	I ran one pass of badblocks on each of the drives. No problem found.
31
32	I know some Linux folks don't like Spinrite but I've had good luck
33	with it so that's running now. Problem is it cannot run the drives at
34	the same time and it looks like it wants at least 24 hours to do the
35	whole drive so using it would take 3 days. I will likely let it run
36	through the first drive (I'm busy today) and then tomorrow drop back
37	into Linux and possibly try your badblocks on all 3 drives. I'm not
38	overly concerned about losing the install.
39
40	<SNIP>
41	>
42	>
43	> [8 second spin-down timeouts]
44	>
45	>> Very true. Here is the same drive model I put in a new machine for my
46	>> dad. It's been powered up and running Gentoo as a typical desktop
47	>> machine for about 50 days. He doesn't use it more than about an hour a
48	>> day on average. It's already hit 31K load/unload cycles. At 10% of 300K
49	>> that about 1.5 years of life before I hit that spec. I've watched his
50	>> system a bit and his system seems to add 1 to the count almost exactly
51	>> every 2 minutes on average. Is that a common cron job maybe?
52	>
53	> It's unlikely to be a cron job. But check your logging, and check what
54	> sort of atime you're using on your mounts (relatime is the new kernel
55	> default, but it was atime until relatively recently, say 2.6.30 or .31 or
56	> some such, and noatime is recommended unless you have something that
57	> actually depends on atime, alpine is known to need it for mail, and some
58	> backup software uses it, tho little else on a modern system will, I always
59	> use noatime on my real disk mounts, as opposed to say tmpfs, here). If
60	> there's something writing to the log every two minutes or less, and the
61	> buffers are set to timeout dirty data and flush to disk every two
62	> minutes... And simply accessing a file will change the atime on it if you
63	> have that turned on, thus necessitating a write to disk to update the
64	> atime, with those dirty buffers flushed every X minutes or seconds as well.
65
66	Here is fstab from my dad's machine which racks up 30
67	Load_Cycle_Counts and hour:
68
69	# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.
70	LABEL="myboot" /boot ext2 noauto,noatime 1 2
71	LABEL="myroot" / ext3 noatime 0 1
72	LABEL="myswap" none swap sw 0 0
73	LABEL="homeherb" /home/herb ext3 noatime 0 1
74	/dev/cdrom /mnt/cdrom auto noauto,ro 0 0
75	#/dev/fd0 /mnt/floppy auto noauto 0 0
76
77	# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
78	# POSIX shared memory (shm_open, shm_unlink).
79	# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will
80	# use almost no memory if not populated with files)
81	shm /dev/shm tmpfs
82	nodev,nosuid,noexec 0 0
83
84	On the other hand there is some cron stuff going on every 10 minutes
85	or so. Possibly it's not 1 event ever 2 minutes but maybe 5 events
86	every 10 minutes?
87
88	Apr 2 07:10:01 gandalf cron[6310]: (root) CMD (test -x
89	/usr/sbin/run-crons && /usr/sbin/run-crons )
90	Apr 2 07:20:01 gandalf cron[6322]: (root) CMD (test -x
91	/usr/sbin/run-crons && /usr/sbin/run-crons )
92	Apr 2 07:30:01 gandalf cron[6335]: (root) CMD (test -x
93	/usr/sbin/run-crons && /usr/sbin/run-crons )
94	Apr 2 07:40:01 gandalf cron[6348]: (root) CMD (test -x
95	/usr/sbin/run-crons && /usr/sbin/run-crons )
96	Apr 2 07:50:01 gandalf cron[6361]: (root) CMD (test -x
97	/usr/sbin/run-crons && /usr/sbin/run-crons )
98	Apr 2 07:59:01 gandalf cron[6374]: (root) CMD (rm -f
99	/var/spool/cron/lastrun/cron.hourly)
100	Apr 2 08:00:01 gandalf cron[6376]: (root) CMD (test -x
101	/usr/sbin/run-crons && /usr/sbin/run-crons )
102	Apr 2 08:10:01 gandalf cron[6388]: (root) CMD (test -x
103	/usr/sbin/run-crons && /usr/sbin/run-crons )
104	Apr 2 08:20:01 gandalf cron[6401]: (root) CMD (test -x
105	/usr/sbin/run-crons && /usr/sbin/run-crons )
106	Apr 2 08:30:01 gandalf cron[6414]: (root) CMD (test -x
107	/usr/sbin/run-crons && /usr/sbin/run-crons )
108	Apr 2 08:40:01 gandalf cron[6427]: (root) CMD (test -x
109	/usr/sbin/run-crons && /usr/sbin/run-crons )
110	Apr 2 08:50:01 gandalf cron[6440]: (root) CMD (test -x
111	/usr/sbin/run-crons && /usr/sbin/run-crons )
112	Apr 2 08:59:01 gandalf cron[6453]: (root) CMD (rm -f
113	/var/spool/cron/lastrun/cron.hourly)
114	Apr 2 09:00:01 gandalf cron[6455]: (root) CMD (test -x
115	/usr/sbin/run-crons && /usr/sbin/run-crons )
116	Apr 2 09:10:01 gandalf cron[6467]: (root) CMD (test -x
117	/usr/sbin/run-crons && /usr/sbin/run-crons )
118	Apr 2 09:18:01 gandalf sshd[6479]: Accepted keyboard-interactive/pam
119	for root from 67.188.27.80 port 51981 ssh2
120	Apr 2 09:18:01 gandalf sshd[6479]: pam_unix(sshd:session): session
121	opened for user root by (uid=0)
122
123
124	<SNIP>
125	>
126	> Note that I don't have #193, the load-cycle counts. There's a couple
127	> different technologies here. The ramp-type load/unload yours uses is
128	> typical of the smaller 2.5" laptop drives. These are designed for far
129	> shorter idle/standby timeouts and thus a far higher cycle count, load
130	> cycles, typical rating 300,000 to 600,000. Standard desktop/server drives
131	> use a contact park method and a lower power cycle count, typically 50,000
132	> or so. That's the difference.
133
134	I also purchased two Enterprise Edition drives - the 500GB size. They
135	are also spec'ed at 300K
136
137	http://www.wdc.com/en/products/products.asp?DriveID=489
138
139	My intention was to use them in a RAID0 and then back them up daily to
140	RAID1 for more safety. However I'm starting to think this TLER feature
141	may well be part of this problem. I don't want to start using them
142	however until I understand this 30/minute issue. No reason to wear
143	everything out!
144
145	<SNIP>
146	>
147	> One thing they recommend with RAID, which I did NOT do, BTW, and which I'm
148	> beginning to worry about since I'm approaching the end of my 5 year
149	> warranties, is buying either different brands or models, or at least
150	> ensuring you're getting different lot numbers of the same model. The idea
151	> being, if they're all the same model and lot number, and they're all part
152	> of the same RAID so in similar operating conditions, they're all likely to
153	> go out pretty close to each other. That's one reason to be glad I'm
154	> running 4-way RAID-1, I suppose, as one hopes that when they start going,
155	> even if they are the same model and lot number, at least one of the four
156	> can hang on long enough for me to buy replacements and transfer the
157	> critical data.
158
159	Exactly! My plan for this box is a 3 disk RAID1 as 3 disks is all it will hold.
160
161	Most folks don't understand that if 1 drive has a 1% chance of failing
162	then 3 drives is more like a 3% chance of failing assuming they are
163	are truly independent. If they all come from the same lot and 1 fails
164	then it's logically more likely that the other 2 will fail in the next
165	few days or weeks. Certainly much faster then getting them from
166	different companies.
167
168
169	<SNIP>
170	>>
171	>> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 >
172	>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
173	>
174	> [snipped the trace]
175	>
176	> Ouch! Blocked for 2 minutes...
177	>
178	> Yes, between the logs and the 2-minute hung-task, that does look like some
179	> serious issues, chipset or other...
180	>
181	> Talking about which...
182	>
183	> Can you try different SATA cables? I'm assuming you and your dad aren't
184	> using the same cables. Maybe it's the cables, not the chipset.
185
186	Now that's an interesting thought. n my other machines I used the
187	cables Intel shipped with the MB. However in this case I couldn't
188	because the SATA connectors don't point upward but come out
189	horizontally. Due to proximity to the drive container I had to get 90
190	degree cables and all 3 drives are using those right now. I can switch
191	two of the drives to the Intel cables.
192
193	That said Spinrite has been running for hours without and problem at
194	all and it will tell me if there are delays, sectors not found, etc.,
195	so if it was as blatant a problem as it appears to be when running
196	Linux then I really think I would have seen it by now. I would have
197	guessed I would have seen it running badblocks also, but possibly not.
198
199	>
200	> Also, consider slowing the data down. Disable UDMA or reduce it to a
201	> lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1
202	> speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the
203	> wrap!):
204	>
205	> http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?
206	> p_faqid=1337
207	>
208	> If that solves the issue, then you know it's related to signal timing.
209
210	Will try it.
211
212	>
213	> Unfortunately, this can be mobo related. I had very similar issues with
214	> memory at one point, and had to slow it down from the rated PC3200, to
215	> PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS.
216	> Unfortunately, initially the BIOS didn't have a setting for that; it
217	> wasn't until a BIOS update that I got it. Until I got the update and
218	> declocked it, it would work most of the time, but was borderline. The
219	> thing was, the memory was solid and tested so in memtest86+, but that
220	> tests memory cells, not speed, and at the rated speed, that memory and
221	> that board just didn't like each other, and there'd be occasional issues
222	> (bunzip2 erroring out due to checksum mismatch was a common one, and
223	> occasional crashes). Ultimately, I fixed the problem when I upgraded
224	> memory.
225
226	OK, so I have 6 of these drives and multiple PCs. While not a perfect
227	test I can try putting a couple into another machine and building a 2
228	drive RAID1 just to see what happens.
229	>
230	> So having experienced the issue with memory, I know exactly how
231	> frustrating it can be. But if you slow it down with the jumper and it
232	> works, then you can try different cables, or take off the jumper and try
233	> lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm
234	> or something. Or exchange either the drives or the mobo, if you can, or
235	> buy an add-on SATA card and disable the onboard one.
236	>
237	> Oh, and double-check the kernel driver you are using for it as well.
238	> Maybe there's another that'll work better, or driver options you can feed
239	> to it, or something.
240
241	The kernel driver is ahci. Don't know that I have any alternatives
242	when booting AHCI from BIOS, but I can look at the other modes with
243	other drivers and see if the problems still occurs. That's a bit of
244	work but probably worth it.
245
246	This is all a big table of experiments that eventually limit the
247	problem to a single location. (Hopefully!)
248
249	>
250	> Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, and
251	> reloaded the system from backup, after you switched to AHCI, try that.
252	> AHCI and the kernel driver for it is almost certainly what you want, not
253	> compatibility mode, but that could potentially screw things up too, if you
254	> switched it and didn't redo the disk afterward.
255	>
256	> I do wish you luck! Seeing those errors brought back BAD memories of the
257	> memory problems I had, so while yours is disk not memory, I can definitely
258	> sympathize!
259
260	As always, thanks for the help. I'm very interested, and yes, even a
261	little frustrated! ;-)
262
263	Cheers,
264	Mark