[gentoo-amd64] Re: RAID1 boot - no bootable media found - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: RAID1 boot - no bootable media found
Date:	Fri, 02 Apr 2010 10:02:38
Message-Id:	`pan.2010.04.02.09.43.22@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found by Mark Knecht

1

Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted:

2

3

> A bit long in response. Sorry.

4

>

5

> On Tue, Mar 30, 2010 at 11:56 PM, Duncan <1i5t5.duncan@×××.net> wrote:

6

>> Mark Knecht posted on Tue, 30 Mar 2010 13:26:59 -0700 as excerpted:

7

>>

8

>>> [W]hen I change the hard drive boot does the old sdb become the new

9

>>> sda because it's what got booted? Or is the order still as it was?

10

11

>> That depends on your BIOS.

12

13

> It seems to be constant mapping meaning (I guess) that I need to change

14

> the drive specs in grub.conf on the second drive to actually use the

15

> second drive.

16

>

17

> I made the titles for booting different for each grub.conf file to

18

> ensure I was really getting grub from the second drive. My sda grub boot

19

> menu says "2.6.33-gentoo booting from sda" on the first drive, sdb on

20

> the second drive, etc.

21

22

Making the titles different is a very good idea.  It's what I ended up 

23

doing too, as otherwise, it can get confusing pretty fast.

24

25

Something else you might want to do, for purposes of identifying the 

26

drives at the grub boot prompt if something goes wrong or you are 

27

otherwise trying to boot something on another drive, is create a (probably 

28

empty) differently named file on each one, say grub.sda, grub.sdb, etc.

29

30

That way, if you end up at the boot prompt you can do a find /grub.sda 

31

(or /grub/grub.sda, or whatever), and grub will return a list of the 

32

drives with that file, in this case, only one drive, thus identifying your 

33

normal sda drive.

34

35

You can of course do similar by cat-ing the grub.conf file on each one, 

36

since you are keeping your titles different, but that's a bit more 

37

complicated than simply doing a find on the appropriate file, to get your 

38

bearings straight on which is which in the event something screws up.

39

40

>>

41

>> The point being... [using badblocks] it /is/ actually possible to

42

>> verify that they're working well before you fdisk/mkfs and load data.

43

>> Tho it does take awhile... days... on drives of modern size.

44

>>

45

> I'm trying badblocks right now on sdc. using

46

>

47

> badblocks -v /dev/sdc

48

>

49

> Maybe I need to do something more strenuous? It looks like it will be

50

> done an an hour or two. (i7-920 with SATA drives so it should be fast,

51

> as long as I'm not just reading the buffers or something like that.

52

>

53

> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7

54

> hours. I'm at 18% after 28 minutes so that seems about right. (With no

55

> errors so far assuming I'm using the right command)

56

57

I used the -w switch here, which actually goes over the disk a total of 8 

58

times, alternating writing and then reading back to verify the written 

59

pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which 

60

is alternating 10101010, alternating 01010101, all ones, all zeroes).

61

62

But that's "data distructive".  IOW, it effectively wipes the disk.  Doing 

63

it when the disks were new, before I fdisked them let alone mkfs-ed and 

64

started loading data, was fine, but it's not something you do if you have 

65

unbacked up data on them that you want to keep!

66

67

Incidently, that's not /quite/ the infamous US-DOD 7-pass wipe, as it's 

68

only 4 passes, but it should reasonably ensure against ordinary recovery, 

69

in any case, if you have reason to wipe your disks...  Well, except for 

70

any blocks the disk internals may have detected as bad and rewritten 

71

elsewhere, you can get the SMART data on that.  But a 4-pass wipe, as 

72

badblocks -w does, should certainly be good for the normal case, and is 

73

already way better than just an fdisk, which doesn't even wipe anything 

74

but the allocation tables!

75

76

But back to the timing.  Since the -w switch does a total of 8 passes (4 

77

each write and read, alternating) while you're doing just one with just

78

-v, it'll obviously take 8 times the time.  So 80K seconds... 22+ hours.

79

80

So I guess it's not days... just about a day.  (Probably something more, 

81

as the first part of the disk, near the outside edge, should go faster 

82

than the last part, so figure a bit over a day, maybe 30 hours...)

83

84

85

[8 second spin-down timeouts]

86

87

> Very true. Here is the same drive model I put in a new machine for my

88

> dad. It's been powered up and running Gentoo as a typical desktop

89

> machine for about 50 days. He doesn't use it more than about an hour a

90

> day on average. It's already hit 31K load/unload cycles. At 10% of 300K

91

> that about 1.5 years of life before I hit that spec. I've watched his

92

> system a bit and his system seems to add 1 to the count almost exactly

93

> every 2 minutes on average. Is that a common cron job maybe?

94

95

It's unlikely to be a cron job.  But check your logging, and check what 

96

sort of atime you're using on your mounts (relatime is the new kernel 

97

default, but it was atime until relatively recently, say 2.6.30 or .31 or 

98

some such, and noatime is recommended unless you have something that 

99

actually depends on atime, alpine is known to need it for mail, and some 

100

backup software uses it, tho little else on a modern system will, I always 

101

use noatime on my real disk mounts, as opposed to say tmpfs, here).  If 

102

there's something writing to the log every two minutes or less, and the 

103

buffers are set to timeout dirty data and flush to disk every two 

104

minutes...  And simply accessing a file will change the atime on it if you 

105

have that turned on, thus necessitating a write to disk to update the 

106

atime, with those dirty buffers flushed every X minutes or seconds as well.

107

108

> I looked up the spec on all three WD lines - Green, Blue and Black. All

109

> three were 300K cycles. This issue has come up on the RAID list. It

110

> seems that some other people are seeing this and aren't exactly sure

111

> what Linux is doing to cause this.

112

113

It's probably not just Linux, but a combination of Linux and the drive 

114

defaults.

115

116

> I'll study hdparm and BIOS when I can reboot.

117

>

118

> My dad's current data:

119

120

> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED

121

> WHEN_FAILED RAW_VALUE

122

123

>   4 Start_Stop_Count        0x0032   100   100   000    Old_age

124

> Always       -       21

125

126

>   9 Power_On_Hours          0x0032   099   099   000    Old_age

127

> Always       -       1183

128

129

>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age

130

> Always       -       20

131

132

> 193 Load_Cycle_Count        0x0032   190   190   000    Old_age Always

133

>     -       31240

134

135

Here's my comparable numbers, several years old Seagate 7200.8 series:

136

137

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   

138

Always       -       996

139

140

  9 Power_On_Hours          0x0032   066   066   000    Old_age   

141

Always       -       30040

142

143

 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   

144

Always       -       1045

145

146

Note that I don't have #193, the load-cycle counts.  There's a couple 

147

different technologies here.  The ramp-type load/unload yours uses is 

148

typical of the smaller 2.5" laptop drives.  These are designed for far 

149

shorter idle/standby timeouts and thus a far higher cycle count, load 

150

cycles, typical rating 300,000 to 600,000.  Standard desktop/server drives 

151

use a contact park method and a lower power cycle count, typically 50,000 

152

or so.  That's the difference.

153

154

At 300,000 load cycle count rating, your WDs are at the lower end of the 

155

ramp-type ratings, but still far higher than comparable contact power-

156

cycle ratings.  Even tho the ramp-type they use is good for far more 

157

cycles, as you mentioned, you're already at 10% after only 50 days.

158

159

My old Seagates, OTOH, about 4.5 years old best I can figure (I bought 

160

them around October, 30K operating hours ~3.5 years, and I run them most 

161

but not all of the time, so 4.5 years is a good estimate), rated for only 

162

50,000 contact start/stop cycles (they're NOT the ramp type), but SMART 

163

says only about 1000. or 2% of the rating, gone.  (If you check the stats 

164

they seem to recommend replacing at 20%, assuming that's a percentage, 

165

which looks likely, but either way, that's a metric I don't need to worry 

166

about any time soon.)

167

168

OTOH, at 30,000+ operating hours (about 3.5 years if on constantly, as I 

169

mentioned above), that one's running rather lower.  Again assuming it's a 

170

percentage metric, it would appear they rate them @ 90,000 hours.  (I 

171

looked up the specsheets tho, and couldn't see anything like that listed, 

172

only 5 years lifetime and warranty, which would be about half that, 45,000 

173

hours.  But given the 0 threshold there, it appears they expect the start-

174

stop cycles to be more critical, so they may indeed rate it 90,000 

175

operating hours.)  That'd be three and a half years of use, straight thru, 

176

so yeah, I've had them, probably four and half years now, probably five in 

177

October -- I don't have them spin down at all and often leave my system on 

178

for days at a time, but not /all/ the time, so 3.5 years of use in 4.5 

179

years sounds reasonable.

180

181

> Yeah, that's important. Thanks. If I can solve all these RAID problems

182

> then maybe I'll look at adding RAID to his box with better drives or

183

> something.

184

185

One thing they recommend with RAID, which I did NOT do, BTW, and which I'm 

186

beginning to worry about since I'm approaching the end of my 5 year 

187

warranties, is buying either different brands or models, or at least 

188

ensuring you're getting different lot numbers of the same model.  The idea 

189

being, if they're all the same model and lot number, and they're all part 

190

of the same RAID so in similar operating conditions, they're all likely to 

191

go out pretty close to each other.  That's one reason to be glad I'm 

192

running 4-way RAID-1, I suppose, as one hopes that when they start going, 

193

even if they are the same model and lot number, at least one of the four 

194

can hang on long enough for me to buy replacements and transfer the 

195

critical data.  But I also have an external 1 TB USB I bought, kept off 

196

most of the time as opposed to the RAID disks which are on most of the 

197

time, that I've got an external backup on, as well as the backups on the 

198

RAIDs, tho the external one isn't as regularly synced.  But in the event 

199

all four RAID drives die on me, I HAVE test-booted from a USB thumb drive 

200

(the external 1TB isn't bootable -- good thing I tested, eh!), to the 

201

external 1TB, and CAN recover from it, if I HAVE to.

202

203

> Note that on my system only I'm seeing real problems in

204

> /var/log/message, non-RAID, like 1000's of these:

205

>

206

> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 45276264 on sda3

207

> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46309336 on sda3

208

> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567488 on sda3

209

> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567680 on sda3

210

>

211

> or

212

>

213

> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555752 on

214

> sda3

215

> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555760 on

216

> sda3

217

> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555768 on

218

> sda3

219

> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555776 on

220

> sda3

221

222

That doesn't look so good...

223

224

> However I see NONE of that on my dad's machine using the same drive but

225

> different chipset.

226

>

227

> The above problems seem to result in this sort of problem when I try

228

> going with RAID as I tried again this monring:

229

>

230

> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 >

231

> /proc/sys/kernel/hung_task_timeout_secs" disables this message.

232

233

[snipped the trace]

234

235

Ouch!  Blocked for 2 minutes...

236

237

Yes, between the logs and the 2-minute hung-task, that does look like some 

238

serious issues, chipset or other...

239

240

Talking about which...

241

242

Can you try different SATA cables?  I'm assuming you and your dad aren't 

243

using the same cables.  Maybe it's the cables, not the chipset.

244

245

Also, consider slowing the data down.  Disable UDMA or reduce it to a 

246

lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1 

247

speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the 

248

wrap!):

249

250

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?

251

p_faqid=1337

252

253

If that solves the issue, then you know it's related to signal timing.  

254

255

Unfortunately, this can be mobo related.  I had very similar issues with 

256

memory at one point, and had to slow it down from the rated PC3200, to 

257

PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS.  

258

Unfortunately, initially the BIOS didn't have a setting for that; it 

259

wasn't until a BIOS update that I got it.  Until I got the update and 

260

declocked it, it would work most of the time, but was borderline.  The 

261

thing was, the memory was solid and tested so in memtest86+, but that 

262

tests memory cells, not speed, and at the rated speed, that memory and 

263

that board just didn't like each other, and there'd be occasional issues 

264

(bunzip2 erroring out due to checksum mismatch was a common one, and 

265

occasional crashes). Ultimately, I fixed the problem when I upgraded 

266

memory.

267

268

So having experienced the issue with memory, I know exactly how 

269

frustrating it can be.  But if you slow it down with the jumper and it 

270

works, then you can try different cables, or take off the jumper and try 

271

lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm 

272

or something.  Or exchange either the drives or the mobo, if you can, or 

273

buy an add-on SATA card and disable the onboard one.

274

275

Oh, and double-check the kernel driver you are using for it as well.  

276

Maybe there's another that'll work better, or driver options you can feed 

277

to it, or something.

278

279

Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, and 

280

reloaded the system from backup, after you switched to AHCI, try that.  

281

AHCI and the kernel driver for it is almost certainly what you want, not 

282

compatibility mode, but that could potentially screw things up too, if you 

283

switched it and didn't redo the disk afterward.

284

285

I do wish you luck!  Seeing those errors brought back BAD memories of the 

286

memory problems I had, so while yours is disk not memory, I can definitely 

287

sympathize!

288

289

--

290

Duncan - List replies preferred.   No HTML msgs.

291

"Every nonfree program has a lord, a master --

292

and if you use the program, he is your master."  Richard Stallman

Gentoo Archives: gentoo-amd64

Replies

1	Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted:
2
3	> A bit long in response. Sorry.
4	>
5	> On Tue, Mar 30, 2010 at 11:56 PM, Duncan <1i5t5.duncan@×××.net> wrote:
6	>> Mark Knecht posted on Tue, 30 Mar 2010 13:26:59 -0700 as excerpted:
7	>>
8	>>> [W]hen I change the hard drive boot does the old sdb become the new
9	>>> sda because it's what got booted? Or is the order still as it was?
10
11	>> That depends on your BIOS.
12
13	> It seems to be constant mapping meaning (I guess) that I need to change
14	> the drive specs in grub.conf on the second drive to actually use the
15	> second drive.
16	>
17	> I made the titles for booting different for each grub.conf file to
18	> ensure I was really getting grub from the second drive. My sda grub boot
19	> menu says "2.6.33-gentoo booting from sda" on the first drive, sdb on
20	> the second drive, etc.
21
22	Making the titles different is a very good idea. It's what I ended up
23	doing too, as otherwise, it can get confusing pretty fast.
24
25	Something else you might want to do, for purposes of identifying the
26	drives at the grub boot prompt if something goes wrong or you are
27	otherwise trying to boot something on another drive, is create a (probably
28	empty) differently named file on each one, say grub.sda, grub.sdb, etc.
29
30	That way, if you end up at the boot prompt you can do a find /grub.sda
31	(or /grub/grub.sda, or whatever), and grub will return a list of the
32	drives with that file, in this case, only one drive, thus identifying your
33	normal sda drive.
34
35	You can of course do similar by cat-ing the grub.conf file on each one,
36	since you are keeping your titles different, but that's a bit more
37	complicated than simply doing a find on the appropriate file, to get your
38	bearings straight on which is which in the event something screws up.
39
40	>>
41	>> The point being... [using badblocks] it /is/ actually possible to
42	>> verify that they're working well before you fdisk/mkfs and load data.
43	>> Tho it does take awhile... days... on drives of modern size.
44	>>
45	> I'm trying badblocks right now on sdc. using
46	>
47	> badblocks -v /dev/sdc
48	>
49	> Maybe I need to do something more strenuous? It looks like it will be
50	> done an an hour or two. (i7-920 with SATA drives so it should be fast,
51	> as long as I'm not just reading the buffers or something like that.
52	>
53	> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7
54	> hours. I'm at 18% after 28 minutes so that seems about right. (With no
55	> errors so far assuming I'm using the right command)
56
57	I used the -w switch here, which actually goes over the disk a total of 8
58	times, alternating writing and then reading back to verify the written
59	pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which
60	is alternating 10101010, alternating 01010101, all ones, all zeroes).
61
62	But that's "data distructive". IOW, it effectively wipes the disk. Doing
63	it when the disks were new, before I fdisked them let alone mkfs-ed and
64	started loading data, was fine, but it's not something you do if you have
65	unbacked up data on them that you want to keep!
66
67	Incidently, that's not /quite/ the infamous US-DOD 7-pass wipe, as it's
68	only 4 passes, but it should reasonably ensure against ordinary recovery,
69	in any case, if you have reason to wipe your disks... Well, except for
70	any blocks the disk internals may have detected as bad and rewritten
71	elsewhere, you can get the SMART data on that. But a 4-pass wipe, as
72	badblocks -w does, should certainly be good for the normal case, and is
73	already way better than just an fdisk, which doesn't even wipe anything
74	but the allocation tables!
75
76	But back to the timing. Since the -w switch does a total of 8 passes (4
77	each write and read, alternating) while you're doing just one with just
78	-v, it'll obviously take 8 times the time. So 80K seconds... 22+ hours.
79
80	So I guess it's not days... just about a day. (Probably something more,
81	as the first part of the disk, near the outside edge, should go faster
82	than the last part, so figure a bit over a day, maybe 30 hours...)
83
84
85	[8 second spin-down timeouts]
86
87	> Very true. Here is the same drive model I put in a new machine for my
88	> dad. It's been powered up and running Gentoo as a typical desktop
89	> machine for about 50 days. He doesn't use it more than about an hour a
90	> day on average. It's already hit 31K load/unload cycles. At 10% of 300K
91	> that about 1.5 years of life before I hit that spec. I've watched his
92	> system a bit and his system seems to add 1 to the count almost exactly
93	> every 2 minutes on average. Is that a common cron job maybe?
94
95	It's unlikely to be a cron job. But check your logging, and check what
96	sort of atime you're using on your mounts (relatime is the new kernel
97	default, but it was atime until relatively recently, say 2.6.30 or .31 or
98	some such, and noatime is recommended unless you have something that
99	actually depends on atime, alpine is known to need it for mail, and some
100	backup software uses it, tho little else on a modern system will, I always
101	use noatime on my real disk mounts, as opposed to say tmpfs, here). If
102	there's something writing to the log every two minutes or less, and the
103	buffers are set to timeout dirty data and flush to disk every two
104	minutes... And simply accessing a file will change the atime on it if you
105	have that turned on, thus necessitating a write to disk to update the
106	atime, with those dirty buffers flushed every X minutes or seconds as well.
107
108	> I looked up the spec on all three WD lines - Green, Blue and Black. All
109	> three were 300K cycles. This issue has come up on the RAID list. It
110	> seems that some other people are seeing this and aren't exactly sure
111	> what Linux is doing to cause this.
112
113	It's probably not just Linux, but a combination of Linux and the drive
114	defaults.
115
116	> I'll study hdparm and BIOS when I can reboot.
117	>
118	> My dad's current data:
119
120	> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
121	> WHEN_FAILED RAW_VALUE
122
123	> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
124	> Always - 21
125
126	> 9 Power_On_Hours 0x0032 099 099 000 Old_age
127	> Always - 1183
128
129	> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
130	> Always - 20
131
132	> 193 Load_Cycle_Count 0x0032 190 190 000 Old_age Always
133	> - 31240
134
135	Here's my comparable numbers, several years old Seagate 7200.8 series:
136
137	4 Start_Stop_Count 0x0032 100 100 020 Old_age
138	Always - 996
139
140	9 Power_On_Hours 0x0032 066 066 000 Old_age
141	Always - 30040
142
143	12 Power_Cycle_Count 0x0032 099 099 020 Old_age
144	Always - 1045
145
146	Note that I don't have #193, the load-cycle counts. There's a couple
147	different technologies here. The ramp-type load/unload yours uses is
148	typical of the smaller 2.5" laptop drives. These are designed for far
149	shorter idle/standby timeouts and thus a far higher cycle count, load
150	cycles, typical rating 300,000 to 600,000. Standard desktop/server drives
151	use a contact park method and a lower power cycle count, typically 50,000
152	or so. That's the difference.
153
154	At 300,000 load cycle count rating, your WDs are at the lower end of the
155	ramp-type ratings, but still far higher than comparable contact power-
156	cycle ratings. Even tho the ramp-type they use is good for far more
157	cycles, as you mentioned, you're already at 10% after only 50 days.
158
159	My old Seagates, OTOH, about 4.5 years old best I can figure (I bought
160	them around October, 30K operating hours ~3.5 years, and I run them most
161	but not all of the time, so 4.5 years is a good estimate), rated for only
162	50,000 contact start/stop cycles (they're NOT the ramp type), but SMART
163	says only about 1000. or 2% of the rating, gone. (If you check the stats
164	they seem to recommend replacing at 20%, assuming that's a percentage,
165	which looks likely, but either way, that's a metric I don't need to worry
166	about any time soon.)
167
168	OTOH, at 30,000+ operating hours (about 3.5 years if on constantly, as I
169	mentioned above), that one's running rather lower. Again assuming it's a
170	percentage metric, it would appear they rate them @ 90,000 hours. (I
171	looked up the specsheets tho, and couldn't see anything like that listed,
172	only 5 years lifetime and warranty, which would be about half that, 45,000
173	hours. But given the 0 threshold there, it appears they expect the start-
174	stop cycles to be more critical, so they may indeed rate it 90,000
175	operating hours.) That'd be three and a half years of use, straight thru,
176	so yeah, I've had them, probably four and half years now, probably five in
177	October -- I don't have them spin down at all and often leave my system on
178	for days at a time, but not /all/ the time, so 3.5 years of use in 4.5
179	years sounds reasonable.
180
181	> Yeah, that's important. Thanks. If I can solve all these RAID problems
182	> then maybe I'll look at adding RAID to his box with better drives or
183	> something.
184
185	One thing they recommend with RAID, which I did NOT do, BTW, and which I'm
186	beginning to worry about since I'm approaching the end of my 5 year
187	warranties, is buying either different brands or models, or at least
188	ensuring you're getting different lot numbers of the same model. The idea
189	being, if they're all the same model and lot number, and they're all part
190	of the same RAID so in similar operating conditions, they're all likely to
191	go out pretty close to each other. That's one reason to be glad I'm
192	running 4-way RAID-1, I suppose, as one hopes that when they start going,
193	even if they are the same model and lot number, at least one of the four
194	can hang on long enough for me to buy replacements and transfer the
195	critical data. But I also have an external 1 TB USB I bought, kept off
196	most of the time as opposed to the RAID disks which are on most of the
197	time, that I've got an external backup on, as well as the backups on the
198	RAIDs, tho the external one isn't as regularly synced. But in the event
199	all four RAID drives die on me, I HAVE test-booted from a USB thumb drive
200	(the external 1TB isn't bootable -- good thing I tested, eh!), to the
201	external 1TB, and CAN recover from it, if I HAVE to.
202
203	> Note that on my system only I'm seeing real problems in
204	> /var/log/message, non-RAID, like 1000's of these:
205	>
206	> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 45276264 on sda3
207	> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46309336 on sda3
208	> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567488 on sda3
209	> Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567680 on sda3
210	>
211	> or
212	>
213	> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555752 on
214	> sda3
215	> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555760 on
216	> sda3
217	> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555768 on
218	> sda3
219	> Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555776 on
220	> sda3
221
222	That doesn't look so good...
223
224	> However I see NONE of that on my dad's machine using the same drive but
225	> different chipset.
226	>
227	> The above problems seem to result in this sort of problem when I try
228	> going with RAID as I tried again this monring:
229	>
230	> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 >
231	> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
232
233	[snipped the trace]
234
235	Ouch! Blocked for 2 minutes...
236
237	Yes, between the logs and the 2-minute hung-task, that does look like some
238	serious issues, chipset or other...
239
240	Talking about which...
241
242	Can you try different SATA cables? I'm assuming you and your dad aren't
243	using the same cables. Maybe it's the cables, not the chipset.
244
245	Also, consider slowing the data down. Disable UDMA or reduce it to a
246	lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1
247	speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the
248	wrap!):
249
250	http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?
251	p_faqid=1337
252
253	If that solves the issue, then you know it's related to signal timing.
254
255	Unfortunately, this can be mobo related. I had very similar issues with
256	memory at one point, and had to slow it down from the rated PC3200, to
257	PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS.
258	Unfortunately, initially the BIOS didn't have a setting for that; it
259	wasn't until a BIOS update that I got it. Until I got the update and
260	declocked it, it would work most of the time, but was borderline. The
261	thing was, the memory was solid and tested so in memtest86+, but that
262	tests memory cells, not speed, and at the rated speed, that memory and
263	that board just didn't like each other, and there'd be occasional issues
264	(bunzip2 erroring out due to checksum mismatch was a common one, and
265	occasional crashes). Ultimately, I fixed the problem when I upgraded
266	memory.
267
268	So having experienced the issue with memory, I know exactly how
269	frustrating it can be. But if you slow it down with the jumper and it
270	works, then you can try different cables, or take off the jumper and try
271	lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm
272	or something. Or exchange either the drives or the mobo, if you can, or
273	buy an add-on SATA card and disable the onboard one.
274
275	Oh, and double-check the kernel driver you are using for it as well.
276	Maybe there's another that'll work better, or driver options you can feed
277	to it, or something.
278
279	Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, and
280	reloaded the system from backup, after you switched to AHCI, try that.
281	AHCI and the kernel driver for it is almost certainly what you want, not
282	compatibility mode, but that could potentially screw things up too, if you
283	switched it and didn't redo the disk afterward.
284
285	I do wish you luck! Seeing those errors brought back BAD memories of the
286	memory problems I had, so while yours is disk not memory, I can definitely
287	sympathize!
288
289	--
290	Duncan - List replies preferred. No HTML msgs.
291	"Every nonfree program has a lord, a master --
292	and if you use the program, he is your master." Richard Stallman