1 |
Good stuff. I'll snip out the less important to keep the response |
2 |
shorter but don't think for a second that I didn't appreciate it. I |
3 |
do! |
4 |
|
5 |
On Fri, Apr 2, 2010 at 2:43 AM, Duncan <1i5t5.duncan@×××.net> wrote: |
6 |
> Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted: |
7 |
<SNIP> |
8 |
> |
9 |
> Making the titles different is a very good idea. It's what I ended up |
10 |
> doing too, as otherwise, it can get confusing pretty fast. |
11 |
> |
12 |
> Something else you might want to do, for purposes of identifying the |
13 |
> drives at the grub boot prompt if something goes wrong or you are |
14 |
> otherwise trying to boot something on another drive, is create a (probably |
15 |
> empty) differently named file on each one, say grub.sda, grub.sdb, etc. |
16 |
|
17 |
I'll consider that, once I get the hard problems solved. |
18 |
<SNIP> |
19 |
>> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7 |
20 |
>> hours. I'm at 18% after 28 minutes so that seems about right. (With no |
21 |
>> errors so far assuming I'm using the right command) |
22 |
> |
23 |
> I used the -w switch here, which actually goes over the disk a total of 8 |
24 |
> times, alternating writing and then reading back to verify the written |
25 |
> pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which |
26 |
> is alternating 10101010, alternating 01010101, all ones, all zeroes). |
27 |
|
28 |
OK, makes sense then. |
29 |
|
30 |
I ran one pass of badblocks on each of the drives. No problem found. |
31 |
|
32 |
I know some Linux folks don't like Spinrite but I've had good luck |
33 |
with it so that's running now. Problem is it cannot run the drives at |
34 |
the same time and it looks like it wants at least 24 hours to do the |
35 |
whole drive so using it would take 3 days. I will likely let it run |
36 |
through the first drive (I'm busy today) and then tomorrow drop back |
37 |
into Linux and possibly try your badblocks on all 3 drives. I'm not |
38 |
overly concerned about losing the install. |
39 |
|
40 |
<SNIP> |
41 |
> |
42 |
> |
43 |
> [8 second spin-down timeouts] |
44 |
> |
45 |
>> Very true. Here is the same drive model I put in a new machine for my |
46 |
>> dad. It's been powered up and running Gentoo as a typical desktop |
47 |
>> machine for about 50 days. He doesn't use it more than about an hour a |
48 |
>> day on average. It's already hit 31K load/unload cycles. At 10% of 300K |
49 |
>> that about 1.5 years of life before I hit that spec. I've watched his |
50 |
>> system a bit and his system seems to add 1 to the count almost exactly |
51 |
>> every 2 minutes on average. Is that a common cron job maybe? |
52 |
> |
53 |
> It's unlikely to be a cron job. But check your logging, and check what |
54 |
> sort of atime you're using on your mounts (relatime is the new kernel |
55 |
> default, but it was atime until relatively recently, say 2.6.30 or .31 or |
56 |
> some such, and noatime is recommended unless you have something that |
57 |
> actually depends on atime, alpine is known to need it for mail, and some |
58 |
> backup software uses it, tho little else on a modern system will, I always |
59 |
> use noatime on my real disk mounts, as opposed to say tmpfs, here). If |
60 |
> there's something writing to the log every two minutes or less, and the |
61 |
> buffers are set to timeout dirty data and flush to disk every two |
62 |
> minutes... And simply accessing a file will change the atime on it if you |
63 |
> have that turned on, thus necessitating a write to disk to update the |
64 |
> atime, with those dirty buffers flushed every X minutes or seconds as well. |
65 |
|
66 |
Here is fstab from my dad's machine which racks up 30 |
67 |
Load_Cycle_Counts and hour: |
68 |
|
69 |
# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts. |
70 |
LABEL="myboot" /boot ext2 noauto,noatime 1 2 |
71 |
LABEL="myroot" / ext3 noatime 0 1 |
72 |
LABEL="myswap" none swap sw 0 0 |
73 |
LABEL="homeherb" /home/herb ext3 noatime 0 1 |
74 |
/dev/cdrom /mnt/cdrom auto noauto,ro 0 0 |
75 |
#/dev/fd0 /mnt/floppy auto noauto 0 0 |
76 |
|
77 |
# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for |
78 |
# POSIX shared memory (shm_open, shm_unlink). |
79 |
# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will |
80 |
# use almost no memory if not populated with files) |
81 |
shm /dev/shm tmpfs |
82 |
nodev,nosuid,noexec 0 0 |
83 |
|
84 |
On the other hand there is some cron stuff going on every 10 minutes |
85 |
or so. Possibly it's not 1 event ever 2 minutes but maybe 5 events |
86 |
every 10 minutes? |
87 |
|
88 |
Apr 2 07:10:01 gandalf cron[6310]: (root) CMD (test -x |
89 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
90 |
Apr 2 07:20:01 gandalf cron[6322]: (root) CMD (test -x |
91 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
92 |
Apr 2 07:30:01 gandalf cron[6335]: (root) CMD (test -x |
93 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
94 |
Apr 2 07:40:01 gandalf cron[6348]: (root) CMD (test -x |
95 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
96 |
Apr 2 07:50:01 gandalf cron[6361]: (root) CMD (test -x |
97 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
98 |
Apr 2 07:59:01 gandalf cron[6374]: (root) CMD (rm -f |
99 |
/var/spool/cron/lastrun/cron.hourly) |
100 |
Apr 2 08:00:01 gandalf cron[6376]: (root) CMD (test -x |
101 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
102 |
Apr 2 08:10:01 gandalf cron[6388]: (root) CMD (test -x |
103 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
104 |
Apr 2 08:20:01 gandalf cron[6401]: (root) CMD (test -x |
105 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
106 |
Apr 2 08:30:01 gandalf cron[6414]: (root) CMD (test -x |
107 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
108 |
Apr 2 08:40:01 gandalf cron[6427]: (root) CMD (test -x |
109 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
110 |
Apr 2 08:50:01 gandalf cron[6440]: (root) CMD (test -x |
111 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
112 |
Apr 2 08:59:01 gandalf cron[6453]: (root) CMD (rm -f |
113 |
/var/spool/cron/lastrun/cron.hourly) |
114 |
Apr 2 09:00:01 gandalf cron[6455]: (root) CMD (test -x |
115 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
116 |
Apr 2 09:10:01 gandalf cron[6467]: (root) CMD (test -x |
117 |
/usr/sbin/run-crons && /usr/sbin/run-crons ) |
118 |
Apr 2 09:18:01 gandalf sshd[6479]: Accepted keyboard-interactive/pam |
119 |
for root from 67.188.27.80 port 51981 ssh2 |
120 |
Apr 2 09:18:01 gandalf sshd[6479]: pam_unix(sshd:session): session |
121 |
opened for user root by (uid=0) |
122 |
|
123 |
|
124 |
<SNIP> |
125 |
> |
126 |
> Note that I don't have #193, the load-cycle counts. There's a couple |
127 |
> different technologies here. The ramp-type load/unload yours uses is |
128 |
> typical of the smaller 2.5" laptop drives. These are designed for far |
129 |
> shorter idle/standby timeouts and thus a far higher cycle count, load |
130 |
> cycles, typical rating 300,000 to 600,000. Standard desktop/server drives |
131 |
> use a contact park method and a lower power cycle count, typically 50,000 |
132 |
> or so. That's the difference. |
133 |
|
134 |
I also purchased two Enterprise Edition drives - the 500GB size. They |
135 |
are also spec'ed at 300K |
136 |
|
137 |
http://www.wdc.com/en/products/products.asp?DriveID=489 |
138 |
|
139 |
My intention was to use them in a RAID0 and then back them up daily to |
140 |
RAID1 for more safety. However I'm starting to think this TLER feature |
141 |
may well be part of this problem. I don't want to start using them |
142 |
however until I understand this 30/minute issue. No reason to wear |
143 |
everything out! |
144 |
|
145 |
<SNIP> |
146 |
> |
147 |
> One thing they recommend with RAID, which I did NOT do, BTW, and which I'm |
148 |
> beginning to worry about since I'm approaching the end of my 5 year |
149 |
> warranties, is buying either different brands or models, or at least |
150 |
> ensuring you're getting different lot numbers of the same model. The idea |
151 |
> being, if they're all the same model and lot number, and they're all part |
152 |
> of the same RAID so in similar operating conditions, they're all likely to |
153 |
> go out pretty close to each other. That's one reason to be glad I'm |
154 |
> running 4-way RAID-1, I suppose, as one hopes that when they start going, |
155 |
> even if they are the same model and lot number, at least one of the four |
156 |
> can hang on long enough for me to buy replacements and transfer the |
157 |
> critical data. |
158 |
|
159 |
Exactly! My plan for this box is a 3 disk RAID1 as 3 disks is all it will hold. |
160 |
|
161 |
Most folks don't understand that if 1 drive has a 1% chance of failing |
162 |
then 3 drives is more like a 3% chance of failing assuming they are |
163 |
are truly independent. If they all come from the same lot and 1 fails |
164 |
then it's logically more likely that the other 2 will fail in the next |
165 |
few days or weeks. Certainly much faster then getting them from |
166 |
different companies. |
167 |
|
168 |
|
169 |
<SNIP> |
170 |
>> |
171 |
>> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 > |
172 |
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
173 |
> |
174 |
> [snipped the trace] |
175 |
> |
176 |
> Ouch! Blocked for 2 minutes... |
177 |
> |
178 |
> Yes, between the logs and the 2-minute hung-task, that does look like some |
179 |
> serious issues, chipset or other... |
180 |
> |
181 |
> Talking about which... |
182 |
> |
183 |
> Can you try different SATA cables? I'm assuming you and your dad aren't |
184 |
> using the same cables. Maybe it's the cables, not the chipset. |
185 |
|
186 |
Now that's an interesting thought. n my other machines I used the |
187 |
cables Intel shipped with the MB. However in this case I couldn't |
188 |
because the SATA connectors don't point upward but come out |
189 |
horizontally. Due to proximity to the drive container I had to get 90 |
190 |
degree cables and all 3 drives are using those right now. I can switch |
191 |
two of the drives to the Intel cables. |
192 |
|
193 |
That said Spinrite has been running for hours without and problem at |
194 |
all and it will tell me if there are delays, sectors not found, etc., |
195 |
so if it was as blatant a problem as it appears to be when running |
196 |
Linux then I really think I would have seen it by now. I would have |
197 |
guessed I would have seen it running badblocks also, but possibly not. |
198 |
|
199 |
> |
200 |
> Also, consider slowing the data down. Disable UDMA or reduce it to a |
201 |
> lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1 |
202 |
> speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the |
203 |
> wrap!): |
204 |
> |
205 |
> http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php? |
206 |
> p_faqid=1337 |
207 |
> |
208 |
> If that solves the issue, then you know it's related to signal timing. |
209 |
|
210 |
Will try it. |
211 |
|
212 |
> |
213 |
> Unfortunately, this can be mobo related. I had very similar issues with |
214 |
> memory at one point, and had to slow it down from the rated PC3200, to |
215 |
> PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS. |
216 |
> Unfortunately, initially the BIOS didn't have a setting for that; it |
217 |
> wasn't until a BIOS update that I got it. Until I got the update and |
218 |
> declocked it, it would work most of the time, but was borderline. The |
219 |
> thing was, the memory was solid and tested so in memtest86+, but that |
220 |
> tests memory cells, not speed, and at the rated speed, that memory and |
221 |
> that board just didn't like each other, and there'd be occasional issues |
222 |
> (bunzip2 erroring out due to checksum mismatch was a common one, and |
223 |
> occasional crashes). Ultimately, I fixed the problem when I upgraded |
224 |
> memory. |
225 |
|
226 |
OK, so I have 6 of these drives and multiple PCs. While not a perfect |
227 |
test I can try putting a couple into another machine and building a 2 |
228 |
drive RAID1 just to see what happens. |
229 |
> |
230 |
> So having experienced the issue with memory, I know exactly how |
231 |
> frustrating it can be. But if you slow it down with the jumper and it |
232 |
> works, then you can try different cables, or take off the jumper and try |
233 |
> lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm |
234 |
> or something. Or exchange either the drives or the mobo, if you can, or |
235 |
> buy an add-on SATA card and disable the onboard one. |
236 |
> |
237 |
> Oh, and double-check the kernel driver you are using for it as well. |
238 |
> Maybe there's another that'll work better, or driver options you can feed |
239 |
> to it, or something. |
240 |
|
241 |
The kernel driver is ahci. Don't know that I have any alternatives |
242 |
when booting AHCI from BIOS, but I can look at the other modes with |
243 |
other drivers and see if the problems still occurs. That's a bit of |
244 |
work but probably worth it. |
245 |
|
246 |
This is all a big table of experiments that eventually limit the |
247 |
problem to a single location. (Hopefully!) |
248 |
|
249 |
> |
250 |
> Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, and |
251 |
> reloaded the system from backup, after you switched to AHCI, try that. |
252 |
> AHCI and the kernel driver for it is almost certainly what you want, not |
253 |
> compatibility mode, but that could potentially screw things up too, if you |
254 |
> switched it and didn't redo the disk afterward. |
255 |
> |
256 |
> I do wish you luck! Seeing those errors brought back BAD memories of the |
257 |
> memory problems I had, so while yours is disk not memory, I can definitely |
258 |
> sympathize! |
259 |
|
260 |
As always, thanks for the help. I'm very interested, and yes, even a |
261 |
little frustrated! ;-) |
262 |
|
263 |
Cheers, |
264 |
Mark |