1 |
Hi, |
2 |
|
3 |
Setting up and testing my new system (after wasting nearly 1 month |
4 |
with bad RAM modules), I got this error today: |
5 |
|
6 |
[48055.741389] ata3.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6 frozen |
7 |
[48055.741393] ata3.00: failed command: READ FPDMA QUEUED |
8 |
[48055.741398] ata3.00: cmd 60/20:08:38:15:03/01:00:18:00:00/40 tag 1 |
9 |
ncq 147456 in |
10 |
[48055.741400] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask |
11 |
0x4 (timeout) |
12 |
[48055.741402] ata3.00: status: { DRDY } |
13 |
[48055.741405] ata3: hard resetting link |
14 |
[48056.198746] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) |
15 |
[48056.210514] ata3.00: configured for UDMA/133 |
16 |
[48056.210518] ata3.00: device reported invalid CHS sector 0 |
17 |
[48056.210523] ata3: EH complete |
18 |
|
19 |
I really don't understand what it means, but the "timeout", "hard |
20 |
resetting link" and "invalid CHS sector 0" look scary to me... |
21 |
|
22 |
Initial bootup messages for this device were: |
23 |
Mar 25 22:02:32 [kernel] [ 4.496102] ata3: SATA max UDMA/133 abar |
24 |
m2048@0xfbffc000 port 0xfbffc200 irq 34 |
25 |
Mar 25 22:02:32 [kernel] [ 8.519169] ata3: SATA link up 3.0 Gbps |
26 |
(SStatus 123 SControl 300) |
27 |
Mar 25 22:02:32 [kernel] [ 8.536681] ata3.00: ATA-8: SAMSUNG |
28 |
HD203WI, 1AN10002, max UDMA/133 |
29 |
Mar 25 22:02:32 [kernel] [ 8.548388] ata3.00: 3907029168 sectors, |
30 |
multi 0: LBA48 NCQ (depth 31/32), AA |
31 |
Mar 25 22:02:32 [kernel] [ 8.566100] ata3.00: configured for UDMA/133 |
32 |
|
33 |
That disk is part of a md RAID5, but I was at work when this error |
34 |
happened so I didn't notice if the RAID repaired itself or whatever |
35 |
would happen in this case (I don't have mdadm monitoring configured |
36 |
yet). Right now all RAID disks are all up and healthy. |
37 |
|
38 |
I googled it but most of the results are pastebin snippets. I'm using |
39 |
kernel 2.6.33 and ahci driver for the SATA controllers. |
40 |
|
41 |
From libata documentation in the section about timeout errors it says: |
42 |
"Most often this is due to an unrelated interrupt subsystem bug (try |
43 |
booting with 'pci=nomsi' or 'acpi=off' or 'noapic'), which failed to |
44 |
deliver an interrupt when we were expecting one from the hardware." |
45 |
|
46 |
I really don't know the potential implications of disabling MSI or |
47 |
APIC, but in /proc/interrupts I do see AHCI related to both MSI and |
48 |
APIC rows. So at least I know they are active right now. |
49 |
|
50 |
Temperatures in my system are good, hddtemp says the drive in question |
51 |
is 21C degrees right now. |
52 |
|
53 |
Another possibility is that I need to increase voltage on the |
54 |
motherboard, since it is running 6 hdd's and 1 DVD-ROM. I'll have to |
55 |
research to see which voltage is related to this. (X58 motherboard) |
56 |
|
57 |
Thanks in advance if anyone has any knowledge about this, otherwise I |
58 |
go to trial-and-hopefully-no-error mode. :) |
59 |
|
60 |
Paul |