Gentoo Archives: gentoo-user

From: Rich Freeman <rich0@g.o>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Long boot time after kernel update
Date: Mon, 27 Dec 2021 14:15:31
Message-Id: CAGfcS_m_b-voarBnRfXYQsU3D1dGQSAm3qR57dcCSaE0X=ouVw@mail.gmail.com
In Reply to: Re: [gentoo-user] Long boot time after kernel update by Wols Lists
1 On Mon, Dec 27, 2021 at 8:46 AM Wols Lists <antlists@××××××××××××.uk> wrote:
2 >
3 > On 27/12/2021 13:40, Michael wrote:
4 > > On Monday, 27 December 2021 11:32:39 GMT Wols Lists wrote:
5 > >> On 27/12/2021 11:07, Jacques Montier wrote:
6 > >>> Well, i don't know if my partitions are aligned or mis-aligned... How
7 > >>> could i get it ?
8 > >>
9 > >> fdisk would have spewed a bunch of warnings. So you're okay.
10 > >>
11 > >> I'm not sure of the details, but it's the classic "off by one" problem -
12 > >> if there's a mismatch between the kernel block size and the disk block
13 > >> size any writes required doing a read-update-write cycle which of course
14 > >> knackered performance. I had that hit a while back.
15 > >>
16 > >> But seeing as fdisk isn't moaning, that isn't the problem ...
17 > >>
18 > >> Cheers,
19 > >> Wol
20 > >
21 > > I also thought of misaligned boundaries when I first saw the error, but the
22 > > mention of Seagate by the OP pointed me to another edge case which crept up
23 > > with zstd compression on ZFS. I'm mentioning it here in case it is relevant:
24 > >
25 > > https://livelace.ru/posts/2021/Jul/19/unaligned-write-command/
26 > >
27 > that might be of interest to me ... I'm getting system lockups but it's
28 > not an SSD. I've got two IronWolves and a Barracuda.
29 >
30 > But I notice the OP has a Barra*C*uda. Note the different spelling.
31 > That's a shingled drive I believe, which shouldn't make a lot of
32 > difference in light usage, but you don't want to hammer it!
33
34 I've run into this issue and I've seen rare reports of it online, but
35 no sign of resolution. I'm pretty sure it is some sort of bug in the
36 kernel. I've tended to see it under load, and mostly when using zfs.
37 I do not use zstd compression and do not have any zvols on the pools
38 that had this issue. So, either there are multiple problems, or that
39 linked post did not correctly identify the root cause (which seems
40 likely). I'm guessing it is triggered under load and perhaps using
41 zstd compression helps create that load.
42
43 I haven't seen it much lately - probably because I've shifted a lot of
44 my load to lizardfs and also I'm using USB3 hard drives for the bulk
45 of my storage and since these seem to be ATA errors the removal of the
46 SATA host and associated drivers may bypass the problem.
47
48 I doubt this has anything to do with physical/logical sector size and
49 partition alignment. The disks should still work correctly if the
50 physical sectors aren't aligned - they should just have performance
51 degradation. In any case, all my drives are aligned on physical
52 sector boundaries. I'm not familiar enough with ATA to understand
53 what the actual errors are referring to.
54
55 Here is an example of one of the errors I've had in the past from one
56 of these situations. A zpool scrub usually clears up any damage and
57 then the drive works normally until the issue happens again (which
58 hasn't happened in quite a while for me now). I have a dump of the
59 SMART logs and the kernel ring buffer:
60
61 ATA Error Count: 1
62 CR = Command Register [HEX]
63 FR = Features Register [HEX]
64 SC = Sector Count Register [HEX]
65 SN = Sector Number Register [HEX]
66 CL = Cylinder Low Register [HEX]
67 CH = Cylinder High Register [HEX]
68 DH = Device/Head Register [HEX]
69 DC = Device Command Register [HEX]
70 ER = Error register [HEX]
71 ST = Status register [HEX]
72 Powered_Up_Time is measured from power on, and printed as
73 DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
74 SS=sec, and sss=millisec. It "wraps" after 49.710 days.
75
76 Error 1 occurred at disk power-on lifetime: 12838 hours (534 days + 22 hours)
77 When the command that caused the error occurred, the device was
78 active or idle.
79
80 After command completion occurred, registers were:
81 ER ST SC SN CL CH DH
82 -- -- -- -- -- -- --
83 84 51 e0 88 cc c3 06 Error: ICRC, ABRT at LBA = 0x06c3cc88 = 113495176
84
85 Commands leading to the command that caused the error were:
86 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
87 -- -- -- -- -- -- -- -- ---------------- --------------------
88 61 00 c0 68 cb c3 40 08 2d+00:45:18.962 WRITE FPDMA QUEUED
89 60 00 b8 98 67 00 40 08 2d+00:45:18.917 READ FPDMA QUEUED
90 60 00 b0 98 65 00 40 08 2d+00:45:18.916 READ FPDMA QUEUED
91 60 00 a8 98 66 00 40 08 2d+00:45:18.916 READ FPDMA QUEUED
92 61 00 a0 68 ca c3 40 08 2d+00:45:18.879 WRITE FPDMA QUEUED
93
94 [354064.268896] ata6.00: exception Emask 0x11 SAct 0x1000000 SErr
95 0x480000 action 0x6 frozen
96 [354064.268907] ata6.00: irq_stat 0x48000008, interface fatal error
97 [354064.268910] ata6: SError: { 10B8B Handshk }
98 [354064.268915] ata6.00: failed command: WRITE FPDMA QUEUED
99 [354064.268919] ata6.00: cmd 61/00:c0:68:cb:c3/07:00:06:01:00/40 tag
100 24 ncq dma 917504 out
101 res 50/00:00:68:cb:c3/00:07:06:01:00/40 Emask
102 0x10 (ATA bus error)
103 [354064.268922] ata6.00: status: { DRDY }
104 [354064.268926] ata6: hard resetting link
105 [354064.731093] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
106 [354064.734739] ata6.00: configured for UDMA/133
107 [354064.734759] sd 5:0:0:0: [sdc] tag#24 FAILED Result:
108 hostbyte=DID_OK driverbyte=DRIVER_SENSE
109 [354064.734764] sd 5:0:0:0: [sdc] tag#24 Sense Key : Illegal Request [current]
110 [354064.734767] sd 5:0:0:0: [sdc] tag#24 Add. Sense: Unaligned write command
111 [354064.734771] sd 5:0:0:0: [sdc] tag#24 CDB: Write(16) 8a 00 00 00 00
112 01 06 c3 cb 68 00 00 07 00 00 00
113 [354064.734774] print_req_error: I/O error, dev sdc, sector 4408462184
114 [354064.734791] ata6: EH complete
115
116
117 --
118 Rich