Gentoo Archives: gentoo-user

From: antlists <antlists@××××××××××××.uk>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Seagate ST8000NM0065 PMR or SMR plus NAS SAS SATA question
Date: Fri, 22 May 2020 18:08:52
Message-Id: c25d83fa-b023-8866-8d19-6045c21f62c1@youngman.org.uk
In Reply to: Re: [gentoo-user] Seagate ST8000NM0065 PMR or SMR plus NAS SAS SATA question by Rich Freeman
1 On 22/05/2020 18:20, Rich Freeman wrote:
2 > On Fri, May 22, 2020 at 12:47 PM antlists <antlists@××××××××××××.uk> wrote:
3 >>
4 >> What puzzles me (or rather, it doesn't, it's just cost cutting), is why
5 >> you need a *dedicated* cache zone anyway.
6 >>
7 >> Stick a left-shift register between the LBA track and the hard drive,
8 >> and by switching this on you write to tracks 2,4,6,8,10... and it's a
9 >> CMR zone. Switch the register off and it's an SMR zone writing to all
10 >> tracks.
11 >
12 > Disclaimer: I'm not a filesystem/DB design expert.
13 >
14 > Well, I'm sure the zones aren't just 2 tracks wide, but that is worked
15 > around easily enough. I don't see what this gets you though. If
16 > you're doing sequential writes you can do them anywhere as long as
17 > you're doing them sequentially within any particular SMR zone. If
18 > you're overwriting data then it doesn't matter how you've mapped them
19 > with a static mapping like this, you're still going to end up with
20 > writes landing in the middle of an SMR zone.
21
22 Let's assume each shingled track overwrites half the previous write.
23 Let's also assume a shingled zone is 2GB in size. My method converts
24 that into a 1GB CMR zone, because we're only writing to every second track.
25
26 I don't know how these drives cache their writes before re-organising,
27 but this means that ANY disk zone can be used as cache, rather than
28 having a (too small?) dedicated zone...
29
30 So what you could do is allocate one zone of CMR to every four or five
31 zones of SMR and just reshingle each SMR as the CMR filled up. The
32 important point is that zones can switch from CMR cache to SMR filling
33 up, to full SMR zones decaying as they are re-written.
34 >
35 >> The other thing is, why can't you just stream writes to a SMR zone,
36 >> especially if we try and localise writes so lets say all LBAs in Gig 1
37 >> go to the same zone ... okay - if we run out of zones to re-shingle to,
38 >> then the drive is going to grind to a halt, but it will be much less
39 >> likely to crash into that barrier in the first place.
40 >
41 > I'm not 100% following you, but if you're suggesting remapping all
42 > blocks so that all writes are always sequential, like some kind of
43 > log-based filesystem, your biggest problem here is going to be
44 > metadata. Blocks logically are only 512 bytes, so there are a LOT of
45 > them. You can't just freely remap them all because then you're going
46 > to end up with more metadata than data.
47 >
48 > I'm sure they are doing something like that within the cache area,
49 > which is fine for short bursts of writes, but at some point you need
50 > to restructure that data so that blocks are contiguous or otherwise
51 > following some kind of pattern so that you don't have to literally
52 > remap every single block.
53
54 Which is why I'd break it down to maybe 2GB zones. If as the zone fills
55 it streams, but is then re-organised and re-written properly when time
56 permits, you've not got too large chunks of metadata. You need a btree
57 to work out where each zone is stored, then each one has a btree to say
58 where the blocks is stored. Oh - and these drives are probably 4K blocks
59 only - most new drives are.
60
61 > Now, they could still reside in different
62 > locations, so maybe some sequential group of blocks are remapped, but
63 > if you have a write to one block in the middle of a group you need to
64 > still read/rewrite all those blocks somewhere. Maybe you could use a
65 > COW-like mechanism like zfs to reduce this somewhat, but you still
66 > need to manage blocks in larger groups so that you don't have a ton of
67 > metadata.
68
69 The problem with drives at the moment is they run out of CMR cache, so
70 they have to rewrite all those blocks WHILE THE USER IS STILL WRITING.
71 The point of my idea is that they can repurpose disk as SMR or CMR as
72 required, so they don't run out of cache at the wrong time ...
73
74 Yes metadata may bloom under pressure, but give the drives a break and
75 they can grab a new zone, do an SMR ordered stream, and shrink the metadata.
76 >
77 > With host-managed SMR this is much less of a problem because the host
78 > can use extents/etc to reduce the metadata, because the host already
79 > needs to map all this stuff into larger structures like
80 > files/records/etc. The host is already trying to avoid having to
81 > track individual blocks, so it is counterproductive to re-introduce
82 > that problem at the block layer.
83 >
84 > Really the simplest host-managed SMR solution is something like f2fs
85 > or some other log-based filesystem that ensures all writes to the disk
86 > are sequential. Downside to flash-based filesystems is that they can
87 > disregard fragmentation on flash, but you can't disregard that for an
88 > SMR drive because random disk performance is terrible.
89
90 Which is why you have small(ish) zones so logically close writes are
91 hopefully physically close as well ...
92 >
93 >> Even better, if we have two independent heads, we could presumably
94 >> stream updates using one head, and re-shingle with the other. But that's
95 >> more cost ...
96 >
97 > Well, sure, or if you're doing things host-managed then you stick the
98 > journal on an SSD and then do the writes to the SMR drive
99 > opportunistically. You're basically describing a system where you
100 > have independent drives for the journal and the data areas. Adding an
101 > extra head on a disk (or just having two disks) greatly improves
102 > performance, especially if you're alternating between two regions
103 > constantly.
104 >
105 EXcept I'm describing a system where journal and data areas are
106 interchangeable :-)
107
108 Cheers,
109 Wol

Replies