1 |
On 22/05/2020 18:20, Rich Freeman wrote: |
2 |
> On Fri, May 22, 2020 at 12:47 PM antlists <antlists@××××××××××××.uk> wrote: |
3 |
>> |
4 |
>> What puzzles me (or rather, it doesn't, it's just cost cutting), is why |
5 |
>> you need a *dedicated* cache zone anyway. |
6 |
>> |
7 |
>> Stick a left-shift register between the LBA track and the hard drive, |
8 |
>> and by switching this on you write to tracks 2,4,6,8,10... and it's a |
9 |
>> CMR zone. Switch the register off and it's an SMR zone writing to all |
10 |
>> tracks. |
11 |
> |
12 |
> Disclaimer: I'm not a filesystem/DB design expert. |
13 |
> |
14 |
> Well, I'm sure the zones aren't just 2 tracks wide, but that is worked |
15 |
> around easily enough. I don't see what this gets you though. If |
16 |
> you're doing sequential writes you can do them anywhere as long as |
17 |
> you're doing them sequentially within any particular SMR zone. If |
18 |
> you're overwriting data then it doesn't matter how you've mapped them |
19 |
> with a static mapping like this, you're still going to end up with |
20 |
> writes landing in the middle of an SMR zone. |
21 |
|
22 |
Let's assume each shingled track overwrites half the previous write. |
23 |
Let's also assume a shingled zone is 2GB in size. My method converts |
24 |
that into a 1GB CMR zone, because we're only writing to every second track. |
25 |
|
26 |
I don't know how these drives cache their writes before re-organising, |
27 |
but this means that ANY disk zone can be used as cache, rather than |
28 |
having a (too small?) dedicated zone... |
29 |
|
30 |
So what you could do is allocate one zone of CMR to every four or five |
31 |
zones of SMR and just reshingle each SMR as the CMR filled up. The |
32 |
important point is that zones can switch from CMR cache to SMR filling |
33 |
up, to full SMR zones decaying as they are re-written. |
34 |
> |
35 |
>> The other thing is, why can't you just stream writes to a SMR zone, |
36 |
>> especially if we try and localise writes so lets say all LBAs in Gig 1 |
37 |
>> go to the same zone ... okay - if we run out of zones to re-shingle to, |
38 |
>> then the drive is going to grind to a halt, but it will be much less |
39 |
>> likely to crash into that barrier in the first place. |
40 |
> |
41 |
> I'm not 100% following you, but if you're suggesting remapping all |
42 |
> blocks so that all writes are always sequential, like some kind of |
43 |
> log-based filesystem, your biggest problem here is going to be |
44 |
> metadata. Blocks logically are only 512 bytes, so there are a LOT of |
45 |
> them. You can't just freely remap them all because then you're going |
46 |
> to end up with more metadata than data. |
47 |
> |
48 |
> I'm sure they are doing something like that within the cache area, |
49 |
> which is fine for short bursts of writes, but at some point you need |
50 |
> to restructure that data so that blocks are contiguous or otherwise |
51 |
> following some kind of pattern so that you don't have to literally |
52 |
> remap every single block. |
53 |
|
54 |
Which is why I'd break it down to maybe 2GB zones. If as the zone fills |
55 |
it streams, but is then re-organised and re-written properly when time |
56 |
permits, you've not got too large chunks of metadata. You need a btree |
57 |
to work out where each zone is stored, then each one has a btree to say |
58 |
where the blocks is stored. Oh - and these drives are probably 4K blocks |
59 |
only - most new drives are. |
60 |
|
61 |
> Now, they could still reside in different |
62 |
> locations, so maybe some sequential group of blocks are remapped, but |
63 |
> if you have a write to one block in the middle of a group you need to |
64 |
> still read/rewrite all those blocks somewhere. Maybe you could use a |
65 |
> COW-like mechanism like zfs to reduce this somewhat, but you still |
66 |
> need to manage blocks in larger groups so that you don't have a ton of |
67 |
> metadata. |
68 |
|
69 |
The problem with drives at the moment is they run out of CMR cache, so |
70 |
they have to rewrite all those blocks WHILE THE USER IS STILL WRITING. |
71 |
The point of my idea is that they can repurpose disk as SMR or CMR as |
72 |
required, so they don't run out of cache at the wrong time ... |
73 |
|
74 |
Yes metadata may bloom under pressure, but give the drives a break and |
75 |
they can grab a new zone, do an SMR ordered stream, and shrink the metadata. |
76 |
> |
77 |
> With host-managed SMR this is much less of a problem because the host |
78 |
> can use extents/etc to reduce the metadata, because the host already |
79 |
> needs to map all this stuff into larger structures like |
80 |
> files/records/etc. The host is already trying to avoid having to |
81 |
> track individual blocks, so it is counterproductive to re-introduce |
82 |
> that problem at the block layer. |
83 |
> |
84 |
> Really the simplest host-managed SMR solution is something like f2fs |
85 |
> or some other log-based filesystem that ensures all writes to the disk |
86 |
> are sequential. Downside to flash-based filesystems is that they can |
87 |
> disregard fragmentation on flash, but you can't disregard that for an |
88 |
> SMR drive because random disk performance is terrible. |
89 |
|
90 |
Which is why you have small(ish) zones so logically close writes are |
91 |
hopefully physically close as well ... |
92 |
> |
93 |
>> Even better, if we have two independent heads, we could presumably |
94 |
>> stream updates using one head, and re-shingle with the other. But that's |
95 |
>> more cost ... |
96 |
> |
97 |
> Well, sure, or if you're doing things host-managed then you stick the |
98 |
> journal on an SSD and then do the writes to the SMR drive |
99 |
> opportunistically. You're basically describing a system where you |
100 |
> have independent drives for the journal and the data areas. Adding an |
101 |
> extra head on a disk (or just having two disks) greatly improves |
102 |
> performance, especially if you're alternating between two regions |
103 |
> constantly. |
104 |
> |
105 |
EXcept I'm describing a system where journal and data areas are |
106 |
interchangeable :-) |
107 |
|
108 |
Cheers, |
109 |
Wol |