[gentoo-user] Re: [offtopic] Copy-On-Write ? - gentoo-user

From:	Kai Krakow <hurikhan77@×××××.com>
To:	gentoo-user@l.g.o
Subject:	[gentoo-user] Re: [offtopic] Copy-On-Write ?
Date:	Sat, 16 Sep 2017 17:48:28
Message-Id:	`20170916194814.6d8b7274@jupiter.sol.kaishome.de`
In Reply to:	Re: [gentoo-user] Re: [offtopic] Copy-On-Write ? by Rich Freeman

1

Am Sat, 16 Sep 2017 10:05:21 -0700

2

schrieb Rich Freeman <rich0@g.o>:

3

4

> On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@×××××.com>

5

> wrote:

6

> >

7

> > Actually, I'm running across 3x 1TB here on my desktop, with mraid1

8

> > and draid 0. Combined with bcache it gives confident performance.

9

> >

10

>

11

> Not entirely sure I'd use the word "confident" to describe a

12

> filesystem where the loss of one disk guarantees that:

13

> 1.  You will lose data (no data redundancy).

14

> 2.  But the filesystem will be able to tell you exactly what data you

15

> lost (as metadata will be fine).

16

17

I take daily backups with borg backup. It takes only 15 minutes to run.

18

And it has been tested twice successfully. The only breakdowns I had

19

were due to btrfs bugs, not hardware faults.

20

21

This is confident enough for my desktop system.

22

23

24

> > I was very happy a long time with XFS but switched to btrfs when it

25

> > became usable due to compression and stuff. But performance of

26

> > compression seems to get worse lately, IO performance drops due to

27

> > hogged CPUs even if my system really isn't that incapable.

28

> >

29

>

30

> Btrfs performance is pretty bad in general right now.  The problem is

31

> that they just simply haven't gotten around to optimizing it fully,

32

> mainly because they're more focused on getting rid of the data

33

> corruption bugs (which is of course the right priority).  For example,

34

> with raid1 mode btrfs picks the disk to use for raid based on whether

35

> the PID is even or odd, without any regard to disk utilization.

36

>

37

> When I moved to zfs I noticed a huge performance boost.

38

39

Interesting... While I never tried it I always feared that it would

40

perform worse if not throwing RAM and ZIL/L2ARC at it.

41

42

43

> Fundamentally I don't see why btrfs can't perform just as well as the

44

> others.  It just isn't there yet.

45

46

And it will take a long time still, because devs are still throwing new

47

features at it which need to stabilize.

48

49

50

> > What's still cool is that I don't need to manage volumes since the

51

> > volume manager is built into btrfs. XFS on LVM was not that

52

> > flexible. If btrfs wouldn't have this feature, I probably would

53

> > have switched back to XFS already.

54

>

55

> My main concern with xfs/ext4 is that neither provides on-disk

56

> checksums or protection against the raid write hole.

57

58

Btrfs suffers the same RAID5 write hole problem since years. I always

59

planned moving to RAID5 later (which is why I have 3 disks) but I fear

60

this won't be fixed any time soon due to design decisions made too

61

early.

62

63

64

> I just switched motherboards a few weeks ago and either a connection

65

> or a SATA port was bad because one of my drives was getting a TON of

66

> checksum errors on zfs.  I moved it to an LSI card and scrubbed, and

67

> while it took forever and the system degraded the array more than once

68

> due to the high error rate, eventually it patched up all the errors

69

> and now the array is working without issue.  I didn't suffer more than

70

> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE

71

> headache trying to recover from that (doing who knows how much

72

> troubleshooting before realizing I had to do a slow full restore from

73

> backup with the system down).

74

75

I found md raid not very reliable in the past but I didn't try again in

76

years. So this may have changed. I only remember it destroyed a file

77

system after an unclean shutdown not only once, that's not what I

78

expect from RAID1. Other servers with file systems on bare metal

79

survived this just fine.

80

81

82

> I just don't see how a modern filesystem can get away without having

83

> full checksum support.  It is a bit odd that it has taken so long for

84

> Ceph to introduce it, and I'm still not sure if it is truly

85

> end-to-end, or if at any point in its life the data isn't protected by

86

> checksums.  If I were designing something like Ceph I'd checksum the

87

> data at the client the moment it enters storage, then independently

88

> store the checksum and data, and then retrieve both and check it at

89

> the client when the data leaves storage.  Then you're protected

90

> against corruption at any layer below that.  You could of course have

91

> additional protections to catch errors sooner before the client even

92

> sees them.  I think that the issue is that Ceph was really designed

93

> for object storage originally and they just figured the application

94

> would be responsible for data integrity.

95

96

I'd at least pass the checksum through all the layers while checking it

97

again, so you could detect which transport or layer is broken.

98

99

100

> The other benefit of checksums is that if they're done right scrubs

101

> can go a lot faster, because you don't have to scrub all the

102

> redundancy data synchronously.  You can just start an idle-priority

103

> read thread on every drive and then pause it anytime a drive is

104

> accessed, and an access on one drive won't slow down the others.  With

105

> traditional RAID you have to read all the redundancy data

106

> synchronously because you can't check the integrity of any of it

107

> without the full set.  I think even ZFS is stuck doing synchronous

108

> reads due to how it stores/computes the checksums.  This is something

109

> btrfs got right.

110

111

One other point I decided for btrfs, tho I don't make much use of it

112

currently. I used to do regular scrubs a while ago but combined with

113

bcache, that is an SSD killer... I killed my old 128G SSD within one

114

year, although I used overprovisioning. Well, I actually didn't kill

115

it, it swapped it at 99% lifetime according to smartctl. It would

116

probably still work for a long time in normal workloads.

117

118

119

> >>  For the moment I'm

120

> >> relying more on zfs.

121

> >

122

> > How does it perform memory-wise? Especially, I'm currently using

123

> > bees[1] for deduplication: It uses a 1G memory mapped file (you can

124

> > choose other sizes if you want), and it picks up new files really

125

> > fast, within a minute. I don't think zfs can do anything like that

126

> > within the same resources.

127

>

128

> I'm not using deduplication, but my understanding is that zfs

129

> deduplication:

130

> 1.  Works just fine.

131

132

No doubt...

133

134

> 2.  Uses a TON of RAM.

135

136

That's the problem. And I think there is no near-line dedup tool

137

available?

138

139

140

> So, it might not be your cup of tea.  There is no way to do

141

> semi-offline dedup as with btrfs (not really offline in that the

142

> filesystem is fully running - just that you periodically scan for dups

143

> and fix them after the fact, vs detect them in realtime).    With a

144

> semi-offline mode then the performance hits would only come at a time

145

> of my choosing, vs using gobs of RAM all the time to detect what are

146

> probably fairly rare dups.

147

148

I'm using bees, and I'd call it near-line. Changes to files are picked

149

up at commit time, when a new generation is made, and then it walks the

150

new extents, maps those to files, and deduplicates the blocks. I was

151

surprised how fast it detects new duplicate blocks. But it is still

152

working through the rest of the file system (since days), at least

153

without much impact on performance. Giving up 1G of RAM for this is

154

totally okay.

155

156

Once it finished scanning the first time, I'm thinking about starting

157

it at timed intervals. But it looks like impact will be so low that I

158

can keep it running all the time. Using cgroups to limit cpu and io

159

shares works really great.

160

161

I still didn't evaluate how it interferes with defragmenting, tho, or

162

how big the impact is of bees fragmenting extents.

163

164

165

> That aside, I find it works fine memory-wise (I don't use dedup).  It

166

> has its own cache system not integrated fully into the kernel's native

167

> cache, so it tends to hold on to a lot more ram than other

168

> filesystems, but you can tune this behavior so that it stays fairly

169

> tame.

170

171

I think the reasoning using own caching is, that block caching at the

172

vfs layer cannot just be done in an efficient way for a cow file system

173

with scrubbing and everything. You need to use good cache hinting

174

throuhout the whole pipeline which is currently slowly integrated into

175

the kernel.

176

177

E.g., when btrfs does cow action, bcache doesn't get notified that it

178

can discard the free block from cache. I don't know if this is handled

179

in the kernel cache layer...

180

181

182

--

183

Regards,

184

Kai

185

186

Replies to list-only preferred.

1	Am Sat, 16 Sep 2017 10:05:21 -0700
2	schrieb Rich Freeman <rich0@g.o>:
3
4	> On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@×××××.com>
5	> wrote:
6	> >
7	> > Actually, I'm running across 3x 1TB here on my desktop, with mraid1
8	> > and draid 0. Combined with bcache it gives confident performance.
9	> >
10	>
11	> Not entirely sure I'd use the word "confident" to describe a
12	> filesystem where the loss of one disk guarantees that:
13	> 1. You will lose data (no data redundancy).
14	> 2. But the filesystem will be able to tell you exactly what data you
15	> lost (as metadata will be fine).
16
17	I take daily backups with borg backup. It takes only 15 minutes to run.
18	And it has been tested twice successfully. The only breakdowns I had
19	were due to btrfs bugs, not hardware faults.
20
21	This is confident enough for my desktop system.
22
23
24	> > I was very happy a long time with XFS but switched to btrfs when it
25	> > became usable due to compression and stuff. But performance of
26	> > compression seems to get worse lately, IO performance drops due to
27	> > hogged CPUs even if my system really isn't that incapable.
28	> >
29	>
30	> Btrfs performance is pretty bad in general right now. The problem is
31	> that they just simply haven't gotten around to optimizing it fully,
32	> mainly because they're more focused on getting rid of the data
33	> corruption bugs (which is of course the right priority). For example,
34	> with raid1 mode btrfs picks the disk to use for raid based on whether
35	> the PID is even or odd, without any regard to disk utilization.
36	>
37	> When I moved to zfs I noticed a huge performance boost.
38
39	Interesting... While I never tried it I always feared that it would
40	perform worse if not throwing RAM and ZIL/L2ARC at it.
41
42
43	> Fundamentally I don't see why btrfs can't perform just as well as the
44	> others. It just isn't there yet.
45
46	And it will take a long time still, because devs are still throwing new
47	features at it which need to stabilize.
48
49
50	> > What's still cool is that I don't need to manage volumes since the
51	> > volume manager is built into btrfs. XFS on LVM was not that
52	> > flexible. If btrfs wouldn't have this feature, I probably would
53	> > have switched back to XFS already.
54	>
55	> My main concern with xfs/ext4 is that neither provides on-disk
56	> checksums or protection against the raid write hole.
57
58	Btrfs suffers the same RAID5 write hole problem since years. I always
59	planned moving to RAID5 later (which is why I have 3 disks) but I fear
60	this won't be fixed any time soon due to design decisions made too
61	early.
62
63
64	> I just switched motherboards a few weeks ago and either a connection
65	> or a SATA port was bad because one of my drives was getting a TON of
66	> checksum errors on zfs. I moved it to an LSI card and scrubbed, and
67	> while it took forever and the system degraded the array more than once
68	> due to the high error rate, eventually it patched up all the errors
69	> and now the array is working without issue. I didn't suffer more than
70	> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
71	> headache trying to recover from that (doing who knows how much
72	> troubleshooting before realizing I had to do a slow full restore from
73	> backup with the system down).
74
75	I found md raid not very reliable in the past but I didn't try again in
76	years. So this may have changed. I only remember it destroyed a file
77	system after an unclean shutdown not only once, that's not what I
78	expect from RAID1. Other servers with file systems on bare metal
79	survived this just fine.
80
81
82	> I just don't see how a modern filesystem can get away without having
83	> full checksum support. It is a bit odd that it has taken so long for
84	> Ceph to introduce it, and I'm still not sure if it is truly
85	> end-to-end, or if at any point in its life the data isn't protected by
86	> checksums. If I were designing something like Ceph I'd checksum the
87	> data at the client the moment it enters storage, then independently
88	> store the checksum and data, and then retrieve both and check it at
89	> the client when the data leaves storage. Then you're protected
90	> against corruption at any layer below that. You could of course have
91	> additional protections to catch errors sooner before the client even
92	> sees them. I think that the issue is that Ceph was really designed
93	> for object storage originally and they just figured the application
94	> would be responsible for data integrity.
95
96	I'd at least pass the checksum through all the layers while checking it
97	again, so you could detect which transport or layer is broken.
98
99
100	> The other benefit of checksums is that if they're done right scrubs
101	> can go a lot faster, because you don't have to scrub all the
102	> redundancy data synchronously. You can just start an idle-priority
103	> read thread on every drive and then pause it anytime a drive is
104	> accessed, and an access on one drive won't slow down the others. With
105	> traditional RAID you have to read all the redundancy data
106	> synchronously because you can't check the integrity of any of it
107	> without the full set. I think even ZFS is stuck doing synchronous
108	> reads due to how it stores/computes the checksums. This is something
109	> btrfs got right.
110
111	One other point I decided for btrfs, tho I don't make much use of it
112	currently. I used to do regular scrubs a while ago but combined with
113	bcache, that is an SSD killer... I killed my old 128G SSD within one
114	year, although I used overprovisioning. Well, I actually didn't kill
115	it, it swapped it at 99% lifetime according to smartctl. It would
116	probably still work for a long time in normal workloads.
117
118
119	> >> For the moment I'm
120	> >> relying more on zfs.
121	> >
122	> > How does it perform memory-wise? Especially, I'm currently using
123	> > bees[1] for deduplication: It uses a 1G memory mapped file (you can
124	> > choose other sizes if you want), and it picks up new files really
125	> > fast, within a minute. I don't think zfs can do anything like that
126	> > within the same resources.
127	>
128	> I'm not using deduplication, but my understanding is that zfs
129	> deduplication:
130	> 1. Works just fine.
131
132	No doubt...
133
134	> 2. Uses a TON of RAM.
135
136	That's the problem. And I think there is no near-line dedup tool
137	available?
138
139
140	> So, it might not be your cup of tea. There is no way to do
141	> semi-offline dedup as with btrfs (not really offline in that the
142	> filesystem is fully running - just that you periodically scan for dups
143	> and fix them after the fact, vs detect them in realtime). With a
144	> semi-offline mode then the performance hits would only come at a time
145	> of my choosing, vs using gobs of RAM all the time to detect what are
146	> probably fairly rare dups.
147
148	I'm using bees, and I'd call it near-line. Changes to files are picked
149	up at commit time, when a new generation is made, and then it walks the
150	new extents, maps those to files, and deduplicates the blocks. I was
151	surprised how fast it detects new duplicate blocks. But it is still
152	working through the rest of the file system (since days), at least
153	without much impact on performance. Giving up 1G of RAM for this is
154	totally okay.
155
156	Once it finished scanning the first time, I'm thinking about starting
157	it at timed intervals. But it looks like impact will be so low that I
158	can keep it running all the time. Using cgroups to limit cpu and io
159	shares works really great.
160
161	I still didn't evaluate how it interferes with defragmenting, tho, or
162	how big the impact is of bees fragmenting extents.
163
164
165	> That aside, I find it works fine memory-wise (I don't use dedup). It
166	> has its own cache system not integrated fully into the kernel's native
167	> cache, so it tends to hold on to a lot more ram than other
168	> filesystems, but you can tune this behavior so that it stays fairly
169	> tame.
170
171	I think the reasoning using own caching is, that block caching at the
172	vfs layer cannot just be done in an efficient way for a cow file system
173	with scrubbing and everything. You need to use good cache hinting
174	throuhout the whole pipeline which is currently slowly integrated into
175	the kernel.
176
177	E.g., when btrfs does cow action, bcache doesn't get notified that it
178	can discard the free block from cache. I don't know if this is handled
179	in the kernel cache layer...
180
181
182	--
183	Regards,
184	Kai
185
186	Replies to list-only preferred.

Gentoo Archives: gentoo-user

Replies