[gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD? - gentoo-user

From:	Kai Krakow <hurikhan77@×××××.com>
To:	gentoo-user@l.g.o
Subject:	[gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
Date:	Tue, 24 Jun 2014 18:51:32
Message-Id:	`luso7b-h43.ln1@hurikhan77.spdns.de`
In Reply to:	Re: [gentoo-user] Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD? by Rich Freeman

1

Rich Freeman <rich0@g.o> schrieb:

2

3

> On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@×××××.com> wrote:

4

>> I don't see where you could lose the volume management features. You just

5

>> add device on top of the bcache device after you initialized the raw

6

>> device with a bcache superblock and attached it. The rest works the same,

7

>> just that you use bcacheX instead of sdX devices.

8

>

9

> Ah, didn't realize you could attach/remove devices to bcache later.

10

> Presumably it handles device failures gracefully, ie exposing them to

11

> the underlying filesystem so that it can properly recover?

12

13

I'm not sure if multiple partitions can share the same cache device 

14

partition but more or less that's it: Initialize bcache, then attach your 

15

backing devices, then add those bcache devices to your btrfs.

16

17

I don't know how errors are handled, tho. But as with every caching 

18

technique (even in ZFS) your data is likely toast if the cache device dies 

19

in the middle of action. Thus, you should put bcache on LVM RAID if you are 

20

going to use it for write caching (i.e. write-back mode). Read caching 

21

should be okay (write-through mode). Bcache is a little slower than other 

22

flash-cache implementations because it only reports data as written back to 

23

the FS if it reached stable storage (which can be the cache device, tho, if 

24

you are using write-back mode). It was also designed with unexpected reboots 

25

in mind, read. It will replay transactions from its log on reboot. This 

26

means, you can have unstable data conditions on the raw device which is why 

27

you should never try to use that directly, e.g. from a rescue disk. But 

28

since bcache wraps the partition with its own superblock this mistake should 

29

be impossible.

30

31

I'm not sure how graceful device failures are handled. I suppose in write-

32

back mode you can get into trouble because it's too late for bcache to tell 

33

the FS that there is a write error when it already confirmed that stable 

34

storage has been hit. Maybe it will just keep the data around so you could 

35

swap devices or will report the error next time when data is written to that 

36

location. It probably interferes with btrfs RAID logic on that matter.

37

38

> The only problem with doing stuff like this at a lower level (both

39

> write and read caching) is that it isn't RAID-aware.  If you write

40

> 10GB of data, you use 20GB of cache to do it if you're mirrored,

41

> because the cache doesn't know about mirroring.

42

43

Yes, it will write double the data to the cache then - but only if btrfs 

44

also did actually read both copies (which it probably does not because it 

45

has checksums and does not need to compare data, and lets just ignore the 

46

case that another process could try to read the same data from the other 

47

raid member later, that case should become optimized-out by the OS cache). 

48

Otherwise both caches should work pretty individually with their own set of 

49

data depending on how btrfs uses each device individually. Remember that 

50

btrfs raid is not a block-based raid where block locations would match 1:1 

51

on each device. Btrfs raid can place one mirror of data in two completely 

52

different locations on each member device (which is actually a good thing in 

53

case block errors accumulate in specific locations for a "faulty" model of a 

54

disk). In case of write caching it will of course cache double the data 

55

(because both members will be written to). But I think that's okay for the 

56

same reasons, except it will wear your cache device faster. But in that case 

57

I suggest to use individual SSDs for each btrfs member device anyways. It's 

58

not optimal, I know. Could be useful to see some best practices and 

59

pros/cons on that topic (individual cache device per btrfs member vs. bcache 

60

on LVM RAID with bcache partitions on the RAID for all members). I think the 

61

best strategy depends on if you are write-most or read-most.

62

63

Thanks for mentioning. Interesting thoughts. ;-)

64

65

> Offhand I'm not sure

66

> if there are any performance penalties as well around the need for

67

> barriers/etc with the cache not being able to be relied on to do the

68

> right thing in terms of what gets written out - also, the data isn't

69

> redundant while it is on the cache, unless you mirror the cache.

70

71

This is partialy what I outlined above. I think in case of write-caching, 

72

there is no barriers pass-thru needed. Bcache will confirm the barriers and 

73

that's all the FS needs to know (because bcache is supervising the FS, all 

74

requests go through the bcache layer, no direct access to the backing 

75

device). Of course, it's then bcache's job to ensure everything gets written 

76

out correctly in the background (whenever it feels to do so). But it can use 

77

its own write-barriers to ensure that for the underlying device - that's 

78

nothing the FS has to care about. Performance should be faster anyway 

79

because, well, you are writing to a faster device - that is what bcache is 

80

all about, isn't it? ;-)

81

82

I don't think write-barriers for read caching are needed, at least not from 

83

point of view of the FS. The caching layer, tho, will use it internally for 

84

its caching structures. If that will have a bad effect on performance is 

85

probably dependent on the implementation, but my intuition says: No 

86

performance impact because putting read data in the cache can be defered and 

87

then data will be written in the background (write-behind).

88

89

> Granted, if you're using it for write intent logging then there isn't

90

> much getting around that.

91

92

Well, sure for bcache. But I think in case of FS-internal write caching 

93

devices that case could be handled gracefully (the method which you'd 

94

prefer). Since in the internal case the cache has knowledge about the FS bad 

95

block handling, it can just retry writing data to another location/disk or 

96

keep it around until the admin fixed the problem with the backing device.

97

98

BTW: SSD firmwares usually suffer similar problems like outlined above 

99

because they do writes in the background when they already confirmed 

100

persistence to the OS layer. This is why SSD failures are usually much 

101

severe compared HDD failures. Do some research, and you should find tests 

102

about that topic. Especially consumer SSD firmwares have a big problem with 

103

that. So I'm not sure if it really should be bcache's job to fix that 

104

particular problem. You should just ensure good firmware and proper failure 

105

protection at the hardware level if you want to do fancy caching stuff - the 

106

FTL should be able to hide those problems before the whole thing explodes, 

107

then report errors before it is able to no longer ensure correct 

108

persistence. I suppose that is also the detail where enterprise grade SSDs 

109

behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS, 

110

hotword: IO timeouts and bad blocks, and why you should not use consumer 

111

hardware for RAIDs). I think all the same holds true for ZFS.

112

113

>> Having to prepare devices for bcache is kind of a show-stopper because it

114

>> is no drop-in component that way. But OTOH I like that approach better

115

>> than dm- cache because it protects from using the backing device without

116

>> going through the caching layer which could otherwise severely damage

117

>> your data, and you get along with fewer devices and don't need to size a

118

>> meta device (which probably needs to grow later if you add devices, I

119

>> don't know).

120

>

121

> And this is the main thing keeping me away from it.  It is REALLY

122

> painful to migrate to/from.  Having it integrated into the filesystem

123

> delivers all the same benefits of not being able to mount it without

124

> the cache present.

125

126

The migration pain is what currently keeps me away, too. Otherwise I would 

127

just buy one of those fancy new cheap but still speedy Crucial SSDs and 

128

"just enable" bcache... :-\

129

130

> Now excuse me while I go fix my btrfs (I tried re-enabling snapper and

131

> it again got the filesystem into a worked-up state after trying to

132

> clean up half a dozen snapshots at the same time - it works fine until

133

> you go and try to write a lot of data to it, then it stops syncing

134

> though you don't necessarily notice until a few hours later when the

135

> write cache exhausts RAM and on reboot your disk reverts back a few

136

> hours).  I suspect that if I just treat it gently for a few hours

137

> btrfs will clean up the mess and it will work normally again, but the

138

> damage apparently persists after a reboot if you go heavy in the disk

139

> too quickly...

140

141

You should report that to the btrfs list. You could try to "echo w > 

142

/proc/sysrq-trigger" and look at the blocked processes list in dmesg 

143

afterwards. I'm sure one important btrfs thread is in blocked state then...

144

145

--

146

Replies to list only preferred.

1	Rich Freeman <rich0@g.o> schrieb:
2
3	> On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@×××××.com> wrote:
4	>> I don't see where you could lose the volume management features. You just
5	>> add device on top of the bcache device after you initialized the raw
6	>> device with a bcache superblock and attached it. The rest works the same,
7	>> just that you use bcacheX instead of sdX devices.
8	>
9	> Ah, didn't realize you could attach/remove devices to bcache later.
10	> Presumably it handles device failures gracefully, ie exposing them to
11	> the underlying filesystem so that it can properly recover?
12
13	I'm not sure if multiple partitions can share the same cache device
14	partition but more or less that's it: Initialize bcache, then attach your
15	backing devices, then add those bcache devices to your btrfs.
16
17	I don't know how errors are handled, tho. But as with every caching
18	technique (even in ZFS) your data is likely toast if the cache device dies
19	in the middle of action. Thus, you should put bcache on LVM RAID if you are
20	going to use it for write caching (i.e. write-back mode). Read caching
21	should be okay (write-through mode). Bcache is a little slower than other
22	flash-cache implementations because it only reports data as written back to
23	the FS if it reached stable storage (which can be the cache device, tho, if
24	you are using write-back mode). It was also designed with unexpected reboots
25	in mind, read. It will replay transactions from its log on reboot. This
26	means, you can have unstable data conditions on the raw device which is why
27	you should never try to use that directly, e.g. from a rescue disk. But
28	since bcache wraps the partition with its own superblock this mistake should
29	be impossible.
30
31	I'm not sure how graceful device failures are handled. I suppose in write-
32	back mode you can get into trouble because it's too late for bcache to tell
33	the FS that there is a write error when it already confirmed that stable
34	storage has been hit. Maybe it will just keep the data around so you could
35	swap devices or will report the error next time when data is written to that
36	location. It probably interferes with btrfs RAID logic on that matter.
37
38	> The only problem with doing stuff like this at a lower level (both
39	> write and read caching) is that it isn't RAID-aware. If you write
40	> 10GB of data, you use 20GB of cache to do it if you're mirrored,
41	> because the cache doesn't know about mirroring.
42
43	Yes, it will write double the data to the cache then - but only if btrfs
44	also did actually read both copies (which it probably does not because it
45	has checksums and does not need to compare data, and lets just ignore the
46	case that another process could try to read the same data from the other
47	raid member later, that case should become optimized-out by the OS cache).
48	Otherwise both caches should work pretty individually with their own set of
49	data depending on how btrfs uses each device individually. Remember that
50	btrfs raid is not a block-based raid where block locations would match 1:1
51	on each device. Btrfs raid can place one mirror of data in two completely
52	different locations on each member device (which is actually a good thing in
53	case block errors accumulate in specific locations for a "faulty" model of a
54	disk). In case of write caching it will of course cache double the data
55	(because both members will be written to). But I think that's okay for the
56	same reasons, except it will wear your cache device faster. But in that case
57	I suggest to use individual SSDs for each btrfs member device anyways. It's
58	not optimal, I know. Could be useful to see some best practices and
59	pros/cons on that topic (individual cache device per btrfs member vs. bcache
60	on LVM RAID with bcache partitions on the RAID for all members). I think the
61	best strategy depends on if you are write-most or read-most.
62
63	Thanks for mentioning. Interesting thoughts. ;-)
64
65	> Offhand I'm not sure
66	> if there are any performance penalties as well around the need for
67	> barriers/etc with the cache not being able to be relied on to do the
68	> right thing in terms of what gets written out - also, the data isn't
69	> redundant while it is on the cache, unless you mirror the cache.
70
71	This is partialy what I outlined above. I think in case of write-caching,
72	there is no barriers pass-thru needed. Bcache will confirm the barriers and
73	that's all the FS needs to know (because bcache is supervising the FS, all
74	requests go through the bcache layer, no direct access to the backing
75	device). Of course, it's then bcache's job to ensure everything gets written
76	out correctly in the background (whenever it feels to do so). But it can use
77	its own write-barriers to ensure that for the underlying device - that's
78	nothing the FS has to care about. Performance should be faster anyway
79	because, well, you are writing to a faster device - that is what bcache is
80	all about, isn't it? ;-)
81
82	I don't think write-barriers for read caching are needed, at least not from
83	point of view of the FS. The caching layer, tho, will use it internally for
84	its caching structures. If that will have a bad effect on performance is
85	probably dependent on the implementation, but my intuition says: No
86	performance impact because putting read data in the cache can be defered and
87	then data will be written in the background (write-behind).
88
89	> Granted, if you're using it for write intent logging then there isn't
90	> much getting around that.
91
92	Well, sure for bcache. But I think in case of FS-internal write caching
93	devices that case could be handled gracefully (the method which you'd
94	prefer). Since in the internal case the cache has knowledge about the FS bad
95	block handling, it can just retry writing data to another location/disk or
96	keep it around until the admin fixed the problem with the backing device.
97
98	BTW: SSD firmwares usually suffer similar problems like outlined above
99	because they do writes in the background when they already confirmed
100	persistence to the OS layer. This is why SSD failures are usually much
101	severe compared HDD failures. Do some research, and you should find tests
102	about that topic. Especially consumer SSD firmwares have a big problem with
103	that. So I'm not sure if it really should be bcache's job to fix that
104	particular problem. You should just ensure good firmware and proper failure
105	protection at the hardware level if you want to do fancy caching stuff - the
106	FTL should be able to hide those problems before the whole thing explodes,
107	then report errors before it is able to no longer ensure correct
108	persistence. I suppose that is also the detail where enterprise grade SSDs
109	behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS,
110	hotword: IO timeouts and bad blocks, and why you should not use consumer
111	hardware for RAIDs). I think all the same holds true for ZFS.
112
113	>> Having to prepare devices for bcache is kind of a show-stopper because it
114	>> is no drop-in component that way. But OTOH I like that approach better
115	>> than dm- cache because it protects from using the backing device without
116	>> going through the caching layer which could otherwise severely damage
117	>> your data, and you get along with fewer devices and don't need to size a
118	>> meta device (which probably needs to grow later if you add devices, I
119	>> don't know).
120	>
121	> And this is the main thing keeping me away from it. It is REALLY
122	> painful to migrate to/from. Having it integrated into the filesystem
123	> delivers all the same benefits of not being able to mount it without
124	> the cache present.
125
126	The migration pain is what currently keeps me away, too. Otherwise I would
127	just buy one of those fancy new cheap but still speedy Crucial SSDs and
128	"just enable" bcache... :-\
129
130	> Now excuse me while I go fix my btrfs (I tried re-enabling snapper and
131	> it again got the filesystem into a worked-up state after trying to
132	> clean up half a dozen snapshots at the same time - it works fine until
133	> you go and try to write a lot of data to it, then it stops syncing
134	> though you don't necessarily notice until a few hours later when the
135	> write cache exhausts RAM and on reboot your disk reverts back a few
136	> hours). I suspect that if I just treat it gently for a few hours
137	> btrfs will clean up the mess and it will work normally again, but the
138	> damage apparently persists after a reboot if you go heavy in the disk
139	> too quickly...
140
141	You should report that to the btrfs list. You could try to "echo w >
142	/proc/sysrq-trigger" and look at the blocked processes list in dmesg
143	afterwards. I'm sure one important btrfs thread is in blocked state then...
144
145	--
146	Replies to list only preferred.

Gentoo Archives: gentoo-user

Replies