Gentoo Archives: gentoo-user

From: Kai Krakow <hurikhan77@×××××.com>
To: gentoo-user@l.g.o
Subject: [gentoo-user] Re: {OT} Allow work from home?
Date: Sun, 06 Mar 2016 12:18:22
Message-Id: 20160305161610.022bd42e@jupiter.sol.kaishome.de
In Reply to: Re: [gentoo-user] Re: {OT} Allow work from home? by lee
1 Am Sat, 05 Mar 2016 00:52:09 +0100
2 schrieb lee <lee@××××××××.de>:
3
4 > >> > It uses some very clever ideas to place files into groups and
5 > >> > into proper order - other than using file mod and access times
6 > >> > like other defrag tools do (which even make the problem worse by
7 > >> > doing so because this destroys locality of data even more).
8 > >>
9 > >> I've never heard of MyDefrag, I might try it out. Does it make
10 > >> updating any faster?
11 > >
12 > > Ah well, difficult question... Short answer: It uses countermeasures
13 > > against performance after updates decreasing too fast. It does this
14 > > by using a "gapped" on-disk file layout - leaving some gaps for
15 > > Windows to put temporary files. By this, files don't become a far
16 > > spread as usually during updates. But yes, it improves installation
17 > > time.
18 >
19 > What difference would that make with an SSD?
20
21 Well, those gapps are by a good chance a trimmed erase block, so it can
22 be served fast by the SSD firmware. Of course, the same applies if your
23 OS is using discard commands to mark free blocks and you still have
24 enough free space in the FS. So, actually, for SSDs it probably makes
25 no difference.
26
27 > > Apparently it's unmaintained since a few years but it still does a
28 > > good job. It was built upon a theory by a student about how to
29 > > properly reorganize file layout on a spinning disk to stay at high
30 > > performance as best as possible.
31 >
32 > For spinning disks, I can see how it can be beneficial.
33
34 My comment was targetted at this.
35
36 > >> > But even SSDs can use _proper_ defragmentation from time to time
37 > >> > for increased lifetime and performance (this is due to how the
38 > >> > FTL works and because erase blocks are huge, I won't get into
39 > >> > detail unless someone asks). This is why mydefrag also supports
40 > >> > flash optimization. It works by moving as few files as possible
41 > >> > while coalescing free space into big chunks which in turn relaxes
42 > >> > pressure on the FTL and allows to have more free and continuous
43 > >> > erase blocks which reduces early flash chip wear. A filled SSD
44 > >> > with long usage history can certainly gain back some performance
45 > >> > from this.
46 > >>
47 > >> How does it improve performance? It seems to me that, for
48 > >> practical use, almost all of the better performance with SSDs is
49 > >> due to reduced latency. And IIUC, it doesn't matter for the
50 > >> latency where data is stored on an SSD. If its performance
51 > >> degrades over time when data is written to it, the SSD sucks, and
52 > >> the manufacturer should have done a better job. Why else would I
53 > >> buy an SSD. If it needs to reorganise the data stored on it, the
54 > >> firmware should do that.
55 > >
56 > > There are different factors which have impact on performance, not
57 > > just seek times (which, as you write, is the worst performance
58 > > breaker):
59 > >
60 > > * management overhead: the OS has to do more house keeping, which
61 > > (a) introduces more IOPS (which is the only relevant limiting
62 > > factor for SSD) and (b) introduces more CPU cycles and data
63 > > structure locking within the OS routines during performing IO
64 > > which comes down to more CPU cycles spend during IO
65 >
66 > How would that be reduced by defragmenting an SSD?
67
68 FS structures are coalesced back into simpler structures by
69 defragmenting, e.g. btrfs creates a huge overhead by splitting extents
70 due to its COW nature. Doing a defrag here combines this back into
71 fewer extents. It's reported on the btrfs list that this CAN make a big
72 difference even for SSD, tho usually you only see the performance loss
73 with heavily fragmented files like VM images - so recommendation here
74 is to set those files nocow.
75
76 > > * erasing a block is where SSDs really suck at performance wise,
77 > > plus blocks are essentially read-only once written - that's how
78 > > flash works, a flash data block needs to be erased prior to being
79 > > rewritten - and that is (compared to the rest of its
80 > > performance) a really REALLY HUGE time factor
81 >
82 > So let the SSD do it when it's idle. For applications in which it
83 > isn't idle enough, an SSD won't be the best solution.
84
85 That's probably true - haven't thought of this.
86
87 > > * erase blocks are huge compared to common filesystem block sizes
88 > > (erase block = 1 or 2 MB vs. file system block being 4-64k
89 > > usually) which happens to result in this effect:
90 > >
91 > > - OS replaces a file by writing a new, deleting the old
92 > > (common during updates), or the user deletes files
93 > > - OS marks some blocks as free in its FS structures, it depends
94 > > on the file size and its fragmentation if this gives you a
95 > > continuous area of free blocks or many small blocks scattered
96 > > across the disk: it results in free space fragmentation
97 > > - free space fragments happen to become small over time, much
98 > > smaller then the erase block size
99 > > - if your system has TRIM/discard support it will tell the SSD
100 > > firmware: here, I no longer use those 4k blocks
101 > > - as you already figured out: those small blocks marked as free
102 > > do not properly align with the erase block size - so actually, you
103 > > may end up with a lot of free space but essentially no
104 > > complete erase block is marked as free
105 >
106 > Use smaller erase blocks.
107
108 It's a hardware limitation - and it's probably not going to change. I
109 think erase blocks will become even bigger when capacities increase.
110
111 > > - this situation means: the SSD firmware cannot reclaim this
112 > > free space to do "free block erasure" in advance so if you write
113 > > another block of small data you may end up with the SSD going
114 > > into a direct "read/modify/erase/write" cycle instead of just
115 > > "read/modify/write" and deferring the erasing until later - ah
116 > > yes, that's probably becoming slow then
117 > > - what do we learn: (a) defragment free space from time to time,
118 > > (b) enable TRIM/discard to reclaim blocks in advance, (c) you
119 > > may want to over-provision your SSD: just don't ever use 10-15% of
120 > > your SSD, trim that space, and leave it there for the
121 > > firmware to shuffle erase blocks around
122 >
123 > Use better firmware for SSDs.
124
125 This is a technical limitation. I don't think there's anything a
126 firmware could improve here - except by using internal overprovisioning
127 and bigger caches to defer this into idle background - but see your
128 comment above regarding idle time.
129
130 Problem that goes hand in hand with this: If your SSD firmware falls
131 back to "read/erase/modify/write" cycle, this wears the flash cells
132 much faster. Thus, I'd recommend to use bigger overprovisioning
133 depending on application and usage pattern.
134
135 > > - the latter point also increases life-time for obvious reasons
136 > > as SSDs only support a limited count of write-cycles per block
137 > > - this "shuffling around" blocks is called wear-levelling: the
138 > > firmware chooses a block candidate with the least write cycles
139 > > for doing "read/modify/write"
140 > >
141 > > So, SSDs actually do this "reorganization" as you call it - but they
142 > > are limited to it within the bounds of erase block sizes - and the
143 > > firmware knows nothing about the on-disk format and its smaller
144 > > blocks, so it can do nothing to go down to a finer grained
145 > > reorganization.
146 >
147 > Well, I can't help it. I'm going to need to use 2 SSDs on a hardware
148 > RAID controller in a RAID-1. I expect the SSDs to just work fine. If
149 > they don't, then there isn't much point in spending the extra money on
150 > them.
151 >
152 > The system needs to boot from them. So what choice do I have to make
153 > these SSDs happy?
154
155 Well, from OS point of view they should just work the same with
156 hardware and software RAID. Your RAID controller should support passing
157 discard commands down to the SSD - or you use bigger overprovisioning
158 by not assigning all space to the array configuration.
159
160 But by all means: It is worth spending the money. We are using mirrored
161 SSDs for LSI CacheCade configuration - the result is lightning-fast
162 systems. The SSD mirror just acts as a huge write-back and random
163 access cache for the bigger spinning RAID sets - like l2arc does for
164 ZFS, just at RAID controller level. This way, you can have your cake
165 and eat it, too: Best of both worlds - big storage + high IOPS.
166
167 > > These facts are apparently unknown to most people, that's why they
168 > > are denying a SSD could become slow or needs some specialized form
169 > > of "defragmentation". The usual recommendation is to do a "secure
170 > > erase" of the disk if it becomes slow - which I consider pretty
171 > > harmful as it rewrites ALL blocks (reducing their write-cycle
172 > > counter/lifetime), plus it's time consuming and could be avoided.
173 >
174 > That isn't an option because it would be way too much hassle.
175
176 You mean secure erase: Yes. Not an option. For different reasons.
177
178 > > BTW: OS makers (and FS designers) actually optimize their systems
179 > > for that kind of reorganization of the SSD firmware. NTFS may use
180 > > different allocation strategies on SSD (just a guess) and in Linux
181 > > there is F2FS which actually exploits this reorganization for
182 > > increased performance and lifetime, Ext4 and Btrfs use different
183 > > allocation strategies and prefer spreading file data instead of
184 > > free space (which is just the opposite of what's done for HDD). So,
185 > > with a modern OS you are much less prone to the effects described
186 > > above.
187 >
188 > Does F2FS come with some sort of redundancy? Reliability and booting
189 > from these SSDs are requirements, so I can't really use btrfs because
190 > it's troublesome to boot from, and the reliability is questionable.
191 > Ext4 doesn't have raid. Using ext4 on mdadm probably won't be any
192 > better than using the hardware RAID, so there's no point in doing
193 > that, and I rather spare me the overhead.
194
195 Well, you can use F2FS with mdadm. Btrfs boots just fine if you are not
196 using multi-device btrfs - so you have to fall back to hardware RAID or
197 mdadm instead of using btrfs native RAID pooling.
198
199 > After your explanation, I have to wonder even more than before what
200 > the point in using SSDs is, considering current hard- and software
201 > which doesn't properly use them. OTOH, so far they do seem to
202 > provide better performance than hard disks even when not used with
203 > all the special precautions I don't want to have to think about.
204
205 Yes, they do. But I think there's still lot that can be done.
206 Developing file systems is a multi-year, if not multi-decade process.
207 Historically, everything is designed around spinning disk
208 characteristics. Of course, much has been done already to make these FS
209 work better with SSD: Ext4 has optimizations, btrfs was designed with
210 having SSD in mind, F2FS is a completely new filesystem specifically
211 targetted at simple flash storage (those without an FTL, read: embedded
212 devices) but also works great for SSD (which uses an FTL), most other
213 systems added some sort of caches to make use of SSDs while still
214 providing big storage, that is:
215
216 > BTW, why would anyone use SSDs for ZFS's zil or l2arc? Does ZFS treat
217 > SSDs properly in this application?
218
219 ZFS' caches are properly designed around this, I think. Linux adds its
220 own l2arc/zil like caches (usable for every FS), namely bcache,
221 flashcache, mdcache, maybe more... I'm very confident with bcache in
222 writeback mode for my home system. [1]
223
224 Hardware solutions like LSI CacheCade also work very well. So, if
225 you're using a RAID controller anyways, consider that.
226
227 But I think all of those caches just work around the design patterns of
228 todays common filesystems - those can still use improvements and
229 optimizations. But in itself I already see it as a huge improvement.
230
231 [1]: Tho, I must say that you can wear out your SSD with bcache in
232 around 2 years, at least the cheaper ones. But my Win7 VM can boot in 7
233 seconds at best with it (btrfs-raid/bcache), tho usually it's around
234 15-20 seconds - an its image is bigger than my SSD. And working with
235 it feels no different than using Win7 natively on SDD (read: no VM,
236 drive C and everything on SSD). But actually, I feel it's simpler to
237 replace the caching-SSD due to wearing out than reinstalling the system
238 on a new SSD when used natively due to it's space just becoming to
239 small.
240
241 --
242 Regards,
243 Kai
244
245 Replies to list-only preferred.