1 |
Am Sat, 05 Mar 2016 00:52:09 +0100 |
2 |
schrieb lee <lee@××××××××.de>: |
3 |
|
4 |
> >> > It uses some very clever ideas to place files into groups and |
5 |
> >> > into proper order - other than using file mod and access times |
6 |
> >> > like other defrag tools do (which even make the problem worse by |
7 |
> >> > doing so because this destroys locality of data even more). |
8 |
> >> |
9 |
> >> I've never heard of MyDefrag, I might try it out. Does it make |
10 |
> >> updating any faster? |
11 |
> > |
12 |
> > Ah well, difficult question... Short answer: It uses countermeasures |
13 |
> > against performance after updates decreasing too fast. It does this |
14 |
> > by using a "gapped" on-disk file layout - leaving some gaps for |
15 |
> > Windows to put temporary files. By this, files don't become a far |
16 |
> > spread as usually during updates. But yes, it improves installation |
17 |
> > time. |
18 |
> |
19 |
> What difference would that make with an SSD? |
20 |
|
21 |
Well, those gapps are by a good chance a trimmed erase block, so it can |
22 |
be served fast by the SSD firmware. Of course, the same applies if your |
23 |
OS is using discard commands to mark free blocks and you still have |
24 |
enough free space in the FS. So, actually, for SSDs it probably makes |
25 |
no difference. |
26 |
|
27 |
> > Apparently it's unmaintained since a few years but it still does a |
28 |
> > good job. It was built upon a theory by a student about how to |
29 |
> > properly reorganize file layout on a spinning disk to stay at high |
30 |
> > performance as best as possible. |
31 |
> |
32 |
> For spinning disks, I can see how it can be beneficial. |
33 |
|
34 |
My comment was targetted at this. |
35 |
|
36 |
> >> > But even SSDs can use _proper_ defragmentation from time to time |
37 |
> >> > for increased lifetime and performance (this is due to how the |
38 |
> >> > FTL works and because erase blocks are huge, I won't get into |
39 |
> >> > detail unless someone asks). This is why mydefrag also supports |
40 |
> >> > flash optimization. It works by moving as few files as possible |
41 |
> >> > while coalescing free space into big chunks which in turn relaxes |
42 |
> >> > pressure on the FTL and allows to have more free and continuous |
43 |
> >> > erase blocks which reduces early flash chip wear. A filled SSD |
44 |
> >> > with long usage history can certainly gain back some performance |
45 |
> >> > from this. |
46 |
> >> |
47 |
> >> How does it improve performance? It seems to me that, for |
48 |
> >> practical use, almost all of the better performance with SSDs is |
49 |
> >> due to reduced latency. And IIUC, it doesn't matter for the |
50 |
> >> latency where data is stored on an SSD. If its performance |
51 |
> >> degrades over time when data is written to it, the SSD sucks, and |
52 |
> >> the manufacturer should have done a better job. Why else would I |
53 |
> >> buy an SSD. If it needs to reorganise the data stored on it, the |
54 |
> >> firmware should do that. |
55 |
> > |
56 |
> > There are different factors which have impact on performance, not |
57 |
> > just seek times (which, as you write, is the worst performance |
58 |
> > breaker): |
59 |
> > |
60 |
> > * management overhead: the OS has to do more house keeping, which |
61 |
> > (a) introduces more IOPS (which is the only relevant limiting |
62 |
> > factor for SSD) and (b) introduces more CPU cycles and data |
63 |
> > structure locking within the OS routines during performing IO |
64 |
> > which comes down to more CPU cycles spend during IO |
65 |
> |
66 |
> How would that be reduced by defragmenting an SSD? |
67 |
|
68 |
FS structures are coalesced back into simpler structures by |
69 |
defragmenting, e.g. btrfs creates a huge overhead by splitting extents |
70 |
due to its COW nature. Doing a defrag here combines this back into |
71 |
fewer extents. It's reported on the btrfs list that this CAN make a big |
72 |
difference even for SSD, tho usually you only see the performance loss |
73 |
with heavily fragmented files like VM images - so recommendation here |
74 |
is to set those files nocow. |
75 |
|
76 |
> > * erasing a block is where SSDs really suck at performance wise, |
77 |
> > plus blocks are essentially read-only once written - that's how |
78 |
> > flash works, a flash data block needs to be erased prior to being |
79 |
> > rewritten - and that is (compared to the rest of its |
80 |
> > performance) a really REALLY HUGE time factor |
81 |
> |
82 |
> So let the SSD do it when it's idle. For applications in which it |
83 |
> isn't idle enough, an SSD won't be the best solution. |
84 |
|
85 |
That's probably true - haven't thought of this. |
86 |
|
87 |
> > * erase blocks are huge compared to common filesystem block sizes |
88 |
> > (erase block = 1 or 2 MB vs. file system block being 4-64k |
89 |
> > usually) which happens to result in this effect: |
90 |
> > |
91 |
> > - OS replaces a file by writing a new, deleting the old |
92 |
> > (common during updates), or the user deletes files |
93 |
> > - OS marks some blocks as free in its FS structures, it depends |
94 |
> > on the file size and its fragmentation if this gives you a |
95 |
> > continuous area of free blocks or many small blocks scattered |
96 |
> > across the disk: it results in free space fragmentation |
97 |
> > - free space fragments happen to become small over time, much |
98 |
> > smaller then the erase block size |
99 |
> > - if your system has TRIM/discard support it will tell the SSD |
100 |
> > firmware: here, I no longer use those 4k blocks |
101 |
> > - as you already figured out: those small blocks marked as free |
102 |
> > do not properly align with the erase block size - so actually, you |
103 |
> > may end up with a lot of free space but essentially no |
104 |
> > complete erase block is marked as free |
105 |
> |
106 |
> Use smaller erase blocks. |
107 |
|
108 |
It's a hardware limitation - and it's probably not going to change. I |
109 |
think erase blocks will become even bigger when capacities increase. |
110 |
|
111 |
> > - this situation means: the SSD firmware cannot reclaim this |
112 |
> > free space to do "free block erasure" in advance so if you write |
113 |
> > another block of small data you may end up with the SSD going |
114 |
> > into a direct "read/modify/erase/write" cycle instead of just |
115 |
> > "read/modify/write" and deferring the erasing until later - ah |
116 |
> > yes, that's probably becoming slow then |
117 |
> > - what do we learn: (a) defragment free space from time to time, |
118 |
> > (b) enable TRIM/discard to reclaim blocks in advance, (c) you |
119 |
> > may want to over-provision your SSD: just don't ever use 10-15% of |
120 |
> > your SSD, trim that space, and leave it there for the |
121 |
> > firmware to shuffle erase blocks around |
122 |
> |
123 |
> Use better firmware for SSDs. |
124 |
|
125 |
This is a technical limitation. I don't think there's anything a |
126 |
firmware could improve here - except by using internal overprovisioning |
127 |
and bigger caches to defer this into idle background - but see your |
128 |
comment above regarding idle time. |
129 |
|
130 |
Problem that goes hand in hand with this: If your SSD firmware falls |
131 |
back to "read/erase/modify/write" cycle, this wears the flash cells |
132 |
much faster. Thus, I'd recommend to use bigger overprovisioning |
133 |
depending on application and usage pattern. |
134 |
|
135 |
> > - the latter point also increases life-time for obvious reasons |
136 |
> > as SSDs only support a limited count of write-cycles per block |
137 |
> > - this "shuffling around" blocks is called wear-levelling: the |
138 |
> > firmware chooses a block candidate with the least write cycles |
139 |
> > for doing "read/modify/write" |
140 |
> > |
141 |
> > So, SSDs actually do this "reorganization" as you call it - but they |
142 |
> > are limited to it within the bounds of erase block sizes - and the |
143 |
> > firmware knows nothing about the on-disk format and its smaller |
144 |
> > blocks, so it can do nothing to go down to a finer grained |
145 |
> > reorganization. |
146 |
> |
147 |
> Well, I can't help it. I'm going to need to use 2 SSDs on a hardware |
148 |
> RAID controller in a RAID-1. I expect the SSDs to just work fine. If |
149 |
> they don't, then there isn't much point in spending the extra money on |
150 |
> them. |
151 |
> |
152 |
> The system needs to boot from them. So what choice do I have to make |
153 |
> these SSDs happy? |
154 |
|
155 |
Well, from OS point of view they should just work the same with |
156 |
hardware and software RAID. Your RAID controller should support passing |
157 |
discard commands down to the SSD - or you use bigger overprovisioning |
158 |
by not assigning all space to the array configuration. |
159 |
|
160 |
But by all means: It is worth spending the money. We are using mirrored |
161 |
SSDs for LSI CacheCade configuration - the result is lightning-fast |
162 |
systems. The SSD mirror just acts as a huge write-back and random |
163 |
access cache for the bigger spinning RAID sets - like l2arc does for |
164 |
ZFS, just at RAID controller level. This way, you can have your cake |
165 |
and eat it, too: Best of both worlds - big storage + high IOPS. |
166 |
|
167 |
> > These facts are apparently unknown to most people, that's why they |
168 |
> > are denying a SSD could become slow or needs some specialized form |
169 |
> > of "defragmentation". The usual recommendation is to do a "secure |
170 |
> > erase" of the disk if it becomes slow - which I consider pretty |
171 |
> > harmful as it rewrites ALL blocks (reducing their write-cycle |
172 |
> > counter/lifetime), plus it's time consuming and could be avoided. |
173 |
> |
174 |
> That isn't an option because it would be way too much hassle. |
175 |
|
176 |
You mean secure erase: Yes. Not an option. For different reasons. |
177 |
|
178 |
> > BTW: OS makers (and FS designers) actually optimize their systems |
179 |
> > for that kind of reorganization of the SSD firmware. NTFS may use |
180 |
> > different allocation strategies on SSD (just a guess) and in Linux |
181 |
> > there is F2FS which actually exploits this reorganization for |
182 |
> > increased performance and lifetime, Ext4 and Btrfs use different |
183 |
> > allocation strategies and prefer spreading file data instead of |
184 |
> > free space (which is just the opposite of what's done for HDD). So, |
185 |
> > with a modern OS you are much less prone to the effects described |
186 |
> > above. |
187 |
> |
188 |
> Does F2FS come with some sort of redundancy? Reliability and booting |
189 |
> from these SSDs are requirements, so I can't really use btrfs because |
190 |
> it's troublesome to boot from, and the reliability is questionable. |
191 |
> Ext4 doesn't have raid. Using ext4 on mdadm probably won't be any |
192 |
> better than using the hardware RAID, so there's no point in doing |
193 |
> that, and I rather spare me the overhead. |
194 |
|
195 |
Well, you can use F2FS with mdadm. Btrfs boots just fine if you are not |
196 |
using multi-device btrfs - so you have to fall back to hardware RAID or |
197 |
mdadm instead of using btrfs native RAID pooling. |
198 |
|
199 |
> After your explanation, I have to wonder even more than before what |
200 |
> the point in using SSDs is, considering current hard- and software |
201 |
> which doesn't properly use them. OTOH, so far they do seem to |
202 |
> provide better performance than hard disks even when not used with |
203 |
> all the special precautions I don't want to have to think about. |
204 |
|
205 |
Yes, they do. But I think there's still lot that can be done. |
206 |
Developing file systems is a multi-year, if not multi-decade process. |
207 |
Historically, everything is designed around spinning disk |
208 |
characteristics. Of course, much has been done already to make these FS |
209 |
work better with SSD: Ext4 has optimizations, btrfs was designed with |
210 |
having SSD in mind, F2FS is a completely new filesystem specifically |
211 |
targetted at simple flash storage (those without an FTL, read: embedded |
212 |
devices) but also works great for SSD (which uses an FTL), most other |
213 |
systems added some sort of caches to make use of SSDs while still |
214 |
providing big storage, that is: |
215 |
|
216 |
> BTW, why would anyone use SSDs for ZFS's zil or l2arc? Does ZFS treat |
217 |
> SSDs properly in this application? |
218 |
|
219 |
ZFS' caches are properly designed around this, I think. Linux adds its |
220 |
own l2arc/zil like caches (usable for every FS), namely bcache, |
221 |
flashcache, mdcache, maybe more... I'm very confident with bcache in |
222 |
writeback mode for my home system. [1] |
223 |
|
224 |
Hardware solutions like LSI CacheCade also work very well. So, if |
225 |
you're using a RAID controller anyways, consider that. |
226 |
|
227 |
But I think all of those caches just work around the design patterns of |
228 |
todays common filesystems - those can still use improvements and |
229 |
optimizations. But in itself I already see it as a huge improvement. |
230 |
|
231 |
[1]: Tho, I must say that you can wear out your SSD with bcache in |
232 |
around 2 years, at least the cheaper ones. But my Win7 VM can boot in 7 |
233 |
seconds at best with it (btrfs-raid/bcache), tho usually it's around |
234 |
15-20 seconds - an its image is bigger than my SSD. And working with |
235 |
it feels no different than using Win7 natively on SDD (read: no VM, |
236 |
drive C and everything on SSD). But actually, I feel it's simpler to |
237 |
replace the caching-SSD due to wearing out than reinstalling the system |
238 |
on a new SSD when used natively due to it's space just becoming to |
239 |
small. |
240 |
|
241 |
-- |
242 |
Regards, |
243 |
Kai |
244 |
|
245 |
Replies to list-only preferred. |