1 |
On Thu, Sep 1, 2016 at 6:35 PM, Kai Krakow <hurikhan77@×××××.com> wrote: |
2 |
> Am Tue, 30 Aug 2016 17:59:02 -0400 |
3 |
> schrieb Rich Freeman <rich0@g.o>: |
4 |
> |
5 |
>> |
6 |
>> That depends on the mode of operation. In journal=data I believe |
7 |
>> everything gets written twice, which should make it fairly immune to |
8 |
>> most forms of corruption. |
9 |
> |
10 |
> No, journal != data integrity. Journal only ensure that data is written |
11 |
> transactionally. You won't end up with messed up meta data, and from |
12 |
> API perspective and with journal=data, a partial written block of data |
13 |
> will be rewritten after recovering from a crash - up to the last fsync. |
14 |
> If it happens that this last fsync was half way into a file: Well, then |
15 |
> there's only your work written upto the half of the file. |
16 |
|
17 |
Well, sure, but all an application needs to do is make sure it calls |
18 |
write on whole files, and not half-files. It doesn't need to fsync as |
19 |
far as I'm aware. It just needs to write consistent files in one |
20 |
system call. Then that write either will or won't make it to disk, |
21 |
but you won't get half of a write. |
22 |
|
23 |
> Journals only ensure consistency on API level, not integrity. |
24 |
|
25 |
Correct, but this is way better than not journaling or ordering data, |
26 |
which protects the metadata but doesn't ensure your files aren't |
27 |
garbled even if the application is careful. |
28 |
|
29 |
> |
30 |
> If you need integrity, so then file system can tell you if your file is |
31 |
> broken or not, you need checksums. |
32 |
> |
33 |
|
34 |
Btrfs and zfs fail in the exact same way in this particular regard. |
35 |
If you call write with half of a file, btrfs/zfs will tell you that |
36 |
half of that file was successfully written. But, it won't hold up for |
37 |
the other half of the file that the kernel hasn't been told about. |
38 |
|
39 |
The checksumming in these filesystems really only protects data from |
40 |
modification after it is written. Sectors that were only half-written |
41 |
during an outage which have inconsistent checksums probably won't even |
42 |
be looked at during an fsck/mount, because the filesystem is just |
43 |
going to replay the journal and write right over them (or to some new |
44 |
block, still treating the half-written data as unallocated). These |
45 |
filesystems don't go scrubbing the disk to figure out what happened, |
46 |
they just replay the log back to the last checkpoint. The checksums |
47 |
are just used during routine reads to ensure the data wasn't somehow |
48 |
corrupted after it was written, in which case a good copy is used, |
49 |
assuming one exists. If not at least you'll know about the problem. |
50 |
|
51 |
> If you need a way to recover from a half written file, you need a CoW |
52 |
> file system where you could, by luck, go back some generations. |
53 |
|
54 |
Only if you've kept snapshots, or plan to hex-edit your disk/etc. The |
55 |
solution here is to correctly use the system calls. |
56 |
|
57 |
> |
58 |
>> f2fs would also have this benefit. Data is not overwritten in-place |
59 |
>> in a log-based filesystem; they're essentially journaled by their |
60 |
>> design (actually, they're basically what you get if you ditch the |
61 |
>> regular part of the filesystem and keep nothing but the journal). |
62 |
> |
63 |
> This is log-structed, not journalled. You pointed that out, yes, but |
64 |
> you weakened that by writing "basically the same". I think the |
65 |
> difference is important. Mostly because the journal is a fixed area on |
66 |
> the disk, while a log-structured file system has no such journal. |
67 |
|
68 |
My point was that they're equivalent from the standpoint that every |
69 |
write either completes or fails and you don't get half-written data. |
70 |
Yes, I know how f2fs actually works, and this wasn't intended to be a |
71 |
primer on log-based filesystems. The COW filesystems have similar |
72 |
benefits since they don't overwrite data in place, other than maybe |
73 |
their superblocks (or whatever you call them). I don't know what the |
74 |
on-disk format of zfs is, but btrfs has multiple copies of the tree |
75 |
root with a generation number so if something dies partway it is |
76 |
really easy for it to figure out where it left off (if none of the |
77 |
roots were updated then any partial tree structures laid down are in |
78 |
unallocated space and just get rewritten on the next commit, and if |
79 |
any were written then you have a fully consistent new tree used to |
80 |
update the remaining roots). |
81 |
|
82 |
One of these days I'll have to read up on the on-disk format of zfs as |
83 |
I suspect it would make an interest contrast with btrfs. |
84 |
|
85 |
> |
86 |
> This point was raised because it supports checksums, not because it |
87 |
> supports CoW. |
88 |
|
89 |
Sure, but both provide benefits in these contexts. And the only COW |
90 |
filesystems are also the only ones I'm aware of (at least in popular |
91 |
use) that have checksums. |
92 |
|
93 |
> |
94 |
> Log structered file systems are, btw, interesting for write-mostly |
95 |
> workloads on spinning disks because head movements are minimized. |
96 |
> They are not automatically helping dumb/simple flash translation layers. |
97 |
> This incorporates a little more logic by exploiting the internal |
98 |
> structure of flash (writing only sequentially in page sized blocks, |
99 |
> garbage collection and reuse only on erase block level). F2fs and |
100 |
> bcache (as a caching layer) do this. Not sure about the others. |
101 |
|
102 |
Sure. It is just really easy to do big block erases in a log-based |
103 |
filesystem since everything tends to be written (and overwritten) |
104 |
sequentially. You can of course build a log-based filesystem that |
105 |
doesn't perform well on flash. They would still tend to have the |
106 |
benefits of data journaling (for free; the cost is fragmentation which |
107 |
is of course a bigger issue on disks). |
108 |
|
109 |
-- |
110 |
Rich |