Gentoo Archives: gentoo-user

From: Laurence Perkins <lperkins@×××××××.net>
To: "gentoo-user@l.g.o" <gentoo-user@l.g.o>
Subject: RE: [gentoo-user] How to compress lots of tarballs
Date: Wed, 29 Sep 2021 20:28:06
Message-Id: MW2PR07MB4058BEB071D7FB140761B07CD2A99@MW2PR07MB4058.namprd07.prod.outlook.com
In Reply to: Re: [gentoo-user] How to compress lots of tarballs by Dale
1 > > On Wed, Sep 29, 2021 at 4:27 AM Peter Humphrey <peter@××××××××××××.uk> wrote:
2 > >> Thanks Laurence. I've looked at borg before, wondering whether I
3 > >> needed a more sophisticated tool than just tar, but it looked like
4 > >> too much work for little gain. I didn't know about duplicity, but I'm
5 > >> used to my weekly routine and it seems reliable, so I'll stick with
6 > >> it pro tem. I've been keeping a daily KMail archive since the bad old
7 > >> days, and five weekly backups of the whole system, together with 12
8 > >> monthly backups and, recently an annual backup. That last may be overkill, I dare say.
9 > > I think Restic might be gaining some ground on duplicity. I use
10 > > duplicity and it is fine, so I haven't had much need to look at
11 > > anything else. Big advantages of duplicity over tar are:
12 > >
13 > > 1. It will do all the compression/encryption/etc stuff for you - all
14 > > controlled via options.
15 > > 2. It uses librsync, which means if one byte in the middle of a 10GB
16 > > file changes, you end up with a few bytes in your archive and not 10GB
17 > > (pre-compression).
18 > > 3. It has a ton of cloud/remote backends, so it is real easy to store
19 > > the data on AWS/Google/whatever. When operating this way it can keep
20 > > local copies of the metadata, and if for some reason those are lost it
21 > > can just pull that only down from the cloud to resync without a huge
22 > > bill.
23 > > 4. It can do all the backup rotation logic (fulls, incrementals,
24 > > retention, etc).
25 > > 5. It can prefix files so that on something like AWS you can have the
26 > > big data archive files go to glacier (cheap to store, expensive to
27 > > restore), and the small metadata stays in a data class that is cheap
28 > > to access.
29 > > 6. By default local metadata is kept unencrypted, and anything on the
30 > > cloud is encrypted. This means that you can just keep a public key in
31 > > your keyring for completely unattended backups, without fear of access
32 > > to the private key. Obviously if you need to restore your metadata
33 > > from the cloud you'll need the private key for that.
34 > >
35 > > If you like the more tar-like process another tool you might want to
36 > > look at is dar. It basically is a near-drop-in replacement for tar
37 > > but it stores indexes at the end of every file, which means that you
38 > > can view archive contents/etc or restore individual files without
39 > > scanning the whole archive. tar was really designed for tape where
40 > > random access is not possible.
41 > >
42 >
43 >
44 > Curious question here. As you may recall, I backup to a external hard drive. Would it make sense to use that software for a external hard drive? Right now, I'm just doing file updates with rsync and the drive is encrypted. Thing is, I'm going to have to split into three drives soon. So, compressing may help. Since it is video files, it may not help much but I'm not sure about that. Just curious.
45 >
46 > Dale
47 >
48 > :-) :-)
49 >
50 >
51
52 If I understand correctly you're using rsync+tar and then keeping a set of copies of various ages.
53
54 If you lose a single file that you want to restore and have to go hunting for it, with tar you can only list the files in the archive by reading the entire thing into memory and only extract by reading from the beginning until you stumble across the matching filename. So with large archives to hunt through, that could take... a while...
55
56 dar is compatible with tar (Pretty sure, would have to look again, but I remember that being one of its main selling-points) but adds an index at the end of the file allowing listing of the contents and jumping to particular files without having to read the entire thing. Won't help with your space shortage, but will make searching and single-file restores much faster.
57
58 Duplicity and similar has the indices, and additionally a full+incremental scheme. So searching is reasonably quick, and restoring likewise doesn't have to grovel over all the data. It can be slower than tar or dar for restore though because it has to restore first from the full, and then walk through however many incrementals are necessary to get the version you want. This comes with a substantial space savings though as each set of archive files after the full contains only the pieces which actually changed. Coupled with compression, that might solve your space issues for a while longer.
59
60 Borg and similar break the files into variable-size chunks and store each chunk indexed by its content hash. So each chunk gets stored exactly once regardless of how many times it may occur in the data set. Backups then become simply lists of file attributes and what chunks they contain. This results both in storing only changes between backup runs and in deduplication of commonly-occurring data chunks across the entire backup. The database-like structure also means that all backups can be searched and restored from in roughly equal amounts of time and that backup sets can be deleted in any order. Many of them (Borg included) also allow mounting backup sets via FUSE. The disadvantage is that restore requires a compatible version of the backup tool rather than just a generic utility.
61
62 LMP

Replies

Subject Author
Re: [gentoo-user] How to compress lots of tarballs Dale <rdalek1967@×××××.com>