Gentoo Archives: gentoo-user

From: Grant Edwards <grant.b.edwards@×××××.com>
To: gentoo-user@l.g.o
Subject: [gentoo-user] Re: How to copy gzip data from bytestream?
Date: Tue, 22 Feb 2022 03:05:46
Message-Id: sv1jtn$c0u$1@ciao.gmane.io
In Reply to: Re: [gentoo-user] How to copy gzip data from bytestream? by Rich Freeman
1 On 2022-02-22, Rich Freeman <rich0@g.o> wrote:
2 > On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@×××××.com> wrote:
3 >>
4 >> But I was trying to figure out a way to do it without uncompressing
5 >> and recompressing the data. I had hoped that the gzip header would
6 >> contain a "length" field (so I would know how many bytes to copy using
7 >> dd), but it does not. Apparently, the only way to find the end of the
8 >> compressed data is to parse it using the proper algorithm (deflate, in
9 >> this case).
10 >
11 > I'm guessing that the reason it lacks such a header, is precisely so
12 > that you can use it in a stream in just this manner. In order to
13 > have a length in the header it would need to be able to seek back to
14 > the start of the file to modify the header, which isn't always
15 > possible.
16
17 Indeed. It's clearly designed to be used on non-seekable media/devices
18 like pipes and tapes. I should have realized that would be the case
19 and would preclude a length field in the header.
20
21 > I wouldn't be surprised if it stores some kind of metadata at the end
22 > of the file, but of course you can only find that if the end of the
23 > file is marked in some way.
24
25 The gzip file format has a length and CRC field in a trailer at the
26 end (after the compressed data). But, the only way to locate the end
27 is to parse the data using the appropriate decompression algorithm.
28 The header allows for multiple algorithms, but only one (deflate) is
29 actually defined.
30
31 > If you google the details of the gzip file format
32
33 I did -- link is below.
34
35 > you might be able to figure out how to identify the end of the file,
36 > scan the image to find this marker,
37
38 I'm pretty sure the only way to find the end of the file is to parse
39 the compressed data payload itself. There isn't a marker.
40
41 > and then use dd to extract just the desired range. Unless the file
42 > is VERY large I suspect that is going to take you longer than just
43 > recompressing it all.
44
45 Definitely. It's purely an academic question at this point.
46
47 > I can't imagine that there is any way around sequentially reading
48 > the entire file to find the end,
49
50 I believe you're right.
51
52 > unless you have some mechanism that can read a random block and
53 > determine if it is valid gzip data and if so you can do a binary
54 > search assuming the data on the drive past the end of the file isn't
55 > valid gzip.
56
57 I don't think that determining if something is valid deflate data is
58 easy (and may be impossible in the general case). I implemented the
59 deflate algorithm from scratch once a few years ago, and vaguely
60 recall that you can usually deflate almost anything. It turns out
61 that the flash drive I used was pretty new, and almost all 0x00
62 bytes. Once I knew where to look it was pretty obvious where the gzip
63 data ended.
64
65 I've copied it the easy way (zcat | gzip -c), and verified that the
66 copy matches byte-for-byte except for the MTIME field in the gzip
67 header. It appears that gzipping stdin produces an empty MTIME
68 field. No surprise there.
69
70 gzip file format:
71
72 https://datatracker.ietf.org/doc/html/rfc1952