1 |
On 2022-02-22, Rich Freeman <rich0@g.o> wrote: |
2 |
> On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@×××××.com> wrote: |
3 |
>> |
4 |
>> But I was trying to figure out a way to do it without uncompressing |
5 |
>> and recompressing the data. I had hoped that the gzip header would |
6 |
>> contain a "length" field (so I would know how many bytes to copy using |
7 |
>> dd), but it does not. Apparently, the only way to find the end of the |
8 |
>> compressed data is to parse it using the proper algorithm (deflate, in |
9 |
>> this case). |
10 |
> |
11 |
> I'm guessing that the reason it lacks such a header, is precisely so |
12 |
> that you can use it in a stream in just this manner. In order to |
13 |
> have a length in the header it would need to be able to seek back to |
14 |
> the start of the file to modify the header, which isn't always |
15 |
> possible. |
16 |
|
17 |
Indeed. It's clearly designed to be used on non-seekable media/devices |
18 |
like pipes and tapes. I should have realized that would be the case |
19 |
and would preclude a length field in the header. |
20 |
|
21 |
> I wouldn't be surprised if it stores some kind of metadata at the end |
22 |
> of the file, but of course you can only find that if the end of the |
23 |
> file is marked in some way. |
24 |
|
25 |
The gzip file format has a length and CRC field in a trailer at the |
26 |
end (after the compressed data). But, the only way to locate the end |
27 |
is to parse the data using the appropriate decompression algorithm. |
28 |
The header allows for multiple algorithms, but only one (deflate) is |
29 |
actually defined. |
30 |
|
31 |
> If you google the details of the gzip file format |
32 |
|
33 |
I did -- link is below. |
34 |
|
35 |
> you might be able to figure out how to identify the end of the file, |
36 |
> scan the image to find this marker, |
37 |
|
38 |
I'm pretty sure the only way to find the end of the file is to parse |
39 |
the compressed data payload itself. There isn't a marker. |
40 |
|
41 |
> and then use dd to extract just the desired range. Unless the file |
42 |
> is VERY large I suspect that is going to take you longer than just |
43 |
> recompressing it all. |
44 |
|
45 |
Definitely. It's purely an academic question at this point. |
46 |
|
47 |
> I can't imagine that there is any way around sequentially reading |
48 |
> the entire file to find the end, |
49 |
|
50 |
I believe you're right. |
51 |
|
52 |
> unless you have some mechanism that can read a random block and |
53 |
> determine if it is valid gzip data and if so you can do a binary |
54 |
> search assuming the data on the drive past the end of the file isn't |
55 |
> valid gzip. |
56 |
|
57 |
I don't think that determining if something is valid deflate data is |
58 |
easy (and may be impossible in the general case). I implemented the |
59 |
deflate algorithm from scratch once a few years ago, and vaguely |
60 |
recall that you can usually deflate almost anything. It turns out |
61 |
that the flash drive I used was pretty new, and almost all 0x00 |
62 |
bytes. Once I knew where to look it was pretty obvious where the gzip |
63 |
data ended. |
64 |
|
65 |
I've copied it the easy way (zcat | gzip -c), and verified that the |
66 |
copy matches byte-for-byte except for the MTIME field in the gzip |
67 |
header. It appears that gzipping stdin produces an empty MTIME |
68 |
field. No surprise there. |
69 |
|
70 |
gzip file format: |
71 |
|
72 |
https://datatracker.ietf.org/doc/html/rfc1952 |