Gentoo Archives: gentoo-user

From: Andreas Fink <finkandreas@×××.de>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
Date: Thu, 14 Jan 2021 21:36:42
Message-Id: 1MxYXF-1lxjzF26S9-00xYxm@smtp.web.de
In Reply to: Re: [gentoo-user] [OT] Differences between wget and browser file retrieval? by Jack
1 On Thu, 14 Jan 2021 16:10:09 -0500
2 Jack <ostroffjh@×××××××××××××××××.net> wrote:
3
4 > On 2021.01.14 15:49, Walter Dnes wrote:
5 > > I'm bored, so I do a regular daily report at the DSL Reports
6 > > "CanChat"
7 > > sub-forum, on the Covid-19 case counts for Ontario, using provincial
8 > > data. I download 2 files daily as source data. One of them is a PDF
9 > > file, which is run through "pdftotext" and then parsed by a bash
10 > > script
11 > > (don't ask). Today, the command...
12 > >
13 > > wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
14 > >
15 > > ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar
16 > > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up
17 > > the
18 > > PDF file just fine. Is "wget" being blocked? I have to do extra
19 > > steps
20 > > to get from the browser-invoked PDF to get the PDF file saved to the
21 > > standard work area where my script expects it to be, so it can work
22 > > its
23 > > magic and parse out the daily breakdown by PHU (Public Health Unit).
24 > > BTW, today's posts requiring the PDF file are...
25 > > https://www.dslreports.com/forum/r33002718-
26 > > https://www.dslreports.com/forum/r33002752-
27 > >
28 > > I've tried setting --user-agent= with my browser's string as shown
29 > > by
30 > > https://www.whatismybrowser.com/detect/what-is-my-user-agent but no
31 > > luck. Is there some way to get around this? I have not updated this
32 > > past week, so I don't think the problem is at my end.
33 >
34 > I just copy/pasted that wget command into my terminal, and it got me a
35 > 1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an
36 > issue or not.
37 >
38 > Jack
39 >
40
41 I could download the file too with the wget command that you posted. If
42 you still have trouble, you could try using curl and pretend that
43 you're a firefox:
44 curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf
45
46 Andreas