1 |
On Thu, 14 Jan 2021 16:10:09 -0500 |
2 |
Jack <ostroffjh@×××××××××××××××××.net> wrote: |
3 |
|
4 |
> On 2021.01.14 15:49, Walter Dnes wrote: |
5 |
> > I'm bored, so I do a regular daily report at the DSL Reports |
6 |
> > "CanChat" |
7 |
> > sub-forum, on the Covid-19 case counts for Ontario, using provincial |
8 |
> > data. I download 2 files daily as source data. One of them is a PDF |
9 |
> > file, which is run through "pdftotext" and then parsed by a bash |
10 |
> > script |
11 |
> > (don't ask). Today, the command... |
12 |
> > |
13 |
> > wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf |
14 |
> > |
15 |
> > ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar |
16 |
> > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up |
17 |
> > the |
18 |
> > PDF file just fine. Is "wget" being blocked? I have to do extra |
19 |
> > steps |
20 |
> > to get from the browser-invoked PDF to get the PDF file saved to the |
21 |
> > standard work area where my script expects it to be, so it can work |
22 |
> > its |
23 |
> > magic and parse out the daily breakdown by PHU (Public Health Unit). |
24 |
> > BTW, today's posts requiring the PDF file are... |
25 |
> > https://www.dslreports.com/forum/r33002718- |
26 |
> > https://www.dslreports.com/forum/r33002752- |
27 |
> > |
28 |
> > I've tried setting --user-agent= with my browser's string as shown |
29 |
> > by |
30 |
> > https://www.whatismybrowser.com/detect/what-is-my-user-agent but no |
31 |
> > luck. Is there some way to get around this? I have not updated this |
32 |
> > past week, so I don't think the problem is at my end. |
33 |
> |
34 |
> I just copy/pasted that wget command into my terminal, and it got me a |
35 |
> 1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an |
36 |
> issue or not. |
37 |
> |
38 |
> Jack |
39 |
> |
40 |
|
41 |
I could download the file too with the wget command that you posted. If |
42 |
you still have trouble, you could try using curl and pretend that |
43 |
you're a firefox: |
44 |
curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf |
45 |
|
46 |
Andreas |