1 |
On Sat, Apr 29, 2017 at 9:11 PM, lee <lee@××××××××.de> wrote: |
2 |
> |
3 |
> "Poison BL." <poisonbl@×××××.com> writes: |
4 |
> > Half petabyte datasets aren't really something I'd personally *ever* |
5 |
trust |
6 |
> > ftp with in the first place. |
7 |
> |
8 |
> Why not? (12GB are nowhere close to half a petabyte ...) |
9 |
|
10 |
Ah... I completely misread that "or over 50k files in 12GB" as 50k files |
11 |
*at* 12GB each... which works out to 0.6 PB, incidentally. |
12 |
|
13 |
> The data would come in from suppliers. There isn't really anything |
14 |
> going on atm but fetching data once a month which can be like 100MB or |
15 |
> 12GB or more. That's because ppl don't use ftp ... |
16 |
|
17 |
Really, if you're pulling it in from third party suppliers, you tend to be |
18 |
tied to what they offer as a method of pulling it from them (or them |
19 |
pushing it out to you), unless you're in the unique position to dictate the |
20 |
decision for them. From there, assuming you can push your choice of product |
21 |
on them, it becomes a question of how often the same dataset will need |
22 |
updated from the same sources, how much it changes between updates, how |
23 |
secure it needs to be in transit, how much you need to be able to trust |
24 |
that the source is still legitimately who you think it is, and how much |
25 |
verification that there wasn't any corruption during the transfer. Generic |
26 |
FTP has been losing favor over time because it was built in a time that |
27 |
many of those questions weren't really at the top of the list for concerns. |
28 |
|
29 |
SFTP (or SCP) (as long as keys are handled properly) allows for pretty |
30 |
solid certainty that a) both ends of the connection are who they say they |
31 |
are, b) those two ends are the only ones reading the data in transit, and |
32 |
c) the data that was sent is the same that was received (simply as a side |
33 |
benefit of the encryption/decryption). Rsync over SSH gives the same set of |
34 |
benefits, reduces the bandwidth used for updating the dataset (when it's |
35 |
the same dataset, at least), and will also verify the data on both ends (as |
36 |
it exists on disk) matches. If you're particularly lucky, the data might |
37 |
even hit just the right mark that benefits from the in-line compression you |
38 |
can turn on with SSH, too, cutting down the actual amount of bandwidth you |
39 |
burn through for each transfer. |
40 |
|
41 |
If your suppliers all have *nix based systems available, those are also |
42 |
standard tools that they'll have on hand. If they're strictly Windows |
43 |
shops, SCP/SFTP are still readily available, though they aren't built into |
44 |
the OS... rsync gets a bit trickier. |
45 |
|
46 |
> > How often does it need moved in/out of your facility, and is there no |
47 |
way |
48 |
> > to break up the processing into smaller chunks than a 0.6PB mass of |
49 |
files? |
50 |
> > Distribute out the smaller pieces with rsync, scp, or the like, operate |
51 |
on |
52 |
> > them, and pull back in the results, rather than trying to shift around |
53 |
the |
54 |
> > entire set. There's a reason Amazon will send a physical truck to a |
55 |
site to |
56 |
> > import large datasets into glacier... ;) |
57 |
> |
58 |
> Amazon has trucks? Perhaps they do in other countries. Here, amazon is |
59 |
> just another web shop. They might have some delivery vans, but I've |
60 |
> never seen one, so I doubt it. And why would anyone give them their |
61 |
> data? There's no telling what they would do with it. |
62 |
|
63 |
Amazon's also one of the best known cloud computing suppliers on the planet |
64 |
(AWS = Amazon Web Services). They have everything from pure compute |
65 |
offerings to cloud storage geared towards *large* data archival. The latter |
66 |
offering is named "glacier", and they offer a service for the import of |
67 |
data into it (usually the "first pass", incremental changes are generally |
68 |
done over the wire) that consists of a shipping truck with a rather nifty |
69 |
storage system in the back of it that they hook right into your network. |
70 |
You fill it with data, and then they drive it back to one of their data |
71 |
centers to load it into place. |
72 |
|
73 |
-- |
74 |
Poison [BLX] |
75 |
Joshua M. Murphy |