Gentoo Archives: gentoo-user

From:	Kerin Millar <kerframil@×××××××××××.uk>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Re: File system testing
Date:	Fri, 19 Sep 2014 15:21:48
Message-Id:	`541C4A01.10505@fastmail.co.uk`
In Reply to:	Re: [gentoo-user] Re: File system testing by Alec Ten Harmsel

1	On 18/09/2014 14:12, Alec Ten Harmsel wrote:
2	>
3	> On 09/18/2014 05:17 AM, Kerin Millar wrote:
4	>> On 17/09/2014 21:20, Alec Ten Harmsel wrote:
5	>>> As far as HDFS goes, I would only set that up if you will use it for
6	>>> Hadoop or related tools. It's highly specific, and the performance is
7	>>> not good unless you're doing a massively parallel read (what it was
8	>>> designed for). I can elaborate why if anyone is actually interested.
9	>>
10	>> I, for one, am very interested.
11	>>
12	>> --Kerin
13	>>
14	>
15	> Alright, here goes:
16	>
17	> Rich Freeman wrote:
18	>
19	>> FYI - one very big limitation of hdfs is its minimum filesize is
20	>> something huge like 1MB or something like that. Hadoop was designed
21	>> to take a REALLY big input file and chunk it up. If you use hdfs to
22	>> store something like /usr/portage it will turn into the sort of
23	>> monstrosity that you'd actually need a cluster to store.
24	>
25	> This is exactly correct, except we run with a block size of 128MB, and a large cluster will typically have a block size of 256MB or even 512MB.
26	>
27	> HDFS has two main components: a NameNode, which keeps track of which blocks are a part of which file (in memory), and the DataNodes that actually store the blocks. No data ever flows through the NameNode; it negotiates transfers between the client and DataNodes and negotiates transfers for jobs. Since the NameNode stores metadata in-memory, small files are bad because RAM gets wasted.
28	>
29	> What exactly is Hadoop/HDFS used for? The most common uses are generating search indices on data (which is a batch job) and doing non-realtime processing of log streams and/or data streams (another batch job) and allowing a large number of analysts run disparate queries on the same large dataset (another batch job). Batch processing - processing the entire dataset - is really where Hadoop shines.
30	>
31	> When you put a file into HDFS, it gets split based on the block size. This is done so that a parallel read will be really fast - each map task reads in a single block and processes it. Ergo, if you put in a 1GB file with a 128MB block size and run a MapReduce job, 8 map tasks will be launched. If you put in a 1TB file, 8192 tasks would be launched. Tuning the block size is important to optimize the overhead of launching tasks vs. potentially under-utilizing a cluster. Typically, a cluster with a lot of data has a bigger block size.
32	>
33	> The downsides of HDFS:
34	> * Seeked reads are not supported afaik because no one needs that for batch processing
35	> * Seeked writes into an existing file are not supported because either blocks would be added in the middle of a file and wouldn't be 128MB, or existing blocks would be edited, resulting in blocks larger than 128MB. Both of these scenarios are bad.
36	>
37	> Since HDFS users typically do not need seeked reads or seeked writes, these downsides aren't really a big deal.
38	>
39	> If something's not clear, let me know.
40
41	Thank you for taking the time to explain.
42
43	--Kerin

Report Message

Find on MARC Find on Google Groups