Gentoo Archives: gentoo-user

From: Alec Ten Harmsel <alec@××××××××××××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Re: File system testing
Date: Thu, 18 Sep 2014 13:12:46
Message-Id: 541ADA46.4010307@alectenharmsel.com
In Reply to: Re: [gentoo-user] Re: File system testing by Kerin Millar
1 On 09/18/2014 05:17 AM, Kerin Millar wrote:
2 > On 17/09/2014 21:20, Alec Ten Harmsel wrote:
3 >> As far as HDFS goes, I would only set that up if you will use it for
4 >> Hadoop or related tools. It's highly specific, and the performance is
5 >> not good unless you're doing a massively parallel read (what it was
6 >> designed for). I can elaborate why if anyone is actually interested.
7 >
8 > I, for one, am very interested.
9 >
10 > --Kerin
11 >
12
13 Alright, here goes:
14
15 Rich Freeman wrote:
16
17 > FYI - one very big limitation of hdfs is its minimum filesize is
18 > something huge like 1MB or something like that. Hadoop was designed
19 > to take a REALLY big input file and chunk it up. If you use hdfs to
20 > store something like /usr/portage it will turn into the sort of
21 > monstrosity that you'd actually need a cluster to store.
22
23 This is exactly correct, except we run with a block size of 128MB, and a large cluster will typically have a block size of 256MB or even 512MB.
24
25 HDFS has two main components: a NameNode, which keeps track of which blocks are a part of which file (in memory), and the DataNodes that actually store the blocks. No data ever flows through the NameNode; it negotiates transfers between the client and DataNodes and negotiates transfers for jobs. Since the NameNode stores metadata in-memory, small files are bad because RAM gets wasted.
26
27 What exactly is Hadoop/HDFS used for? The most common uses are generating search indices on data (which is a batch job) and doing non-realtime processing of log streams and/or data streams (another batch job) and allowing a large number of analysts run disparate queries on the same large dataset (another batch job). Batch processing - processing the entire dataset - is really where Hadoop shines.
28
29 When you put a file into HDFS, it gets split based on the block size. This is done so that a parallel read will be really fast - each map task reads in a single block and processes it. Ergo, if you put in a 1GB file with a 128MB block size and run a MapReduce job, 8 map tasks will be launched. If you put in a 1TB file, 8192 tasks would be launched. Tuning the block size is important to optimize the overhead of launching tasks vs. potentially under-utilizing a cluster. Typically, a cluster with a lot of data has a bigger block size.
30
31 The downsides of HDFS:
32 * Seeked reads are not supported afaik because no one needs that for batch processing
33 * Seeked writes into an existing file are not supported because either blocks would be added in the middle of a file and wouldn't be 128MB, or existing blocks would be edited, resulting in blocks larger than 128MB. Both of these scenarios are bad.
34
35 Since HDFS users typically do not need seeked reads or seeked writes, these downsides aren't really a big deal.
36
37 If something's not clear, let me know.
38
39 Alec

Replies

Subject Author
Re: [gentoo-user] Re: File system testing Kerin Millar <kerframil@×××××××××××.uk>