Skip to content
Daniel Blazevski edited this page Aug 15, 2016 · 16 revisions

Here is the HDFS wiki page!

  1. Requirements

  2. Hadoop Distributed File System

  3. HDFS Architecture

3.1. Getting Started with a small file

3.2 Generate and examine large files

4 Best Practices with HDFS

4.1 Avoiding small files

4.2 Never delete raw data

4.3 Vertical Partitioning

5 Alternatives to HDFS

5.1 Amazon S3

5.2 PFS

5.3 Tachyon

5.4 Pail

5.5 Apache Ignite

Requirements

Hadoop Intro Dev with 1 NameNode and at least 3 DataNodes

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a data store that can hold general purpose data. In many ways, HDFS is similar to a Unix-based file system (e.g. Linux, MacOSX) which has commands like ls, mkdir, cd, rm, etc. The key difference is that HDFS distributes data across multiple machines for several reasons:

  • Horizontal scaling: As the data scales larger, more and more machines are required to hold it. HDFS is specifically designed to handle adding machines to the cluster, and it scales linearly - if you double the number of DataNodes in the cluster, you can roughly double the storage capacity.

  • Fault-tolerance: As more and more machines are added to the cluster, it will become more and more likely that one of the machines fails, or is momentarily unavailable due to a network partition. To account for this, Hadoop splits files into “blocks” and replicates them onto several DataNodes, which ensures that one of the blocks of data is always available.

  • Parallel processing: Since the data is distributed in blocks across several machines, the data can be processed by multiple machines at once with a parallel processing framework like MapReduce.

HDFS is a higher-level abstraction that allows us to ignore the underlying details of blocks, replication, etc. and is relatively simple to use. Still, it’s important to understand the basic concepts of the architecture in order to troubleshoot and optimize the design of your data system.

HDFS Architecture

HDFS uses a centralized setup with several “worker” nodes that store all the data, and one “master” NameNode that stores the metadata about which DataNodes have what data. Each file of data is split into blocks (128MB chunks by default) and stores a copy of each block on multiple machines (3 by default). The NameNode also ensures that the DataNodes are healthy by waiting for them to check-in with a “heartbeat” occasionally (every 3 seconds by default). If one of the DataNodes is down for too long, one of the other copies is replicated and moved to a new DataNode.

Clone this wiki locally