-
Notifications
You must be signed in to change notification settings - Fork 61
HDFS
-
Horizontal scaling: As the data scales larger, more and more machines are required to hold it. HDFS is specifically designed to handle adding machines to the cluster, and it scales linearly - if you double the number of DataNodes in the cluster, you can roughly double the storage capacity.
-
Fault-tolerance: As more and more machines are added to the cluster, it will become more and more likely that one of the machines fails, or is momentarily unavailable due to a network partition. To account for this, Hadoop splits files into “blocks” and replicates them onto several DataNodes, which ensures that one of the blocks of data is always available.
-
Parallel processing: Since the data is distributed in blocks across several machines, the data can be processed by multiple machines at once with a parallel processing framework like MapReduce.
HDFS is a higher-level abstraction that allows us to ignore the underlying details of blocks, replication, etc. and is relatively simple to use. Still, it’s important to understand the basic concepts of the architecture in order to troubleshoot and optimize the design of your data system.
3.1. Getting Started with a small file
3.2 Generate and examine large files
5.1 Amazon S3
5.2 PFS
5.3 Tachyon
5.4 Pail
5.5 Apache Ignite
## HDFS ArchitectureHDFS uses a centralized setup with several “worker” nodes that store all the data, and one “master” NameNode that stores the metadata about which DataNodes have what data. Each file of data is split into blocks (128MB chunks by default) and stores a copy of each block on multiple machines (3 by default). The NameNode also ensures that the DataNodes are healthy by waiting for them to check-in with a “heartbeat” occasionally (every 3 seconds by default). If one of the DataNodes is down for too long, one of the other copies is replicated and moved to a new DataNode.
## Best Practices with HDFSThere are a few best practices that aren’t strictly enforced while using HDFS, but will help you optimize your usage.
### Avoiding small filesEvery file in HDFS has to have it’s own meta-data stored by the NameNode. This makes it inefficient to store lots of small files rather than a few big ones. At the same time, Hadoop needs to be able to split apart files into blocks in order to process them in parallel.
To avoid the small file inefficiency, files are usually compressed together before being “ingested” into HDFS, but not all compression formats work well with Hadoop. The LZO compression allows for blocks to be split up for MapReduce, but cannot be distributed with Hadoop due to its Gnu Public License. Google’s Snappy compression isn’t inherently splittable but works with container formats like SequenceFiles and Avro. Ingestion is an important topic that deserves it’s own Dev; check out the first two chapters of Hadoop Applications Architecture (in the Dropbox) for a good overview.
### Never delete raw dataIn the above examples we deleted files because they were trivial examples. In practice, disk space on a real cluster is getting cheaper and cheaper, but the cost of accidentally deleting a file you need remains high. Even data with a few mistakes may be useful later for troubleshooting. Thus it’s generally a best practice to hold on to all data, especially raw data, and just move undesired data into a discard directory until you’re absolutely sure you don’t need it or you need more space.
### Vertical PartitioningHadoop typically reads and writes to directories rather than individual files (as we saw for the generated data example). For this reason, storing all your data in one big directory is a bad practice. Instead you should create directories for each type of data you have. For example, you might store the tweets of all your users in a directory named /data/tweets
and the age of the users in another directory named /data/age
. You can further separate the tweets by month with directories like /data/tweets/01-2015
for January /data/tweets/02-2015
for February, and so on, which allows you to process only the time periods you’re interested in. You can read more about vertical partitioning in Sections 4.5 and 4.6 (p. 62-63) of Nathan Marz’s Big Data (in the library and Dropbox).
There are many alternative to HDFS, but most of them offer the exact same functionality and they all have similar underlying architectures:
### Amazon S3 Amazon Simple Storage Service (S3) is the proprietary form of HDFS used for AWS. S3 is fairly easy to use and integrates well with other AWS services like EMR, Kinesis, and Redshift. We’ll work more with S3 later when we learn those tools. ### PFS [The Pachyderm File System](http://www.pachyderm.io/) (PFS) is a new dockerized version of HDFS that is lightweight for simple deployment. We will be learning more about Pachyderm and Docker on Wednesday of Week 1. ### Tachyon [Tachyon](http://www.alluxio.org/) is a newer distributed file system optimized for memory rather than hard disk, from the same Berkeley group (AMPLab) that developed Spark and Mesos. Tachyon is compatible with popular tools like Hadoop MapReduce and Spark, and claims to be much faster. ### Pail [Pail](https://github.com/nathanmarz/dfs-datastores) is a higher level abstraction that runs on top of HDFS to enforce many of the best practices discussed above like vertical partitioning, combining small files, and data serialization. You can learn more about Pail in Chapter 5 of Big Data (p 65-82). ### Apache Ignite [Apache Ignite](http://ignite.apache.org/) is a brand new in-memory platform that offers many of the same advantage as Tachyon and claims to accelerate all types of data processing. At this point, Ignite is bleeding edge but offers some promising advantages in the future.Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.