We introduced Hadoop of Cloudera, our partnership, and now we will go forth explaining HDFS(Hadoop Distributed File System). This posting is divided into total five parts. This Part.1 consists of the introduction, features, pros and cons.
What is HDFS?
HDFS is the filesystem that distributes and stores a file in a large distributed environment. In order words, it has structure that divides a large file into the block① of 64MB or 128MB, distributes, duplicates and stores that at each node cluster.
A key aim of HDFS ensures the high-processing performance of the large data and prevents loss for data. To do this, the structure divides a large file, makes copies of each block, stores the data in a double, and is able to access by the other’s node without problems even if the failure happens on a node. HDFS doesn’t support a renewal of data and inserts new data by deleting the existing data when the file is changed.
HDFS is an essential module of Hadoop, and Hadoop is a distributed processing framework for the large data. It also is the open source by Java in 2011 and released Ver. 2.7.2 on January, 2016. As updated from Ver. 1.x to Ver. 2.x, its components were changed. The below is figures comparing changed components.
- Archiving on the block(chunk)-unit: Separate and keep files with the unit(64BM)
- Distributed filesystem: Distribute and keep the block in the multi-node
- Replication: Duplicate a block into several nodes and is possible to do nonstop action in case of trouble
Pros and Cons
- Supply the linear expandability: In case of using HDFS, constitute the necessary storage only and can add the volume when needed
- Offer the global namespace: Have a sole identification unit for the file
- Whole processing volume increase: Because HDFS uses the disk of distributed server, the network, disk I/O, etc. are separated and processed into each server.
- Application-based filesystem: Cannot enter the commands as ls, copy, re, etc. like general filesystems. Must use FUSE(Filesystem in User space)② for using these commands.
- Store only Immutable file: HDFS supposed that a file can’t be changed once it is changed.
- Limit of namenode-memory under namespace management: HDFS manages the namespace’s information (directory name, file name, etc.) in the memory of namenode, so files stored on Hadoop and the number of the directories are limited in the memory size. The namenode uses hundreds of bytes for storing a file’s information.
- Namenode SFP③ problem: HDFS generates SFP(Single Failure Point). Data node doesn’t have a decisive effect on the service even if the problems happen, yet if the namenode occurs, the entire filesystems will break down.
① HDFS block: The block is the minimum unit to read/write in HDFS. It distributes and stores a file as a unit of the block for efficiently saving a file.
② FUSE: Filesystem in Userspace. It supports to integrating the filesystem of the user space by UNIX.
③ SFP: Single Failure Point. The situation generating problem in total HDFS if the namenode occur since the namenode is saved meta-information of all blocks.
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업