HDFS 架構 · 大數據寶典

# HDFS 架構 [TOC] ## 1. 介紹 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is[http://hadoop.apache.org/](http://hadoop.apache.org/). Hadoop分布式文件系統（HDFS）是一個設計用于在商品硬件上運行的分布式文件系統。它與現有的分布式文件系統有許多相似之處。但是，與其他分布式文件系統的區別也很顯著。HDFS具有高度的容錯性，設計用于部署在低成本硬件上。HDFS 提供對應用程序數據的高吞吐量訪問，適用于具有大數據集的應用程序。HDFS放寬了一些POSIX要求，以實現對文件系統數據的流式訪問。HDFS最初是作為ApacheNutch Web搜索引擎項目的基礎設施構建的。HDFS是 Apache Hadoop 核心項目的一部分。項目URL是[http://hadoop.apache.org/](http://hadoop.apache.org)。 ## 2. 設想與目標 ### 硬件故障 Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 硬件故障是正常的而非異常的。HDFS實例可能由數百或數千臺服務器組成，每臺服務器都存儲文件系統的部分數據。事實上，有大量的組件，并且每個組件都有不小的失敗概率，這意味著HDFS的某些組件總是不起作用。因此，檢測和快速、自動地從故障中恢復是HDFS的核心體系結構的目標。 ### 流數據訪問 Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. ### Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. ### Simple Coherency Model HDFS applications need a write-once-read-many access(訪問) model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. ### “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. ### Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable(便攜式的;手提的;輕便的) from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. ## 3.NameNode and DataNodes HDFS has a master/slave architecture [?ɑ?k?tekt??(r)] . An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates[(用規則條例)約束，控制，管理;調節，控制(速度、壓力、溫度等)] access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes(暴露) a file system namespace and allows user data to be stored in files. Internally(內部地;), a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes(實行;執行;實施;) file system namespace operations like opening, closing, and renaming files and directories. It also determines(查明;測定;準確算出;決定;形成;支配;影響;確定;裁決;安排) the mapping of blocks to DataNodes. The DataNodes are responsible for serving(提供；服務) read and write requests from the file system’s clients. The DataNodes also perform(執行) block creation, deletion, and replication upon instruction(按指令復制) from the NameNode. ![](https://box.kancloud.cn/78957dd1a3c86068730f36c109e2aca7_874x604.png) The NameNode and DataNode are pieces of software designed to run on commodity machines(商品機器). These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines(在廣泛機器上). A typical deployment has a dedicated(專用) machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude(排除) running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence(存在) of a single NameNode in a cluster greatly simplifies(簡化) the architecture of the system. The NameNode is the arbitrator(仲裁者) and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode(用戶數據不會通過NmaeNode？). ## The File System Namespace HDFS supports a traditional hierarchical(分級的) file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy(統治集團;層次體系) is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS supports[user quotas](用戶額度？)(http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html)and[access permissions](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html). HDFS does not support hard links or soft links（不支持軟硬連接）. However, the HDFS architecture does not preclude implementing these features（不排除以后會引進這些特征）. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas（副本） of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor（副本因子;） of that file. This information is stored by the NameNode. ## Data Replication HDFS is designed to reliably（可靠地） store very large files across machines in a large cluster. It stores each file as a sequence of blocks（一系列塊）. The blocks of a file are replicated for fault tolerance（過錯容忍）. The block size and replication factor（銀子） are configurable per file. All blocks in a file except the last block are the same size（除了最后一塊，塊都是同樣大小的）, while users can start a new block without filling out（填寫） the last block to the configured（配置的） block size after the support for variable length block was added to append and hsync（塊可變長的支持被添加切同步后）. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time（任何時候都只能有各個寫入者）. The NameNode makes all decisions regarding replication of blocks（根據副本做決定）. It periodically（定期地） receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt（收據） of a Heartbeat implies（暗示？） that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode（一個Blockreport 為一個DataNode上的所有塊的列表）. ![](https://box.kancloud.cn/ea9dcf6bb9f0ef0c4f1ce67a37f7f6f3_874x536.png) ### Replica Placement（副本放置）: The First Baby Steps(第一個嬰兒步) The placement of replicas is critical（至關緊要的;） to HDFS reliability(可靠性) and performance. Optimizing replica placement distinguishes（使有別于） HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience（調優與經驗）. The purpose of a rack-aware（機架感知？） replica placement policy（策略） is to improve data reliability, availability, and network bandwidth utilization（網絡帶寬利用率）. The current implementation for the replica placement policy is a first effort in this direction（反正當前副本放置策略是朝這個方向努力的啦）. The short-term goals（短期目標） of implementing this policy are to validate it on production systems（短期目標是在生產系統上驗證？）, learn more about its behavior, and build a foundation to test and research more sophisticated policies（并為測試和研究更復雜的政策打下基礎）. Large HDFS instances run on a cluster of computers that commonly spread across many racks（分布在多個機架上）. Communication between two nodes in different racks has to go through switches（兩個節點間的通信必須通過交換機）. In most cases（在大多數案例中）, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks（相同機架間機器的網速比不同機架上部署的機器的網速好，你想說啥？？？）. The NameNode determines the rack id each DataNode belongs to via the process outlined in[Hadoop Rack Awareness](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html). A simple but non-optimal policy is to place replicas on unique racks（名稱節點通過Hadoop Rack Aware中概述的過程確定每個數據節點所屬的機架ID。一個簡單但非最優的策略是將副本放在唯一的機架上。）. This prevents （放置）losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data（并允許在讀取數據時使用多個機架的帶寬）. This policy evenly（均勻地） distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks（這個破策略增加了寫數據的負擔，因為呀，一個寫操作就需要傳輸數據到多個機架上去）. >還能識別機架？ For the common case, when the replication factor is three（三分）, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack（遠程機架）, and the last on a different node in the same remote rack（同樣的遠程機架？）. This policy cuts the inter-rack write traffic（節省了機架間寫的流量） which generally improves write performance（所以寫的性能就提升了）. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees（不會影響數據的可靠性和可用性保證）. However, it does reduce the aggregate（總的） network bandwidth used when reading data since a block is placed in only two unique racks rather than（而不是） three. With this policy, the replicas of a file do not evenly（均勻地） distribute across the racks. One third of replicas are on one node（三分之一）, two thirds of replicas are on one rack（三分之二的節點在一個機架上）, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising（拖鞋） data reliability（可靠性） or read performance. If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit （在每個機架上線的數量之下，你可以隨機放啦）(which is basically(replicas - 1) / racks + 2). Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time（副本的最大數量是DataNodes的數量）. After the support for[Storage Types and Storage Policies](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html)was added to HDFS（你要增加什么鬼存儲類型和存儲策略的支持呢？）, the NameNode takes the policy into account for replica placement（**沒看明白**） in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file（檢查候選節點是否具有與文件關聯的策略所需的存儲-----------害怕空間不夠嗎）. If the candidate node does not have the storage type（又來存儲類型？？）, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types（回退存儲類型） in the second path. The current, default replica placement policy described here is a work in progress. ### Replica Selection（副本選擇） To minimize global bandwidth consumption and read latency（最小化全局帶寬消耗和讀取延遲）, HDFS tries to satisfy（滿足） a read request from a replica that is closest to the reader（離讀取這最近）. If there exists a replica on the same rack as the reader node, then that replica is preferred（首選） to satisfy the read request. If HDFS cluster spans multiple data centers（跨越）, then a replica that is resident in the local data center is preferred over any remote replica（首選當前中心）. ### Safemode安全模式 On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur （發生）when the NameNode is in the Safemode state. The NameNode receives Heartbeat（心跳） and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes. >沒看明白 ## The Persistence of File System Metadata（元數據持久性） The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently（堅持不懈） record every change that occurs to file system metadata. For example, creating a new file in HDFS causes（導致） the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage（文件鏡像？）. The FsImage is stored as a file in the NameNode’s local file system too. >元數據都是存儲在本地的？ The NameNode keeps an image of the entire（整個） file system namespace and file Blockmap in memory（在內存保留）. When the NameNode starts up（啟動）, or a checkpoint is triggered（出發檢查點） by a configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions（應用所有事務？） from the EditLog to the in-memory representation（代表） of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage（這個玩意兒事在內存？）. This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a consistent view（視圖） of the file system metadata by taking a snapshot of the file system metadata（獲取文件系統元數據的快照） and saving it to FsImage. Even though it is efficient to read a FsImage（即使讀取fsimage是有效的）, it is not efficient to make incremental edits directly to a FsImage直接對fsimage進行增量編輯是不有效的（）. Instead of modifying FsImage for each edit, we persist the edits in the Editlog（編輯的其實是Editlog？）. During the checkpoint the changes from Editlog are applied to the FsImage. A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds（以秒為單位的時間間隔時，一個檢查點會被觸發）, or after a given number of filesystem transactions have accumulated（累計） (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint. The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files（DataNode 對HDFS時沒有感知的）. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic（啟發式？什么鬼） to determine the optimal（最有） number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the*Blockreport*. ## The Communication Protocols通信協議 All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes（建立） a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol（遠程過程調用抽象包含客戶機協議和數據節點協議。）. By design, the NameNode never initiates（發起） any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients. ## Robustness穩健性 The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions. >HDFS的主要目標是即使在出現故障時也能可靠地存儲數據。三種常見的故障類型是namenode故障、datanode故障和網絡分區(網絡阻隔吧？？)。 ### Data Disk Failure, Heartbeats and Re-Replication Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks(不斷追蹤) which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. The time-out to mark DataNodes dead is conservatively long (over 10 minutes by default) in order to avoid replication storm caused by state flapping of DataNodes. Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads. ### Cluster Rebalancing集群均很 The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented. ### Data Integrity數據完整性 It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block. ### Metadata Disk Failure The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use. Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a[shared storage on NFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html)or using a[distributed edit log](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)(called Journal). The latter is the recommended approach. ### Snapshots [Snapshots](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time.