分析分布式系統(tǒng)_第1頁(yè)
分析分布式系統(tǒng)_第2頁(yè)
分析分布式系統(tǒng)_第3頁(yè)
分析分布式系統(tǒng)_第4頁(yè)
分析分布式系統(tǒng)_第5頁(yè)
已閱讀5頁(yè),還剩37頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、百度系統(tǒng)部Hadoop Distributed File SystemWhat is HadoopOpen Source, JavaApache開源組織下Lucene(開源搜索引擎)的一個(gè)子項(xiàng)目 map-reduce engine + HDFS(+Hbase) Hadoop不應(yīng)該簡(jiǎn)簡(jiǎn)單單地被認(rèn)為是一個(gè)分布式文件系統(tǒng),實(shí)際上Hadoop是一套完善的分布式計(jì)算和存儲(chǔ)基礎(chǔ)設(shè)施。 What is HDFSHDFS(Hadoop Distributed filesystem)被設(shè)計(jì)用來在大型集群上(由普通硬件設(shè)備組成)執(zhí)行分布式應(yīng)用的底層框架,而并非一個(gè)單純用于存儲(chǔ)的分布式文件系統(tǒng)適合大數(shù)據(jù)集的應(yīng)用程序

2、高可靠性和高可用性支持map-reduce編程模型其它類GFS系統(tǒng)KFS(Kosmos Filesystem), 來自startup垂直搜索引擎的開源項(xiàng)目, c+ , Kosmix 僅僅是一個(gè)文件系統(tǒng),沒有MapReduce層Backing store for other open source projects: Hadoop (provides a Map/Reduce implementation ) Hypertable (provides a Big-Table interface, Zvents Inc)DisadvantageGFS支持低效的re-write和高效的并發(fā)appen

3、d操作,而HDFS目前還不支持rewrite和append。HDFS只允許一次性地創(chuàng)建文件,創(chuàng)建時(shí)就需要寫入數(shù)據(jù),一旦創(chuàng)建完畢就不能再修改,嚴(yán)格的遵守“one-writer-write-once & read-many” 。 然而,現(xiàn)在有很多應(yīng)用對(duì)append都有需求。比如,不斷往HDFS中的一個(gè)文件進(jìn)行日志追加。Our plan實(shí)現(xiàn)單一Client端append和truncate: HDFS允許多次打開文件進(jìn)行修改(append和truncate),每一次都只允許一個(gè)client進(jìn)行修改,修改的過程中允許多個(gè)client并發(fā)讀。 ArchitectureMaster/Slave Arch.a

4、 single namenode and multiple datanodesNamenodeexecutes file system namespace operations like opening, closing, and renaming files and directoriesdetermines the mapping of blocks to DatanodesArchitectureDatanodesDatanodes are responsible for serving read and write requests from the file systems clie

5、nts. Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode.ArchitectureNamenodeServes as both diretory namespace manager and “inode table”Filename-blocksequence(namespace), stored on disk and is very preciousBlock-machinelist(“inodes”), rebuilt every tim

6、e the NameNode comes upNamenodeInitiation:new FSNamesystem:Load FS ImageCheck and trigger safe mode if neededSet the total number of blocks in the systemRecord all blocks that are getting replicatedStart monitorsStart http serverstart RPC server Start Trash Emptier threadMonitorsSafeModeMonitorPerio

7、dically check whether it is time to leave safe mode.PendingReplicationMonitorA periodic thread that scans for blocks that never finished their replication request.HeartbeatMonitorPeriodically Check if there are any expired heartbeats.MonitorsLeaseMonitorPeriodically checks for leases that have expir

8、ed, and disposes of them.ReplicationMonitorPeriodically Look at a few datanodes and compute any replication work that can be scheduled on them. missionedMonitorPeriodically check if any of the nodes being missioned has finished moving all its datablocks to another replica.Data ReplicationStores each

9、 file as a sequence of blocksBlocks of a file are replicated for fault toleranceThe replication factor can be specified at file creation time and can be changed laterFiles in HDFS are write-once and have strictly one writer at any timeData ReplicationData ReplicationThe Namenode makes all decisions

10、regarding replication of blocksNamenode receives Heartbeat and Blockreport from datanodesHeartbeat: Im live! (3 seconds)Blockreport: all blocks on datanode(1 hour)HeartbeatMonitordatanode向namenode發(fā)送heartbeat(TCP)一個(gè)間隔內(nèi)沒有收到heartbeat,則認(rèn)為datanode為dead每一次只允許一個(gè)datanode被標(biāo)記為dead更新需要復(fù)制的block數(shù)響應(yīng)時(shí)攜帶命令:看是否有需要復(fù)制

11、block的工作和需要?jiǎng)h除block的工作要做ReplicationMonitor計(jì)算需要復(fù)制的塊,如果沒有復(fù)制工作,就計(jì)算需要?jiǎng)h除的塊默認(rèn)每3秒種進(jìn)行一次每次只處理32%的datanode如果某一個(gè)datanode的復(fù)制塊負(fù)載比較大,會(huì)跳過,而不再添加新的工作(默認(rèn)只能同時(shí)處理2個(gè))SafeModenamenode一種特殊的狀態(tài),此時(shí)的namenode不接受任何對(duì)命名空間的操作,也不進(jìn)行任何副本數(shù)目調(diào)整。namenode啟動(dòng)的時(shí)候會(huì)自動(dòng)進(jìn)入安全模式,接受來自數(shù)據(jù)節(jié)點(diǎn)的心跳和塊報(bào)告,并檢查數(shù)據(jù)塊的列表。當(dāng)一個(gè)塊的副本數(shù)大于配置的最小復(fù)制數(shù)(dfs.replication.min)時(shí),該塊就被認(rèn)

12、為是安全的;當(dāng)檢測(cè)到系統(tǒng)已達(dá)到配置的塊安全復(fù)制比例(dfs.safemode.threshold.pct),namenode會(huì)持續(xù)一段時(shí)間(通過dfs.safemode.extension配置)的安全模式,讓剩余的datanode完成注冊(cè)(check in),就自動(dòng)退出安全模式。SafeMode可以通過調(diào)用DFSAdmin中的setSafeMode命令手動(dòng)地進(jìn)入或退出安全模式。 說明:如果threshold配置為0或命名空間為空,namenode啟動(dòng)時(shí)將不會(huì)自動(dòng)進(jìn)入安全模式;如果threshold的值大于1,namenode將只能手動(dòng)退出。SafemodeMonitor檢查Namonode是否

13、可以離開安全模式 默認(rèn)每1秒種進(jìn)行一次如果可以離開,則退出安全模式,并停止該MonitorLease與鎖的區(qū)別:時(shí)限Client在創(chuàng)建文件時(shí),需要先向namenode申請(qǐng)一個(gè)lease,目的是為了防止有失效的Client長(zhǎng)久地占有節(jié)點(diǎn)服務(wù)器的資源。namenode假定在一段時(shí)間后沒有收到Client的lease 更新調(diào)用就認(rèn)為該Client“死掉”,必須釋放掉它在該節(jié)點(diǎn)上持有的資源。namenode使用一種名叫l(wèi)eases的類來實(shí)現(xiàn)這種機(jī)制。每個(gè)lease記錄了該lease對(duì)應(yīng)的資源(file)、lease持有者(Client)和上次renew lease的時(shí)間。Lease客戶端通過周期性地調(diào)

14、用renewLease向namenode表明自己alive,如果namenode在一定的時(shí)間內(nèi)沒有收到某個(gè)客戶端對(duì)該函數(shù)的調(diào)用,便認(rèn)為該客戶端已經(jīng)死掉。 如果lease超時(shí),該lease實(shí)例會(huì)使用一個(gè)線程來進(jìn)行資源清理工作,該線程會(huì)在lease關(guān)閉的時(shí)候終止。LeaseMonitor檢查當(dāng)前是否有l(wèi)ease,lease按照創(chuàng)建時(shí)間進(jìn)行排序 默認(rèn)每2秒種進(jìn)行一次每次只處理第一個(gè)leaseLease如果超時(shí)(1個(gè)小時(shí)),就將該lease刪除 Filesystem Managementtrack several important tablesvalid fsname - blocklist (ke

15、pt on disk, logged)Set of all valid blocksblock - machinelist (kept in memory, rebuilt dynamically from reports) machine - blocklist LRU cache of updated-heartbeat machinesFilesystem Managementabstract class INode implements Comparable protected byte name;protected INodeDirectory parent;protected lo

16、ng modificationTime;Filesystem Managementpublic class INode enum FileType DIRECTORY, FILE public static final FileType FILE_TYPES = FileType.DIRECTORY, FileType.FILE ; public static final INode DIRECTORY_INODE = new INode(FileType.DIRECTORY, null); private FileType fileType; private Block blocks; Fi

17、lesystem Managementclass INodeDirectory extends INode protected static final int DEFAULT_FILES_PER_DIRECTORY = 5; final static String ROOT_NAME = ; private List children; class INodeFile extends INode private BlockInfo blocks = null; protected short blockReplication; protected long preferredBlockSiz

18、e; Filesystem Managementclass INodeDirectory extends INode protected static final int DEFAULT_FILES_PER_DIRECTORY = 5; final static String ROOT_NAME = ; private List children; class INodeFile extends INode private BlockInfo blocks = null; protected short blockReplication; protected long preferredBlo

19、ckSize; Filesystem Managementclass LocatedBlock implements Writable private Block b; private long offset; /offset of the first byte of the block in the file private DatanodeInfo locs; Filesystem Managementpublic class DatanodeDescriptor extends DatanodeInfo private volatile BlockInfo blockList = nul

20、l; protected boolean isAlive = false; List replicateBlocks; List replicateTargetSets; List invalidateBlocks; static class DatanodeImage implements parable DatanodeDescriptor node; Filesystem Managementclass BlocksMap static class BlockInfo extends Block private INodeFile inode; private Object triple

21、ts;private static class NodeIterator implements Iterator private BlockInfo blockInfo; private int nextIdx = 0; Filesystem ManagementArrayList heartbeats = new ArrayList();private Map leases = new TreeMap();private SortedSet sortedLeases = new TreeSet();Persistence of Filesystem MetadataEditLogA tran

22、saction log: persistently record every change that occurs to file system metadata:OP_ADD,OP_RENAME,OP_DELETE,OP_MKDIR,OP_SET_REPLICATION,OP_DATANODE_ADD,OP_DATANODE_REMOVE(datanode只持久化一部分)FsImageStores the entire file system namespace, including the mapping of blocks to files and file system propert

23、iesCheckpointNamenode startupPeriodic checkpointing(secondary namenode, HTTP)checkpointdoCheckpoint()doSetup(); / Do the required initialization of the merge work node.rollEditLog(); / start logging transactions in a new edit filegetFSImage(); / Fetch fsimagegetFSEdits(); / Fetch edistdoMer

24、ge(); / Do the mergeputFSImage(token); / Upload the new image into the NameNodenamenode.rollFsImage();checkpointprivate void doMerge() throws IOException fsImage.loadFSImage(srcImage);fsImage.getEditLog().loadFSEdits(editFile);fsImage.saveFSImage(destImage);checkpoint loadFSEdits(File edits) case OP

25、_ADD : unprotectedAddFile case OP_SET_REPLICATION : unprotectedSetReplicationcase OP_RENAME : unprotectedRenameTo case OP_DELETE : unprotectedDeletecase OP_MKDIR: unprotectedMkdircase OP_DATANODE_ADDcase OP_DATANODE_REMOVENamenodeclose:close namesystemstop PendingReplication daemonstop http serverInterrupt Heartbeat daemonInterrupt Replication daemonInter

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論