采集文件到HDFS · 大數據

[TOC] # 分析采集需求：比如業務系統使用log4j生成的日志，日志內容不斷增加，需要把追加到日志文件中的數據實時采集到hdfs ![](https://box.kancloud.cn/d678f7412c66a41023bc1453cfdc6669_666x263.png) 根據需求，首先定義以下3大要素 * 采集源，即source——監控文件內容更新 : exec ‘tail -F file’ * 下沉目標，即sink——HDFS文件系統 : hdfs sink * Source和sink之間的傳遞通道——channel，可用file channel 也可以用內存channel # 配置文件 ~~~ # 定義名稱 agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure tail -F source1 # 定義source # source的類型的exec,這是個命令行,需要個命令 agent1.sources.source1.type = exec # tail -F監控這個文件的新增的變化 agent1.sources.source1.command = tail -F /root/hadoop2/logs/access_log #configure host for source # 使用2個攔截器,i1和i2 agent1.sources.source1.interceptors = i1 i2 # 類型是host agent1.sources.source1.interceptors.i1.type = host # 解析對應host里面的hostname agent1.sources.source1.interceptors.i1.hostHeader = hostname # 主機名默認是不是使用ip,如果是false,這解析就是對應的主機名了 agent1.sources.source1.interceptors.i1.userIP=true agent1.sources.source1.interceptors.i2.type = timestamp # Describe sink1 agent1.sinks.sink1.type = hdfs # 這邊寫hdfs的 agent1.sinks.sink1.hdfs.path=hdfs://master:9000/file/%{hostname}/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 10240 agent1.sinks.sink1.hdfs.rollCount = 1000 agent1.sinks.sink1.hdfs.rollInterval = 10 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 ~~~ # 測試啟動 ~~~ flume-ng agent -c conf -f fhd.conf -n agent1 -Dflume.root.logger=INFO,console ~~~ fhd.conf換成你自己寫的,不同的目錄加上目錄 -n代表上面定義的agent的名字啟動后可以看到打印的日志,只要`/root/data/`下面有文件就會移動到hdfs `/root/hadoop2/logs/access_log`這個文件有改動就會記錄上傳到hdfs