<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                [TOC] # Flume簡介及安裝 ## 1 Flume介紹 ### 1.1 概述 1. Flume是Cloudera提供的一個分布式、可靠、和高可用的海量日志采集、聚合和傳輸的系統。 2. Flume可以采集文件,socket數據包、文件夾等各種形式源數據,又可以將采集到的數據輸出到HDFS、hbase、hive、kafka等眾多外部存儲系統中 3. 一般的采集需求,通過對flume的簡單配置即可實現 4. Flume針對特殊場景也具備良好的自定義擴展能力,因此,flume可以適用于大部分的日常數據采集場景 > 當前Flume有兩個版本: ~~~ Flume 0.9X版本的統稱Flume-og, Flume1.X版本的統稱Flume-ng。 ~~~ > 由于Flume-ng經過重大重構,與Flume-og有很大不同,使用時請注意區分。 ### 1.2 運行機制 1) Flume分布式系統中最核心的角色是agent,flume采集系統就是由一個個agent所連接起來形成 2) 每一個agent相當于一個數據傳遞員,內部有三個組件: a) Source:采集源,用于跟數據源對接,以獲取數據 b) Sink:下沉地,采集數據的傳送目的,用于往下一級agent傳遞數據或者往最終存儲系統傳遞數據 c) Channel:angent內部的數據傳輸通道,用于從source將數據傳遞到sink ![](https://box.kancloud.cn/4f783d2f46e70bb168618831b883fc4d_554x214.png) ### 1.3 Flume采集系統結構圖 #### 1.3.1. 簡單結構 > 單個agent采集數據 ![](https://box.kancloud.cn/ffb388d7a108a0943dfa0f85ae2ab3ae_556x212.png) #### 1.3.2. 復雜結構 > 多級agent之間串聯 1) 第一種:2個agent串聯 ![](https://box.kancloud.cn/5f541bbb6d942f0b6c8c261f72b83400_577x116.png) 2) 第二種:多個agent的采集的數據進行匯總 ![](https://box.kancloud.cn/9dc9b5e8d0dcff6853488402bc58b037_481x324.png) 3) 第三種:采集的數據可以下層到不同的系統中 ![](https://box.kancloud.cn/676fce641b9671e1b3cb01d495dbd7a6_578x307.png) ## 2 Flume實戰案例 ### 2.1 Flume的安裝部署 1) Flume的安裝非常簡單,只需要解壓即可,當然,前提是已有hadoop環境 ~~~ 上傳安裝包到數據源所在節點上 然后解壓 tar -zxvf apache-flume-1.6.0-bin.tar.gz 然后進入flume的目錄,修改conf下的flume-env.sh,在里面配置JAVA_HOME ~~~ 2) 根據數據采集的需求配置采集方案,描述在配置文件中(文件名可任意自定義) 3) 指定采集方案配置文件,在相應的節點上啟動flume agent > 先用一個最簡單的例子來測試一下程序環境是否正常 ![](https://box.kancloud.cn/a28f0aae60bce285105a9e879651e329_553x273.png) 1) 先在flume的conf目錄下新建一個文件 ~~~ vi netcat-logger.conf ~~~ ~~~ # 定義這個agent中各組件的名字 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 描述和配置source組件:r1 a1.sources.r1.type = netcat a1.sources.r1.bind = itcast01 a1.sources.r1.port = 44444 # 描述和配置sink組件:k1 a1.sinks.k1.type = logger # 描述和配置channel組件,此處使用是內存緩存的方式 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 描述和配置source channel sink之間的連接關系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 ~~~ 2) 啟動agent去采集數據 ~~~ bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console -c conf 指定flume自身的配置文件所在目錄 -f conf/netcat-logger.con 指定我們所描述的采集方案 -n a1 指定我們這個agent的名字 ~~~ 3) 測試 > NaNundefined先要往agent采集監聽的端口上發送數據,讓agent有數據可采 > NaNundefined隨便在一個能跟agent節點聯網的機器上 > NaNundefinedtelnet anget-hostname port (telnet itcast01 44444) ![](https://box.kancloud.cn/1342ddbfcd16d9cd756efc5aacffd024_469x140.png) ### 2.2 Flume中常用的source、channel、sink組件 #### 2.2.1 source組件 | Source類型 | 說明 | | --- | --- | | Avro Source | 支持Avro協議(實際上是Avro RPC),內置支持 | | Thrift Source | 支持Thrift協議,內置支持 | | Exec Source | 基于Unix的command在標準輸出上生產數據 | | JMS Source | 從JMS系統(消息、主題)中讀取數據,ActiveMQ已經測試過 | | Spooling Directory Source | 監控指定目錄內數據變更 | | Twitter 1% firehose Source | 通過API持續下載Twitter數據,試驗性質 | | Netcat Source | 監控某個端口,將流經端口的每一個文本行數據作為Event輸入 | | Sequence Generator Source | 序列生成器數據源,生產序列數據 | | Syslog Sources| 讀取syslog數據,產生Event,支持UDP和TCP兩種協議 | | HTTP Source | 基于HTTP POST或GET方式的數據源,支持JSON、BLOB表示形式 | | Legacy Sources | 兼容老的Flume OG中Source(0.9.x版本) | #### 2.2.2 Channel組件 | Channel類型 | 說明 | | ---------- | --- | | Memory Channel | Event數據存儲在內存中 | | JDBC Channel | Event數據存儲在持久化存儲中,當前Flume Channel內置支持Derby | | File Channel | Event數據存儲在磁盤文件中 | | Spillable Memory Channel | Event數據存儲在內存中和磁盤上,當內存隊列滿了,會持久化到磁盤文件(當前試驗性的,不建議生產環境使用) | | Pseudo Transaction Channel | 測試用途 | | Custom Channel | 自定義Channel實現 | #### 2.2.3 sink組件 | Sink類型 | 說明 | | --- | --- | | HDFS Sink | 數據寫入HDFS | | Logger Sink | 數據寫入日志文件 | | Avro Sink | 數據被轉換成Avro Event,然后發送到配置的RPC端口上 | | Thrift Sink | 數據被轉換成Thrift Event,然后發送到配置的RPC端口上 | | IRC Sink | 數據在IRC上進行回放 | | File Roll Sink | | 存儲數據到本地文件系統 | | Null Sink | 丟棄到所有數據 | | HBase Sink | 數據寫入HBase數據庫 | | Morphline Solr Sink | 數據發送到Solr搜索服務器(集群) | | ElasticSearch Sink | 數據發送到Elastic Search搜索服務器(集群) | | Kite Dataset Sink | 寫數據到Kite Dataset,試驗性質的 | | Custom Sink | 自定義Sink實現 | > Flume支持眾多的source、channel、sink類型,詳細手冊可參考官方文檔 > http://flume.apache.org/FlumeUserGuide.html ### 2.3 采集案例 #### 2.3.1、采集目錄到HDFS > 采集需求:某服務器的某特定目錄下,會不斷產生新的文件,每當有新文件出現,就需要把文件采集到HDFS中去 > 根據需求,首先定義以下3大要素 1. 采集源,即source——監控文件目錄 : spooldir 2. 下沉目標,即sink——HDFS文件系統 : hdfs sink 3. source和sink之間的傳遞通道——channel,可用file channel 也可以用內存memory channel > 配置文件編寫: ~~~ #定義三大組件的名稱 agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # 配置source組件 agent1.sources.source1.type = spooldir agent1.sources.source1.spoolDir = /root/data/ agent1.sources.source1.fileHeader = false #配置攔截器 agent1.sources.source1.interceptors = i1 agent1.sources.source1.interceptors.i1.type = timestamp # 配置sink組件 agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path =/weblog/flume-collection/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text #滾動生成的文件按大小生成 agent1.sinks.sink1.hdfs.rollSize = 102400 #滾動生成的文件按行數生成 agent1.sinks.sink1.hdfs.rollCount = 1000000 #滾動生成的文件按時間生成 agent1.sinks.sink1.hdfs.rollInterval = 60 #開啟滾動生成目錄 agent1.sinks.sink1.hdfs.round = true #以10為一梯度滾動生成 agent1.sinks.sink1.hdfs.roundValue = 10 #單位為分鐘 agent1.sinks.sink1.hdfs.roundUnit = minute # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 agent1.channels.channel1.keep-alive = 120 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 ~~~ > flume的source采用spoodir時! 目錄下面不允許存放同名的文件,否則報錯! > Channel參數解釋: | capacity| 默認該通道中最大的可以存儲的event數量| | --- | --- | | trasactionCapacity| 每次最大可以從source中拿到或者送到sink中的event數量| | keep-alive| event添加到通道中或者移出的允許時間| > 其他組件:Interceptor(攔截器) > 用于Source的一組Interceptor,按照預設的順序在必要地方裝飾和過濾events。 > 內建的Interceptors允許增加event的headers比如:時間戳、主機名、靜態標記等等 > 定制的interceptors可以通過內省event payload(讀取原始日志),實現自己的業務邏輯(很強大) #### 2.3.2、采集文件到HDFS > 采集需求:比如業務系統使用log4j生成的日志,日志內容不斷增加,需要把追加到日志文件中的數據實時采集到hdfs ![](https://box.kancloud.cn/83ac180abcf8657815ae2cdc480e0c2e_553x229.png) > 根據需求,首先定義以下3大要素 1. 采集源,即source——監控文件內容更新 : exec ‘tail -F file’ 2. 下沉目標,即sink——HDFS文件系統 : hdfs sink 3. Source和sink之間的傳遞通道——channel,可用file channel 也可以用 內存channel > 配置文件編寫: ~~~ agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure tail -F source1 agent1.sources.source1.type = exec agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log agent1.sources.source1.channels = channel1 #configure host for source agent1.sources.source1.interceptors = i1 i2 agent1.sources.source1.interceptors.i1.type = host agent1.sources.source1.interceptors.i1.hostHeader = hostname # Describe sink1 agent1.sinks.sink1.type = hdfs #a1.sinks.k1.channel = c1 agent1.sinks.sink1.hdfs.path=hdfs://itcast01:9000/file/%{hostname}/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 10240 agent1.sinks.sink1.hdfs.rollCount = 1000 agent1.sinks.sink1.hdfs.rollInterval = 10 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute # Use a channel which buffers events in memory agent1.channels.channel1.type = memory agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 500000 agent1.channels.channel1.transactionCapacity = 600 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 ~~~ #### 2.3.3、多個agent串聯 > 采集需求:比如業務系統使用log4j生成的日志,日志內容不斷增加,需要把追加到日志文件中的數據實時采集到hdfs,使用agent串聯 ![](https://box.kancloud.cn/1f39f7bc3a06cac2704215582df3cf2d_553x160.png) > 根據需求,首先定義以下3大要素 > 第一臺flume agent 1. 采集源,即source——監控文件內容更新 : exec ‘tail -F file’ 2. 下沉目標,即sink——數據的發送者,實現序列化 : avro sink 3. Source和sink之間的傳遞通道——channel,可用file channel 也可以用 內存channel > 第二臺flume agent 1. 采集源,即source——接受數據。并實現反序列化 : avro source 2. 下沉目標,即sink——HDFS文件系統 : HDFS sink 3. Source和sink之間的傳遞通道——channel,可用file channel 也可以用 內存channel > 配置文件編寫: ~~~ Flume-agent1 #tail-avro-avro-logger.conf # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/logs/test.log a1.sources.r1.channels = c1 # Describe the sink ##sink端的avro是一個數據發送者 a1.sinks.k1.type = avro a1.sinks.k1.hostname = itcast02 a1.sinks.k1.port = 41414 a1.sinks.k1.batch-size = 10 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 Flume-agent2: avro-hdfs.conf a1.sources = r1 a1.sinks =s1 a1.channels = c1 ##source中的avro組件是一個接收者服務 a1.sources.r1.type = avro a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 41414 a1.sinks.s1.type=hdfs a1.sinks.s1.hdfs.path=hdfs://itcast01:9000/flumedata a1.sinks.s1.hdfs.filePrefix = access_log a1.sinks.s1.hdfs.batchSize= 100 a1.sinks.s1.hdfs.fileType = DataStream a1.sinks.s1.hdfs.writeFormat =Text a1.sinks.s1.hdfs.rollSize = 10240 a1.sinks.s1.hdfs.rollCount = 1000 a1.sinks.s1.hdfs.rollInterval = 10 a1.sinks.s1.hdfs.round = true a1.sinks.s1.hdfs.roundValue = 10 a1.sinks.s1.hdfs.roundUnit = minute a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sources.r1.channels = c1 a1.sinks.s1.channel = c1 ~~~ #### 2.3.4、高可用配置案例 (一)、failover故障轉移 > 在完成單點的Flume NG搭建后,下面我們搭建一個高可用的Flume NG集群,架構圖如下所示: ![](https://box.kancloud.cn/4a0d04f7b24ca9c2ff5313a5045a036d_553x372.png) (1)節點分配 > Flume的Agent和Collector分布如下表所示: | 名稱 | Ip地址 | Host| 角色| | --- | --- |--- |--- | | Agent1 | 192.168.200.101| Itcast01 | WebServer| | Collector1 | 192.168.200.102 | Itcast02| AgentMstr1| | Collector2 | 192.168.200.103| Itcast03 | AgentMstr2| > Agent1數據分別流入到Collector1和Collector2,Flume NG本身提供了Failover機制,可以自動切換和恢復。下面我們開發配置Flume NG集群。 (2)配置 > 在下面單點Flume中,基本配置都完成了,我們只需要新添加兩個配置文件,它們是flume-client.conf和flume-server.conf,其配置內容如下所示: 1) itcast01上的flume-client.conf配置 ~~~ #agent1 name agent1.channels = c1 agent1.sources = r1 agent1.sinks = k1 k2 #set gruop agent1.sinkgroups = g1 #set sink group agent1.sinkgroups.g1.sinks = k1 k2 #set channel agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sources.r1.type = exec agent1.sources.r1.command = tail -F /root/log/test.log agent1.sources.r1.interceptors = i1 i2 agent1.sources.r1.interceptors.i1.type = static agent1.sources.r1.interceptors.i1.key = Type agent1.sources.r1.interceptors.i1.value = LOGIN agent1.sources.r1.interceptors.i2.type = timestamp # set sink1 agent1.sinks.k1.channel = c1 agent1.sinks.k1.type = avro agent1.sinks.k1.hostname = itcast02 agent1.sinks.k1.port = 52020 # set sink2 agent1.sinks.k2.channel = c1 agent1.sinks.k2.type = avro agent1.sinks.k2.hostname = itcast03 agent1.sinks.k2.port = 52020 #set failover agent1.sinkgroups.g1.processor.type = failover agent1.sinkgroups.g1.processor.priority.k1 = 10 agent1.sinkgroups.g1.processor.priority.k2 = 5 agent1.sinkgroups.g1.processor.maxpenalty = 10000 #這里首先要申明一個sinkgroups,然后再設置2個sink ,k1與k2,其中2個優先級是10和5,#而processor的maxpenalty被設置為10秒,默認是30秒。‘ ~~~ > 啟動命令: ~~~ bin/flume-ng agent -n agent1 -c conf -f conf/flume-client.conf -Dflume.root.logger=DEBUG,console ~~~ 2) Itcast02和itcast03上的flume-server.conf配置 ~~~ #set Agent name a1.sources = r1 a1.channels = c1 a1.sinks = k1 #set channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # other node,nna to nns a1.sources.r1.type = avro a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 52020 a1.sources.r1.channels = c1 a1.sources.r1.interceptors = i1 i2 a1.sources.r1.interceptors.i1.type = timestamp a1.sources.r1.interceptors.i2.type = host a1.sources.r1.interceptors.i2.hostHeader=hostname #set sink to hdfs a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/data/flume/logs/%{hostname} a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d a1.sinks.k1.hdfs.fileType=DataStream a1.sinks.k1.hdfs.writeFormat=TEXT a1.sinks.k1.hdfs.rollInterval=10 a1.sinks.k1.channel=c1 ~~~ > 啟動命令: ~~~ bin/flume-ng agent -n agent1 -c conf -f conf/flume-server.conf -Dflume.root.logger=DEBUG,console ~~~ (3)測試failover 1) 先在itcast02和itcast03上啟動腳本 ~~~ bin/flume-ng agent -n a1 -c conf -f conf/flume-server.conf -Dflume.root.logger=DEBUG,consoln ~~~ 2) 然后啟動itcast01上的腳本 ~~~ bin/flume-ng agent -n agent1 -c conf -f conf/flume-client.conf -Dflume.root.logger=DEBUG,console ~~~ 3) Shell腳本生成數據 ~~~ while true;do date >> test.log; sleep 1s ;done ~~~ 4) 觀察HDFS上生成的數據目錄。只觀察到itcast02在接受數據 ![](https://box.kancloud.cn/c6e6fbaf06617d87fb186e3203f3fcba_554x32.png) 5) Itcast02上的agent被干掉之后,繼續觀察HDFS上生成的數據目錄,itcast03對應的ip目錄出現,此時數據收集切換到itcast03上 ![](https://box.kancloud.cn/468ba516d9eaa5bc82c66ab98863efb2_553x42.png) 6) Itcast02上的agent重啟后,繼續觀察HDFS上生成的數據目錄。此時數據收集切換到itcast02上,又開始繼續工作 ![](https://box.kancloud.cn/83d0e4cde7e419f1e9a4dd6baf58178a_554x42.png) (二)、load balance負載均衡 (1)節點分配 > 如failover故障轉移的節點分配 (2)配置 > 在failover故障轉移的配置上稍作修改 > itcast01上的flume-client-loadbalance.conf配置 ~~~ #agent1 name agent1.channels = c1 agent1.sources = r1 agent1.sinks = k1 k2 #set gruop agent1.sinkgroups = g1 #set channel agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sources.r1.type = exec agent1.sources.r1.command = tail -F /root/log/test.log # set sink1 agent1.sinks.k1.channel = c1 agent1.sinks.k1.type = avro agent1.sinks.k1.hostname = itcast02 agent1.sinks.k1.port = 52020 # set sink2 agent1.sinks.k2.channel = c1 agent1.sinks.k2.type = avro agent1.sinks.k2.hostname = itcast03 agent1.sinks.k2.port = 52020 #set sink group agent1.sinkgroups.g1.sinks = k1 k2 #set load-balance agent1.sinkgroups.g1.processor.type = load_balance # 默認是round_robin,還可以選擇random agent1.sinkgroups.g1.processor.selector = round_robin #如果backoff被開啟,則 sink processor會屏蔽故障的sink agent1.sinkgroups.g1.processor.backoff = true ~~~ > Itcast02和itcast03上的flume-server-loadbalance.conf配置 ~~~ #set Agent name a1.sources = r1 a1.channels = c1 a1.sinks = k1 #set channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # other node,nna to nns a1.sources.r1.type = avro a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 52020 a1.sources.r1.channels = c1 a1.sources.r1.interceptors = i1 i2 a1.sources.r1.interceptors.i1.type = timestamp a1.sources.r1.interceptors.i2.type = host a1.sources.r1.interceptors.i2.hostHeader=hostname a1.sources.r1.interceptors.i2.useIP=false #set sink to hdfs a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/data/flume/loadbalance/%{hostname} a1.sinks.k1.hdfs.fileType=DataStream a1.sinks.k1.hdfs.writeFormat=TEXT a1.sinks.k1.hdfs.rollInterval=10 a1.sinks.k1.channel=c1 a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d ~~~ (3)測試load balance 1) 先在itcast02和itcast03上啟動腳本 bin/flume-ng agent -n a1 -c conf -f conf/flume-server-loadbalance.conf -Dflume.root.logger=DEBUG,console 2) 然后啟動itcast01上的腳本 ~~~ bin/flume-ng agent -n agent1 -c conf -f conf/flume-client-loadbalance.conf -Dflume.root.logger=DEBUG,console ~~~ 3) Shell腳本生成數據 ~~~ while true;do date >> test.log; sleep 1s ;done ~~~ 4) 觀察HDFS上生成的數據目錄,由于輪訓機制都會收集到數據 ![](https://box.kancloud.cn/25be59a1c1cc84826d52932e555e049a_554x41.png) 5) Itcast02上的agent被干掉之后,itcast02上不在產生數據 ![](https://box.kancloud.cn/e0d4e496b209724ebcfd25418b65f6a5_554x44.png) 6) Itcast02上的agent重新啟動后,兩者都可以接受到數據 ![](https://box.kancloud.cn/ea4ac2e1a62906567fee922c858e32f0_554x43.png) #### 2.3.5、Flume日志分類采集匯總 > <見案例資料> #### 2.3.6、Flume自定義攔截器 > <見案例資料> ## 3、Flume實際使用時需要注意的事項 1) 注意啟動腳本命令的書寫 > agent 的名稱別寫錯了,后臺執行加上nohup ... 2) channel參數 | capacity:| 默認該通道中最大的可以存儲的event數量| | --- | --- | | trasactionCapacity:| 每次最大可以從source中拿到或者送到sink中的event數量| | keep-alive:| event添加到通道中或者移出的允許時間| > 注意:capacity > trasactionCapacity 3) 日志采集到HDFS配置說明1(sink端) > #定義sink ~~~ a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path=hdfs://192.168.200.101:9000/source/logs/%{type}/%Y%m%d a1.sinks.k1.hdfs.filePrefix =events a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text #時間類型 a1.sinks.k1.hdfs.useLocalTimeStamp = true #生成的文件不按條數生成 a1.sinks.k1.hdfs.rollCount = 0 #生成的文件按時間生成 a1.sinks.k1.hdfs.rollInterval = 30 #生成的文件按大小生成 a1.sinks.k1.hdfs.rollSize = 10485760 #批量寫入hdfs的個數 a1.sinks.k1.hdfs.batchSize = 10000 flume操作hdfs的線程數(包括新建,寫入等) a1.sinks.k1.hdfs.threadsPoolSize=10 #操作hdfs超時時間 a1.sinks.k1.hdfs.callTimeout=30000 ~~~ 4) 日志采集到HDFS配置說明2(sink端) > hdfs.round false Should the timestamp be rounded down (if true, affects all time based escape sequences except %t) > hdfs.roundValue 1 Rounded down to the highest multiple of this (in the unit configured usinghdfs.roundUnit), less than current time. > hdfs.roundUnit second The unit of the round down value - > second, minute or hour. | round | 默認值:false 是否啟用時間上的”舍棄”,這里的”舍棄”,類似于”四舍五入”| | --- | --- | | roundValue| 默認值:1 時間上進行“舍棄”的值; | roundUnit | 默認值:seconds時間上進行”舍棄”的單位,包含:second,minute,hour| > 案例(1): > a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H:%M/%S > a1.sinks.k1.hdfs.round = true > a1.sinks.k1.hdfs.roundValue = 10 > a1.sinks.k1.hdfs.roundUnit = minute > 當時間為2015-10-16 17:38:59時候,hdfs.path依然會被解析為: ~~~ /flume/events/2015-10-16/17:30/00 /flume/events/2015-10-16/17:40/00 /flume/events/2015-10-16/17:50/00 ~~~ > 因為設置的是舍棄10分鐘內的時間,因此,該目錄每10分鐘新生成一個。 > 案例(2): ~~~ a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H:%M/%S a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = second ~~~ > 現象:10秒為時間梯度生成對應的目錄,目錄下面包括很多小文件!!! > HDFS產生的數據目錄格式如下: ~~~ /flume/events/2016-07-28/18:45/10 /flume/events/2016-07-28/18:45/20 /flume/events/2016-07-28/18:45/30 /flume/events/2016-07-28/18:45/40 /flume/events/2016-07-28/18:45/50 /flume/events/2016-07-28/18:46/10 /flume/events/2016-07-28/18:46/20 /flume/events/2016-07-28/18:46/30 /flume/events/2016-07-28/18:46/40 /flume/events/2016-07-28/18:46/50 ~~~ 5) 日志采集使用tail -F 監控一個文件新增的內容(斷點續傳) > (詳細見案例:flume的第6個配置案例-分類收集數據-使用static攔截器) > Source端的代碼: ~~~ a1.sources.r2.type = exec a1.sources.r2.command = tail -F /root/data/nginx.log a1.sources.r2.interceptors = i2 a1.sources.r2.interceptors.i2.type = static a1.sources.r2.interceptors.i2.key = type a1.sources.r2.interceptors.i2.value = nginx ~~~ > 這里會出現這樣一個情況,當你的這個flume agent程序掛了或者是服務器宕機了,那么隨著文件內容的增加,下次重啟時,會消費到重復的數據, 怎么辦呢? > 解決方案:使用改進版的配置信息,修改信息 ~~~ a1.sources.r2.command= tail -n +$(tail -n1 /root/log) -F /root/data/nginx.log | awk 'ARGIND==1{i=$0;next}{i++;if($0~/^tail/){i=0};print $0;print i >> "/root/log";fflush("")}' /root/log- ~~~ > 意思就是說:Source每次讀取一條信息,就往/root/log文件記住當前消息的行數。這樣的話當你的程序掛了之后,重啟時先獲取上次讀取所在的行數,依次從下讀,這樣避免了數據重復。 > 而在flume1.7已經集成了該功能 > 配置文件: > 配置案例: ~~~ a1.channels = ch1 a1.sources = s1 a1.sinks = hdfs-sink1 #channel a1.channels.ch1.type = memory a1.channels.ch1.capacity=100000 a1.channels.ch1.transactionCapacity=50000 #source a1.sources.s1.channels = ch1 #監控一個目錄下的多個文件新增的內容 a1.sources.s1.type = taildir #通過 json 格式存下每個文件消費的偏移量,避免從頭消費 a1.sources.s1.positionFile = /var/local/apache-flume-1.7.0-bin/taildir_position.json a1.sources.s1.filegroups = f1 f2 f3 a1.sources.s1.filegroups.f1 = /root/data/access.log a1.sources.s1.filegroups.f2 = /root/data/nginx.log a1.sources.s1.filegroups.f3 = /root/data/web.log a1.sources.s1.headers.f1.headerKey = access a1.sources.s1.headers.f2.headerKey = nginx a1.sources.s1.headers.f3.headerKey = web a1.sources.s1.fileHeader = true ##sink a1.sinks.hdfs-sink1.channel = ch1 a1.sinks.hdfs-sink1.type = hdfs a1.sinks.hdfs-sink1.hdfs.path =hdfs://master:9000/demo/data a1.sinks.hdfs-sink1.hdfs.filePrefix = event_data a1.sinks.hdfs-sink1.hdfs.fileSuffix = .log a1.sinks.hdfs-sink1.hdfs.rollSize = 10485760 a1.sinks.hdfs-sink1.hdfs.rollInterval =20 a1.sinks.hdfs-sink1.hdfs.rollCount = 0 a1.sinks.hdfs-sink1.hdfs.batchSize = 1500 a1.sinks.hdfs-sink1.hdfs.round = true a1.sinks.hdfs-sink1.hdfs.roundUnit = minute a1.sinks.hdfs-sink1.hdfs.threadsPoolSize = 25 a1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true a1.sinks.hdfs-sink1.hdfs.minBlockReplicas = 1 a1.sinks.hdfs-sink1.hdfs.fileType =DataStream a1.sinks.hdfs-sink1.hdfs.writeFormat = Text a1.sinks.hdfs-sink1.hdfs.callTimeout = 60000 6)flume的header參數配置講解 #配置信息test-header.conf a1.channels = c1 a1.sources = r1 a1.sinks = k1 #channel a1.channels.c1.type = memory a1.channels.c1.capacity=100000 a1.channels.c1.transactionCapacity=50000 #source a1.sources.r1.channels = c1 a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /var/tmp a1.sources.r1.batchSize= 100 a1.sources.r1.inputCharset = UTF-8 a1.sources.r1.fileHeader = true a1.sources.r1.fileHeaderKey = mmm a1.sources.r1.basenameHeader = true a1.sources.r1.basenameHeaderKey = nnn #sink a1.sinks.k1.type = logger a1.sinks.k1.channel = c1 ~~~ 執行腳本: ~~~ bin/flume-ng agent -c conf -f conf/test-header.conf -name a1 -Dflume.root.logger=DEBUG,console ~~~ > 看到內容控制臺打印的信息: ~~~ Event: { headers:{mmm=/var/tmp/bbb, nnn=bbb} body: 30 30 30 000 } Event: { headers:{mmm=/var/tmp/aaa, nnn=aaa} body: 31 31 31 111 } ~~~ > 其中aaa bbb 為目錄/var/tmp 下面的2個文件名稱 > 官網描述: ![](https://box.kancloud.cn/20c1c41b3b7fc445c3ee4ac611b42d5d_463x196.png)
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看