FAQ · Storm 1.1.0 中文文檔

# FAQ ## 最佳實踐 ### 針對配置 Storm+Trident, 您可以給我哪些經驗呢? * worker 的數量是機器數量的倍數; 并行度是 worker 數量的倍數; kafka partitions 的數量是 spout 并行度的倍數 * 每個機器上的每個 topology 使用一個 worker * Start with fewer, larger aggregators, one per machine with workers on it * 使用 isolation scheduler（隔離調度器） * 每個 worker 使用一個 acker -- 0.9 版本默認是這樣的, 但是更早的版本沒有這樣. * 啟用 GC 日志記錄; 在正常情況下, 你應該看到很少的 major GC. * set the trident batch millis to about 50% of your typical end-to-end latency. * Start with a max spout pending that is for sure too small -- one for trident, or the number of executors for storm -- and increase it until you stop seeing changes in the flow. You'll probably end up with something near `2*(throughput in recs/sec)*(end-to-end latency)` (2x the Little's law capacity). ### What are some of the best ways to get a worker to mysteriously and bafflingly die? ### 什么是獲取 * 您是否有對 log directory（日志目錄）的寫入權限 * 您擴大過你的 heap 大小嗎? * 是否所有的 workers 都安裝了正確的 libraries（函數庫）? * 是否 zookeeper 的 hostname（主機名）仍然設置為 localhost 了? * 您提供了一個正確, 唯一的 hostname（主機名） -- 它可以解析回機器上 -- 對于每個 worker, 并且將它們放入 storm conf 文件中? * 您是否雙向開啟了 firewall/securitygroup 的權限 a) 所有的 workers, b) storm master, c) zookeeper? 另外, 從 workers 到您的 topology 訪問的任何 kafka/kestrel/database/etc ? 使用 netcat 來檢測下對應的 ports（端口）并且確定一下. ### Help! 我不能看到: * **my logs** 默認情況下, 日志為 $STORM_HOME/logs. 檢查您是否具有該目錄的寫入權限. 他們配置在 log4j2/{cluster, worker}.xml 文件中. * **final JVM settings** 在 childopts 中添加 `-XX+PrintFlagsFinal` 命令行選項（請看配置文件） * **final Java system properties** 添加 `Properties props = System.getProperties(); props.list(System.out);` 靠近您構建 topology（拓撲）的地方. ### 我應該使用多少個 Workers? worker 的數量是由 supervisors 來確定的 -- 每個 supervisor 將監督一些 JVM slots. 您在 topology（拓撲）上設置的事情是它將嘗試聲明多少個 worker slots. 每臺機器每個 topology（拓撲）使用多個 worker 沒有很好的理由。一個 topology（拓撲）運行在三個 8 核心的節點上, 并行度是 24, 每臺機器的每個 bolt 將得到 8 個 executor（執行器）, 即每個核心一個. 運行三個 worker（每個有 8 個指定的 executor）相對于運行 24 個 worker（每個分配一個 executor）有 3 個大的優勢. 第一，對同一個 worker 的 executor 進行重新分區（shuffles 或 group-bys）的數據不必放入傳輸緩沖區. 相反, tuple 直接從發送到接收緩沖區存儲. 這是一個很大的優勢. 相反，如果目標 executor 在同一臺計算機上的不同 worker 上, 則必須執行 send - > worker transfer - > local socket - > worker recv - > exec recv buffer. 它不經過打網卡，但并不像 executor 在同一個 worker 那么大. 通常情況下，三個具有非常大的 backing cache（后備緩存）的 aggregator（聚合器）比擁有小的 backing caches（后臺緩存）的二十四個 aggregators（聚合器）更好，因為這樣減少了數據傾斜的影響，并提高了 LRU 效率. 最后，更少的 workers 降低了控制 flow 的難度. ## Topology（拓撲） ### 一個 Trident topology 可以有多個 Streams 嗎? > Trident topology 可以像帶條件路徑（if-else）的 workflow（工作流）一樣工作嗎? 例如. 一個 Spout(S1) 連接到 bolt(b0), 其基于進入 tuple（元組）中的某些值將它們引導到 blolt（B1）或 bolt（B2），而不是兩者都有. 一個 Trident 的 "each" 操作返回一個 Stream 對象, 你可以在一個變量中存儲它. 然后，您可以在同一個 Stream 上運行多個 each 進行 split 拆分, 例如: ``` Stream s = topology.each(...).groupBy(...).aggregate(...) Stream branch1 = s.each(..., FilterA) Stream branch2 = s.each(..., FilterB) ``` 你可以使用 join, merge 或 multiReduce 來 join streams. 在寫入操作時，您不能向 Trident 的 emit（發射）多個輸出流 -- 請參閱 [STORM-68](https://issues.apache.org/jira/browse/STORM-68) ### 當我啟動 topology 時, 為什么獲得一個 NotSerializableException/IllegalStateException 異常? 在 Storm 的生命周期中，在執行 topology 之前，topology 被實例化，然后序列化為字節格式以存儲在 ZooKeeper 中. 在此步驟中，如果 topology 中的 Spout 或 Bolt 具有初始化的不可序列化屬性，序列化將會失敗. 如果需要一個不序列化的字段，請在將 topology 傳遞給 worker 之后運行的 blot 或 spout 的 prepare 方法中進行初始化. ## Spouts ### coordinator 是什么, 為什么有幾個? trident-spout 實際運行在 storm _bolt_ 之內. trident topology 的 storm-spout 是 MasterBatchCoordinator -- 它協調了 trident batches，無論您使用什么 spout 都是一樣的. 當 MBC 為每個 spout-coordinators 分配一個 seed tuple（種子元組）時，batch 就誕生了. spout-coordinator bolts 知道您特定的 spouts 應該如何配合 -- 所以在 kafka 的場景中, 這有助于找出每個 spout 應該從哪個 partition 和 offset 進行 pull 操作. ### What can I store into the spout's metadata record? You should only store static data, and as little of it as possible, into the metadata record (note: maybe you _can_ store more interesting things; you shouldn't, though) ### 'emitPartitionBatchNew' 函數多久被調用一次? Since the MBC is the actual spout, all the tuples in a batch are just members of its tupletree. That means storm's "max spout pending" config effectively defines the number of concurrent batches trident runs. The MBC emits a new batch if it has fewer than max-spending tuples pending and if at least one [trident batch interval](http://github.com/apache/storm/blob/master%0A/conf/defaults.yaml#L115)'s worth of seconds has passed since the last batch. 由于 MBC 是實際的 spout，所以一個 batch 中的所有 tuple 只是它的 tupletree 的成員. 這意味著 storm 的 "max spout pending" 配置有效地定義了并發 batch trident 運行的次數. ### If nothing was emitted does Trident slow down the calls? Yes, there's a pluggable "spout wait strategy"; the default is to sleep for a [configurable amount of time](http://github.com/apache/storm/blob/master%0A/conf/defaults.yaml#L110) ### OK, 那么 trident batch 間隔是多少? 你知道 486 時代的電腦有一個 [turbo button](http://en.wikipedia.org/wiki/Turbo_button) 嗎？ Actually, it has two practical uses. One is to throttle spouts that poll a remote source without throttling processing. For example, we have a spout that looks in a given S3 bucket for a new batch-uploaded file to read, linebreak and emit. We don't want it hitting S3 more than every few seconds: files don't show up more than once every few minutes, and a batch takes a few seconds to process. The other is to limit overpressure on the internal queues during startup or under a heavy burst load -- if the spouts spring to life and suddenly jam ten batches' worth of records into the system, you could have a mass of less-urgent tuples from batch 7 clog up the transfer buffer and prevent the $commit tuple from batch 3 to get through (or even just the regular old tuples from batch 3). What we do is set the trident batch interval to about half the typical end-to-end processing latency -- if it takes 600ms to process a batch, it's OK to only kick off a batch every 300ms. Note that this is a cap, not an additional delay -- with a period of 300ms, if your batch takes 258ms Trident will only delay an additional 42ms. ### 你是怎樣設置 batch 大小的? Trident 不對 batch 數量設置自己的限制. 在 Kafka spout 的場景中，最大抓取的字節大小初一平均的記錄大小定義了每個子分區的有效記錄. ### 如何調整 batch 的大小? trident batch 是一個有點過載的設施. 與 partition（分區）數量一起，batch 大小受限于或用于定義: 1. the unit of transactional safety (tuples at risk vs time) 2. per partition, an effective windowing mechanism for windowed stream analytics 3. per partition, the number of simultaneous queries that will be made by a partitionQuery, partitionPersist, etc; 4. per partition, the number of records convenient for the spout to dispatch at the same time; 一旦生成，您將無法更改總體的 batch 大小，但您可以更改 partition 數量 - 執行 shuffle，然后更改 parallelism hint（并行度） ## Time Series（時間序列） ### 如何按時間聚合事件? 如果您的記錄具有不可變的 timestamp（時間戳），并且您想 count，average 或以其他方式將它們聚合到離散時間段中，則 Trident 是一款出色且可擴展的解決方案。編寫一個將 timestamp 轉換成 time bucket 的 `Each` 函數: 如果 bucket 的大小是 "by hour（按小時的）" , 則時間戳 `2013-08-08 12:34:56` 將被映射成 `2013-08-08 12:00:00` time bucket, 十二點鐘以后的其它時間也是這樣. 然后在那個 timebucket 上的 group（組）并使用分組的 persistentAggregate 方法. persistentAggregate 使用由數據存儲支持的本地 cacheMap. 具有許多記錄的 Groups 需要從數據存儲器讀取很少的數據, 并使用高效的批量讀寫. 只要您的數據供給相對較快，Trident 就可以非常有效地利用內存和網絡. 即使服務器脫機一天，然后在一瞬間提供全天的數據, 舊的結果將被安靜地檢索和更新 -- 并且不干擾計算當前結果. ### 我怎么知道一個時間內的 bucket 的所有記錄已經被收到? 你不能知道所有的 event（事件）都被收集 -- 這是一個 epistemological challenge（認識論的挑戰），而不是分布式系統的挑戰. 您可以: * 使用 domain knowledge 設置時間限制 * 引入 _punctuation_: 一個 record 知道緊跟特定時間 bucket 內所有記錄之后而來. Trident 使用此方案知道 batch 何時完成. 例如，如果您從一組傳感器接收記錄，則每個傳感器都將按照傳感器的順序發送，所有傳感器都會向您發送 3：02：xx 或更后版本的時間戳，以讓您知道可以 commit（提交）. * 在可能的情況下, 使您的進程增加: 進來的每個 value 會讓答案越來越正確. Trident ReducerAggregator 是一個 operator, 它采取先前的結果和一組新的記錄，并返回一個新的結果. 這樣可以將結果緩存并序列化到數據存儲; 如果一臺服務器脫機一天，然后在一天內回來一整天的數據，舊的結果將被平靜地檢索和更新. * Lambda 架構: 在接收時將所有 event（時間）記錄到 archival store（S3, HBase, HDFS）. 在快速處理的層面上, 一旦時間窗口被 clear（清楚）, 處理 bucket 以獲得可行的答案, 并忽略比時間窗口更舊的一切. 定期運行全局聚合以計算 "正確的" 答案。