快速入門指南 · LightGBM 中文文檔

# 快速入門指南本文檔是 LightGBM CLI 版本的快速入門指南。參考 [安裝指南](./Installation-Guide.rst) 先安裝 LightGBM 。 **其他有幫助的鏈接列表** * [參數](./Parameters.rst) * [參數調整](./Parameters-Tuning.rst) * [Python 包快速入門](./Python-Intro.rst) * [Python API](./Python-API.rst) ## 訓練數據格式 LightGBM 支持 [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) 和 [LibSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) 格式的輸入數據文件。 Label 是第一列的數據，文件中是不包含 header（標題）的。 ### 類別特征支持 12/5/2016 更新: LightGBM 可以直接使用 categorical feature（類別特征）（不需要單獨編碼）。 [Expo data](http://stat-computing.org/dataexpo/2009/) 實驗顯示，與 one-hot 編碼相比，其速度提高了 8 倍。有關配置的詳細信息，請參閱 [參數](./Parameters.rst) 章節。 ### 權重和 Query/Group 數據 LightGBM 也支持加權訓練，它需要一個額外的 [加權數據](./Parameters.rst#io-parameters) 。它需要額外的 [query 數據](./Parameters.rst#io-parameters) 用于排名任務。 11/3/2016 更新: 1. 現在支持 header（標題）輸入 2. 可以指定 label 列，權重列和 query/group id 列。索引和列都支持 3. 可以指定一個被忽略的列的列表 ## 參數快速查看參數格式是 `key1=value1 key2=value2 ...` 。參數可以在配置文件和命令行中。一些重要的參數如下 : * `config`, 默認=`""`, type（類型）=string, alias（別名）=`config_file` * 配置文件的路徑 * `task`, 默認=`train`, type（類型）=enum, options（可選）=`train`, `predict`, `convert_model` * `train`, alias（別名）=`training`, 用于訓練 * `predict`, alias（別名）=`prediction`, `test`, 用于預測。 * `convert_model`, 用于將模型文件轉換為 if-else 格式，在 [轉換模型參數](./Parameters.rst#convert-model-parameters) 中了解更多信息 * `application`, 默認=`regression`, 類型=enum, 可選=`regression`, `regression_l1`, `huber`, `fair`, `poisson`, `quantile`, `quantile_l2`, `binary`, `multiclass`, `multiclassova`, `xentropy`, `xentlambda`, `lambdarank`, 別名=`objective`, `app` * 回歸 application * `regression_l2`, L2 損失, 別名=`regression`, `mean_squared_error`, `mse` * `regression_l1`, L1 損失, 別名=`mean_absolute_error`, `mae` * `huber`, [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) * `fair`, [Fair loss](https://www.kaggle.com/c/allstate-claims-severity/discussion/24520) * `poisson`, [Poisson regression](https://en.wikipedia.org/wiki/Poisson_regression) * `quantile`, [Quantile regression](https://en.wikipedia.org/wiki/Quantile_regression) * `quantile_l2`, 與 `quantile` 類似, 但是使用 L2 損失 * `binary`, 二進制`log loss`_ 分類 application * 多類別分類 application * `multiclass`, [softmax](https://en.wikipedia.org/wiki/Softmax_function) 目標函數, `num_class` 也應該被設置 * `multiclassova`, [One-vs-All](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) 二元目標函數, `num_class` 也應該被設置 * 交叉熵 application * `xentropy`, 交叉熵的目標函數 (可選線性權重), 別名=`cross_entropy` * `xentlambda`, 交叉熵的替代參數化, 別名=`cross_entropy_lambda` * label 是在 [0, 1] 間隔中的任何東西 * `lambdarank`, [lambdarank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf) application * 在 lambdarank 任務中 label 應該是 `int` 類型，而較大的數字表示較高的相關性（例如，0:bad, 1:fair, 2:good, 3:perfect） * `label_gain` 可以用來設置 `int` label 的 gain(weight)（增益（權重）） * `boosting`, 默認=`gbdt`, type=enum, 選項=`gbdt`, `rf`, `dart`, `goss`, 別名=`boost`, `boosting_type` * `gbdt`, traditional Gradient Boosting Decision Tree（傳統梯度提升決策樹） * `rf`, 隨機森林 * `dart`, [Dropouts meet Multiple Additive Regression Trees](https://arxiv.org/abs/1505.01866) * `goss`, Gradient-based One-Side Sampling（基于梯度的單面采樣） * `data`, 默認=`""`, 類型=string, 別名=`train`, `train_data` * 訓練數據， LightGBM 將從這個數據訓練 * `valid`, 默認=`""`, 類型=multi-string, 別名=`test`, `valid_data`, `test_data` * 驗證/測試數據，LightGBM 將輸出這些數據的指標 * 支持多個驗證數據，使用 `,` 分開 * `num_iterations`, 默認=`100`, 類型=int, 別名=`num_iteration`, `num_tree`, `num_trees`, `num_round`, `num_rounds`, `num_boost_round` * boosting iterations/trees 的數量 * `learning_rate`, 默認=`0.1`, 類型=double, 別名=`shrinkage_rate` * shrinkage rate（收斂率） * `num_leaves`, 默認=`31`, 類型=int, 別名=`num_leaf` * 在一棵樹中的葉子數量 * `tree_learner`, 默認=`serial`, 類型=enum, 可選=`serial`, `feature`, `data`, `voting`, 別名=`tree` * `serial`, 單個 machine tree 學習器 * `feature`, 別名=`feature_parallel`, feature parallel tree learner（特征并行樹學習器） * `data`, 別名=`data_parallel`, data parallel tree learner（數據并行樹學習器） * `voting`, 別名=`voting_parallel`, voting parallel tree learner（投票并行樹學習器） * 參考 [Parallel Learning Guide（并行學習指南）](./Parallel-Learning-Guide.rst) 來了解更多細節 * `num_threads`, 默認=`OpenMP_default`, 類型=int, 別名=`num_thread`, `nthread` * LightGBM 的線程數 * 為了獲得最好的速度，將其設置為 **real CPU cores（真實 CPU 內核）** 數量，而不是線程數（大多數 CPU 使用 [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) 來為每個 CPU core 生成 2 個線程） * 對于并行學習，不應該使用全部的 CPU cores ，因為這會導致網絡性能不佳 * `max_depth`, 默認=`-1`, 類型=int * 樹模型最大深度的限制。當 `#data` 很小的時候，這被用來處理 overfit（過擬合）。樹仍然通過 leaf-wise 生長 * `< 0` 意味著沒有限制 * `min_data_in_leaf`, 默認=`20`, 類型=int, 別名=`min_data_per_leaf` , `min_data`, `min_child_samples` * 一個葉子中的最小數據量。可以用這個來處理過擬合。 * `min_sum_hessian_in_leaf`, 默認=`1e-3`, 類型=double, 別名=`min_sum_hessian_per_leaf`, `min_sum_hessian`, `min_hessian`, `min_child_weight` * 一個葉子節點中最小的 sum hessian 。類似于 `min_data_in_leaf` ，它可以用來處理過擬合。想要了解全部的參數，請參閱 [Parameters（參數）](./Parameters.rst). ## 運行 LightGBM 對于 Windows: ``` lightgbm.exe config=your_config_file other_args ... ``` 對于 Unix: ``` ./lightgbm config=your_config_file other_args ... ``` 參數既可以在配置文件中，也可以在命令行中，命令行中的參數優先于配置文件。例如下面的命令行會保留 `num_trees=10` ，并忽略配置文件中的相同參數。 ``` ./lightgbm config=train.conf num_trees=10 ``` ## 示例 * [Binary Classification（二元分類）](https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification) * [Regression（回歸）](https://github.com/Microsoft/LightGBM/tree/master/examples/regression) * [Lambdarank](https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank) * [Parallel Learning（并行學習）](https://github.com/Microsoft/LightGBM/tree/master/examples/parallel_learning)