# 5.15 機器學習 · Machine Learning 學習筆記(一) by OTreeWEN
> 來源:https://uqer.io/community/share/55bc40caf9f06c91f918c604
```py
# 測試sklearn安裝包是否已裝好
import sklearn as sk
import numpy as np
import matplotlib.pylab as plt
# 加載數據
from sklearn import datasets
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
# 將X 和 y 的shape 打印出來
print X_iris.shape, y_iris.shape
print 'X_iris =', X_iris[0]
print 'y_iris =',y_iris[0]
(150, 4) (150,)
X_iris = [ 5.1 3.5 1.4 0.2]
y_iris = 0
```
Any machine learning problem can be represented with the following three concepts:
? We will have to learn to solve a task T. For example, build a spam filter that learns to classify e-mails as spam or ham.
? We will need some experience E to learn to perform the task. Usually, experience is represented through a dataset. For the spam filter, experience comes as a set of e-mails, manually classified by a human as spam or ham.
? We will need a measure of performance P to know how well we are solving the task and also to know whether after doing some modifications, our results are improving or getting worse. The percentage of e-mails that our spam filtering is correctly classifying as spam or ham could be P for our spam-filtering task.
Our first machine learning method – linear classification
```py
# ex1 linear classification
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
# Get dataset with only the first two attributes
X, y = X_iris[:, :2], y_iris
# Split the dataset into a training and a testing set
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=33)
# The train_test_split function automatically builds the training
# and evaluation datasets, randomly selecting the samples.
scaler = preprocessing.StandardScaler().fit(X_train) # Standardize the features
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print 'X_train',X_train.shape # 75% of sample
print 'X_test',X_test.shape # 25% of sample
import matplotlib.pyplot as plt
colors = ['red', 'greenyellow', 'blue']
for i in xrange(len(colors)):
xs = X_train[:, 0][y_train == i]
ys = X_train[:, 1][y_train == i]
plt.scatter(xs, ys, c=colors[i])
plt.legend(iris.target_names)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
# from sklearn.linear_modelsklearn._model import SGDClassifier
from sklearn.linear_model import SGDClassifier
# SGD stands for Stochastic Gradient Descent
clf = SGDClassifier()
clf.fit(X_train, y_train) # three-class problem
print 'shape of coef =',clf.coef_.shape
print 'shape of intercept =',clf.intercept_.shape
X_train (112, 2)
X_test (38, 2)
shape of coef = (3, 2)
shape of intercept = (3,)
```

The following code draws the three decision boundaries and lets us know if they worked as expected:
```py
# 設定畫圖的刻度邊界
x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5
y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5
xs = np.arange(x_min, x_max, 0.5)
fig, axes = plt.subplots(1, 3)
fig.set_size_inches(10, 6)
for i in [0, 1, 2]:
# 設定圖表i的標識、刻度等
axes[i].set_aspect('equal')
axes[i].set_title('Class '+ str(i) + ' versus the rest')
axes[i].set_xlabel('Sepal length')
axes[i].set_ylabel('Sepal width')
axes[i].set_xlim(x_min, x_max)
axes[i].set_ylim(y_min, y_max)
plt.sca(axes[i]) # 選擇圖表 i
# 畫散點圖
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.prism)
# 計算分割線
ys = (- clf.intercept_[i] - xs * clf.coef_[i, 0]) / clf.coef_[i, 1]
# 畫分割線
plt.plot(xs, ys, hold=True)
```

suppose that we have a new flower with a sepal width of 4.7 and a sepal length of 3.1, and we want to predict its class. We just have to apply our brand new classifier to it (after normalizing!).
```py
print clf.predict(scaler.transform([[4.7, 3.1]]))[0]
print clf.decision_function(scaler.transform([[4.7, 3.1]]))
0
[[ 19.77232705 8.13983962 -28.65250296]]
```
We want to be a little more formal when we talk about a good classifier. What does that mean? The performance of a classifier is a measure of its effectiveness. The simplest performance measure is accuracy: given a classifier and an evaluation dataset, it measures the proportion of instances correctly classified by the classifier.
```py
# Evaluating the results
from sklearn import metrics
y_train_pred = clf.predict(X_train)
print metrics.accuracy_score(y_train, y_train_pred)
0.821428571429
```
Probably, the most important thing you should learn from this chapter is that measuring accuracy on the training set is really a bad idea. You have built your model using this data, and it is possible that your model adjusts well to them but performs poorly in future (previously unseen data), which is its purpose. This phenomenon is called overfitting, and you will see it now and again while you read this book. If you measure based on your training data, you will never detect overfitting. So, never measure based on your training data. This is why we have reserved part of the original dataset (the testing partition)—we want to evaluate performance on previously unseen data. Let's check the accuracy again, now on the evaluation set (recall that it was already scaled):
```py
y_pred = clf.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)
0.684210526316
```
Precision: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).
Recall: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).
F1-score: This is the harmonic mean of precision and recall, and tries to combine both in a single number.
```py
print metrics.classification_report(y_test, y_pred, target_names=iris.target_names)
precision recall f1-score support
setosa 1.00 1.00 1.00 8
versicolor 0.43 0.27 0.33 11
virginica 0.65 0.79 0.71 19
avg / total 0.66 0.68 0.66 38
```
Another useful metric (especially for multi-class problems) is the confusion matrix: in its (i, j) cell, it shows the number of class instances i that were predicted to be in class j. A good classifier will accumulate the values on the confusion matrix diagonal, where correctly classified instances belong.
```py
print metrics.confusion_matrix(y_test, y_pred)
```
To finish our evaluation process, we will introduce a very useful method known as cross-validation. As we explained before, we have to partition our dataset into a training set and a testing set. However, partitioning the data, results such that there are fewer instances to train on, and also, depending on the particular partition we make (usually made randomly), we can get either better or worse results. Cross-validation allows us to avoid this particular case, reducing result variance and producing a more realistic score for our models. The usual steps for k-fold cross-validation are the following:
1. Partition the dataset into k different subsets.
1. Create k different models by training on k-1 subsets and testing on the remaining subset.
1. Measure the performance on each of the k models and take the average measure.
```py
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# create a composite estimator made by a pipeline of the
# standarization and the linear model
clf = Pipeline([('scaler', StandardScaler()),('linear_model', SGDClassifier())])
# create a k-fold cross validation iterator of k=5 folds
cv = KFold(X.shape[0], 5, shuffle=True, random_state=33)
# by default the score used is the one returned by score
# method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print scores
[ 0.73333333 0.63333333 0.73333333 0.66666667 0.6 ]
```
We obtained an array with the k scores. We can calculate the mean and the standard error to obtain a final figure:
```py
from scipy.stats import sem
def mean_score(scores):
return ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
print mean_score(scores)
Mean score: 0.673 (+/-0.027)
```
Machine learning categories
Classification is only one of the possible machine learning problems that can be addressed with scikit-learn. We can organize them in the following categories:
? In the previous example, we had a set of instances (that is, a set of data collected from a population) represented by certain features and with a particular target attribute. Supervised learning algorithms try to build a model from this data, which lets us predict the target attribute for new instances, knowing only these instance features. When the target class belongs to a discrete set (such as a list of flower species), we are facing a classification problem.
? Sometimes the class we want to predict, instead of belonging to a discrete set, ranges on a continuous set, such as the real number line. In this case, we are trying to solve a regression problem (the term was coined by Francis Galton, who observed that the heights of tall ancestors tend to regress down towards a normal value, the average human height). For example, we could try to predict the petal width based on the other three features. We will see that the methods used for regression are quite different from those used for classification.
? Another different type of machine learning problem is that of unsupervised learning. In this case, we do not have a target class to predict but instead want to group instances according to some similarity measure based on the available set of features. For example, suppose you have a dataset composed of e-mails and want to group them by their main topic (the task of grouping instances is called clustering). We can use it as features, for example, the different words used in each of them.
- Python 量化交易教程
- 第一部分 新手入門
- 一 量化投資視頻學習課程
- 二 Python 手把手教學
- 量化分析師的Python日記【第1天:誰來給我講講Python?】
- 量化分析師的Python日記【第2天:再接著介紹一下Python唄】
- 量化分析師的Python日記【第3天:一大波金融Library來襲之numpy篇】
- 量化分析師的Python日記【第4天:一大波金融Library來襲之scipy篇】
- 量化分析師的Python日記【第5天:數據處理的瑞士軍刀pandas】
- 量化分析師的Python日記【第6天:數據處理的瑞士軍刀pandas下篇
- 量化分析師的Python日記【第7天:Q Quant 之初出江湖】
- 量化分析師的Python日記【第8天 Q Quant兵器譜之函數插值】
- 量化分析師的Python日記【第9天 Q Quant兵器譜之二叉樹】
- 量化分析師的Python日記【第10天 Q Quant兵器譜 -之偏微分方程1】
- 量化分析師的Python日記【第11天 Q Quant兵器譜之偏微分方程2】
- 量化分析師的Python日記【第12天:量化入門進階之葵花寶典:因子如何產生和回測】
- 量化分析師的Python日記【第13天 Q Quant兵器譜之偏微分方程3】
- 量化分析師的Python日記【第14天:如何在優礦上做Alpha對沖模型】
- 量化分析師的Python日記【第15天:如何在優礦上搞一個wealthfront出來】
- 第二部分 股票量化相關
- 一 基本面分析
- 1.1 alpha 多因子模型
- 破解Alpha對沖策略——觀《量化分析師Python日記第14天》有感
- 熔斷不要怕, alpha model 為你保駕護航!
- 尋找 alpha 之: alpha 設計
- 1.2 基本面因子選股
- Porfolio(現金比率+負債現金+現金保障倍數)+市盈率
- ROE選股指標
- 成交量因子
- ROIC&cashROIC
- 【國信金工】資產周轉率選股模型
- 【基本面指標】Cash Cow
- 量化因子選股——凈利潤/營業總收入
- 營業收入增長率+市盈率
- 1.3 財報閱讀 ? [米缸量化讀財報] 資產負債表-投資相關資產
- 1.4 股東分析
- 技術分析入門 【2】 —— 大家搶籌碼(06年至12年版)
- 技術分析入門 【2】 —— 大家搶籌碼(06年至12年版)— 更新版
- 誰是中國A股最有錢的自然人
- 1.5 宏觀研究
- 【干貨包郵】手把手教你做宏觀擇時
- 宏觀研究:從估值角度看當前市場
- 追尋“國家隊”的足跡
- 二 套利
- 2.1 配對交易
- HS300ETF套利(上)
- 【統計套利】配對交易
- 相似公司股票搬磚
- Paired trading
- 2.2 期現套利 ? 通過股指期貨的期現差與 ETF 對沖套利
- 三 事件驅動
- 3.1 盈利預增
- 盈利預增事件
- 事件驅動策略示例——盈利預增
- 3.2 分析師推薦 ? 分析師的金手指?
- 3.3 牛熊轉換
- 歷史總是相似 牛市還在延續
- 歷史總是相似 牛市已經見頂?
- 3.4 熔斷機制 ? 股海拾貝之 [熔斷錯殺股]
- 3.5 暴漲暴跌 ? [實盤感悟] 遇上暴跌我該怎么做?
- 3.6 兼并重組、舉牌收購 ? 寶萬戰-大戲開幕
- 四 技術分析
- 4.1 布林帶
- 布林帶交易策略
- 布林帶回調系統-日內
- Conservative Bollinger Bands
- Even More Conservative Bollinger Bands
- Simple Bollinger Bands
- 4.2 均線系統
- 技術分析入門 —— 雙均線策略
- 5日線10日線交易策略
- 用5日均線和10日均線進行判斷 --- 改進版
- macross
- 4.3 MACD
- Simple MACD
- MACD quantization trade
- MACD平滑異同移動平均線方法
- 4.4 阿隆指標 ? 技術指標阿隆( Aroon )全解析
- 4.5 CCI ? CCI 順勢指標探索
- 4.6 RSI
- 重寫 rsi
- RSI指標策略
- 4.7 DMI ? DMI 指標體系的構建及簡單應用
- 4.8 EMV ? EMV 技術指標的構建及應用
- 4.9 KDJ ? KDJ 策略
- 4.10 CMO
- CMO 策略模仿練習 1
- CMO策略模仿練習2
- [技術指標] CMO
- 4.11 FPC ? FPC 指標選股
- 4.12 Chaikin Volatility
- 嘉慶離散指標測試
- 4.13 委比 ? 實時計算委比
- 4.14 封單量
- 按照封單跟流通股本比例排序,剔除6月上市新股,前50
- 漲停股票封單統計
- 實時計算漲停板股票的封單資金與總流通市值的比例
- 4.15 成交量 ? 決戰之地, IF1507 !
- 4.16 K 線分析 ? 尋找夜空中最亮的星
- 五 量化模型
- 5.1 動量模型
- Momentum策略
- 【小散學量化】-2-動量模型的簡單實踐
- 一個追漲的策略(修正版)
- 動量策略(momentum driven)
- 動量策略(momentum driven)——修正版
- 最經典的Momentum和Contrarian在中國市場的測試
- 最經典的Momentum和Contrarian在中國市場的測試-yanheven改進
- [策略]基于勝率的趨勢交易策略
- 策略探討(更新):價量結合+動量反轉
- 反向動量策略(reverse momentum driven)
- 輕松跑贏大盤 - 主題Momentum策略
- Contrarian strategy
- 5.2 Joseph Piotroski 9 F-Score Value Investing Model · 基本面選股系統:Piotroski F-Score ranking system
- 5.3 SVR · 使用SVR預測股票開盤價 v1.0
- 5.4 決策樹、隨機樹
- 決策樹模型(固定模型)
- 基于Random Forest的決策策略
- 5.5 鐘擺理論 · 鐘擺理論的簡單實現——完美躲過股災和精準抄底
- 5.6 海龜模型
- simple turtle
- 俠之大者 一起賺錢
- 5.7 5217 策略 · 白龍馬的新手策略
- 5.8 SMIA · 基于歷史狀態空間相似性匹配的行業配置 SMIA 模型—取交集
- 5.9 神經網絡
- 神經網絡交易的訓練部分
- 通過神經網絡進行交易
- 5.10 PAMR · PAMR : 基于均值反轉的投資組合選擇策略 - 修改版
- 5.11 Fisher Transform · Using Fisher Transform Indicator
- 5.12 分型假說, Hurst 指數 · 分形市場假說,一個聽起來很美的假說
- 5.13 變點理論 · 變點策略初步
- 5.14 Z-score Model
- Zscore Model Tutorial
- 信用債風險模型初探之:Z-Score Model
- user-defined package
- 5.15 機器學習 · Machine Learning 學習筆記(一) by OTreeWEN
- 5.16 DualTrust 策略和布林強盜策略
- 5.17 卡爾曼濾波
- 5.18 LPPL anti-bubble model
- 今天大盤熔斷大跌,后市如何—— based on LPPL anti-bubble model
- 破解股市泡沫之謎——對數周期冪率(LPPL)模型
- 六 大數據模型
- 6.1 市場情緒分析
- 通聯情緒指標策略
- 互聯網+量化投資 大數據指數手把手
- 6.2 新聞熱點
- 如何使用優礦之“新聞熱點”?
- 技術分析【3】—— 眾星拱月,眾口鑠金?
- 七 排名選股系統
- 7.1 小市值投資法
- 學習筆記:可模擬(小市值+便宜 的修改版)
- 市值最小300指數
- 流通市值最小股票(新篩選器版)
- 持有市值最小的10只股票
- 10% smallest cap stock
- 7.2 羊駝策略
- 羊駝策略
- 羊駝反轉策略(修改版)
- 羊駝反轉策略
- 我的羊駝策略,選5只股無腦輪替
- 7.3 低價策略
- 專撿便宜貨(新版quartz)
- 策略原理
- 便宜就是 alpha
- 八 輪動模型
- 8.1 大小盤輪動 · 新手上路 -- 二八ETF擇時輪動策略2.0
- 8.2 季節性策略
- Halloween Cycle
- Halloween cycle 2
- 夏買電,東買煤?
- 歷史的十一月板塊漲幅
- 8.3 行業輪動
- 銀行股輪動
- 申萬二級行業在最近1年、3個月、5個交易日的漲幅統計
- 8.4 主題輪動
- 快速研究主題神器
- recommendation based on subject
- strategy7: recommendation based on theme
- 板塊異動類
- 風險因子(離散類)
- 8.5 龍頭輪動
- Competitive Securities
- Market Competitiveness
- 主題龍頭類
- 九 組合投資
- 9.1 指數跟蹤 · [策略] 指數跟蹤低成本建倉策略
- 9.2 GMVP · Global Minimum Variance Portfolio (GMVP)
- 9.3 凸優化 · 如何在 Python 中利用 CVXOPT 求解二次規劃問題
- 十 波動率
- 10.1 波動率選股 · 風平浪靜 風起豬飛
- 10.2 波動率擇時
- 基于 VIX 指數的擇時策略
- 簡單低波動率指數
- 10.3 Arch/Garch 模型 · 如何使用優礦進行 GARCH 模型分析
- 十一 算法交易
- 11.1 VWAP · Value-Weighted Average Price (VWAP)
- 十二 中高頻交易
- 12.1 order book 分析 · 基于高頻 limit order book 數據的短程價格方向預測—— via multi-class SVM
- 12.2 日內交易 · 大盤日內走勢 (for 擇時)
- 十三 Alternative Strategy
- 13.1 易經、傳統文化 · 老黃歷診股
- 第三部分 基金、利率互換、固定收益類
- 一 分級基金
- “優礦”集思錄——分級基金專題
- 基于期權定價的分級基金交易策略
- 基于期權定價的興全合潤基金交易策略
- 二 基金分析
- Alpha 基金“黑天鵝事件” -- 思考以及原因
- 三 債券
- 債券報價中的小陷阱
- 四 利率互換
- Swap Curve Construction
- 中國 Repo 7D 互換的例子
- 第四部分 衍生品相關
- 一 期權數據
- 如何獲取期權市場數據快照
- 期權高頻數據準備
- 二 期權系列
- [ 50ETF 期權] 1. 歷史成交持倉和 PCR 數據
- 【50ETF期權】 2. 歷史波動率
- 【50ETF期權】 3. 中國波指 iVIX
- 【50ETF期權】 4. Greeks 和隱含波動率微笑
- 【50ETF期權】 5. 日內即時監控 Greeks 和隱含波動率微笑
- 【50ETF期權】 5. 日內即時監控 Greeks 和隱含波動率微笑
- 三 期權分析
- 【50ETF期權】 期權擇時指數 1.0
- 每日期權風險數據整理
- 期權頭寸計算
- 期權探秘1
- 期權探秘2
- 期權市場一周縱覽
- 基于期權PCR指數的擇時策略
- 期權每日成交額PC比例計算
- 四 期貨分析
- 【前方高能!】Gifts from Santa Claus——股指期貨趨勢交易研究