4.5. 隨機投影 · sklearn中文文檔

# 4.5. 隨機投影校驗者: [@FontTian](https://github.com/FontTian) [@程威](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@Sehriff](https://github.com/apachecn/scikit-learn-doc-zh) [`sklearn.random_projection`](classes.html#module-sklearn.random_projection "sklearn.random_projection") 模塊實現了一個簡單且高效率的計算方式來減少數據維度，通過犧牲一定的精度（作為附加變量）來加速處理時間及更小的模型尺寸。這個模型實現了兩類無結構化的隨機矩陣: [Gaussian random matrix](#gaussian-random-matrix) 和 [sparse random matrix](#sparse-random-matrix). 隨機投影矩陣的維度和分布是受控制的，所以可以保存任意兩個數據集的距離。因此隨機投影適用于基于距離的方法。參考: - Sanjoy Dasgupta. 2000. [Experiments with random projection.](http://cseweb.ucsd.edu/~dasgupta/papers/randomf.pdf)In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (UAI‘00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151. - Ella Bingham and Heikki Mannila. 2001. [Random projection in dimensionality reduction: applications to image and text data.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf)In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 245-250. ## 4.5.1. Johnson-Lindenstrauss 輔助定理支撐隨機投影效率的主要理論成果是`Johnson-Lindenstrauss lemma (quoting Wikipedia) <[https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss\_lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)>`\_: > 在數學中，johnson - lindenstrauss 引理是一種將高維的點從高維到低維歐幾里得空間的低失真嵌入的方案。引理闡釋了高維空間下的一小部分的點集可以內嵌到非常低維的空間，這種方式下點之間的距離幾乎全部被保留。內嵌所用到的映射至少符合 Lipschitz 條件,甚至可以被當做正交投影。有了樣本數量， [`sklearn.random_projection.johnson_lindenstrauss_min_dim`](generated/sklearn.random_projection.johnson_lindenstrauss_min_dim.html#sklearn.random_projection.johnson_lindenstrauss_min_dim "sklearn.random_projection.johnson_lindenstrauss_min_dim") 會保守估計隨機子空間的最小大小來保證隨機投影導致的變形在一定范圍內： ``` >>> from sklearn.random_projection import johnson_lindenstrauss_min_dim >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5) 663 >>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01]) array([ 663, 11841, 1112658]) >>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1) array([ 7894, 9868, 11841])` ``` [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_johnson_lindenstrauss_bound_0011.png](https://box.kancloud.cn/c99f3854086ec8cfdf038620bc393c3c_566x424.jpg)](../auto_examples/plot_johnson_lindenstrauss_bound.html) [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_johnson_lindenstrauss_bound_0021.png](https://box.kancloud.cn/e7027770ed6f38c7616890ab5fb2d8e3_566x424.jpg)](../auto_examples/plot_johnson_lindenstrauss_bound.html) 例子: - 查看 [The Johnson-Lindenstrauss bound for embedding with random projections](../auto_examples/plot_johnson_lindenstrauss_bound.html#sphx-glr-auto-examples-plot-johnson-lindenstrauss-bound-py)里面有Johnson-Lindenstrauss引理的理論說明和使用稀疏隨機矩陣的經驗驗證。參考: - Sanjoy Dasgupta and Anupam Gupta, 1999. [An elementary proof of the Johnson-Lindenstrauss Lemma.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.3334&rep=rep1&type=pdf) ## 4.5.2. 高斯隨機投影 The [`sklearn.random_projection.GaussianRandomProjection`](generated/sklearn.random_projection.GaussianRandomProjection.html#sklearn.random_projection.GaussianRandomProjection "sklearn.random_projection.GaussianRandomProjection") 通過將原始輸入空間投影到隨機生成的矩陣（該矩陣的組件由以下分布中抽取） :math:[`](#id4)N(0, frac{1}{n\_{components}})`降低維度。以下小片段演示了任何使用高斯隨機投影轉換器: ``` >>> import numpy as np >>> from sklearn import random_projection >>> X = np.random.rand(100, 10000) >>> transformer = random_projection.GaussianRandomProjection() >>> X_new = transformer.fit_transform(X) >>> X_new.shape (100, 3947) ``` ## 4.5.3. 稀疏隨機矩陣 > [`sklearn.random_projection.SparseRandomProjection`](generated/sklearn.random_projection.SparseRandomProjection.html#sklearn.random_projection.SparseRandomProjection "sklearn.random_projection.SparseRandomProjection") 使用稀疏隨機矩陣，通過投影原始輸入空間來降低維度。稀疏矩陣可以替換高斯隨機投影矩陣來保證相似的嵌入質量，且內存利用率更高、投影數據的計算更快。如果我們定義 `s = 1 / density`, 隨機矩陣的元素由 ![\left\{ \begin{array}{c c l} -\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ 0 &\text{with probability} & 1 - 1 / s \\ +\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ \end{array} \right.](https://box.kancloud.cn/1776cad3f525bd5638f1db63aaaeb5de_337x89.jpg) 抽取。其中 ![n_{\text{components}}](https://box.kancloud.cn/2a7e717ba7b2da2f0767689d4b88c668_78x14.jpg) 是投影后的子空間大小。默認非零元素的濃密度設置為最小濃密度，該值由Ping Li et al.:推薦，根據公式:math:[`](#id7)1 / sqrt{n\_{text{features}}}`計算。以下小片段演示了如何使用稀疏隨機投影轉換器: ``` >>> import numpy as np >>> from sklearn import random_projection >>> X = np.random.rand(100,10000) >>> transformer = random_projection.SparseRandomProjection() >>> X_new = transformer.fit_transform(X) >>> X_new.shape (100, 3947) ``` 參考: - D. Achlioptas. 2003. [Database-friendly random projections: Johnson-Lindenstrauss with binary coins](www.cs.ucsc.edu/~optas/papers/jl.pdf). Journal of Computer and System Sciences 66 (2003) 671–687 - Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. [Very sparse random projections.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.585&rep=rep1&type=pdf)In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). ACM, New York, NY, USA, 287-296.