SVD(Singular Value Decomposition)奇異值分解,可以用來簡化數據,去除噪聲,提高算法的結果。
###一、SVD與推薦系統
下圖由餐館的菜和品菜師對這些菜的意見組成,品菜師可以采用1到5之間的任意一個整數來對菜評級,如果品菜師沒有嘗過某道菜,則評級為0

建立一個新文件svdRec.py并加入如下代碼:
~~~
def loadExData():
return[[0, 0, 0, 2, 2],
[0, 0, 0, 3, 3],
[0, 0, 0, 1, 1],
[1, 1, 1, 0, 0],
[2, 2, 2, 0, 0],
[5, 5, 5, 0, 0],
[1, 1, 1, 0, 0]]
~~~
~~~
>>> import svdRec
>>> Data=svdRec.loadExData()
>>> Data
[[0, 0, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1], [1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [5, 5, 5, 0, 0], [1, 1, 1, 0, 0]]
>>> U,Sigma,VT=linalg.svd(Data)
>>> Sigma
array([ 9.64365076e+00, 5.29150262e+00, 8.05799147e-16,
2.43883353e-16, 2.07518106e-17])
~~~
我們可以發現得到的特征值,前兩個比其他的值大很多,所以可以將最后三個值去掉,因為他們的影響很小。
可以看出上圖中前三個人,喜歡烤牛肉和手撕豬肉,這些菜都是美式燒烤餐館才有的菜,這兩個特征值可以分別對應到美食BBQ和日式食品兩類食品上,所以可以認為這三個人屬于一類用戶,下面四個人屬于一類用戶,這樣推薦就很簡單了。
建立一個新文件svdRec.py并加入如下代碼:
~~~
def loadExData():
return[[1, 1, 1, 0, 0],
[2, 2, 2, 0, 0],
[1, 1, 1, 0, 0],
[5, 5, 5, 0, 0],
[1, 1, 0, 2, 2],
[0, 0, 0, 3, 3],
[0, 0, 0, 1, 1]]
~~~
SVD分解:
~~~
>>> reload(svdRec)
<module 'svdRec' from 'svdRec.py'>
>>> Data=svdRec.loadExData()
>>> Data
[[1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [1, 1, 1, 0, 0], [5, 5, 5, 0, 0], [1, 1, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1]]
>>> U,Sigma,VT=linalg.svd(Data)
>>> Sigma
array([ 9.72140007e+00, 5.29397912e+00, 6.84226362e-01,
1.67441533e-15, 3.39639411e-16])
~~~
我們可以發現得到的特征值,前3個比其他的值大很多,所以可以將最后2個值去掉,因為他們的影響很小。



上面例子就可以將原始數據用如下結果近似:

###二、基于協同過濾的推薦引擎
協同過濾(collaborative filtering)是通過將用戶與其他用戶的數據進行對比來實現推薦的。
1.相似度計算

~~~
from numpy import *
from numpy import linalg as la
def eulidSim(inA,inB):
return 1.0/(1.0+la.norm(inA,inB))
def pearsSim(inA,inB):
if len(inA<3):return 1.0
return 0.5+0.5*corrcoef(inA,inB,rowvar=0)[0][1]
def cosSim(inA,inB):
num=float(inA.T*inB)
denom=la.norm(inA)*la.norm(inB)
return 0.5+0.5*(num/denom)
~~~
2.基于物品的相似度與基于用戶的相似度
當用戶數目很多時,采用基于物品的相似度計算方法更好。
3.示例:基于物品相似度的餐館菜肴推薦引擎

~~~
from numpy import *
from numpy import linalg as la
def loadExData():
return[[1, 1, 1, 0, 0],
[2, 2, 2, 0, 0],
[1, 1, 1, 0, 0],
[5, 5, 5, 0, 0],
[1, 1, 0, 2, 2],
[0, 0, 0, 3, 3],
[0, 0, 0, 1, 1]]
def loadExData2():
return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
[0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
[0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
[3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
[5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
[0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
[4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
[0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
[0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
[0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]
def ecludSim(inA,inB):
return 1.0/(1.0 + la.norm(inA - inB))
def pearsSim(inA,inB):
if len(inA) < 3 : return 1.0
return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]
def cosSim(inA,inB):
num = float(inA.T*inB)
denom = la.norm(inA)*la.norm(inB)
return 0.5+0.5*(num/denom)
#計算在給定相似度計算方法的條件下,用戶對物品的估計評分值
#standEst()函數中:參數dataMat表示數據矩陣,user表示用戶編號,simMeas表示相似度計算方法,item表示物品編號
def standEst(dataMat,user,simMeas,item):
n=shape(dataMat)[1] #shape用于求矩陣的行列
simTotal=0.0; ratSimTotal=0.0
for j in range(n):
userRating=dataMat[user,j]
if userRating==0:continue #若某個物品評分值為0,表示用戶未對物品評分,則跳過,繼續遍歷下一個物品
#尋找兩個用戶都評分的物品
overLap=nonzero(logical_and(dataMat[:,item].A>0,dataMat[:,j].A>0))[0]
if len(overLap)==0:similarity=0
else: similarity=simMeas(dataMat[overLap,item],dataMat[overLap,j])
#print'the %d and%d similarity is: %f' %(item,j,similarity)
simTotal+=similarity
ratSimTotal+=similarity*userRating
if simTotal==0: return 0
else: return ratSimTotal/simTotal
def recommend(dataMat,user,N=3,simMeas=cosSim,estMethod=standEst):
#尋找未評級的物品
unratedItems=nonzero(dataMat[user,:].A==0)[1]
if len(unratedItems)==0: return 'you rated everything'
itemScores=[]
for item in unratedItems:
estimatedScore=estMethod(dataMat,user,simMeas,item) #對每一個未評分物品,調用standEst()來產生該物品的預測得分
itemScores.append((item,estimatedScore)) #該物品的編號和估計得分值放入一個元素列表itemScores中
#對itemScores進行從大到小排序,返回前N個未評分物品
return sorted(itemScores,key=lambda jj:jj[1],reverse=True)[:N]
def svdEst(dataMat, user, simMeas, item):
n = shape(dataMat)[1]
simTotal = 0.0; ratSimTotal = 0.0
U,Sigma,VT = la.svd(dataMat)
Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
xformedItems = dataMat.T * U[:,:4] * Sig4.I #create transformed items
for j in range(n):
userRating = dataMat[user,j]
if userRating == 0 or j==item: continue
similarity = simMeas(xformedItems[item,:].T,\
xformedItems[j,:].T)
print 'the %d and %d similarity is: %f' % (item, j, similarity)
simTotal += similarity
ratSimTotal += similarity * userRating
if simTotal == 0: return 0
else: return ratSimTotal/simTotal
~~~
其中dataMat[:,item].A,表示找出item列,因為是matrix,用.A轉成array,logical_and,其實就是找出最item列和j列都>0,只有都大于0才會是true,nonzero會給出其中不為0的index。
進行SVD分解:
~~~
>>>from numpy import linalg as la
>>> U,Sigma,VT=la.svd(mat(svdRec.loadExData2()))
>>> Sigma
array([ 1.38487021e+01, 1.15944583e+01, 1.10219767e+01,
5.31737732e+00, 4.55477815e+00, 2.69935136e+00,
1.53799905e+00, 6.46087828e-01, 4.45444850e-01,
9.86019201e-02, 9.96558169e-17])
~~~
如何決定r?有個定量的方法是看多少個奇異值可以達到90%的能量,其實和PCA一樣,由于奇異值其實是等于data×dataT特征值的平方根,所以總能量就是特征值的和
~~~
>>> Sig2=Sigma**2
>>> sum(Sig2)
541.99999999999932
~~~
而取到前4個時,發現總能量大于90%,因此r=4
~~~
>>> sum(Sig2[:3])
500.50028912757909
~~~
SVD分解的關鍵在于,降低了user的維度,從n變到了4
~~~
def svdEst(dataMat, user, simMeas, item):
n = shape(dataMat)[1]
simTotal = 0.0; ratSimTotal = 0.0
U,Sigma,VT = la.svd(dataMat)
Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
xformedItems = dataMat.T * U[:,:4] * Sig4.I #create transformed items
for j in range(n):
userRating = dataMat[user,j]
if userRating == 0 or j==item: continue
similarity = simMeas(xformedItems[item,:].T,\
xformedItems[j,:].T)
print 'the %d and %d similarity is: %f' % (item, j, similarity)
simTotal += similarity
ratSimTotal += similarity * userRating
if simTotal == 0: return 0
else: return ratSimTotal/simTotal
~~~
其中關鍵一步,dataMat.T * U[:,:4] * Sig4.I
將m×n的dataMat用特征值縮放轉換為n×4的item和user類的矩陣
~~~
>>> myMat=mat(svdRec.loadExData2())
>>> myMat
matrix([[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
[0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
[0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
[3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
[5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
[0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
[4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
[0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
[0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
[0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]])
>>> svdRec.recommend(myMat,1,estMethod=svdRec.svdEst)
the 0 and 3 similarity is: 0.490950
the 0 and 5 similarity is: 0.484274
the 0 and 10 similarity is: 0.512755
the 1 and 3 similarity is: 0.491294
the 1 and 5 similarity is: 0.481516
the 1 and 10 similarity is: 0.509709
the 2 and 3 similarity is: 0.491573
the 2 and 5 similarity is: 0.482346
the 2 and 10 similarity is: 0.510584
the 4 and 3 similarity is: 0.450495
the 4 and 5 similarity is: 0.506795
the 4 and 10 similarity is: 0.512896
the 6 and 3 similarity is: 0.743699
the 6 and 5 similarity is: 0.468366
the 6 and 10 similarity is: 0.439465
the 7 and 3 similarity is: 0.482175
the 7 and 5 similarity is: 0.494716
the 7 and 10 similarity is: 0.524970
the 8 and 3 similarity is: 0.491307
the 8 and 5 similarity is: 0.491228
the 8 and 10 similarity is: 0.520290
the 9 and 3 similarity is: 0.522379
the 9 and 5 similarity is: 0.496130
the 9 and 10 similarity is: 0.493617
[(4, 3.3447149384692283), (7, 3.3294020724526967), (9, 3.328100876390069)]
~~~