簡單的理解協同過濾: 相似興趣愛好的人喜歡相似的東西,具有相似屬性的物品可以推薦給喜歡同類物品的人。比如,user A喜歡武俠片,user B也喜歡武俠片,那么可以把A喜歡而B沒看過的武俠片推薦給B,反之亦然,這種模式稱為基于用戶的協同過濾推薦(User-User Collaborative Filtering Recommendation);再比如User A買了《java 核心技術卷一》,那么可以推薦給用戶《java核心技術卷二》《java編程思想》,這種模式稱為基于物品的協同過濾(Item-Item Collaborative Filtering Recommendation).
下面是亞馬遜中查看《java核心技術卷一》這本書的推薦結果:

下面參考《集體智慧編程》一書,實現基于歐幾里德距離和基于皮爾遜相關度的用戶相似度計算和推薦。
### 數據集,用戶對電影的打分表:
| movies | Lady in the Water | Snakes on a Plane | Just My Luck | Superman Returns | You, Me and Dupree | The Night Listener |
|-----|-----|-----|-----|-----|-----|-----|
| Lisa | 2.5 | 3.5 | 3.0 | 3.5 | 2.5 | 3.0 |
| Gene | 3.0 | 3.5 | 1.5 | 5.0 | 3.5 | 3.0 |
| Michael | 2.5 | - | 3.0 | 3.5 | - | 4.0 |
| Claudia | - | 3.5 | 3.0 | 4.0 | 2.5 | 4.5 |
| Mick | | 3.0 | 4.0 | 2.0 | 3.0 | 2.0 |
| Jack | 3.0 | 4.0 | - | 5.0 | 3.5 | 3.0 |
| Toby | - | 4.5 | - | 4.0 | 1.0 | - |
### 建立數據字典
~~~
critics={'Lisa': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
~~~
### 歐幾里德距離
~~~
#返回一個有關person1與person2的基于距離的相似度評價
def sim_distance(prefs, person1, person2):
#得到shared_item的列表
ci = {}
for item in prefs[person1]:
if item in prefs[person2]:
ci[item] = prefs[person1][item] - prefs[person2][item]
if len(ci) == 1: # confuses pearson metric
return sim_distance(prefs, person1, person2)
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1 + sqrt(sum_of_squares))
~~~
計算Lisa和Gene之間的歐式距離:
~~~
print(sim_distance(critics,'Lisa','Gene'))
~~~
結果:
~~~
0.294298055086
~~~
### 皮爾遜相關系數
~~~
#返回一個有關person1與person2的基于皮爾遜相關度評價
def sim_pearson(prefs,p1,p2):
# Get the list of mutually rated items
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
# if they are no ratings in common, return 0
if len(si)==0: return 0
# Sum calculations
n=len(si)
# Sums of all the preferences
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
# Sum of the products
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
~~~
計算Lisa和Gene之間的皮爾遜相關系數:
~~~
print(sim_pearson(critics,'Lisa','Gene'))
~~~
結果:
~~~
0.396059017191
~~~
### 選擇Top N
~~~
# 從反應偏好的字典中返回最為匹配者
# 返回結果的個數和相似度函數均為可選參數
def topMatches(prefs, person, n=5, similarity=sim_pearson):
scores = [(similarity(prefs, person, other), other)
for other in prefs if other != person]
scores.sort(reverse=True)
return scores[0:n]
~~~
返回4個和Toby品味最相似的用戶:
~~~
print(topMatches(critics,'Toby',n=4))
~~~
結果:
~~~
yaopans-MacBook-Pro:ucas01 yaopan$ python recommend.py
[(0.9912407071619299, 'Lisa'), (0.9244734516419049, 'Mick'), (0.8934051474415647, 'Claudia'), (0.66284898035987, 'Jack')]
~~~
### 基于用戶推薦
~~~
def getRecommendations(prefs,person,similarity=sim_pearson):
totals={}
simSums={}
for other in prefs:
#和其他人比較,跳過自己
if other==person: continue
sim=similarity(prefs,person,other)
#忽略評價值為0或小于0的情況
if sim<=0: continue
for item in prefs[other]:
# 只對自己還未看過到影片進行評價
if item not in prefs[person] or prefs[person][item]==0:
# 相似度*評價值
totals.setdefault(item,0)
totals[item]+=prefs[other][item]*sim
# 相似度之和
simSums.setdefault(item,0)
simSums[item]+=sim
# 建立一個歸一化列表
rankings=[(total/simSums[item],item) for item,total in totals.items()]
# 返回經過排序的列表
rankings.sort()
rankings.reverse()
return rankings
~~~
給Toby推薦:
~~~
print(getRecommendations(critics,'Toby'))
~~~
推薦結果
~~~
yaopans-MacBook-Pro:ucas01 yaopan$ python recommend.py
[(3.3477895267131013, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.5309807037655645, 'Just My Luck')]
~~~
### 基于物品推薦
基于物品推薦和基于用戶推薦類似,把物品和用戶調換。
轉換函數:
~~~
def transformPrefs(prefs):
result={}
for person in prefs:
for item in prefs[person]:
result.setdefault(item,{})
result[item][person]=prefs[person][item]
return result
~~~
返回和Superman Returns相似的電影:
~~~
movies=transformPrefs(critics)
print("和Superman Returns相似的電影:")
print(topMatches(movies,'Superman Returns'))
~~~
結果:
~~~
[(0.6579516949597695, 'You, Me and Dupree'), (0.4879500364742689, 'Lady in the Water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.1798471947990544, 'The Night Listener'), (-0.42289003161103106, 'Just My Luck')]
~~~
結果為負的是最不相關的。
代碼下載地址:
[recommend.py](http://download.csdn.net/detail/napoay/9385520)