寫在之前
本書涉及的源程序和數(shù)據(jù)都可以在以下網(wǎng)站中找到:http://guidetodatamining.com/
這本書理論比較簡單,書中錯誤較少,動手鍛煉較多,如果每個代碼都自己寫出來,收獲不少。總結(jié):適合入門。
歡迎轉(zhuǎn)載,轉(zhuǎn)載請注明出處,如有問題歡迎指正。。
合集地址:https://www.zybuluo.com/hainingwyx/note/559139
基于物品的協(xié)同過濾
顯示評級:顯示給出評級結(jié)果,如Youtube的點(diǎn)贊、點(diǎn)差按鈕
隱式評級:網(wǎng)站點(diǎn)擊軌跡。
基于鄰居(用戶)的推薦系統(tǒng)計算的次數(shù)十分巨大,所以有延遲性。還有稀疏性的問題。也稱為基于內(nèi)存的協(xié)同過濾,因為需要保存所有的評級結(jié)果來進(jìn)行推薦。
基于物品的過濾:事先找到最相似的物品,并結(jié)合物品的評級結(jié)果生成推薦。也稱為基于模型的協(xié)同過濾,因為不需要保存所有的評級結(jié)果,取而代之的隨時構(gòu)建一個模型表示物品之間的相似度。
為了抵消分?jǐn)?shù)夸大,調(diào)整余弦相似度
U表示所有同事對i和j進(jìn)行過評級的用戶組合,
表示用戶u對物品i的評分,
表示用戶u對所有物品評分的平均值。可以獲得相似度矩陣。
users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5,
"Lorde": 4, "Fall Out Boy": 1},
"Matt": {"Imagine Dragons": 3, "Daft Punk": 4,
"Lorde": 4, "Fall Out Boy": 1},
"Ben": {"Kacey Musgraves": 4, "Imagine Dragons": 3,
"Lorde": 3, "Fall Out Boy": 1},
"Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,
"Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
"Tori": {"Kacey Musgraves": 5, "Imagine Dragons": 4,
"Daft Punk": 5, "Fall Out Boy": 3}}
def computeSimilarity(band1, band2, userRatings):
averages = {}
for (key, ratings) in userRatings.items():
averages[key] = (float(sum(ratings.values()))
/ len(ratings.values()))
num = 0 # numerator
dem1 = 0 # first half of denominator
dem2 = 0
for (user, ratings) in userRatings.items():
if band1 in ratings and band2 in ratings:
avg = averages[user]
num += (ratings[band1] - avg) * (ratings[band2] - avg)
dem1 += (ratings[band1] - avg)**2
dem2 += (ratings[band2] - avg)**2
return num / (sqrt(dem1) * sqrt(dem2))
相似矩陣預(yù)測:
p(u,i)表示用戶u對物品i的預(yù)測值
N表示用戶u的所有評級物品中每個和i得分相似的物品。
是i和N之間的相識度
是u給N的評級結(jié)果,應(yīng)該在[-1, 1]之間取值,可能需要做線性變換
得到新的評級結(jié)果為
Slope One算法
-
計算偏差
物品i到物品j的平均偏差為
card(S)是S集合中的元素的個數(shù)。X是整個評分集合。
是所有對i和j進(jìn)行評分的用戶集合。
def computeDeviations(self):
# for each person in the data:
# get their ratings
for ratings in self.data.values(): # data:users2, ratings:{song:value, , }
# for each item & rating in that set of ratings:
for (item, rating) in ratings.items():
self.frequencies.setdefault(item, {}) #key is song
self.deviations.setdefault(item, {})
# for each item2 & rating2 in that set of ratings:
for (item2, rating2) in ratings.items():
if item != item2:
# add the difference between the ratings to our
# computation
self.frequencies[item].setdefault(item2, 0)
self.deviations[item].setdefault(item2, 0.0)
# frequemcies is card
self.frequencies[item][item2] += 1
# diviations is the sum of dev of diff users
#value of complex dic is dev
self.deviations[item][item2] += rating - rating2
for (item, ratings) in self.deviations.items():
for item2 in ratings:
ratings[item2] /= self.frequencies[item][item2]
# test code for ComputeDeviations(self)
#r = recommender(users2)
#r.computeDeviations()
#r.deviations
?
- 加權(quán)Slope預(yù)測
表示加權(quán)Slope算法給出的用戶u對物品j的預(yù)測
def slopeOneRecommendations(self, userRatings):
recommendations = {}
frequencies = {}
# for every item and rating in the user's recommendations
for (userItem, userRating) in userRatings.items(): # userItem :i
# for every item in our dataset that the user didn't rate
for (diffItem, diffRatings) in self.deviations.items(): #diffItem : j
if diffItem not in userRatings and \
userItem in self.deviations[diffItem]:
freq = self.frequencies[diffItem][userItem] #freq:c_ji
# 如果鍵不存在于字典中,將會添加鍵并將值設(shè)為默認(rèn)值。
recommendations.setdefault(diffItem, 0.0)
frequencies.setdefault(diffItem, 0)
# add to the running sum representing the numerator
# of the formula
recommendations[diffItem] += (diffRatings[userItem] +
userRating) * freq
# keep a running sum of the frequency of diffitem
frequencies[diffItem] += freq
#p(u)j list
recommendations = [(self.convertProductID2name(k),
v / frequencies[k])
for (k, v) in recommendations.items()]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# I am only going to return the first 50 recommendations
return recommendations[:50]
# test code for SlopeOneRecommendations
#r = recommender(users2)
#r.computeDeviations()
#g = users2['Ben']
#r.slopeOneRecommendations(g)
def loadMovieLens(self, path=''):
self.data = {}
#
# first load movie ratings
#
i = 0
#
# First load book ratings into self.data
#
#f = codecs.open(path + "u.data", 'r', 'utf8')
f = codecs.open(path + "u.data", 'r', 'ascii')
# f = open(path + "u.data")
for line in f:
i += 1
#separate line into fields
fields = line.split('\t')
user = fields[0]
movie = fields[1]
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[movie] = rating
self.data[user] = currentRatings
f.close()
#
# Now load movie into self.productid2name
# the file u.item contains movie id, title, release date among
# other fields
#
#f = codecs.open(path + "u.item", 'r', 'utf8')
f = codecs.open(path + "u.item", 'r', 'iso8859-1', 'ignore')
#f = open(path + "u.item")
for line in f:
i += 1
#separate line into fields
fields = line.split('|')
mid = fields[0].strip()
title = fields[1].strip()
self.productid2name[mid] = title
f.close()
#
# Now load user info into both self.userid2name
# and self.username2id
#
#f = codecs.open(path + "u.user", 'r', 'utf8')
f = open(path + "u.user")
for line in f:
i += 1
fields = line.split('|')
userid = fields[0].strip('"')
self.userid2name[userid] = line
self.username2id[line] = userid
f.close()
print(i)
# test code
#r = recommender(0)
#r.loadMovieLens('ml-100k/')
#r.computeDeviations()
#r.slopeOneRecommendations(r.data['1'])
#r.slopeOneRecommendations(r.data['25'])