摘要:用戶過去的偏好很可能展示或者反應(yīng)未來的興趣偏好。數(shù)據(jù)集我們選用,下載地址數(shù)據(jù)集算法理論算法框架如圖,輸入是的評(píng)分矩陣,該矩陣非常稀疏。所以預(yù)測(cè)分兩步進(jìn)行計(jì)算項(xiàng)目之間的相似性和根據(jù)相似性進(jìn)行預(yù)測(cè)評(píng)分。
【參考文獻(xiàn)】:Sarwar B M . Item-based collaborative filtering recommendation algorithms[C]// International Conference on World Wide Web. ACM, 2001.
背景:推薦領(lǐng)域必讀文獻(xiàn)之一,經(jīng)典之作,本博客主要記錄了該文章的主要思想和相關(guān)實(shí)現(xiàn)代碼,歡迎觀摩!
前提或假設(shè)
用戶對(duì)項(xiàng)目的評(píng)分值,能夠反應(yīng)用戶對(duì)項(xiàng)目某種程度上的偏好。
用戶過去的偏好很可能展示或者反應(yīng)未來的興趣偏好。
數(shù)據(jù)集
我們選用MovieLens 100K Dataset,=> 100,000 ratings from 1000 users on 1700 movies.
下載地址:movielens數(shù)據(jù)集
算法理論
算法框架:如圖,輸入是user-item的評(píng)分矩陣,該矩陣非常稀疏。算法的任務(wù)是預(yù)測(cè)特定用戶對(duì)特定項(xiàng)目的評(píng)分,填補(bǔ)矩陣中空白單元格,接著根據(jù)預(yù)測(cè)評(píng)分從高到低為特定用戶進(jìn)行top-N推薦
算法預(yù)測(cè):算法認(rèn)為某用戶喜歡某項(xiàng)目,在很大程度上也會(huì)對(duì)和該項(xiàng)目較相似的項(xiàng)目產(chǎn)生興趣。所以預(yù)測(cè)分兩步進(jìn)行:計(jì)算項(xiàng)目之間的相似性和根據(jù)相似性進(jìn)行預(yù)測(cè)評(píng)分。
文章提供了三個(gè)相似性計(jì)算公式:
Cosine-based Similarity
$$ sim(i,j)= cos(vec{i},vec{j})= frac{vec{i}cdot vec{j}}{left | vec{i}
ight |_{2}*left | vec{j}
ight |_{2}} $$
Correlation-based Similarity
$$ sim(i,j)= frac{sum _{uin U}(R_{u,i}-ar{R}_{i})(R_{u,j}-ar{R}_{j})}{sqrt{sum _{uin U}(R_{u,i}-ar{R}_{i})^{2}}sqrt{sum _{uin U}(R_{u,j}-ar{R}_{j})^{2}}} $$
Adjusted Cosine Similarity
$$ sim(i,j)= frac{sum _{uin U}(R_{u,i}-ar{R}_{u})(R_{u,j}-ar{R}_{u})}{sqrt{sum _{uin U}(R_{u,i}-ar{R}_{u})^{2}}sqrt{sum _{uin U}(R_{u,j}-ar{R}_{u})^{2}}} $$
但是所有的相似性計(jì)算公式必須在共同評(píng)分項(xiàng)上進(jìn)行,即同時(shí)評(píng)價(jià)過i和j的歷史評(píng)分
算法選取和該項(xiàng)目最相似的前N個(gè)項(xiàng)目作為預(yù)測(cè)基礎(chǔ),預(yù)測(cè)公式如下:
$$ P_{u,i}=frac{sum _{all similar items,N}(S_{i,N}*R_{u,N})}{sum _{all similar items,N}(left | S_{i,N}
ight |)} $$
算法最后一步,根據(jù)預(yù)測(cè)評(píng)分值從高到低進(jìn)行推薦
實(shí)驗(yàn)度量
文章采用MAE進(jìn)行誤差度量,公式如下:
$$ MAE = frac{sum_{i=1}^{N}left | p_{i}-q_{i}
ight |}{N} $$
Python 代碼
# !usr/bin/python # -*- coding=utf-8 -*- import math import operator #加載數(shù)據(jù) def loadData(): # trainSet格式為: testSet格式一致 # { # userid:{ # itemid1: rating, # itemid2: rating # } # } # movieUser格式為:看過某一部電影的所有用戶集合 # { # itemid: { # userid1: rating, # userid2: rating # } # } # # # trainSet = {} testSet = {} movieUser = {} TrainFile = "./dataset/u1.base" # 指定訓(xùn)練集 TestFile = "./dataset/u1.test" # 指定測(cè)試集 # 讀取訓(xùn)練集 f = open(TrainFile,"r") lines = f.readlines() for line in lines: arr = line.strip().split(" ") userId = arr[0] itemId = arr[1] rating = arr[2] trainSet.setdefault(userId, {}) trainSet[userId].setdefault(itemId, float(rating)) movieUser.setdefault(itemId, {}) movieUser[itemId].setdefault(userId, float(rating)) # 讀取測(cè)試集 f1 = open(TestFile,"r") lines1 = f1.readlines() for line1 in lines1: arr1 = line1.strip().split(" ") userId1 = arr1[0] itemId1 = arr1[1] rating1 = arr1[2] testSet.setdefault(userId1, {}) testSet[userId1].setdefault(itemId1, float(rating1)) arr = [trainSet,movieUser] return arr # 生成電影電影共有用戶矩陣 def i_j_users(i_id,j_id,movieUser): # ij_users格式為: # { # (i_id,j_id):{userid1:None,userid2:None,....} # } if i_id in movieUser.keys(): i_users = movieUser[i_id] else: i_users = {} if j_id in movieUser.keys(): j_users = movieUser[j_id] else: j_users = {} inter = dict.fromkeys([x for x in i_users if x in j_users]) i_j_users = {(i_id,j_id):inter} return i_j_users #計(jì)算一個(gè)用戶的平均分?jǐn)?shù) def getAverageRating(trainSet,userid): average = (sum(trainSet[userid].values()) * 1.0) / len(trainSet[userid].keys()) return average #計(jì)算項(xiàng)目相似度 def getItemSim(i_j_users,i_id,j_id,trainSet): # 分子 sumtop # 分母 sumbot1 sumbot2 sumtop = 0 sumbot1 = 0 sumbot2 = 0 ij_users = i_j_users[(i_id,j_id)] if not ij_users: ij_sim = -9999 # 疑問? 為0 或者為None else: for user in ij_users.keys(): avr_user = getAverageRating(trainSet,user) # 求分子 left = trainSet[user][i_id] - avr_user right = trainSet[user][j_id] - avr_user sumtop += left*right # 求分母 sumbot1 += left*left sumbot2 += right*right if sumbot1 == 0 or sumbot2 == 0: ij_sim = 1 else: ij_sim = sumtop*1.0 / (math.sqrt(sumbot1)*math.sqrt(sumbot2)) return ij_sim # 計(jì)算項(xiàng)目i和其她所有項(xiàng)目的相似度并排序 # i_allitem_sim格式為: # { # j_id1:s1, # j_id2:s2 # } def i_allitem_sort(i_id,movieUser,trainSet,N): i_allitem = {} for j in movieUser.keys(): if j != i_id: i_j_user = i_j_users(i_id,j,movieUser) s = getItemSim(i_j_user,i_id,j,trainSet) i_allitem.setdefault(j, s) i_allitem_sort1 = sorted(i_allitem.items(), key = operator.itemgetter(1), reverse = True)[0:N] i_allitem_sort_dict = {} for n in range(len(i_allitem_sort1)): j1 = i_allitem_sort1[n][0] s = i_allitem_sort1[n][1] i_allitem_sort_dict.setdefault(j1, s) return i_allitem_sort_dict # 預(yù)測(cè)評(píng)分 def prediction(userid,itemid,moviUser,trainSet,N): # predict 格式為: # { # (userid,itemid): pui # } predict = 0 sumtop = 0 sumbot = 0 nsets = i_allitem_sort(itemid,movieUser,trainSet,N) for j in nsets.keys(): # 防止用戶對(duì)i的領(lǐng)域集合內(nèi)的j沒評(píng)分 if j not in trainSet[userid].keys(): ruj = 0 mid = 0 else: ruj = trainSet[userid][j] mid = abs(nsets[j]) sumtop += nsets[j]*ruj sumbot += mid # 防止分母為0 if sumbot == 0: predict = 0 else: predict = sumtop * 1.0 / sumbot return predict def saveFile(moviUser,trainSet,N): # 讀取用戶 string = "" # 正在讀取 f = open("../Collaborative Filtering/dataset/u1.test") fw = open("../Collaborative Filtering/predict","w") fl = f.readlines() for i in fl: arr = i.split(" ") uid = str(arr[0].strip()) item = str(arr[1].strip()) rating = float(arr[2].strip()) predictScore = prediction(str(uid),str(item),moviUser,trainSet,N) string = string + str(uid) + " " + str(item) + " " + str(rating) + " " + str(predictScore) + " " fw.write(string) f.close() fw.close() # 計(jì)算預(yù)測(cè)分析準(zhǔn)確度 def getMAE(): f = open("../Collaborative Filtering/predict") fl = f.readlines() mae = 0.0 s = 0 counttest = 0# 測(cè)試集的個(gè)數(shù) for i in fl: arr = i.split(" ") uid = str(arr[0].strip()) item = str(arr[1].strip()) rating = float(arr[2].strip()) predictScore = float(arr[3].strip()) if predictScore == 0: mid = 0 else: mid = abs((predictScore-rating)) counttest = counttest + 1 s = s + mid mae = s/counttest print(mae) if __name__ == "__main__": N = 30 arr = loadData() trainSet = arr[0] movieUser = arr[1] saveFile(movieUser,trainSet,N) # getMAE()
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/42694.html
摘要:如果做推薦系統(tǒng)不知道基于物品的協(xié)同過濾,那等同于做程序員不懂得冒泡排序?;谖锲返陌素曰谖锲返膮f(xié)同過濾算法誕生于年,是由亞馬遜首先提出的,并在年由其發(fā)明者發(fā)表了相應(yīng)的論文。 不管你有沒有剁過手,你對(duì)看了這個(gè)商品的還看了這樣的推薦形式一定不陌生。無論是貓還是狗,或者是其他電商網(wǎng)站,這樣的推薦產(chǎn)品可以說是推薦系統(tǒng)的標(biāo)配了。 類似的還有,如點(diǎn)評(píng)標(biāo)記類網(wǎng)站的喜歡了這部電影的還喜歡了,社交媒...
摘要:經(jīng)過一段時(shí)間的說句搜集,當(dāng)具備一定的數(shù)據(jù)量時(shí),你就可以用通過機(jī)器學(xué)習(xí)算法來執(zhí)行一些有用的分析并產(chǎn)生一些有價(jià)值的推薦了。 翻譯自?Google Cloud Platform 原文標(biāo)題:Using Machine Learning on Compute Engine to Make Product Recommendations 原文地址:https://cloud.google.com/...
摘要:默認(rèn)值為返回值,一個(gè)對(duì)象,包含了原生用戶原生項(xiàng)目真實(shí)評(píng)分預(yù)測(cè)評(píng)分可能對(duì)后面預(yù)測(cè)有用的一些其他的詳細(xì)信息在給定的測(cè)試集上測(cè)試算法,即估計(jì)給定測(cè)試集中的所有評(píng)分。 這里的格式并沒有做過多的處理,可參考于OneNote筆記鏈接 由于OneNote取消了單頁分享,如果需要請(qǐng)留下郵箱,我會(huì)郵件發(fā)送pdf版本,后續(xù)再解決這個(gè)問題 推薦算法庫surprise安裝 pip install surp...
閱讀 4442·2021-09-09 09:33
閱讀 2391·2019-08-29 17:15
閱讀 2377·2019-08-29 16:21
閱讀 989·2019-08-29 15:06
閱讀 2624·2019-08-29 13:25
閱讀 589·2019-08-29 11:32
閱讀 3263·2019-08-26 11:55
閱讀 2598·2019-08-23 18:24