摘要:默認(rèn)值為返回值,一個(gè)對(duì)象,包含了原生用戶原生項(xiàng)目真實(shí)評(píng)分預(yù)測(cè)評(píng)分可能對(duì)后面預(yù)測(cè)有用的一些其他的詳細(xì)信息在給定的測(cè)試集上測(cè)試算法,即估計(jì)給定測(cè)試集中的所有評(píng)分。
這里的格式并沒(méi)有做過(guò)多的處理,可參考于OneNote筆記鏈接
由于OneNote取消了單頁(yè)分享,如果需要請(qǐng)留下郵箱,我會(huì)郵件發(fā)送pdf版本,后續(xù)再解決這個(gè)問(wèn)題
推薦算法庫(kù)surprise安裝
pip install surprise
基本用法
? 自動(dòng)交叉驗(yàn)證
# Load the movielens-100k dataset (download it if needed), data = Dataset.load_builtin("ml-100k") # We"ll use the famous SVD algorithm. algo = SVD() # Run 5-fold cross-validation and print results cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True) load_builtin方法會(huì)自動(dòng)下載“movielens-100k”數(shù)據(jù)集,放在.surprise_data目錄下面 ? 使用自定義的數(shù)據(jù)集 # path to dataset file file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data") # As we"re loading a custom dataset, we need to define a reader. In the # movielens-100k dataset, each line has the following format: # "user item rating timestamp", separated by " " characters. reader = Reader(line_format="user item rating timestamp", sep=" ") data = Dataset.load_from_file(file_path, reader=reader) # We can now use this dataset as we please, e.g. calling cross_validate cross_validate(BaselineOnly(), data, verbose=True)
交叉驗(yàn)證
○ cross_validate(算法,數(shù)據(jù)集,評(píng)估模塊measures=[],交叉驗(yàn)證折數(shù)cv) ○ 通過(guò)test方法和KFold也可以對(duì)數(shù)據(jù)集進(jìn)行更詳細(xì)的操作,也可以使用LeaveOneOut或是ShuffleSplit from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import Kfold # Load the movielens-100k dataset data = Dataset.load_builtin("ml-100k") # define a cross-validation iterator kf = KFold(n_splits=3) algo = SVD() for trainset, testset in kf.split(data): # train and test algorithm. algo.fit(trainset) predictions = algo.test(testset) # Compute and print Root Mean Squared Error accuracy.rmse(predictions, verbose=True)
使用GridSearchCV來(lái)調(diào)節(jié)算法參數(shù)
如果需要對(duì)算法參數(shù)來(lái)進(jìn)行比較測(cè)試,GridSearchCV類可以提供解決方案
例如對(duì)SVD的參數(shù)嘗試不同的值
from surprise import SVD from surprise import Dataset from surprise.model_selection import GridSearchCV # Use movielens-100K data = Dataset.load_builtin("ml-100k") param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]} gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3) gs.fit(data) # best RMSE score print(gs.best_score["rmse"]) # combination of parameters that gave the best RMSE score print(gs.best_params["rmse"]) # We can now use the algorithm that yields the best rmse: algo = gs.best_estimator["rmse"] algo.fit(data.build_full_trainset())
使用預(yù)測(cè)算法
○ 基線估算配置 § 在使用最小二乘法(ALS)時(shí)傳入?yún)?shù): 1) reg_i:項(xiàng)目正則化參數(shù),默認(rèn)值為10 2) reg_u:用戶正則化參數(shù),默認(rèn)值為15 3) n_epochs:als過(guò)程中的迭代次數(shù),默認(rèn)值為10 print("Using ALS") bsl_options = {"method": "als", "n_epochs": 5, "reg_u": 12, "reg_i": 5 } algo = BaselineOnly(bsl_options=bsl_options) § 在使用隨機(jī)梯度下降(SGD)時(shí)傳入?yún)?shù): 1) reg:優(yōu)化成本函數(shù)的正則化參數(shù),默認(rèn)值為0.02 2) learning_rate:SGD的學(xué)習(xí)率,默認(rèn)值為0.005 3) n_epochs:SGD過(guò)程中的迭代次數(shù),默認(rèn)值為20 print("Using SGD") bsl_options = {"method": "sgd", "learning_rate": .00005, } algo = BaselineOnly(bsl_options=bsl_options) § 在創(chuàng)建KNN算法時(shí)候來(lái)傳遞參數(shù) bsl_options = {"method": "als", "n_epochs": 20, } sim_options = {"name": "pearson_baseline"} algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options) ○ 相似度配置 § name:要使用的相似度名稱,默認(rèn)是MSD § user_based:是否時(shí)基于用戶計(jì)算相似度,默認(rèn)為True § min_support:最小的公共數(shù)目,當(dāng)最小的公共用戶或者公共項(xiàng)目小于min_support時(shí)候,相似度為0 § shrinkage:收縮參數(shù),默認(rèn)值為100 i. sim_options = {"name": "cosine", "user_based": False # compute similarities between items } algo = KNNBasic(sim_options=sim_options) ii. sim_options = {"name": "pearson_baseline", "shrinkage": 0 # no shrinkage } algo = KNNBasic(sim_options=sim_options) ? 其他一些問(wèn)題 ○ 如何獲取top-N的推薦 from collections import defaultdict from surprise import SVD from surprise import Dataset def get_top_n(predictions, n=10): """Return the top-N recommendation for each user from a set of predictions. Args: predictions(list of Prediction objects): The list of predictions, as returned by the test method of an algorithm. n(int): The number of recommendation to output for each user. Default is 10. Returns: A dict where keys are user (raw) ids and values are lists of tuples: [(raw item id, rating estimation), ...] of size n. """ # First map the predictions to each user. top_n = defaultdict(list) for uid, iid, true_r, est, _ in predictions: top_n[uid].append((iid, est)) # Then sort the predictions for each user and retrieve the k highest ones. for uid, user_ratings in top_n.items(): user_ratings.sort(key=lambda x: x[1], reverse=True) top_n[uid] = user_ratings[:n] return top_n # First train an SVD algorithm on the movielens dataset. data = Dataset.load_builtin("ml-100k") trainset = data.build_full_trainset() algo = SVD() algo.fit(trainset) # Than predict ratings for all pairs (u, i) that are NOT in the training set. testset = trainset.build_anti_testset() predictions = algo.test(testset) top_n = get_top_n(predictions, n=10) # Print the recommended items for each user for uid, user_ratings in top_n.items(): print(uid, [iid for (iid, _) in user_ratings]) ○ 如何計(jì)算精度
from collections import defaultdict
from surprise import Dataset from surprise import SVD from surprise.model_selection import KFold def precision_recall_at_k(predictions, k=10, threshold=3.5): """Return precision and recall at k metrics for each user.""" # First map the predictions to each user. user_est_true = defaultdict(list) for uid, _, true_r, est, _ in predictions: user_est_true[uid].append((est, true_r)) precisions = dict() recalls = dict() for uid, user_ratings in user_est_true.items(): # Sort user ratings by estimated value user_ratings.sort(key=lambda x: x[0], reverse=True) # Number of relevant items n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings) # Number of recommended items in top k n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k]) # Number of relevant and recommended items in top k n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in user_ratings[:k]) # Precision@K: Proportion of recommended items that are relevant precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1 # Recall@K: Proportion of relevant items that are recommended recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1 return precisions, recalls data = Dataset.load_builtin("ml-100k") kf = KFold(n_splits=5) algo = SVD() for trainset, testset in kf.split(data): algo.fit(trainset) predictions = algo.test(testset) precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4) # Precision and recall can then be averaged over all users print(sum(prec for prec in precisions.values()) / len(precisions)) print(sum(rec for rec in recalls.values()) / len(recalls)) ○ 如何獲得用戶(或項(xiàng)目)的k個(gè)最近鄰居
import io # needed because of weird encoding of u.item file
from surprise import KNNBaseline from surprise import Dataset from surprise import get_dataset_dir def read_item_names(): """Read the u.item file from MovieLens 100-k dataset and return two mappings to convert raw ids into movie names and movie names into raw ids. """ file_name = get_dataset_dir() + "/ml-100k/ml-100k/u.item" rid_to_name = {} name_to_rid = {} with io.open(file_name, "r", encoding="ISO-8859-1") as f: for line in f: line = line.split("|") rid_to_name[line[0]] = line[1] name_to_rid[line[1]] = line[0] return rid_to_name, name_to_rid # First, train the algortihm to compute the similarities between items data = Dataset.load_builtin("ml-100k") trainset = data.build_full_trainset() sim_options = {"name": "pearson_baseline", "user_based": False} algo = KNNBaseline(sim_options=sim_options) algo.fit(trainset) # Read the mappings raw id <-> movie name rid_to_name, name_to_rid = read_item_names() # Retrieve inner id of the movie Toy Story toy_story_raw_id = name_to_rid["Toy Story (1995)"] toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id) # Retrieve inner ids of the nearest neighbors of Toy Story. toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10) # Convert inner ids of the neighbors into names. toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors) toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors) print() print("The 10 nearest neighbors of Toy Story are:") for movie in toy_story_neighbors: print(movie) ○ 解釋一下什么是raw_id和inner_id? i. 用戶和項(xiàng)目有自己的raw_id和inner_id,原生id是評(píng)分文件或者pandas數(shù)據(jù)集中定義的id,重點(diǎn)在于要知道你使用predict()或者其他方法時(shí)候接收原生的id ii. 在訓(xùn)練集創(chuàng)建時(shí),每一個(gè)原生的id映射到inner id(這是一個(gè)唯一的整數(shù),方便surprise操作),原生id和內(nèi)部id之間的轉(zhuǎn)換可以用訓(xùn)練集中的to_inner_uid(), to_inner_iid(), to_raw_uid(), 以及to_raw_iid()方法 ○ 默認(rèn)數(shù)據(jù)集下載到了哪里?怎么修改這個(gè)位置 i. 默認(rèn)數(shù)據(jù)集下載到了——“~/.surprise_data”中 ii. 如果需要修改,可以通過(guò)設(shè)置“SURPRISE_DATA_FOLDER”環(huán)境變量來(lái)修改位置 ? API合集 ○ 推薦算法包 random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. baseline_only. BaselineOnly Algorithm predicting the baseline estimate for given user and item. knns.KNNBasic A basic collaborative filtering algorithm. knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user. knns.KNNWithZScore A basic collaborative filtering algorithm, taking into account the z-score normalization of each user. knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating. matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings. matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization. slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm. co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering. ○ 推薦算法基類 § class surprise.prediction_algorithms.algo_base.AlgoBase(**kwargs) § 如果算法需要計(jì)算相似度,那么baseline_options參數(shù)可以用來(lái)配置 § 方法介紹: 1) compute_baselines() 計(jì)算用戶和項(xiàng)目的基線,這個(gè)方法只能適用于Pearson相似度或者BaselineOnly算法,返回一個(gè)包含用戶相似度和用戶相似度的元組 2) compute_similarities() 相似度矩陣,計(jì)算相似度矩陣的方式取決于sim_options算法創(chuàng)建時(shí)候所傳遞的參數(shù),返回相似度矩陣 3) default_preditction() 默認(rèn)的預(yù)測(cè)值,如果計(jì)算期間發(fā)生了異常,那么預(yù)測(cè)值則使用這個(gè)值。默認(rèn)情況下時(shí)所有評(píng)分的均值(可以在子類中重寫,以改變這個(gè)值),返回一個(gè)浮點(diǎn)類型 4) fit(trainset) 在給定的訓(xùn)練集上訓(xùn)練算法,每個(gè)派生類都會(huì)調(diào)用這個(gè)方法作為訓(xùn)練算法的第一個(gè)基本步驟,它負(fù)責(zé)初始化一些內(nèi)部結(jié)構(gòu)和設(shè)置self.trainset屬性,返回self指針 5) get_neighbors(iid, k) 返回inner id所對(duì)應(yīng)的k個(gè)最近鄰居的,取決于這個(gè)iid所對(duì)應(yīng)的是用戶還是項(xiàng)目(由sim_options里面的user_based是True還是False決定),返回K個(gè)最近鄰居的內(nèi)部id列表 6) predict(uid, iid, r_ui=None, clip=True, verbose=False) 計(jì)算給定的用戶和項(xiàng)目的評(píng)分預(yù)測(cè),該方法將原生id轉(zhuǎn)換為內(nèi)部id,然后調(diào)用estimate每個(gè)派生類中定義的方法。如果結(jié)果是一個(gè)不可能的預(yù)測(cè)結(jié)果,那么會(huì)根據(jù)default_prediction()來(lái)計(jì)算預(yù)測(cè)值 另外解釋一下clip,這個(gè)參數(shù)決定是否對(duì)預(yù)測(cè)結(jié)果進(jìn)行近似。舉個(gè)例子來(lái)說(shuō),如果預(yù)測(cè)結(jié)果是5.5,而評(píng)分的區(qū)間是[1,5],那么將預(yù)測(cè)結(jié)果修改為5;如果預(yù)測(cè)結(jié)果小于1,那么修改為1。默認(rèn)為True verbose參數(shù)決定了是否打印每個(gè)預(yù)測(cè)的詳細(xì)信息。默認(rèn)值為False 返回值,一個(gè)rediction對(duì)象,包含了: a) 原生用戶id b) 原生項(xiàng)目id c) 真實(shí)評(píng)分 d) 預(yù)測(cè)評(píng)分 e) 可能對(duì)后面預(yù)測(cè)有用的一些其他的詳細(xì)信息 7) test(testset, verbose=False) 在給定的測(cè)試集上測(cè)試算法,即估計(jì)給定測(cè)試集中的所有評(píng)分。返回值是prediction對(duì)象的列表 8) ○ 預(yù)測(cè)模塊 § surprise.prediction_algorithms.predictions模塊定義了Prediction命名元組和PredictionImpossible異常 § Prediction □ 用于儲(chǔ)存預(yù)測(cè)結(jié)果的命名元組 □ 僅用于文檔和打印等目的 □ 參數(shù): uid 原生用戶id iid 原生項(xiàng)目id r_ui 浮點(diǎn)型的真實(shí)評(píng)分 est 浮點(diǎn)型的預(yù)測(cè)評(píng)分 details 預(yù)測(cè)相關(guān)的其他詳細(xì)信息 § surprise.prediction_algorithms.predictions.PredictionImpossible □ 當(dāng)預(yù)測(cè)不可能時(shí)候,出現(xiàn)這個(gè)異常 □ 這個(gè)異常會(huì)設(shè)置當(dāng)前的預(yù)測(cè)評(píng)分變?yōu)槟J(rèn)值(全局平均值) ○ model_selection包 § 交叉驗(yàn)證迭代器 □ 該模塊中包含各種交叉驗(yàn)證迭代器: KFold 基礎(chǔ)交叉驗(yàn)證迭代器 RepeatedKFold 重復(fù)KFold交叉驗(yàn)證迭代器 ShuffleSplit 具有隨機(jī)訓(xùn)練集和測(cè)試集的基本交叉驗(yàn)證迭代器 LeaveOneOut 交叉驗(yàn)證迭代器,其中每個(gè)用戶再測(cè)試集中只有一個(gè)評(píng)級(jí) PredefinedKFold 使用load_from_folds方法加載數(shù)據(jù)集時(shí)的交叉驗(yàn)證迭代器 □ 該模塊中還包含了將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集的功能 train_test_split(data, test_size=0,2, train_size=None, random_state=None, shuffle=True) data,要拆分的數(shù)據(jù)集 test_size,如果是浮點(diǎn)數(shù),表示要包含在測(cè)試集中的評(píng)分比例;如果是整數(shù),則表示測(cè)試集中固定的評(píng)分?jǐn)?shù);如果是None,則設(shè)置為訓(xùn)練集大小的補(bǔ)碼;默認(rèn)為0.2 train_size,如果是浮點(diǎn)數(shù),表示要包含在訓(xùn)練集中的評(píng)分比例;如果是整數(shù),則表示訓(xùn)練集中固定的評(píng)分?jǐn)?shù);如果是None,則設(shè)置為訓(xùn)練集大小的補(bǔ)碼;默認(rèn)為None random_state,整形,一個(gè)隨機(jī)種子,如果多次拆分后獲得的訓(xùn)練集和測(cè)試集沒(méi)有多大分別,可以用這個(gè)參數(shù)來(lái)定義隨機(jī)種子 shuffle,布爾值,是否在數(shù)據(jù)集中改變?cè)u(píng)分,默認(rèn)為True § 交叉驗(yàn)證 surprise.model_selection.validation.cross_validate(algo, data, measures=[u"rmse",u"mae"], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u"2 * n_jobs", verbose=False) ? algo,算法 ? data,數(shù)據(jù)集 ? measures,字符串列表,指定評(píng)估方案 ? cv,交叉迭代器或者整形或者None,如果是迭代器那么按照指定的參數(shù);如果是int,則使用KFold交叉驗(yàn)證迭代器,以參數(shù)為折疊次數(shù);如果是None,那么使用默認(rèn)的KFold,默認(rèn)折疊次數(shù)5 ? return_train_measures,是否計(jì)算訓(xùn)練集的性能指標(biāo),默認(rèn)為False ? n_jobs,整形,并行進(jìn)行評(píng)估的最大折疊數(shù)。如果為-1,那么使用所有的CPU;如果為1,那么沒(méi)有并行計(jì)算(有利于調(diào)試);如果小于-1,那么使用(CPU數(shù)目 + n_jobs + 1)個(gè)CPU計(jì)算;默認(rèn)值為1 ? pre_dispatch,整形或者字符串,控制在并行執(zhí)行期間調(diào)度的作業(yè)數(shù)。(減少這個(gè)數(shù)量可有助于避免在分配過(guò)多的作業(yè)多于CPU可處理內(nèi)容時(shí)候的內(nèi)存消耗)這個(gè)參數(shù)可以是: None,所有作業(yè)會(huì)立即創(chuàng)建并生成 int,給出生成的總作業(yè)數(shù)確切數(shù)量 string,給出一個(gè)表達(dá)式作為函數(shù)n_jobs,例如“2*n_jobs” 默認(rèn)為2*n_jobs 返回值是一個(gè)字典: ? test_*,*對(duì)應(yīng)評(píng)估方案,例如“test_rmse” ? train_*,*對(duì)應(yīng)評(píng)估方案,例如“train_rmse”。當(dāng)return_train_measures為True時(shí)候生效 ? fit_time,數(shù)組,每個(gè)分割出來(lái)的訓(xùn)練數(shù)據(jù)評(píng)估時(shí)間,以秒為單位 ? test_time,數(shù)組,每個(gè)分割出來(lái)的測(cè)試數(shù)據(jù)評(píng)估時(shí)間,以秒為單位 § 參數(shù)搜索 □ class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u"rmse", u"mae"], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u"2 * n_jobs", joblib_verbose=0) ? 參數(shù)類似于上文中交叉驗(yàn)證 ? refit,布爾或者整形。如果為True,使用第一個(gè)評(píng)估方案中最佳平均性能的參數(shù),在整個(gè)數(shù)據(jù)集上重新構(gòu)造算法measures;通過(guò)傳遞字符串可以指定其他的評(píng)估方案;默認(rèn)為False ? joblib_verbose,控制joblib的詳細(xì)程度,整形數(shù)字越高,消息越多 □ 內(nèi)部方法: a) best_estimator,字典,使用measures方案的最佳評(píng)估值,對(duì)所有的分片計(jì)算平均 b) best_score,浮點(diǎn)數(shù),計(jì)算平均得分 c) best_params,字典,獲得measure中最佳的參數(shù)組合 d) best_index,整數(shù),獲取用于該指標(biāo)cv_results的最高精度(平均下來(lái)的)的指數(shù) e) cv_results,數(shù)組字典,measures中所有的參數(shù)組合的訓(xùn)練和測(cè)試的時(shí)間 f) fit,通過(guò)cv參數(shù)給出不同的分割方案,對(duì)所有的參數(shù)組合計(jì)算 g) predit,當(dāng)refit為False時(shí)候生效,傳入數(shù)組,見(jiàn)上文 h) test,當(dāng)refit為False時(shí)候生效,傳入數(shù)組,見(jiàn)上文 □ class surprise.model_selection.search.RandomizedSearchCV(algo_class,param_distributions,n_iter = 10,measures = [u"rmse",u"mae"],cv = None,refit = False,return_train_measures = False,n_jobs = 1,pre_dispatch = u"2 * n_jobs",random_state =無(wú),joblib_verbose = 0 ) 隨機(jī)抽樣進(jìn)行計(jì)算而非像上面的進(jìn)行瓊劇 ○ 相似度模塊 § similarities模塊中包含了用于計(jì)算用戶或者項(xiàng)目之間相似度的工具: 1) cosine 2) msd 3) pearson 4) pearson_baseline ○ 精度模塊 § surprise.accuracy模塊提供了用于計(jì)算一組預(yù)測(cè)的精度指標(biāo)的工具: 1) rmse(均方根誤差) 2) mae(平均絕對(duì)誤差) 3) fcp ○ 數(shù)據(jù)集模塊 § dataset模塊定義了用于管理數(shù)據(jù)集的Dataset類和其他子類 § class surprise.dataset.Dataset(reader) § 內(nèi)部方法: 1) load_builtin(name=u"ml-100k"),加載內(nèi)置數(shù)據(jù)集,返回一個(gè)Dataset對(duì)象 2) load_from_df(df, reader),df(dataframe),數(shù)據(jù)框架,要求必須具有三列(要求順序),用戶原生id,項(xiàng)目原生id,評(píng)分;reader,指定字段內(nèi)容 3) load_from_file(file_path, reader),從文件中加載數(shù)據(jù),參數(shù)為路徑和讀取器 4) load_from_folds(folds_files, reader),處理一種特殊情況,movielens-100k數(shù)據(jù)集中已經(jīng)定義好了訓(xùn)練集和測(cè)試集,可以通過(guò)這個(gè)方法導(dǎo)入 ○ 訓(xùn)練集類 § class surprise.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items) § 屬性分析: 1) ur,用戶評(píng)分列表(item_inner_id,rating)的字典,鍵是用戶的inner_id 2) ir,項(xiàng)目評(píng)分列表(user_inner_id,rating)的字典,鍵是項(xiàng)目的inner_id 3) n_users,用戶數(shù)量 4) n_items,項(xiàng)目數(shù)量 5) n_ratings,總評(píng)分?jǐn)?shù) 6) rating_scale,評(píng)分的最高以及最低的元組 7) global_mean,所有評(píng)級(jí)的平均值 § 方法分析: 1) all_items(),生成函數(shù),迭代所有項(xiàng)目,返回所有項(xiàng)目的內(nèi)部id 2) all_ratings(),生成函數(shù),迭代所有評(píng)分,返回一個(gè)(uid, iid, rating)的元組 3) all_users(),生成函數(shù),迭代所有的用戶,然會(huì)用戶的內(nèi)部id 4) build_anti_testset(fill=None),返回可以在test()方法中用作測(cè)試集的評(píng)分列表,參數(shù)決定填充未知評(píng)級(jí)的值,如果使用None則使用global_mean 5) knows_item(iid),標(biāo)志物品是否屬于訓(xùn)練集 6) knows_user(uid),標(biāo)志用戶是否屬于訓(xùn)練集 7) to_inner_iid(riid),將項(xiàng)目原始id轉(zhuǎn)換為內(nèi)部id 8) to_innser_uid(ruid),將用戶原始id轉(zhuǎn)換為內(nèi)部id 9) to_raw_iid(iiid),將項(xiàng)目的內(nèi)部id轉(zhuǎn)換為原始id 10) to_raw_uid(iuid),將用戶的內(nèi)部id轉(zhuǎn)換為原始id ○ 讀取器類 § class surprise.reader.Reader(name=None, line_format=u"user item rating", sep=None, rating_scale=(1, 5), skip_lines=0) Reader類用于解析包含評(píng)分的文件,要求這樣的文件每行只指定一個(gè)評(píng)分,并且需要每行遵守這個(gè)接口:用戶;項(xiàng)目;評(píng)分;[時(shí)間戳],不要求順序,但是需要指定 § 參數(shù)分析: 1) name,如果指定,則返回一個(gè)內(nèi)置的數(shù)據(jù)集Reader,并忽略其他參數(shù),可接受的值是"ml-100k",“m1l-1m”和“jester”。默認(rèn)為None 2) line_format,string類型,字段名稱,指定時(shí)需要用空格分割,默認(rèn)是“user item rating” 3) sep,char類型,指定字段之間的分隔符 4) rating_scale,元組類型,評(píng)分區(qū)間,默認(rèn)為(1,5) 5) skip_lines,int類型,要在文件開(kāi)頭跳過(guò)的行數(shù),默認(rèn)為0 ○ 轉(zhuǎn)儲(chǔ)模塊 § surprise.dump.dump(file_name, predictions=None, algo=None, verbose=0) □ 一個(gè)pickle的基本包裝器,用來(lái)序列化預(yù)測(cè)或者算法的列表 □ 參數(shù)分析: a) file_name,str,指定轉(zhuǎn)儲(chǔ)的位置 b) predictions,Prediction列表,用來(lái)轉(zhuǎn)儲(chǔ)的預(yù)測(cè) c) algo,Algorithm,用來(lái)轉(zhuǎn)儲(chǔ)的算法 d) verbose,詳細(xì)程度,0或者1 § surprise.dump.load(file_name) □ 用于讀取轉(zhuǎn)儲(chǔ)文件 □ 返回一個(gè)元組(predictions, algo),其中可能為None
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/43807.html
摘要:平安夜圣誕節(jié)總是讓人聯(lián)想到平安果圣誕襪圣誕樹(shù)圣誕老人圣誕櫥窗等等讓人歡喜滿滿期望滿滿的詞語(yǔ)。禮物祝福笑臉驚喜溫暖都伴隨而來(lái),最近課程輕松,便想著做一個(gè)有關(guān)圣誕的小程序,來(lái)當(dāng)作對(duì)小程序的初步學(xué)習(xí)。 Christmas is coming! 平安夜/圣誕節(jié)總是讓人聯(lián)想到平安果、圣誕襪、圣誕樹(shù)、圣誕老人、圣誕櫥窗等等讓人歡喜滿滿、期望滿滿的詞語(yǔ)。禮物、祝福、笑臉、驚喜、溫暖都伴隨而來(lái),最...
摘要:近十年監(jiān)控系統(tǒng)開(kāi)發(fā)經(jīng)驗(yàn),具有構(gòu)建基于大數(shù)據(jù)平臺(tái)的海量高可用分布式監(jiān)控系統(tǒng)研發(fā)經(jīng)驗(yàn)。監(jiān)控多維數(shù)據(jù)特點(diǎn)監(jiān)控的核心是對(duì)監(jiān)控對(duì)象的指標(biāo)采集處理檢測(cè)和分析。通過(guò)單一對(duì)象的指標(biāo)反映的狀態(tài)已不能滿足業(yè)務(wù)監(jiān)控需求。 吳樹(shù)生:騰訊高級(jí)工程師,負(fù)責(zé)SNG大數(shù)據(jù)監(jiān)控平臺(tái)建設(shè)。近十年監(jiān)控系統(tǒng)開(kāi)發(fā)經(jīng)驗(yàn),具有構(gòu)建基于大數(shù)據(jù)平臺(tái)的海量高可用分布式監(jiān)控系統(tǒng)研發(fā)經(jīng)驗(yàn)。前言在2015年構(gòu)建多維監(jiān)控平臺(tái)時(shí)用kmeans做了異常點(diǎn)...
摘要:近十年監(jiān)控系統(tǒng)開(kāi)發(fā)經(jīng)驗(yàn),具有構(gòu)建基于大數(shù)據(jù)平臺(tái)的海量高可用分布式監(jiān)控系統(tǒng)研發(fā)經(jīng)驗(yàn)。的哈勃多維監(jiān)控平臺(tái)在完成大數(shù)據(jù)架構(gòu)改造后,嘗試引入能力,多維根因分析是其中一試點(diǎn),用于摸索的應(yīng)用經(jīng)驗(yàn)。 作者丨吳樹(shù)生:騰訊高級(jí)工程師,負(fù)責(zé)SNG大數(shù)據(jù)監(jiān)控平臺(tái)建設(shè)。近十年監(jiān)控系統(tǒng)開(kāi)發(fā)經(jīng)驗(yàn),具有構(gòu)建基于大數(shù)據(jù)平臺(tái)的海量高可用分布式監(jiān)控系統(tǒng)研發(fā)經(jīng)驗(yàn)。 導(dǎo)語(yǔ):監(jiān)控?cái)?shù)據(jù)多維化后,帶來(lái)新的應(yīng)用場(chǎng)景。SNG的哈勃多...
摘要:客戶端框架的個(gè)痛點(diǎn)我們?cè)缰罆?huì)面臨很多的困難,但是不知道會(huì)有這么難。這是對(duì)的,但是總體上來(lái)說(shuō),客戶端框架降低了遲緩的開(kāi)銷。但是,這些問(wèn)題加在一起就是另一回事了,可以說(shuō),客戶端框架成為了我們開(kāi)發(fā)工作的一大負(fù)擔(dān)。 更新: 本文原本的標(biāo)題是為何我們棄用AngularJS:……,現(xiàn)在把它去掉了。因?yàn)檫@些痛點(diǎn)主要是針對(duì)單頁(yè)JS應(yīng)用框架的。有些人認(rèn)為本文是專門批判AngularJS的,這可不是我的...
摘要:是你學(xué)習(xí)從入門到專家必備的學(xué)習(xí)路線和優(yōu)質(zhì)學(xué)習(xí)資源。的數(shù)學(xué)基礎(chǔ)最主要是高等數(shù)學(xué)線性代數(shù)概率論與數(shù)理統(tǒng)計(jì)三門課程,這三門課程是本科必修的。其作為機(jī)器學(xué)習(xí)的入門和進(jìn)階資料非常適合。書(shū)籍介紹深度學(xué)習(xí)通常又被稱為花書(shū),深度學(xué)習(xí)領(lǐng)域最經(jīng)典的暢銷書(shū)。 showImg(https://segmentfault.com/img/remote/1460000019011569); 【導(dǎo)讀】本文由知名開(kāi)源平...
閱讀 3123·2021-11-24 10:34
閱讀 3383·2021-11-22 13:53
閱讀 2676·2021-11-22 12:03
閱讀 3653·2021-09-26 09:47
閱讀 3051·2021-09-23 11:21
閱讀 4931·2021-09-22 15:08
閱讀 3378·2021-07-23 10:59
閱讀 1307·2019-08-29 18:31