成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

使用機(jī)器學(xué)習(xí)識(shí)別出拍賣場(chǎng)中作弊的機(jī)器人用戶(二)

YanceyOfficial / 3227人閱讀

摘要:本文承接上一篇文章使用機(jī)器學(xué)習(xí)識(shí)別出拍賣場(chǎng)中作弊的機(jī)器人用戶本項(xiàng)目為上舉行的一次比賽,地址見(jiàn)數(shù)據(jù)來(lái)源,完整代碼見(jiàn)我的歡迎來(lái)玩代碼數(shù)據(jù)探索數(shù)據(jù)預(yù)處理特征工程模型設(shè)計(jì)及評(píng)測(cè)項(xiàng)目數(shù)據(jù)來(lái)源項(xiàng)目所需額外工具包含有聚和算法項(xiàng)目整體運(yùn)行時(shí)間預(yù)估為左右,在

本文承接上一篇文章:使用機(jī)器學(xué)習(xí)識(shí)別出拍賣場(chǎng)中作弊的機(jī)器人用戶

本項(xiàng)目為kaggle上Facebook舉行的一次比賽,地址見(jiàn)數(shù)據(jù)來(lái)源,完整代碼見(jiàn)我的github,歡迎來(lái)玩~

代碼

數(shù)據(jù)探索——Data_Exploration.ipynb

數(shù)據(jù)預(yù)處理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynb

模型設(shè)計(jì)及評(píng)測(cè)——Model_Design.ipynb

項(xiàng)目數(shù)據(jù)來(lái)源

kaggle

項(xiàng)目所需額外工具包

numpy

pandas

matplotlib

sklearn

xgboost

lightgbm

mlxtend: 含有聚和算法Stacking
項(xiàng)目整體運(yùn)行時(shí)間預(yù)估為60min左右,在Ubuntu系統(tǒng),8G內(nèi)存,運(yùn)行結(jié)果見(jiàn)所提交的jupyter notebook文件


由于文章內(nèi)容過(guò)長(zhǎng),所以分為兩篇文章,總共包含四個(gè)部分

數(shù)據(jù)探索

數(shù)據(jù)預(yù)處理及特征工程

模型設(shè)計(jì)

評(píng)估及總結(jié)


特征工程(續(xù))
import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
# bids = pd.read_csv("bids.csv")
bids = pickle.load(open("bids.pkl"))
print bids.shape
display(bids.head())
(7656329, 9)

bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
bidders = bids.groupby("bidder_id")
針對(duì)國(guó)家、商品單一特征多類別轉(zhuǎn)換為多個(gè)獨(dú)立特征進(jìn)行統(tǒng)計(jì)
cates = (bids["merchandise"].unique()).tolist()
countries = (bids["country"].unique()).tolist()

def dummy_coun_cate(group):
    coun_cate = dict.fromkeys(cates, 0)
    coun_cate.update(dict.fromkeys(countries, 0))
    for cat, value in group["merchandise"].value_counts().iteritems():
        coun_cate[cat] = value

    for c in group["country"].unique():
        coun_cate[c] = 1

    coun_cate = pd.Series(coun_cate)
    return coun_cate
bidder_coun_cate = bidders.apply(dummy_coun_cate)
display(bidder_coun_cate.describe())
bidder_coun_cate.to_csv("coun_cate.csv")
ad ae af ag al am an ao ar at ... vc ve vi vn ws ye za zm zw zz
count 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 ... 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000
mean 0.002724 0.205629 0.054774 0.001059 0.048570 0.023907 0.000303 0.036314 0.120442 0.052655 ... 0.000605 0.033591 0.000303 0.130882 0.001967 0.040551 0.274474 0.067181 0.069753 0.000757
std 0.052121 0.404191 0.227555 0.032530 0.214984 0.152770 0.017395 0.187085 0.325502 0.223362 ... 0.024596 0.180186 0.017395 0.337297 0.044311 0.197262 0.446283 0.250354 0.254750 0.027497
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 209 columns

同樣的,對(duì)于每個(gè)用戶需要統(tǒng)計(jì)他對(duì)于自己每次競(jìng)拍行為的時(shí)間間隔情況

def bidder_interval(group):
    time_diff = np.ediff1d(group["time"])
    bidder_interval = {}
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)
    bidder_interval["tmean"] = diff_mean
    bidder_interval["tstd"] = diff_std
    bidder_interval["tmedian"] = diff_median
    bidder_interval["tzeros"] = diff_zeros
    bidder_interval = pd.Series(bidder_interval)
    return bidder_interval
bidder_inv = bidders.apply(bidder_interval)
display(bidder_inv.describe())
bidder_inv.to_csv("bidder_inv.csv")
tmean tmedian tstd tzeros
count 6.609000e+03 6.609000e+03 6.609000e+03 6609.000000
mean 2.933038e+12 1.860285e+12 3.440901e+12 122.986231
std 8.552343e+12 7.993497e+12 6.512992e+12 3190.805229
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
25% 1.192853e+10 2.578947e+09 1.749995e+09 0.000000
50% 2.641139e+11 5.726316e+10 5.510107e+11 0.000000
75% 1.847456e+12 6.339474e+11 2.911282e+12 0.000000
max 7.610295e+13 7.610295e+13 3.800092e+13 231570.000000

按照用戶-拍賣場(chǎng)分組進(jìn)一步分析

之前的統(tǒng)計(jì)是按照用戶進(jìn)行分組,針對(duì)各個(gè)用戶從整體上針對(duì)競(jìng)標(biāo)行為統(tǒng)計(jì)其各項(xiàng)特征,下面根據(jù)拍賣場(chǎng)來(lái)對(duì)用戶進(jìn)一步細(xì)分,看一看每個(gè)用戶在不同拍賣場(chǎng)的行為模式,類似上述按照用戶分組來(lái)統(tǒng)計(jì)各個(gè)用戶的各項(xiàng)特征,針對(duì)用戶-拍賣場(chǎng)結(jié)對(duì)分組進(jìn)行統(tǒng)計(jì)以下特征

基本計(jì)數(shù)統(tǒng)計(jì),針對(duì)各個(gè)用戶在各個(gè)拍賣場(chǎng)統(tǒng)計(jì)設(shè)備、國(guó)家、ip、url、商品類別、競(jìng)標(biāo)次數(shù)等特征的數(shù)目作為新的特征

時(shí)間間隔統(tǒng)計(jì):統(tǒng)計(jì)各個(gè)用戶在各個(gè)拍賣場(chǎng)每次競(jìng)拍的時(shí)間間隔的 均值、方差、中位數(shù)和0值

針對(duì)商品類別、國(guó)家進(jìn)一步轉(zhuǎn)化為多類別進(jìn)行統(tǒng)計(jì)

def auc_features_count(group):
    time_diff = np.ediff1d(group["time"])
    
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)

    row = dict.fromkeys(cates, 0)
    row.update(dict.fromkeys(countries, 0))

    row["devices_c"] = group["device"].unique().shape[0]
    row["countries_c"] = group["country"].unique().shape[0]
    row["ip_c"] = group["ip"].unique().shape[0]
    row["url_c"] = group["url"].unique().shape[0]
#     row["merch_c"] = group["merchandise"].unique().shape[0]
    row["bids_c"] = group.shape[0]
    row["tmean"] = diff_mean
    row["tstd"] = diff_std
    row["tmedian"] = diff_median
    row["tzeros"] = diff_zeros

    for cat, value in group["merchandise"].value_counts().iteritems():
        row[cat] = value

    for c in group["country"].unique():
        row[c] = 1

    row = pd.Series(row)
    return row
bidder_auc = bids.groupby(["bidder_id", "auction"]).apply(auc_features_count)
bidder_auc.to_csv("bids_auc.csv")
print bidder_auc.shape
(382336, 218)
模型設(shè)計(jì)與參數(shù)評(píng)估 合并特征

對(duì)之前生成的各項(xiàng)特征進(jìn)行合并產(chǎn)生最終的特征空間

import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display

首先將之前根據(jù)用戶分組的統(tǒng)計(jì)特征合并起來(lái),然后將其與按照用戶-拍賣場(chǎng)結(jié)對(duì)分組的特征合并起來(lái),最后加上時(shí)間特征,分別于訓(xùn)練集、測(cè)試集連接生成后續(xù)進(jìn)行訓(xùn)練和預(yù)測(cè)的特征數(shù)據(jù)文件

def merge_data():    
    train = pd.read_csv("train.csv")
    test = pd.read_csv("test.csv")

    time_differences = pd.read_csv("tdiff.csv", index_col=0)
    bids_auc = pd.read_csv("bids_auc.csv")

    bids_auc = bids_auc.groupby("bidder_id").mean()
    
    bidders = pd.read_csv("cnt_bidder.csv", index_col=0)
    country_cate = pd.read_csv("coun_cate.csv", index_col=0)
    bidder_inv = pd.read_csv("bidder_inv.csv", index_col=0)
    bidders = bidders.merge(country_cate, right_index=True, left_index=True)
    bidders = bidders.merge(bidder_inv, right_index=True, left_index=True)

    bidders = bidders.merge(bids_auc, right_index=True, left_index=True)
    bidders = bidders.merge(time_differences, right_index=True,
                            left_index=True)

    train = train.merge(bidders, left_on="bidder_id", right_index=True)
    train.to_csv("train_full.csv", index=False)

    test = test.merge(bidders, left_on="bidder_id", right_index=True)
    test.to_csv("test_full.csv", index=False)    
merge_data()
train_full = pd.read_csv("train_full.csv")
test_full = pd.read_csv("test_full.csv")
print train_full.shape
print test_full.shape
(1983, 445)
(4626, 444)

train_full["outcome"] = train_full["outcome"].astype(int)
ytrain = train_full["outcome"]
train_full.drop("outcome", 1, inplace=True)

test_ids = test_full["bidder_id"]

labels = ["payment_account", "address", "bidder_id"]
train_full.drop(labels=labels, axis=1, inplace=True)
test_full.drop(labels=labels, axis=1, inplace=True)
設(shè)計(jì)交叉驗(yàn)證 模型選擇

根據(jù)之前的分析,由于當(dāng)前的數(shù)據(jù)集中存在正負(fù)例不均衡的問(wèn)題,所以考慮選取了RandomForestClassfier, GradientBoostingClassifier, xgboost, lightgbm等四種模型來(lái)針對(duì)數(shù)據(jù)及進(jìn)行訓(xùn)練和預(yù)測(cè),確定最終模型的基本思路如下:

對(duì)四個(gè)模型分別使用評(píng)價(jià)函數(shù)roc_auc進(jìn)行交叉驗(yàn)證并繪制auc曲線,對(duì)各個(gè)模型的多輪交叉驗(yàn)證得分取平均值并輸出

根據(jù)得分確定最終選用的一個(gè)或多個(gè)模型

若最后發(fā)現(xiàn)一個(gè)模型的表現(xiàn)大幅度優(yōu)于其他所有模型,則選擇該模型進(jìn)一步調(diào)參

若最后發(fā)現(xiàn)多個(gè)模型表現(xiàn)都不錯(cuò),則進(jìn)行模型的集成,得到聚合模型

使用GridSearchCV來(lái)從人為設(shè)定的參數(shù)列表中選擇最佳的參數(shù)組合確定最終的模型

from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle

# from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, auc

def kfold_plot(train, ytrain, model):
#     kf = StratifiedKFold(y=ytrain, n_folds=5)
    kf = StratifiedKFold(n_splits=5)
    scores = []
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    exe_time = []
    
    colors = cycle(["cyan", "indigo", "seagreen", "yellow", "blue"])
    lw = 2
    
    i=0
    for (train_index, test_index), color in zip(kf.split(train, ytrain), colors):
        X_train, X_test = train.iloc[train_index], train.iloc[test_index]
        y_train, y_test = ytrain.iloc[train_index], ytrain.iloc[test_index]
        begin_t = time.time()
        predictions = model(X_train, X_test, y_train)
        end_t = time.time()
        exe_time.append(round(end_t-begin_t, 3))
#         model = model
#         model.fit(X_train, y_train)    
#         predictions = model.predict_proba(X_test)[:, 1]        
        scores.append(roc_auc_score(y_test.astype(float), predictions))        
        fpr, tpr, thresholds = roc_curve(y_test, predictions)
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=lw, color=color, label="ROC fold %d (area = %0.2f)" % (i, roc_auc))
        i += 1
    plt.plot([0, 1], [0, 1], linestyle="--", lw=lw, color="k", label="Luck")
    
    mean_tpr /= kf.get_n_splits(train, ytrain)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color="g", linestyle="--", label="Mean ROC (area = %0.2f)" % mean_auc, lw=lw)
    
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic")
    plt.legend(loc="lower right")
    plt.show()
    
#     print "scores: ", scores
    print "mean scores: ", np.mean(scores)
    print "mean model process time: ", np.mean(exe_time), "s"
    
    return scores, np.mean(scores), np.mean(exe_time)

收集各個(gè)模型進(jìn)行交叉驗(yàn)證的結(jié)果包括每輪交叉驗(yàn)證的auc得分、auc的平均得分以及模型的訓(xùn)練時(shí)間

dct_scores = {}
mean_score = {}
mean_time = {}
RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import time
from sklearn.ensemble import RandomForestClassifier

def forest_model(X_train, X_test, y_train):
#     begin_t = time.time()
    model = RandomForestClassifier(n_estimators=160, max_features=35, max_depth=8, random_state=7)
    model.fit(X_train, y_train)    
#     end_t = time.time()
#     print "train time of forest model: ",round(end_t-begin_t, 3), "s"
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["forest"], mean_score["forest"], mean_time["forest"] = kfold_plot(train_full, ytrain, forest_model)
# kfold_plot(train_full, ytrain, model_forest)

mean scores:  0.909571935157
mean model process time:  0.643 s

from sklearn.ensemble import GradientBoostingClassifier
def gradient_model(X_train, X_test, y_train):
    model = GradientBoostingClassifier(n_estimators=200, random_state=7, max_depth=5, learning_rate=0.03)
    model.fit(X_train, y_train)
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["gbm"], mean_score["gbm"], mean_time["gbm"] = kfold_plot(train_full, ytrain, gradient_model)

mean scores:  0.911847771023
mean model process time:  4.1948 s

import xgboost as xgb
def xgboost_model(X_train, X_test, y_train):
    X_train = xgb.DMatrix(X_train.values, label=y_train.values)
    X_test = xgb.DMatrix(X_test.values)
    params = {"objective": "binary:logistic", "eval_metric": "auc", "silent": 1, "seed": 7,
              "max_depth": 6, "eta": 0.01}    
    model = xgb.train(params, X_train, 600)
    predictions = model.predict(X_test)
    return predictions
/home/lancelot/anaconda2/envs/udacity/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

dct_scores["xgboost"], mean_score["xgboost"], mean_time["xgboost"] = kfold_plot(train_full, ytrain, xgboost_model)

mean scores:  0.915372340426
mean model process time:  3.1482 s

import lightgbm as lgb
def lightgbm_model(X_train, X_test, y_train):
    X_train = lgb.Dataset(X_train.values, y_train.values)
    params = {"objective": "binary", "metric": {"auc"}, "learning_rate": 0.01, "max_depth": 6, "seed": 7}
    model = lgb.train(params, X_train, num_boost_round=600)
    predictions = model.predict(X_test)
    return predictions
dct_scores["lgbm"], mean_score["lgbm"], mean_time["lgbm"] = kfold_plot(train_full, ytrain, lightgbm_model)

mean scores:  0.921512158055
mean model process time:  0.3558 s

模型比較

比較四個(gè)模型在交叉驗(yàn)證機(jī)上的roc_auc平均得分和模型訓(xùn)練的時(shí)間

def plot_model_comp(title, y_label, dct_result):
    data_source = dct_result.keys()
    y_pos = np.arange(len(data_source))
    # model_auc = [0.910, 0.912, 0.915, 0.922]
    model_auc = dct_result.values()
    barlist = plt.bar(y_pos, model_auc, align="center", alpha=0.5)
    # get the index of highest score
    max_val = max(model_auc)
    idx = model_auc.index(max_val)
    barlist[idx].set_color("r")
    plt.xticks(y_pos, data_source)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()
    print "The highest auc score is {0} of model: {1}".format(max_val, data_source[idx])
plot_model_comp("Model Performance", "roc-auc score", mean_score)

The highest auc score is 0.921512158055 of model: lgbm

def plot_time_comp(title, y_label, dct_result):
    data_source = dct_result.keys()
    y_pos = np.arange(len(data_source))
    # model_auc = [0.910, 0.912, 0.915, 0.922]
    model_auc = dct_result.values()
    barlist = plt.bar(y_pos, model_auc, align="center", alpha=0.5)
    # get the index of highest score
    min_val = min(model_auc)
    idx = model_auc.index(min_val)
    barlist[idx].set_color("r")
    plt.xticks(y_pos, data_source)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()
    print "The shortest time is {0} of model: {1}".format(min_val, data_source[idx])
plot_time_comp("Time of Building Model", "time(s)", mean_time)

The shortest time is 0.3558 of model: lgbm

auc_forest = dct_scores["forest"]
auc_gb = dct_scores["gbm"]
auc_xgb = dct_scores["xgboost"]
auc_lgb = dct_scores["lgbm"]
print "std of forest auc score: ",np.std(auc_forest)
print "std of gbm auc score: ",np.std(auc_gb)
print "std of xgboost auc score: ",np.std(auc_xgb)
print "std of lightgbm auc score: ",np.std(auc_lgb)
data_source = ["roc-fold-1", "roc-fold-2", "roc-fold-3", "roc-fold-4", "roc-fold-5"]
y_pos = np.arange(len(data_source))
plt.plot(y_pos, auc_forest, "b-", label="forest")
plt.plot(y_pos, auc_gb, "r-", label="gbm")
plt.plot(y_pos, auc_xgb, "y-", label="xgboost")
plt.plot(y_pos, auc_lgb, "g-", label="lightgbm")
plt.title("roc-auc score of each epoch")
plt.xlabel("epoch")
plt.ylabel("roc-auc score")
plt.legend()
plt.show()
std of forest auc score:  0.0413757504568
std of gbm auc score:  0.027746291638
std of xgboost auc score:  0.0232931322563
std of lightgbm auc score:  0.0287156755513

單從5次交叉驗(yàn)證的各模型roc-auc得分來(lái)看,xgboost的得分相對(duì)比較穩(wěn)定

聚合模型

由上面的模型比較可以發(fā)現(xiàn),四個(gè)模型的經(jīng)過(guò)交叉驗(yàn)證的表現(xiàn)都不錯(cuò),但是綜合而言,xgboost和lightgbm更勝一籌,而且兩者的訓(xùn)練時(shí)間也相對(duì)更短一些,所以接下來(lái)考慮進(jìn)行模型的聚合,思路如下:

先通過(guò)GridSearchCV分別針對(duì)四個(gè)模型在整個(gè)訓(xùn)練集上進(jìn)行調(diào)參獲得最佳的子模型

針對(duì)子模型使用

stacking: 第三方庫(kù)mlxtend里的stacking方法對(duì)子模型進(jìn)行聚合得到聚合模型,并采用之前相同的cv方法對(duì)該模型進(jìn)行打分評(píng)價(jià)

voting: 使用sklearn內(nèi)置的VotingClassifier進(jìn)行四個(gè)模型的聚合

最終對(duì)聚合模型在一次進(jìn)行cv驗(yàn)證評(píng)分,根據(jù)結(jié)果確定最終的模型

先通過(guò)交叉驗(yàn)證針對(duì)模型選擇參數(shù)組合

def choose_xgb_model(X_train, y_train): 
    tuned_params = [{"objective": ["binary:logistic"], "learning_rate": [0.01, 0.03, 0.05], 
                     "n_estimators": [100, 150, 200], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(xgb.XGBClassifier(seed=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters of xgboost: ",clf.best_params_
    return clf.best_estimator_
bst_xgb = choose_xgb_model(train_full, ytrain)
train time:  48.141 s
current best parameters of xgboost:  {"n_estimators": 150, "objective": "binary:logistic", "learning_rate": 0.05, "max_depth": 4}

def choose_lgb_model(X_train, y_train): 
    tuned_params = [{"objective": ["binary"], "learning_rate": [0.01, 0.03, 0.05], 
                     "n_estimators": [100, 150, 200], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(lgb.LGBMClassifier(seed=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters of lgb: ",clf.best_params_
    return clf.best_estimator_
bst_lgb = choose_lgb_model(train_full, ytrain)
train time:  12.543 s
current best parameters of lgb:  {"n_estimators": 150, "objective": "binary", "learning_rate": 0.05, "max_depth": 4}

先使用stacking集成兩個(gè)綜合表現(xiàn)最佳的模型lgb和xgb,此處元分類器使用較為簡(jiǎn)單的LR模型來(lái)在已經(jīng)訓(xùn)練好了并且經(jīng)過(guò)參數(shù)選擇的模型上進(jìn)一步優(yōu)化預(yù)測(cè)結(jié)果

from mlxtend.classifier import StackingClassifier
from sklearn import linear_model

def stacking_model(X_train, X_test, y_train):    
    lr = linear_model.LogisticRegression(random_state=7)
    sclf = StackingClassifier(classifiers=[bst_xgb, bst_lgb], use_probas=True, average_probas=False, 
                              meta_classifier=lr)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["stacking_1"], mean_score["stacking_1"], mean_time["stacking_1"] = kfold_plot(train_full, ytrain, stacking_model)

mean scores:  0.92157674772
mean model process time:  0.7022 s

可以看到相對(duì)之前的得分最高的模型lightgbm,將lightgbm與xgboost經(jīng)過(guò)stacking集成并且使用lr作為元分類器得到的auc得分有輕微的提升,接下來(lái)考慮進(jìn)一步加入另外的RandomForest和GBDT模型看看增加一點(diǎn)模型的差異性使用Stacking是不是會(huì)有所提升

def choose_forest_model(X_train, y_train):    
    tuned_params = [{"n_estimators": [100, 150, 200], "max_features": [8, 15, 30], "max_depth":[4, 8, 10]}]
    begin_t = time.time()
    clf = GridSearchCV(RandomForestClassifier(random_state=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters: ",clf.best_params_
    return clf.best_estimator_
bst_forest = choose_forest_model(train_full, ytrain)
train time:  42.201 s
current best parameters:  {"max_features": 15, "n_estimators": 150, "max_depth": 8}

def choose_gradient_model(X_train, y_train):    
    tuned_params = [{"n_estimators": [100, 150, 200], "learning_rate": [0.03, 0.05, 0.07], 
                     "min_samples_leaf": [8, 15, 30], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(GradientBoostingClassifier(random_state=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters: ",clf.best_params_
    return clf.best_estimator_
bst_gradient = choose_gradient_model(train_full, ytrain)
train time:  641.872 s
current best parameters:  {"n_estimators": 100, "learning_rate": 0.03, "max_depth": 8, "min_samples_leaf": 30}

def stacking_model2(X_train, X_test, y_train):    
    lr = linear_model.LogisticRegression(random_state=7)
    sclf = StackingClassifier(classifiers=[bst_xgb, bst_forest, bst_gradient, bst_lgb], use_probas=True, 
                              average_probas=False, meta_classifier=lr)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["stacking_2"], mean_score["stacking_2"], mean_time["stacking_2"] = kfold_plot(train_full, ytrain, stacking_model2)

mean scores:  0.92686550152
mean model process time:  4.0878 s

可以看到四個(gè)模型的聚合效果比用兩個(gè)模型的stacking聚合效果要好不少,接下來(lái)嘗試使用voting對(duì)四個(gè)模型進(jìn)行聚合

from sklearn.ensemble import VotingClassifier

def voting_model(X_train, X_test, y_train):    
    vclf = VotingClassifier(estimators=[("xgb", bst_xgb), ("rf", bst_forest), ("gbm",bst_gradient),
                                       ("lgb", bst_lgb)], voting="soft", weights=[2, 1, 1, 2])
    vclf.fit(X_train, y_train)
    predictions = vclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["voting"], mean_score["voting"], mean_time["voting"] = kfold_plot(train_full, ytrain, voting_model)

mean scores:  0.926889564336
mean model process time:  4.055 s

再次比較單模型與集成模型的得分

plot_model_comp("Model Performance", "roc-auc score", mean_score)

The highest auc score is 0.926889564336 of model: voting

由上可以看到最終通過(guò)voting將四個(gè)模型進(jìn)行聚合可以得到得分最高的模型,確定為最終用來(lái)預(yù)測(cè)的模型

綜合模型,對(duì)測(cè)試文件進(jìn)行最終預(yù)測(cè)
# predict(train_full, test_full, y_train)
def submit(X_train, X_test, y_train, test_ids):
    predictions = voting_model(X_train, X_test, y_train)

    sub = pd.read_csv("sampleSubmission.csv")
    result = pd.DataFrame()
    result["bidder_id"] = test_ids
    result["outcome"] = predictions
    sub = sub.merge(result, on="bidder_id", how="left")

    # Fill missing values with mean
    mean_pred = np.mean(predictions)
    sub.fillna(mean_pred, inplace=True)

    sub.drop("prediction", 1, inplace=True)
    sub.to_csv("result.csv", index=False, header=["bidder_id", "prediction"])
submit(train_full, test_full, ytrain, test_ids)

最終結(jié)果提交到kaggle上進(jìn)行評(píng)分,得分如下

以上就是整個(gè)完整的流程,當(dāng)然還有很多模型可以嘗試,很多聚合方法也可以使用,此外,特征工程部分還有很多空間可以挖掘,就留給大家去探索啦~

參考資料

Chen, K. T., Pao, H. K. K., & Chang, H. C. (2008, October). Game bot identification based on manifold learning. In Proceedings of the 7th ACM SIGCOMM Workshop on Network and System Support for Games (pp. 21-26). ACM.

Alayed, H., Frangoudes, F., & Neuman, C. (2013, August). Behavioral-based cheating detection in online first person shooters using machine learning techniques. In Computational Intelligence in Games (CIG), 2013 IEEE Conference on (pp. 1-8). IEEE.

https://www.kaggle.com/c/face...

http://stats.stackexchange.co...

https://en.wikipedia.org/wiki...

https://en.wikipedia.org/wiki...

https://en.wikipedia.org/wiki...

https://xgboost.readthedocs.i...

https://github.com/Microsoft/...

https://en.wikipedia.org/wiki...

http://stackoverflow.com/ques...

http://pandas.pydata.org/pand...

http://stackoverflow.com/a/18...

http://www.cnblogs.com/jasonf...

修改日志

感謝評(píng)論區(qū)@Frank同學(xué)的指正,已修改原文中的stacking的錯(cuò)誤,此外針對(duì)繪圖等細(xì)節(jié)做了點(diǎn)優(yōu)化處理。

文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/38586.html

相關(guān)文章

  • OpenAI開(kāi)發(fā)ChatGPT“反作弊神器”,99.9%超高命率,好消息:還沒(méi)上線

    檢查內(nèi)容是否用了ChatGPT,準(zhǔn)確率高達(dá)99.9%!OpenAI又左右互搏上了,給AI生成的文本打水印,高達(dá)99.9%準(zhǔn)確率抓「AI槍手」作弊代寫。其能夠精準(zhǔn)識(shí)別出論文或研究報(bào)告是否由ChatGPT撰寫,甚至能追溯其使用的具體時(shí)間點(diǎn)。它能專門用來(lái)檢測(cè)是否用ChatGPT水了論文/作業(yè)。早在2022年11月(ChatGPT發(fā)布同月)就已經(jīng)提出想法了。但是!這么好用的東西,卻被內(nèi)部雪藏了2年,現(xiàn)在都...

    UCloud小助手 評(píng)論0 收藏0

發(fā)表評(píng)論

0條評(píng)論

最新活動(dòng)
閱讀需要支付1元查看
<