成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

機器學習:隨機森林學習筆記

arashicage / 3494人閱讀

摘要:前言隨機森林是一個很強大的模型,由一組決策樹投票得到最后的結(jié)果。要研究清楚隨機森林,首先需要研究清楚決策樹,然后理解隨機森林如何通過多棵樹的集成提高模型效果。

前言

隨機森林是一個很強大的模型,由一組決策樹投票得到最后的結(jié)果。要研究清楚隨機森林,首先需要研究清楚決策樹,然后理解隨機森林如何通過多棵樹的集成提高模型效果。

本文的目的是將自己學習這個模型時有用的資料匯總在一起。

決策樹基本知識

決策樹知識點精要

ID3:信息增益
C4.5:信息增益率
CART:Gini系數(shù)

決策樹的優(yōu)缺點

集成智慧編程

優(yōu)點有:

最大的優(yōu)勢是易于解釋

同時接受categorical和numerical數(shù)據(jù),不需要做預處理或歸一化。

允許結(jié)果是不確定的:葉子節(jié)點具有多種可能的結(jié)果值卻無法進一步拆分,可以統(tǒng)計count,評估出一個概率。

缺點有:

對于只有幾種可能結(jié)果的問題,算法很有效;面對擁有大量可能結(jié)果的數(shù)據(jù)集時,決策樹會變得異常復雜,預測效果也可能會大打折扣。

盡管能處理簡單的數(shù)值型數(shù)據(jù),但只能創(chuàng)建滿足“大于/小于”條件的節(jié)點。若決定分類的因素取決于更多變量的復雜組合,此時要根據(jù)決策樹進行分類就會比較困難了。例如,假設(shè)結(jié)果值是由兩個變量的差來決定的,那么這棵樹會變得異常龐大,而且預測的準確性也會迅速下降。

總而言之:決策樹最適合用來處理的,是那些帶分界點的、由大量分類數(shù)據(jù)和數(shù)值數(shù)據(jù)共同組成的數(shù)據(jù)集。

關(guān)于書中提到的假設(shè)結(jié)果值是由兩個變量的差來決定的,那么這棵樹會變得異常龐大,而且預測的準確性也會迅速下降,我們可以用下面的例子來實驗一下:

library(rpart)
library(rpart.plot);  

age1 <- as.integer(runif(1000, min=18, max=30))
age2 <- as.integer(runif(1000, min=18, max=30))

df <- data.frame(cbind(age1, aage2))

df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5)

ct <- rpart.control(xval=10, minsplit=20, cp=0.01) 
cfit <- rpart(label~age1+age2,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)


rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  


cfit <- rpart(label~diff,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  

用age1和age2來預測,得到的決策樹截圖如下:

用diff來預測,得到的決策樹截圖如下:

隨機森林理論

sklearn官方文檔

Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

隨機森林實現(xiàn)
from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimator=10)
clf = clf.fit(X, Y)
調(diào)參

sklearn官網(wǎng)

核心參數(shù)由n_estimatorsmax_features

n_estimators: the number of trees in the forest

max_features: the size of the random subsets of features to consider when splitting a node. Default values: max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.

其他參數(shù):Good results are often achieved when setting max_depth=None in combination with min_samples_split=1.

n_jobs=k:computations are partitioned into k jobs, and run on k cores of the machine. if n_jobs=-1 then all cores available on the machine are used.

特征重要性評估

sklearn官方文檔

The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

StackOverflow

You initialize an array feature_importances of all zeros with size n_features.

You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It"s the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

關(guān)于作者:丹追兵:數(shù)據(jù)分析師一枚,編程語言python和R,使用Spark、Hadoop、Storm、ODPS。本文出自丹追兵的pytrafficR專欄,轉(zhuǎn)載請注明作者與出處:https://segmentfault.com/blog...

文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/45517.html

相關(guān)文章

  • 隨機森林算法入門(python)

    摘要:翻譯自昨天收到推送了一篇介紹隨機森林算法的郵件,感覺作為介紹和入門不錯,就順手把它翻譯一下。隨機森林引入的隨機森林算法將自動創(chuàng)建隨機決策樹群。回歸隨機森林也可以用于回歸問題。結(jié)語隨機森林相當起來非常容易。 翻譯自:http://blog.yhat.com/posts/python-random-forest.html 昨天收到y(tǒng)hat推送了一篇介紹隨機森林算法的郵件,感覺作為介紹和入門...

    張遷 評論0 收藏0

發(fā)表評論

0條評論

最新活動
閱讀需要支付1元查看
<