摘要:本次分享的項(xiàng)目來(lái)自的經(jīng)典賽題房?jī)r(jià)預(yù)測(cè)。分為數(shù)據(jù)分析和數(shù)據(jù)挖掘兩部分介紹。本篇為數(shù)據(jù)分析篇。賽題解讀比賽概述影響房?jī)r(jià)的因素有很多,在本題的數(shù)據(jù)集中有個(gè)變量幾乎描述了愛(ài)荷華州艾姆斯住宅的方方面面,要求預(yù)測(cè)最終的房?jī)r(jià)。
本次分享的項(xiàng)目來(lái)自 Kaggle 的經(jīng)典賽題:房?jī)r(jià)預(yù)測(cè)。分為數(shù)據(jù)分析和數(shù)據(jù)挖掘兩部分介紹。本篇為數(shù)據(jù)分析篇。
影響房?jī)r(jià)的因素有很多,在本題的數(shù)據(jù)集中有 79 個(gè)變量幾乎描述了愛(ài)荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求預(yù)測(cè)最終的房?jī)r(jià)。
技術(shù)棧特征工程 (Creative feature engineering)
回歸模型 (Advanced regression techniques like random forest and
gradient boosting)
預(yù)測(cè)出每間房屋的價(jià)格,對(duì)于測(cè)試集中的每一個(gè)Id,給出變量SalePrice相應(yīng)的值。
提交格式Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc.數(shù)據(jù)分析 數(shù)據(jù)描述
首先我們導(dǎo)入數(shù)據(jù)并查看:
train_df = pd.read_csv("./input/train.csv", index_col=0) test_df = pd.read_csv("./input/test.csv", index_col=0)
train_df.head()
我們可以看到有 80 列,也就是有 79 個(gè)特征。
接下來(lái)將訓(xùn)練集和測(cè)試集合并在一起,這么做是為了進(jìn)行數(shù)據(jù)預(yù)處理的時(shí)候更加方便,讓測(cè)試集和訓(xùn)練集的特征變換為相同的格式,等預(yù)處理進(jìn)行完之后,再把他們分隔開(kāi)。
我們知道SalePrice作為我們的訓(xùn)練目標(biāo),只出現(xiàn)在訓(xùn)練集中,不出現(xiàn)在測(cè)試集,因此我們需要把這一列拿出來(lái)再進(jìn)行合并。在拿出這一列前,我們先來(lái)觀察它,看看它長(zhǎng)什么樣子,也就是查看它的分布。
prices = DataFrame({"price": train_df["SalePrice"], "log(price+1)": np.log1p(train_df["SalePrice"])}) prices.hist()
因?yàn)?b>label本身并不平滑,為了我們分類器的學(xué)習(xí)更加準(zhǔn)確,我們需要首先把label給平滑化(正態(tài)化)。我在這里使用的是log1p, 也就是 log(x+1)。要注意的是我們這一步把數(shù)據(jù)平滑化了,在最后算結(jié)果的時(shí)候,還要把預(yù)測(cè)到的平滑數(shù)據(jù)給變回去,那么log1p()的反函數(shù)就是expm1(),后面用到時(shí)再具體細(xì)說(shuō)。
然后我們把這一列拿出來(lái):
y_train = np.log1p(train_df.pop("SalePrice")) y_train.head()
有
Id 1 12.247699 2 12.109016 3 12.317171 4 11.849405 5 12.429220 Name: SalePrice, dtype: float64
這時(shí),y_train就是SalePrice那一列。
然后我們把兩個(gè)數(shù)據(jù)集合并起來(lái):
df = pd.concat((train_df, test_df), axis=0)
查看shape:
df.shape (2919, 79)
df就是我們合并之后的DataFrame。
根據(jù) kaggle 給出的說(shuō)明,有以下特征及其說(shuō)明:
SalePrice - the property"s sale price in dollars. This is the target variable that you"re trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale
接下來(lái)我們對(duì)特征進(jìn)行分析。上述列出了一個(gè)目標(biāo)變量SalePrice和 79 個(gè)特征,數(shù)量較多,這一步的特征分析是為了之后的特征工程做準(zhǔn)備。
我們來(lái)查看哪些特征存在缺失值:
print(pd.isnull(df).sum())
這樣并不方便觀察,我們先查看缺失值最多的 10 個(gè)特征:
df.isnull().sum().sort_values(ascending=False).head(10)
為了更清楚的表示,我們用缺失率來(lái)考察缺失情況:
df_na = (df.isnull().sum() / len(df)) * 100 df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({"缺失率": df_na}) missing_data.head(10)
對(duì)其進(jìn)行可視化:
f, ax = plt.subplots(figsize=(15,12)) plt.xticks(rotation="90") sns.barplot(x=df_na.index, y=df_na) plt.xlabel("Features", fontsize=15) plt.ylabel("Percent of missing values", fontsize=15) plt.title("Percent missing data by feature", fontsize=15)
我們可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近,這些特征有的是 category 數(shù)據(jù),有的是 numerical 數(shù)據(jù),對(duì)它們的缺失值如何處理,將在關(guān)于特征工程的部分給出。
最后,我們對(duì)每個(gè)特征進(jìn)行相關(guān)性分析,查看熱力圖:
corrmat = train_df.corr() plt.subplots(figsize=(15,12)) sns.heatmap(corrmat, vmax=0.9, square=True)
我們看到有些特征相關(guān)性大,容易造成過(guò)擬合現(xiàn)象,因此需要進(jìn)行剔除。在下一篇的數(shù)據(jù)挖掘篇我們來(lái)對(duì)這些特征進(jìn)行處理并訓(xùn)練模型。
不足之處,歡迎指正。
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/44981.html
摘要:到這里,我們經(jīng)過(guò)以上步驟處理過(guò)的數(shù)據(jù),就可以喂給分類器進(jìn)行訓(xùn)練了。一般來(lái)說(shuō),單個(gè)分類器的效果有限。我們會(huì)傾向于把多個(gè)分類器合在一起,做一個(gè)綜合分類器以達(dá)到最好的效果。比理論上更高級(jí)點(diǎn),它也是攬來(lái)一把的分類器。 特征工程 我們注意到 MSSubClass 其實(shí)是一個(gè) category 的值: all_df[MSSubClass].dtypes 有: dtype(int64) 它不應(yīng)該做...
摘要:通過(guò)海拔坡度到水源的距離地塊位置等特征項(xiàng),對(duì)地塊植被的類型進(jìn)行預(yù)測(cè)個(gè)類型。競(jìng)賽結(jié)果提交請(qǐng)選手利用建立的模型對(duì)每階段提供的預(yù)測(cè)數(shù)據(jù)集中的地塊植被類型列進(jìn)行預(yù)測(cè)類,預(yù)測(cè)結(jié)果按如下格式保存成格式提交。 showImg(https://segmentfault.com/img/bVbjmT7); 參加佛山互聯(lián)網(wǎng)協(xié)會(huì)建模大賽,主題為植被類型預(yù)測(cè),數(shù)據(jù)量分3個(gè)階段,10/15/15萬(wàn)左右的放出,暨...
摘要:提取出中的信息特征缺失值同樣,觀察的缺失值情況缺失值處理發(fā)現(xiàn)兩位都是女性。特征缺失值特征有的缺失值,較為嚴(yán)重,如果進(jìn)行大量的填補(bǔ)會(huì)引入更多噪聲。因?yàn)槿笔е狄彩且环N值,這里將缺失值視為一種特殊的值來(lái)處理,并根據(jù)首個(gè)字符衍生一個(gè)新的特征。 作者:xiaoyu 微信公眾號(hào):Python數(shù)據(jù)科學(xué) 知乎:python數(shù)據(jù)分析師 showImg(https://segmentfault.com/...
閱讀 1007·2023-04-26 01:47
閱讀 1685·2021-11-18 13:19
閱讀 2056·2019-08-30 15:44
閱讀 670·2019-08-30 15:44
閱讀 2310·2019-08-30 15:44
閱讀 1246·2019-08-30 14:06
閱讀 1433·2019-08-30 12:59
閱讀 1909·2019-08-29 12:49