Kaggle入門(mén)級(jí)賽題：房?jī)r(jià)預(yù)測(cè)——數(shù)據(jù)分析篇

sarva 發(fā)布于2019-07-31 11:18 / 1396人閱讀

摘要：本次分享的項(xiàng)目來(lái)自的經(jīng)典賽題房?jī)r(jià)預(yù)測(cè)。分為數(shù)據(jù)分析和數(shù)據(jù)挖掘兩部分介紹。本篇為數(shù)據(jù)分析篇。賽題解讀比賽概述影響房?jī)r(jià)的因素有很多，在本題的數(shù)據(jù)集中有個(gè)變量幾乎描述了愛(ài)荷華州艾姆斯住宅的方方面面，要求預(yù)測(cè)最終的房?jī)r(jià)。

本次分享的項(xiàng)目來(lái)自 Kaggle 的經(jīng)典賽題：房?jī)r(jià)預(yù)測(cè)。分為數(shù)據(jù)分析和數(shù)據(jù)挖掘兩部分介紹。本篇為數(shù)據(jù)分析篇。

賽題解讀 比賽概述

影響房?jī)r(jià)的因素有很多，在本題的數(shù)據(jù)集中有 79 個(gè)變量幾乎描述了愛(ài)荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面，要求預(yù)測(cè)最終的房?jī)r(jià)。

技術(shù)棧

特征工程 (Creative feature engineering)

回歸模型 (Advanced regression techniques like random forest and
gradient boosting)

最終目標(biāo)

預(yù)測(cè)出每間房屋的價(jià)格，對(duì)于測(cè)試集中的每一個(gè)Id，給出變量SalePrice相應(yīng)的值。

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

數(shù)據(jù)分析 數(shù)據(jù)描述

首先我們導(dǎo)入數(shù)據(jù)并查看：

train_df = pd.read_csv("./input/train.csv", index_col=0)
test_df = pd.read_csv("./input/test.csv", index_col=0)

train_df.head()

我們可以看到有 80 列，也就是有 79 個(gè)特征。

接下來(lái)將訓(xùn)練集和測(cè)試集合并在一起，這么做是為了進(jìn)行數(shù)據(jù)預(yù)處理的時(shí)候更加方便，讓測(cè)試集和訓(xùn)練集的特征變換為相同的格式，等預(yù)處理進(jìn)行完之后，再把他們分隔開(kāi)。

我們知道SalePrice作為我們的訓(xùn)練目標(biāo)，只出現(xiàn)在訓(xùn)練集中，不出現(xiàn)在測(cè)試集，因此我們需要把這一列拿出來(lái)再進(jìn)行合并。在拿出這一列前，我們先來(lái)觀察它，看看它長(zhǎng)什么樣子，也就是查看它的分布。

prices = DataFrame({"price": train_df["SalePrice"], "log(price+1)": np.log1p(train_df["SalePrice"])})
prices.hist()

因?yàn)?b>label本身并不平滑，為了我們分類器的學(xué)習(xí)更加準(zhǔn)確，我們需要首先把label給平滑化（正態(tài)化）。我在這里使用的是log1p, 也就是 log(x+1)。要注意的是我們這一步把數(shù)據(jù)平滑化了，在最后算結(jié)果的時(shí)候，還要把預(yù)測(cè)到的平滑數(shù)據(jù)給變回去，那么log1p()的反函數(shù)就是expm1()，后面用到時(shí)再具體細(xì)說(shuō)。

然后我們把這一列拿出來(lái)：

y_train = np.log1p(train_df.pop("SalePrice"))

y_train.head()

有

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

這時(shí)，y_train就是SalePrice那一列。

然后我們把兩個(gè)數(shù)據(jù)集合并起來(lái)：

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我們合并之后的DataFrame。

數(shù)據(jù)預(yù)處理

根據(jù) kaggle 給出的說(shuō)明，有以下特征及其說(shuō)明：

SalePrice - the property"s sale price in dollars. This is the target variable that you"re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下來(lái)我們對(duì)特征進(jìn)行分析。上述列出了一個(gè)目標(biāo)變量SalePrice和 79 個(gè)特征，數(shù)量較多，這一步的特征分析是為了之后的特征工程做準(zhǔn)備。

我們來(lái)查看哪些特征存在缺失值：

print(pd.isnull(df).sum())

這樣并不方便觀察，我們先查看缺失值最多的 10 個(gè)特征：

df.isnull().sum().sort_values(ascending=False).head(10)

為了更清楚的表示，我們用缺失率來(lái)考察缺失情況：

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({"缺失率": df_na})
missing_data.head(10)

對(duì)其進(jìn)行可視化：

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation="90")
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel("Features", fontsize=15)
plt.ylabel("Percent of missing values", fontsize=15)
plt.title("Percent missing data by feature", fontsize=15)

我們可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失，LotFrontage 有 16.7% 的缺失率，GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近，這些特征有的是 category 數(shù)據(jù)，有的是 numerical 數(shù)據(jù)，對(duì)它們的缺失值如何處理，將在關(guān)于特征工程的部分給出。

最后，我們對(duì)每個(gè)特征進(jìn)行相關(guān)性分析，查看熱力圖：

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

我們看到有些特征相關(guān)性大，容易造成過(guò)擬合現(xiàn)象，因此需要進(jìn)行剔除。在下一篇的數(shù)據(jù)挖掘篇我們來(lái)對(duì)這些特征進(jìn)行處理并訓(xùn)練模型。

不足之處，歡迎指正。

GPU云服務(wù)器云服務(wù)器 python預(yù)測(cè)房?jī)r(jià) 房?jī)r(jià)預(yù)測(cè)python ASPNET入門(mén)數(shù)據(jù)篇入門(mén)篇

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/44981.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

sarva

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

用conda安裝tensorflow

閱讀 1007·2023-04-26 01:47
新網(wǎng)：雙11上云嘉年華,.com域名低至16元/首年;.cn域名低至8.8元/首年

閱讀 1685·2021-11-18 13:19
css如何實(shí)現(xiàn)n宮格布局?

閱讀 2056·2019-08-30 15:44
前端動(dòng)畫(huà)專題（三）：撩人的按鈕特效

閱讀 670·2019-08-30 15:44
CSS那些事兒

閱讀 2310·2019-08-30 15:44
側(cè)邊欄的固定與自適應(yīng)原來(lái)是這樣實(shí)現(xiàn)的（持續(xù)更新）

閱讀 1246·2019-08-30 14:06
布局方法一

閱讀 1433·2019-08-30 12:59
垂直居中

閱讀 1909·2019-08-29 12:49

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Kaggle入門(mén)級(jí)賽題：房?jī)r(jià)預(yù)測(cè)——數(shù)據(jù)分析篇

相關(guān)文章

Kaggle入門(mén)級(jí)賽題：房?jī)r(jià)預(yù)測(cè)——數(shù)據(jù)挖掘篇

植被類型預(yù)測(cè)

**【Kaggle入門(mén)級(jí)競(jìng)賽top5%排名經(jīng)驗(yàn)分享】— 建模篇**

發(fā)表評(píng)論

0條評(píng)論

sarva

男|高級(jí)講師

TA的文章

用conda安裝tensorflow

新網(wǎng)：雙11上云嘉年華,.com域名低至16元/首年;.cn域名低至8.8元/首年

css如何實(shí)現(xiàn)n宮格布局?

前端動(dòng)畫(huà)專題（三）：撩人的按鈕特效

CSS那些事兒

側(cè)邊欄的固定與自適應(yīng)原來(lái)是這樣實(shí)現(xiàn)的（持續(xù)更新）

布局方法一

垂直居中

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Kaggle入門(mén)級(jí)賽題：房?jī)r(jià)預(yù)測(cè)——數(shù)據(jù)分析篇

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！