Regression_Step 01_02

Abstract

Machine learning applications have evolved in the fields of ranging from Natural Language Processing to the vast majority of business functions, one of which is taking place in a real estate field. While machine learning does not take over the real estate professinoals tasks, it will help people perform complicated task especially realted to a realtively large amount of data.

Objectives

This project focuses on quantitative data science and mainly aims to perform two goals. One is to predict housing price based on multiple variables such as room size or building types etc. The other is to forecast sales of a real estate company.

Regression_Step 02 Data Wrangling

Source of a data set

Source: Kaggle page https://www.kaggle.com/ruiqurm/lianjia

This includes URL, ID, Lng, Lat, CommunityID (Cid), TradeTime, DOM(days on market), Followers, Total price, Price, Square, Living Room, number of Drawing room, Kitchen and Bathroom, Building Type, Construction time. renovation condition, building structure, Ladder ratio( which is the proportion between number of residents on the same floor and number of elevator of ladder. It describes how many ladders a resident have on average), elevator, Property rights for five years(It's related to China restricted purchase of houses policy), Subway, District, Community average price.

All the data was coming from https://bj.lianjia.com/chengjiao.

'price' in this data seems what is known as unit price, which should equal to totalPrice/square.

Categories Assignement

According to this page, the category labels can be decoded as foloowing, but in this project we go ahead with the original encoded dataset.

(buildingType == 1 ~ "Tower", buildingType == 2 ~ "Bungalow", buildingType == 3 ~ "Plate/Tower", buildingType == 4 ~ "Plate")

(renovationCondition == 1 ~ "Other", renovationCondition == 2 ~ "Rough", renovationCondition == 3 ~ "Simplicity", renovationCondition == 4 ~ "Hardcover")

(buildingStructure == 1 ~ "Unavailable", buildingStructure == 2 ~ "Mixed", buildingStructure == 3 ~ "Brick/Wood", buildingStructure == 4 ~ "Brick/Concrete", buildingStructure == 5 ~ "Steel", buildingStructure == 6 ~ "Steel/Concrete")

(elevator == 1 ~ "Has_Elevator", elevator != 1 ~ "No_elevator"))

(subway == 1 ~ "Has_Subway", subway != 1 ~ "No_Subway")

(fiveYearsProperty == 1 ~ "Ownership < 5y", fiveYearsProperty != 1 ~ "Ownership > 5y")

(district == 1 ~ "DongCheng", district == 2 ~ "FengTai", district == 3 ~ "DaXing", district == 4 ~ "FaXing", district == 5 ~ "FangShang", district == 6 ~ "ChangPing", district == 7 ~ "ChaoYang", district == 8 ~ "HaiDian", district == 9 ~ "ShiJingShan", district == 10 ~ "XiCheng", district == 11 ~ "TongZhou", district == 12 ~ "ShunYi", district == 13 ~ "MenTouGou"))

Introduction

This notebook is data wrangling part of a self projects of Housing price in Beijing.

Imports

For data import, cleaning, and visualization, import pandas, numpy, scipy, random and matplotlib.pyplot

For machine learning from sklearn, import LinearRegression

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Other libraries and fnuciton will be imported as necessary.

Configuration

In [3]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.4f}'.format

Objective

The goals of this project are followings.

  • Goal 1: Prediction model for housing price
  • Goal 2: Forecast model for sales

In this notebook, we wrangle data and make data ready for EDA, which will be the next step.

Load Data

The csv file of the raw data has already been downloaded from the Kaggle website. The file is located in the folder named raw_data.

They include a column of datetime type, 'tradeTime'(yyyy-mm-dd). This will be converted to datatime type. As for the encoding, we go on with the 'iso-8859-1'.

Here, we download the file with the specification mentoined above as df. Not only this, but we also keep original dataframe in order to make it easier to check and compare, which is named as original_df.

In [4]:
df = pd.read_csv('../raw_data/Housing price in Beijing.csv',encoding='iso-8859-1',parse_dates = ['tradeTime'])
original_df = df.copy()
df.sort_values('tradeTime',inplace=True)

This warning says there are some mixed data types in the columns and we will be aware of this when cleaning.

Explore The Data for cleaning

Display general information with info method to check #rows, column names, Null value, Dtype.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 318851 entries, 94502 to 90666
Data columns (total 26 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   url                  318851 non-null  object        
 1   id                   318851 non-null  object        
 2   Lng                  318851 non-null  float64       
 3   Lat                  318851 non-null  float64       
 4   Cid                  318851 non-null  int64         
 5   tradeTime            318851 non-null  datetime64[ns]
 6   DOM                  160874 non-null  float64       
 7   followers            318851 non-null  int64         
 8   totalPrice           318851 non-null  float64       
 9   price                318851 non-null  int64         
 10  square               318851 non-null  float64       
 11  livingRoom           318851 non-null  object        
 12  drawingRoom          318851 non-null  object        
 13  kitchen              318851 non-null  int64         
 14  bathRoom             318851 non-null  object        
 15  floor                318851 non-null  object        
 16  buildingType         316830 non-null  float64       
 17  constructionTime     318851 non-null  object        
 18  renovationCondition  318851 non-null  int64         
 19  buildingStructure    318851 non-null  int64         
 20  ladderRatio          318851 non-null  float64       
 21  elevator             318819 non-null  float64       
 22  fiveYearsProperty    318819 non-null  float64       
 23  subway               318819 non-null  float64       
 24  district             318851 non-null  int64         
 25  communityAverage     318388 non-null  float64       
dtypes: datetime64[ns](1), float64(11), int64(7), object(7)
memory usage: 65.7+ MB

The missing values are found in some columns, 6,16,11-23,25. These will be fixed later.

In [6]:
df.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Lng 318851.0000 116.4185 0.1121 116.0725 116.3450 116.4168 116.4776 116.7324
Lat 318851.0000 39.9496 0.0920 39.6270 39.8932 39.9345 40.0030 40.2528
Cid 318851.0000 1129113334722.7512 2363446816043.5889 1111027373683.0000 1111027376255.0000 1111027378407.0000 1111027380579.0000 1114619720585020.0000
DOM 160874.0000 28.8223 50.2373 1.0000 1.0000 6.0000 37.0000 1677.0000
followers 318851.0000 16.7315 34.2092 0.0000 0.0000 5.0000 18.0000 1143.0000
totalPrice 318851.0000 349.0302 230.7808 0.1000 205.0000 294.0000 425.5000 18130.0000
price 318851.0000 43530.4364 21709.0242 1.0000 28050.0000 38737.0000 53819.5000 156250.0000
square 318851.0000 83.2406 37.2347 6.9000 57.9000 74.2600 98.7100 1745.5000
kitchen 318851.0000 0.9946 0.1096 0.0000 1.0000 1.0000 1.0000 4.0000
buildingType 316830.0000 3.0098 1.2699 0.0480 1.0000 4.0000 4.0000 4.0000
renovationCondition 318851.0000 2.6063 1.3117 0.0000 1.0000 3.0000 4.0000 4.0000
buildingStructure 318851.0000 4.4510 1.9018 0.0000 2.0000 6.0000 6.0000 6.0000
ladderRatio 318851.0000 63.1649 25068.5061 0.0000 0.2500 0.3330 0.5000 10009400.0000
elevator 318819.0000 0.5771 0.4940 0.0000 0.0000 1.0000 1.0000 1.0000
fiveYearsProperty 318819.0000 0.6456 0.4783 0.0000 0.0000 1.0000 1.0000 1.0000
subway 318819.0000 0.6011 0.4897 0.0000 0.0000 1.0000 1.0000 1.0000
district 318851.0000 6.7636 2.8126 1.0000 6.0000 7.0000 8.0000 13.0000
communityAverage 318388.0000 63682.4463 22329.2154 10847.0000 46339.0000 59015.0000 75950.0000 183109.0000

Check the first 5 rows to visually check the data itself.

In [7]:
df.head()
Out[7]:
url id Lng Lat Cid tradeTime DOM followers totalPrice price square livingRoom drawingRoom kitchen bathRoom floor buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district communityAverage
94502 https://bj.lianjia.com/chengjiao/BJCP84309432.... BJCP84309432 116.3866 40.0865 1111027382546 2002-06-01 NaN 0 152.5000 17053 89.4300 2 2 1 1 ÖÐ 9 4.0000 2009 1 6 0.3330 1.0000 0.0000 0.0000 6 47574.0000
127021 https://bj.lianjia.com/chengjiao/BJCY84594947.... BJCY84594947 116.4823 40.0142 1111027380302 2002-07-06 NaN 0 325.0000 27041 120.1900 3 1 1 1 ¸ß 26 1.0000 2002 1 6 0.2500 1.0000 1.0000 1.0000 7 65595.0000
126849 https://bj.lianjia.com/chengjiao/BJCY84582560.... BJCY84582560 116.6035 39.9197 1111027377416 2002-07-10 NaN 0 186.0000 14210 130.9000 3 1 1 2 ¸ß 21 1.0000 2001 1 6 0.3330 1.0000 0.0000 1.0000 7 43307.0000
223738 https://bj.lianjia.com/chengjiao/BJFT85137402.... BJFT85137402 116.2992 39.8381 1111027380071 2003-07-31 NaN 0 160.7000 26923 59.6900 2 1 1 1 ÖÐ 28 1.0000 2005 1 6 0.9090 1.0000 0.0000 1.0000 2 48853.0000
93197 https://bj.lianjia.com/chengjiao/BJCP00426371.... BJCP00426371 116.3302 40.0742 1111027381003 2008-10-07 NaN 0 126.0000 22195 56.7700 1 1 1 1 ¸ß 20 3.0000 δ֪ 1 6 0.3330 1.0000 0.0000 1.0000 6 50205.0000

All the data seem ok except the floor column, which seems to have two types of information, forward part and aftward part. The forward part of the floor data column is unreadable for some reasons, but we can still use one as categorise. The aftward part of the floor data column looks perfect.

Data Cleaning

Overview

Based on the df.info(), some problems have been found. In this section, we will deal with the error one by one.

In order to make comaprison easier, extract year and month in differect columns respectively.

In [8]:
df['year'] = df.tradeTime.apply(lambda x: x.year)
df['month'] = df.tradeTime.apply(lambda x: x.month)
df.sort_values('tradeTime',inplace=True)

'bathRoom' column: Comapare correct and wrong data

First error found is the data in the wrong columns. For example, the bathroom has some unreasonable data set.

In [9]:
df.bathRoom.unique()
Out[9]:
array([1, 2, 2006, 2003, 1990, 2000, '1', 1996, 2005, 4, 3, '2', '0', '3',
       0, 5, 2011, 2004, 6, 1994, 'δ֪', '2003', '4', '5', '7', 7],
      dtype=object)

Now we explore what wrong is going in 'bathRoom' column. As we can see, 'bathRoom' column has 2006, 2003 etc which seem to be year. However, these 2003 and 2006 are supposed to be seen only in constructionTime column.

To see the details, we now compare correct data using sample of (bathRoom == 6 or 7) and the wrong data (bathRoom == 1990)

In [10]:
df[(df['bathRoom']==1990)|(df['bathRoom']==6)|(df['bathRoom']==7)].head(4)
Out[10]:
url id Lng Lat Cid tradeTime DOM followers totalPrice price square livingRoom drawingRoom kitchen bathRoom floor buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district communityAverage year month
92304 https://bj.lianjia.com/chengjiao/BJ0000615229.... BJ0000615229 116.1170 39.9339 1111052073923 2010-04-17 1.0000 0 90.0000 14288 62.9900 #NAME? ÖÐ 6 4 1990 »ìºÏ½á¹¹ 0.3330 0 0 0 0.0000 NaN NaN NaN 12 39381.0000 2010 4
92356 https://bj.lianjia.com/chengjiao/BJ0000615469.... BJ0000615469 116.1170 39.9339 1111052073923 2010-10-21 1.0000 0 126.8000 15849 80.0100 #NAME? µÍ 6 4 1990 »ìºÏ½á¹¹ 0.3330 0 0 0 0.0000 NaN NaN NaN 12 39381.0000 2010 10
124056 https://bj.lianjia.com/chengjiao/BJCY84338384.... BJCY84338384 116.3805 40.0252 1111042784632 2012-02-27 1.0000 0 698.3000 25300 276.0000 4 2 2 6 ¶¥ 18 4.0000 2012 1 6 1.0000 1.0000 0.0000 1.0000 7 90147.0000 2012 2
93008 https://bj.lianjia.com/chengjiao/BJCP00382717.... BJCP00382717 116.3036 40.0909 1111027377758 2012-12-24 NaN 0 1140.0000 45950 248.1000 7 2 1 6 µ× 3 NaN 2003 1 4 0.5000 0.0000 0.0000 0.0000 6 56117.0000 2012 12

We also found that the rows that have wrong data in bathrom also include another wrong data, #NAME? in livingRoom column.

Now we check the rows with #NAME? in livingRoom columns.

In [11]:
df[df.livingRoom=='#NAME?'].head()
Out[11]:
url id Lng Lat Cid tradeTime DOM followers totalPrice price square livingRoom drawingRoom kitchen bathRoom floor buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district communityAverage year month
92235 https://bj.lianjia.com/chengjiao/BJ0000615040.... BJ0000615040 116.1177 39.9412 1111052110644 2010-02-21 1.0000 0 198.0000 15729 125.8900 #NAME? ÖÐ 14 3 2006 ¸Ö»ì½á¹¹ 0.5000 1 0 0 0.0000 NaN NaN NaN 12 39492.0000 2010 2
92251 https://bj.lianjia.com/chengjiao/BJ0000615084.... BJ0000615084 116.1220 39.9398 1111027375862 2010-03-07 1.0000 0 136.0000 14601 93.1500 #NAME? ÖÐ 15 4 2003 ¸Ö»ì½á¹¹ 0.3330 1 0 0 0.0000 NaN NaN NaN 12 37588.0000 2010 3
92267 https://bj.lianjia.com/chengjiao/BJ0000615132.... BJ0000615132 116.1184 39.9379 1111043529456 2010-03-22 1.0000 0 127.0000 14613 86.9100 #NAME? ÖÐ 16 4 2006 ¸Ö»ì½á¹¹ 0.3330 1 0 0 0.0000 NaN NaN NaN 12 38753.0000 2010 3
92270 https://bj.lianjia.com/chengjiao/BJ0000615138.... BJ0000615138 116.1113 39.9499 1111027377794 2010-03-23 1.0000 0 115.0000 14240 80.7600 #NAME? ÖÐ 6 4 2003 »ìºÏ½á¹¹ 0.5000 0 0 0 0.0000 NaN NaN NaN 12 36923.0000 2010 3
92297 https://bj.lianjia.com/chengjiao/BJ0000615208.... BJ0000615208 116.1177 39.9412 1111052110644 2010-04-10 1.0000 0 100.0000 17242 58.0000 #NAME? ¸ß 14 3 2006 ¸Ö»ì½á¹¹ 0.1250 1 0 0 0.0000 NaN NaN NaN 12 39492.0000 2010 4
In [12]:
len(df[df.livingRoom=='#NAME?'])
Out[12]:
32

Based on this result, we found that 32 rows of livingRoom=='#NAME?' data, and also found that not only bathRoom but also the other columns upto ladderRatio column also have wrong data that should have been in a column by 3 to the right. Based on this observation, one solution to this problems will be following. regarding the rows with 'year' or any objects other than the integers less than 100 in 'bathRoom' column (as there will be no house with 100 bathrooms), it may be an option to shift these wrong data, horizontally by 3 columns to the right. The columns to shift are from livingRoom to ladderRatio.

Another problem found is that the original data in the buildingType seems actually the data that should have been ladderRatio, which is FOUR columns to the right.

Although there may be some solution to save these 32 rows of data, it will be difficult to presume which column is right location to shift the wrong data to. In addition, the 32 rows of data are relatively quite small compared to entire rows of 318850. Furthermore, it will be difficult to predict the missing values which will be created by shifting to salvge 32 rows. Our conclusion is to drop the 32 rows that have livingRoom=='#NAME?'.

In [13]:
df = df[df.livingRoom!='#NAME?']

Separate Categorical feature

Original 'floortype' column has two categorical features in the columns Separate them into two of new columns.

In [14]:
df['floor'].unique()
Out[14]:
array(['ÖÐ 9', '¸ß 26', '¸ß 21', 'ÖÐ 28', '¸ß 20', 'ÖÐ 6', '¸ß 12',
       '¸ß 6', 'µ× 6', 'ÖÐ 11', 'ÖÐ 16', 'µÍ 6', 'ÖÐ 14', 'µÍ 14', '¶¥ 6',
       '¸ß 15', 'µÍ 13', '¸ß 14', '¶¥ 16', 'µÍ 16', 'ÖÐ 13', '¶¥ 15',
       'µÍ 18', 'µ× 18', 'µÍ 12', 'µÍ 15', '¶¥ 11', 'ÖÐ 15', '¶¥ 5',
       '¸ß 16', '¶¥ 13', 'µ× 12', 'ÖÐ 5', 'ÖÐ 12', 'ÖÐ 18', '¸ß 13',
       '¸ß 28', '¶¥ 12', 'µ× 5', 'ÖÐ 31', 'µ× 7', 'ÖÐ 29', 'ÖÐ 20',
       '¶¥ 10', '¸ß 19', 'ÖÐ 24', 'µÍ 17', 'ÖÐ 7', '¸ß 24', '¸ß 11',
       'ÖÐ 26', 'ÖÐ 25', '¸ß 23', 'δ֪ 6', '¸ß 18', 'µÍ 21', 'µÍ 28',
       '¶¥ 14', 'µÍ 24', 'ÖÐ 21', 'µÍ 22', '¸ß 17', 'δ֪ 5', 'ÖÐ 8',
       'µÍ 29', 'µ× 32', '¸ß 7', 'ÖÐ 4', 'ÖÐ 27', 'µÍ 11', 'µ× 10',
       'ÖÐ 17', 'µÍ 19', 'µÍ 7', 'ÖÐ 22', 'µÍ 32', '¶¥ 9', 'µÍ 10',
       '¸ß 32', '¸ß 9', 'µÍ 26', '¶¥ 28', 'ÖÐ 23', 'µ× 13', 'µÍ 30',
       'ÖÐ 32', '¶¥ 18', 'ÖÐ 19', 'δ֪ 21', 'µ× 16', 'µÍ 9', '¸ß 29',
       '¸ß 10', 'ÖÐ 30', 'µ× 3', 'µ× 4', '¸ß 22', '¸ß 33', 'µÍ 20',
       '¸ß 8', '¸ß 27', '¸ß 25', '¶¥ 7', 'ÖÐ 10', '¶¥ 22', '¶¥ 21',
       'µ× 17', '¶¥ 19', 'µÍ 27', '¶¥ 4', 'µ× 14', 'δ֪ 16', 'δ֪ 20',
       'δ֪ 27', 'µ× 11', 'µ× 20', '¶¥ 20', 'µÍ 25', 'µ× 28', 'µ× 9',
       'µÍ 31', '¸ß 31', 'µÍ 23', 'µ× 21', 'δ֪ 25', '¶¥ 25', '¸ß 30',
       '¶¥ 24', 'δ֪ 18', 'µÍ 8', 'µ× 1', 'µÍ 3', 'δ֪ 23', 'δ֪ 12',
       'µ× 2', 'δ֪ 26', 'δ֪ 9', '¶¥ 3', '¸ß 34', 'δ֪ 22', 'δ֪ 7',
       '¶¥ 2', 'µ× 25', '¶¥ 30', '¶¥ 8', 'µÍ 34', 'δ֪ 17', 'µ× 30',
       'µ× 22', 'µ× 26', 'µ× 24', '¶¥ 17', 'µ× 15', 'µ× 8', 'ÖÐ 33',
       'δ֪ 15', 'δ֪ 4', 'ÖÐ 37', 'µ× 29', '¶¥ 32', 'µ× 19', 'µ× 23',
       'µÍ 33', 'δ֪ 11', '¶¥ 26', '¶¥ 34', 'µ× 27', 'δ֪ 10', '¶¥ 23',
       'δ֪ 14', '¶¥ 27', '¶¥ 29', 'δ֪ 19', 'ÖÐ 34', 'δ֪ 8', '¶¥ 33',
       'µ× 33', '¶¥ 31', 'δ֪ 28', 'µÍ 37', 'µÍ 42', 'µ× 31', 'δ֪ 24',
       'δ֪ 3', 'δ֪ 30', 'δ֪ 31', 'ÖÐ 36', '¸ß 37', 'ÖÐ 57',
       'δ֪ 13', '¸ß 36', 'δ֪ 29', 'ÖÐ 42', 'µÍ 5', 'µÍ 2', '¸ß 42',
       'µÍ 36', 'ÖÐ 35', 'µÍ 35', 'µÍ 63', 'µ× 34'], dtype=object)
In [15]:
def floorType(s):
    try:
        return s.split(' ')[0]
    except:
        return np.nan

def floorHeight(s):
    try:
        return int(s.split(' ')[1])
    except:
        return np.nan
In [16]:
df['floorType'] = df['floor'].map(floorType)
df['floorHeight'] = df['floor'].map(floorHeight)
In [17]:
df.head()
Out[17]:
url id Lng Lat Cid tradeTime DOM followers totalPrice price square livingRoom drawingRoom kitchen bathRoom floor buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district communityAverage year month floorType floorHeight
94502 https://bj.lianjia.com/chengjiao/BJCP84309432.... BJCP84309432 116.3866 40.0865 1111027382546 2002-06-01 NaN 0 152.5000 17053 89.4300 2 2 1 1 ÖÐ 9 4.0000 2009 1 6 0.3330 1.0000 0.0000 0.0000 6 47574.0000 2002 6 ÖÐ 9
127021 https://bj.lianjia.com/chengjiao/BJCY84594947.... BJCY84594947 116.4823 40.0142 1111027380302 2002-07-06 NaN 0 325.0000 27041 120.1900 3 1 1 1 ¸ß 26 1.0000 2002 1 6 0.2500 1.0000 1.0000 1.0000 7 65595.0000 2002 7 ¸ß 26
126849 https://bj.lianjia.com/chengjiao/BJCY84582560.... BJCY84582560 116.6035 39.9197 1111027377416 2002-07-10 NaN 0 186.0000 14210 130.9000 3 1 1 2 ¸ß 21 1.0000 2001 1 6 0.3330 1.0000 0.0000 1.0000 7 43307.0000 2002 7 ¸ß 21
223738 https://bj.lianjia.com/chengjiao/BJFT85137402.... BJFT85137402 116.2992 39.8381 1111027380071 2003-07-31 NaN 0 160.7000 26923 59.6900 2 1 1 1 ÖÐ 28 1.0000 2005 1 6 0.9090 1.0000 0.0000 1.0000 2 48853.0000 2003 7 ÖÐ 28
93197 https://bj.lianjia.com/chengjiao/BJCP00426371.... BJCP00426371 116.3302 40.0742 1111027381003 2008-10-07 NaN 0 126.0000 22195 56.7700 1 1 1 1 ¸ß 20 3.0000 δ֪ 1 6 0.3330 1.0000 0.0000 1.0000 6 50205.0000 2008 10 ¸ß 20

Confirmed floorType and floorHeight have been created and each has categorical values and numerical values, respectively.

In [18]:
df.drop('floor',axis=1,inplace=True)

Data type in bathRoom

In [19]:
df.bathRoom.unique()
Out[19]:
array([1, 2, '1', 4, 3, '2', '0', '3', 0, 5, 6, '4', '5', '7', 7],
      dtype=object)

Columns of 11,12,14,15,16 have object type data and seem to have mixed datatype. We will convert them to int type.

In [20]:
df.iloc[:,[11,12,14]] = df.iloc[:,[11,12,14]].applymap(int)
In [21]:
# Confirm the result
df.bathRoom.unique()
Out[21]:
array([1, 2, 4, 3, 0, 5, 6, 7])

Data type in floorType

In [22]:
# Check the current categorical values in floorType
df.floorType.unique()
Out[22]:
array(['ÖÐ', '¸ß', 'µ×', 'µÍ', '¶¥', 'δ֪'], dtype=object)
In [23]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df.floorType = le.fit_transform(df.floorType)
In [24]:
# Confirm the new categorical values
df.floorType.unique()
Out[24]:
array([5, 3, 1, 0, 2, 4])

Data type in constructionTime

'constructionTime''s datatype should be year or time.

In [25]:
df.constructionTime.unique()
Out[25]:
array(['2009', '2002', '2001', '2005', 'δ֪', '1992', '2008', '1998',
       '2003', '2006', '1990', '1991', '2000', '1996', '2004', '1995',
       '1987', '1997', '1980', '1975', '1989', '2007', '2014', '1984',
       '1993', '2010', '1999', '1981', '1988', '1982', '1963', '1977',
       '1983', '1994', '1985', '1955', '1986', '2011', '1967', '1979',
       '1978', '2012', '1960', '1976', '1970', '1954', '1974', '1972',
       '1965', '1956', '1958', '2013', '1973', '2015', '1959', '1966',
       '1964', '1961', '1944', '1953', '1962', '1957', '1952', '1971',
       '1950', '1906', '1968', '1969', '2016', '1934', '1914', '1933'],
      dtype=object)
In [26]:
len(df[df.constructionTime=='δ֪']),df[df.constructionTime=='δ֪']['tradeTime'].apply(lambda x: x.year).unique()
Out[26]:
(19283, array([2008, 2011, 2012, 2013, 2014, 2015, 2016, 2017]))

One unreadable str,'δ֪', has been found, which counted 19283. Those row have relatively new years.

Our hypothesis is that this 'δ֪' means new house.

Now we check the data which do not have 'δ֪' to compare construction year and tradeTime year.

In [27]:
constIsLater = df[df.constructionTime!='δ֪'].constructionTime.astype(int) > df[df.constructionTime!='δ֪']['tradeTime'].apply(lambda x: x.year)
constIsSame = df[df.constructionTime!='δ֪'].constructionTime.astype(int) == df[df.constructionTime!='δ֪']['tradeTime'].apply(lambda x: x.year)
constIsEarlier = df[df.constructionTime!='δ֪'].constructionTime.astype(int) < df[df.constructionTime!='δ֪']['tradeTime'].apply(lambda x: x.year)
print(sum(constIsLater),
      sum(constIsSame),
      sum(constIsEarlier)
     )
110 330 299096

Base on this comparison, we conclude our hyputhesis is resonable.

Now we replace 'δ֪' for the tradeTime year.

In [28]:
def _const_correction(cols):
    construction_year = cols[0]
    if construction_year=='δ֪':
        return int(cols[1])
    else:
        return int(cols[0])
In [29]:
df['constructionTime']= df[['constructionTime','year']].apply(_const_correction,axis=1)

Final check for data type

In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 318819 entries, 94502 to 90666
Data columns (total 29 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   url                  318819 non-null  object        
 1   id                   318819 non-null  object        
 2   Lng                  318819 non-null  float64       
 3   Lat                  318819 non-null  float64       
 4   Cid                  318819 non-null  int64         
 5   tradeTime            318819 non-null  datetime64[ns]
 6   DOM                  160849 non-null  float64       
 7   followers            318819 non-null  int64         
 8   totalPrice           318819 non-null  float64       
 9   price                318819 non-null  int64         
 10  square               318819 non-null  float64       
 11  livingRoom           318819 non-null  int64         
 12  drawingRoom          318819 non-null  int64         
 13  kitchen              318819 non-null  int64         
 14  bathRoom             318819 non-null  int64         
 15  buildingType         316798 non-null  float64       
 16  constructionTime     318819 non-null  int64         
 17  renovationCondition  318819 non-null  int64         
 18  buildingStructure    318819 non-null  int64         
 19  ladderRatio          318819 non-null  float64       
 20  elevator             318819 non-null  float64       
 21  fiveYearsProperty    318819 non-null  float64       
 22  subway               318819 non-null  float64       
 23  district             318819 non-null  int64         
 24  communityAverage     318356 non-null  float64       
 25  year                 318819 non-null  int64         
 26  month                318819 non-null  int64         
 27  floorType            318819 non-null  int64         
 28  floorHeight          318819 non-null  int64         
dtypes: datetime64[ns](1), float64(11), int64(15), object(2)
memory usage: 73.0+ MB

Null values

Major possible options are as follows.

  1. Drop off the rows with Null
  2. Imute a Null call(missing value) for a value
  3. Drop off the columns with Null

Visualise and scale the NaN values

In [31]:
import missingno as msno
import matplotlib.pyplot as plt

print(df.isnull().sum()/len(df)*100)
# sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# Visualize missingness
msno.matrix(df)
plt.show()
url                    0.0000
id                     0.0000
Lng                    0.0000
Lat                    0.0000
Cid                    0.0000
tradeTime              0.0000
DOM                   49.5485
followers              0.0000
totalPrice             0.0000
price                  0.0000
square                 0.0000
livingRoom             0.0000
drawingRoom            0.0000
kitchen                0.0000
bathRoom               0.0000
buildingType           0.6339
constructionTime       0.0000
renovationCondition    0.0000
buildingStructure      0.0000
ladderRatio            0.0000
elevator               0.0000
fiveYearsProperty      0.0000
subway                 0.0000
district               0.0000
communityAverage       0.1452
year                   0.0000
month                  0.0000
floorType              0.0000
floorHeight            0.0000
dtype: float64

df.isna().sum().plot(kind="bar") plt.show()

Based on this information, there are three columns which have null data, among which the missing percentage of data in DOM column is significantly large, 49.55%. In the following sections, impute these missing data column by column.

Null in 'new_buildingStructure'¶

Replace the null data in this column for valid data.

In [32]:
df['buildingType'].isnull().sum()
Out[32]:
2021

2021 is quite a lot.

In [33]:
df['buildingType'].unique()
Out[33]:
array([ 4.,  1.,  3., nan,  2.])

Instead of dropping them, careate and put a new category label, 5.

In [34]:
df['buildingType'] = df['buildingType'].replace({np.nan:5})
In [35]:
df['buildingType'].unique()
Out[35]:
array([4., 1., 3., 5., 2.])
In [36]:
df.isnull().sum()/len(df)*100
Out[36]:
url                    0.0000
id                     0.0000
Lng                    0.0000
Lat                    0.0000
Cid                    0.0000
tradeTime              0.0000
DOM                   49.5485
followers              0.0000
totalPrice             0.0000
price                  0.0000
square                 0.0000
livingRoom             0.0000
drawingRoom            0.0000
kitchen                0.0000
bathRoom               0.0000
buildingType           0.0000
constructionTime       0.0000
renovationCondition    0.0000
buildingStructure      0.0000
ladderRatio            0.0000
elevator               0.0000
fiveYearsProperty      0.0000
subway                 0.0000
district               0.0000
communityAverage       0.1452
year                   0.0000
month                  0.0000
floorType              0.0000
floorHeight            0.0000
dtype: float64

Null in 'communityAverage'

Before imputing community average, check how community average has been calculated.

In [37]:
# Make a list of columns that seem related to community average.
checklist = ['communityAverage','Cid','totalPrice','price','year','month','district','buildingType','buildingStructure']
In [38]:
# Make a new dataframe of checklist
comAveSet = df[checklist]
In [39]:
# Check the max of communityAverage
comAveSet.communityAverage.max()
Out[39]:
183109.0
In [40]:
# Find the constant value to search the which values are used to make a average of communityAverage
comAveSetmax = comAveSet[comAveSet.communityAverage==comAveSet.communityAverage.max()]
pd.set_option('display.float_format', lambda x: '%.2f' % x)
comAveSetmax.describe().T
Out[40]:
count mean std min 25% 50% 75% max
communityAverage 53.00 183109.00 0.00 183109.00 183109.00 183109.00 183109.00 183109.00
Cid 53.00 1111027374315.00 0.00 1111027374315.00 1111027374315.00 1111027374315.00 1111027374315.00 1111027374315.00
totalPrice 53.00 786.18 610.49 55.00 500.00 610.00 905.00 3840.00
price 53.00 92479.87 32431.31 15546.00 71392.00 87597.00 120250.00 147503.00
year 53.00 2014.38 1.38 2011.00 2014.00 2015.00 2015.00 2016.00
month 53.00 5.72 3.52 1.00 3.00 5.00 9.00 12.00
district 53.00 10.00 0.00 10.00 10.00 10.00 10.00 10.00
buildingType 53.00 3.04 0.27 3.00 3.00 3.00 3.00 5.00
buildingStructure 53.00 5.91 0.69 1.00 6.00 6.00 6.00 6.00
In [41]:
# Another search
comAveSetsecmax = comAveSet[comAveSet.communityAverage==146634.0]
pd.set_option('display.float_format', lambda x: '%.2f' % x)
comAveSetsecmax.describe().T
Out[41]:
count mean std min 25% 50% 75% max
communityAverage 14.00 146634.00 0.00 146634.00 146634.00 146634.00 146634.00 146634.00
Cid 14.00 1111027382080.00 0.00 1111027382080.00 1111027382080.00 1111027382080.00 1111027382080.00 1111027382080.00
totalPrice 14.00 511.07 175.70 179.00 454.25 562.50 646.50 738.00
price 14.00 99890.86 36226.53 32966.00 74412.50 103056.00 131861.25 149276.00
year 14.00 2014.57 1.34 2011.00 2014.00 2015.00 2015.00 2016.00
month 14.00 7.21 3.64 1.00 5.00 7.50 9.75 12.00
district 14.00 10.00 0.00 10.00 10.00 10.00 10.00 10.00
buildingType 14.00 4.00 0.00 4.00 4.00 4.00 4.00 4.00
buildingStructure 14.00 2.00 0.00 2.00 2.00 2.00 2.00 2.00

From these analysis, community average seems the average price in a district or Cid. More importantly, community average is the value averaged over multiple years.

Firstly check # of unique data in each column.

In [42]:
len(df['Cid'].unique()),len(df['communityAverage'].unique()),len(df['district'].unique())
Out[42]:
(4035, 4073, 13)

Next, check the summary for single Cid (117800835289894).

In [43]:
df[df.Cid==117800835289894].describe().T
Out[43]:
count mean std min 25% 50% 75% max
Lng 8.00 116.37 0.00 116.37 116.37 116.37 116.37 116.37
Lat 8.00 40.08 0.00 40.08 40.08 40.08 40.08 40.08
Cid 8.00 117800835289894.00 0.00 117800835289894.00 117800835289894.00 117800835289894.00 117800835289894.00 117800835289894.00
DOM 8.00 71.12 56.24 17.00 32.00 43.00 114.25 169.00
followers 8.00 42.88 40.15 0.00 22.75 33.50 44.00 133.00
totalPrice 8.00 531.86 87.26 420.00 486.00 519.45 550.75 715.00
price 8.00 51673.75 6277.83 41380.00 47969.00 52531.00 56265.75 59941.00
square 8.00 103.32 14.30 90.09 93.13 95.51 116.00 127.68
livingRoom 8.00 2.25 0.46 2.00 2.00 2.00 2.25 3.00
drawingRoom 8.00 1.38 0.52 1.00 1.00 1.00 2.00 2.00
kitchen 8.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00
bathRoom 8.00 1.38 0.52 1.00 1.00 1.00 2.00 2.00
buildingType 8.00 3.00 0.00 3.00 3.00 3.00 3.00 3.00
constructionTime 8.00 2015.62 0.52 2015.00 2015.00 2016.00 2016.00 2016.00
renovationCondition 8.00 2.00 1.07 1.00 1.00 2.00 2.25 4.00
buildingStructure 8.00 6.00 0.00 6.00 6.00 6.00 6.00 6.00
ladderRatio 8.00 0.47 0.34 0.25 0.25 0.25 0.62 1.00
elevator 8.00 0.25 0.46 0.00 0.00 0.00 0.25 1.00
fiveYearsProperty 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
subway 8.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00
district 8.00 6.00 0.00 6.00 6.00 6.00 6.00 6.00
communityAverage 8.00 54570.12 1618.21 53998.00 53998.00 53998.00 53998.00 58575.00
year 8.00 2016.62 0.52 2016.00 2016.00 2017.00 2017.00 2017.00
month 8.00 8.75 2.19 5.00 7.75 9.00 9.50 12.00
floorType 8.00 1.38 1.77 0.00 0.00 1.00 1.50 5.00
floorHeight 8.00 19.12 1.55 18.00 18.00 18.00 21.00 21.00

The communityAverage has 4073 unique numbers, while Cid has 4035. They are very close. #Cid and #communityAverage are almost same, but district is too small comapred to 4073, #communityAverage.

In a single district, there are so many Cid. Also, although I have deleted the map analysis as the data is hundreds MB heavy, from geographical relathion between Cid and district, the broundary between district is not clearly identical as Cid boundary, which means that on the boundary of district, there are several Cid are mixed.

We conclude followings.

    1. It is less likely the community average is average of a district and
    1. Cid seems the most strongly related to communityAverage.
    1. CommunityAverage is average values regardless years

In order to keep feature variables independent and meaningful, drop communityAverage column.

In [44]:
df.drop('communityAverage',inplace=True,axis=1)
In [45]:
# confirm the remaining null.
df.isnull().sum()/len(df)*100
Out[45]:
url                    0.00
id                     0.00
Lng                    0.00
Lat                    0.00
Cid                    0.00
tradeTime              0.00
DOM                   49.55
followers              0.00
totalPrice             0.00
price                  0.00
square                 0.00
livingRoom             0.00
drawingRoom            0.00
kitchen                0.00
bathRoom               0.00
buildingType           0.00
constructionTime       0.00
renovationCondition    0.00
buildingStructure      0.00
ladderRatio            0.00
elevator               0.00
fiveYearsProperty      0.00
subway                 0.00
district               0.00
year                   0.00
month                  0.00
floorType              0.00
floorHeight            0.00
dtype: float64

Null in 'DOM'

The definition of DOM is the days on market. If the day on market are longer than other houses, customers looking for houses start to suspect there are promblems with the houses.

In [46]:
msno.matrix(df)
Out[46]:
<AxesSubplot:>

The vast majority of missing values are ealier time of this data set.

In [47]:
df.describe().DOM
Out[47]:
count   160849.00
mean        28.83
std         50.24
min          1.00
25%          1.00
50%          6.00
75%         37.00
max       1677.00
Name: DOM, dtype: float64

DOM seems important value, so somehow we are willing to impute the missing.

Because of much discrepancy among median, mean, and maximum value, we will further analysis, which will happen in next EDA. So, for now, we will keep this missing values in DOM column.

In [48]:
# df['DOM'] = df['DOM'].fillna(1.0)
In [49]:
# Final check
df.isnull().sum()/len(df)*100
Out[49]:
url                    0.00
id                     0.00
Lng                    0.00
Lat                    0.00
Cid                    0.00
tradeTime              0.00
DOM                   49.55
followers              0.00
totalPrice             0.00
price                  0.00
square                 0.00
livingRoom             0.00
drawingRoom            0.00
kitchen                0.00
bathRoom               0.00
buildingType           0.00
constructionTime       0.00
renovationCondition    0.00
buildingStructure      0.00
ladderRatio            0.00
elevator               0.00
fiveYearsProperty      0.00
subway                 0.00
district               0.00
year                   0.00
month                  0.00
floorType              0.00
floorHeight            0.00
dtype: float64

Save data

This should be done aftre imputation of communityAverage as there are null values in the column.

In [50]:
# save the data to a new csv file
datapath = '../data'
save_file(df, 'data_step2_cleaned.csv', datapath)
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/data_step2_cleaned.csv"

Summary

  • The key processes and findings from this notebook are as follows.

    • Original rows in the data: 318851 rows
    • Rows deleted: 32 rows These data included 10 columns misplaced and 6 missing feature values.
    • Rows after cleaning: 318819 rows
  • The removed columnns

    • communityAverage: This column had average value over several years and had strong relationship with Cid and seems dependent on Cid and price or totalPrice. Based on our objectives in sec4, one of our goals is to analyze time series change in price. Therefore, this average price will not be necessary and has been deleted.
  • DOM column Although there are large amount of missing values in DOM columns, as this DOM is considered to be one of important Key Performance Indicator in real estate, we will leave out the imputation for DOM in this datawrangling and further explore in 03 EDA notebook.

  • Price and totalPrice As discussed above, there is suplicaiton in price information. For example, 'totalPrice' seems equal to 'price' multiplied by 'square'. Either of the columns will need to be removed, but we will discuss about this in EDA.