Springboard Data Science Career Track Unit 4 Challenge - Tier 3 Complete

Objectives

Hey! Great job getting through those challenging DataCamp courses. You're learning a lot in a short span of time.

In this notebook, you're going to apply the skills you've been learning, bridging the gap between the controlled environment of DataCamp and the slightly messier work that data scientists do with actual datasets!

Here’s the mystery we’re going to solve: which boroughs of London have seen the greatest increase in housing prices, on average, over the last two decades?

A borough is just a fancy word for district. You may be familiar with the five boroughs of New York… well, there are 32 boroughs within Greater London (here's some info for the curious). Some of them are more desirable areas to live in, and the data will reflect that with a greater rise in housing prices.

This is the Tier 3 notebook, which means it's not filled in at all: we'll just give you the skeleton of a project, the brief and the data. It's up to you to play around with it and see what you can find out! Good luck! If you struggle, feel free to look at easier tiers for help; but try to dip in and out of them, as the more independent work you do, the better it is for your learning!

This challenge will make use of only what you learned in the following DataCamp courses:

  • Prework courses (Introduction to Python for Data Science, Intermediate Python for Data Science)
  • Data Types for Data Science
  • Python Data Science Toolbox (Part One)
  • pandas Foundations
  • Manipulating DataFrames with pandas
  • Merging DataFrames with pandas

Of the tools, techniques and concepts in the above DataCamp courses, this challenge should require the application of the following:

  • pandas
    • data ingestion and inspection (pandas Foundations, Module One)
    • exploratory data analysis (pandas Foundations, Module Two)
    • tidying and cleaning (Manipulating DataFrames with pandas, Module Three)
    • transforming DataFrames (Manipulating DataFrames with pandas, Module One)
    • subsetting DataFrames with lists (Manipulating DataFrames with pandas, Module One)
    • filtering DataFrames (Manipulating DataFrames with pandas, Module One)
    • grouping data (Manipulating DataFrames with pandas, Module Four)
    • melting data (Manipulating DataFrames with pandas, Module Three)
    • advanced indexing (Manipulating DataFrames with pandas, Module Four)
  • matplotlib (Intermediate Python for Data Science, Module One)
  • fundamental data types (Data Types for Data Science, Module One)
  • dictionaries (Intermediate Python for Data Science, Module Two)
  • handling dates and times (Data Types for Data Science, Module Four)
  • function definition (Python Data Science Toolbox - Part One, Module One)
  • default arguments, variable length, and scope (Python Data Science Toolbox - Part One, Module Two)
  • lambda functions and error handling (Python Data Science Toolbox - Part One, Module Four)

The Data Science Pipeline

This is Tier Three, so we'll get you started. But after that, it's all in your hands! When you feel done with your investigations, look back over what you've accomplished, and prepare a quick presentation of your findings for the next mentor meeting.

Data Science is magical. In this case study, you'll get to apply some complex machine learning algorithms. But as David Spiegelhalter reminds us, there is no substitute for simply taking a really, really good look at the data. Sometimes, this is all we need to answer our question.

Data Science projects generally adhere to the four stages of Data Science Pipeline:

  1. Sourcing and loading
  2. Cleaning, transforming, and visualizing
  3. Modeling
  4. Evaluating and concluding

1. Sourcing and Loading

Any Data Science project kicks off by importing pandas. The documentation of this wonderful library can be found here. As you've seen, pandas is conveniently connected to the Numpy and Matplotlib libraries.

Hint: This part of the data science pipeline will test those skills you acquired in the pandas Foundations course, Module One.

1.1. Importing Libraries

In [1]:
# Let's import the pandas, numpy libraries as pd, and np respectively. 
import pandas as pd
import numpy as np

# Load the pyplot collection of functions from matplotlib, as plt 
import matplotlib.pyplot as plt

1.2. Loading the data

Your data comes from the London Datastore: a free, open-source data-sharing portal for London-oriented datasets.

In [2]:
# First, make a variable called url_LondonHousePrices, and assign it the following link, enclosed in quotation-marks as a string:
# https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK%20House%20price%20index.xls

url_LondonHousePrices = "https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK_House_price_index.xlsx"

# The dataset we're interested in contains the Average prices of the houses, and is actually on a particular sheet of the Excel file. 
# As a result, we need to specify the sheet name in the read_excel() method.
# Put this data into a variable called properties.  
properties = pd.read_excel(url_LondonHousePrices, sheet_name='Average price', index_col= None)

2. Cleaning, transforming, and visualizing

This second stage is arguably the most important part of any Data Science project. The first thing to do is take a proper look at the data. Cleaning forms the majority of this stage, and can be done both before or after Transformation.

The end goal of data cleaning is to have tidy data. When data is tidy:

  1. Each variable has a column.
  2. Each observation forms a row.

Keep the end goal in mind as you move through this process, every step will take you closer.

Hint: This part of the data science pipeline should test those skills you acquired in:

  • Intermediate Python for data science, all modules.
  • pandas Foundations, all modules.
  • Manipulating DataFrames with pandas, all modules.
  • Data Types for Data Science, Module Four.
  • Python Data Science Toolbox - Part One, all modules

2.1. Exploring your data

Think about your pandas functions for checking out a dataframe.

In [3]:
properties.head()
Out[3]:
Unnamed: 0 City of London Barking & Dagenham Barnet Bexley Brent Bromley Camden Croydon Ealing ... NORTH WEST YORKS & THE HUMBER EAST MIDLANDS WEST MIDLANDS EAST OF ENGLAND LONDON SOUTH EAST SOUTH WEST Unnamed: 47 England
0 NaT E09000001 E09000002 E09000003 E09000004 E09000005 E09000006 E09000007 E09000008 E09000009 ... E12000002 E12000003 E12000004 E12000005 E12000006 E12000007 E12000008 E12000009 NaN E92000001
1 1995-01-01 91448.98487 50460.2266 93284.51832 64958.09036 71306.56698 81671.47692 120932.8881 69158.16225 79885.89069 ... 43958.48001 44803.42878 45544.52227 48527.52339 56701.5961 74435.76052 64018.87894 54705.1579 NaN 53202.77128
2 1995-02-01 82202.77314 51085.77983 93190.16963 64787.92069 72022.26197 81657.55944 119508.8622 68951.09542 80897.06551 ... 43925.42289 44528.80721 46051.57066 49341.29029 56593.59475 72777.93709 63715.02399 54356.14843 NaN 53096.1549
3 1995-03-01 79120.70256 51268.96956 92247.52435 64367.49344 72015.76274 81449.31143 120282.2131 68712.44341 81379.86288 ... 44434.8681 45200.46775 45383.82395 49442.17973 56171.18278 73896.84204 64113.60858 53583.07667 NaN 53201.2843
4 1995-04-01 77101.20804 53133.50526 90762.87492 64277.66881 72965.63094 81124.41227 120097.899 68610.04641 82188.90498 ... 44267.7796 45614.34341 46124.23045 49455.93299 56567.89582 74455.28754 64623.22395 54786.01938 NaN 53590.8548

5 rows × 49 columns

2.2. Cleaning the data

You might find you need to transpose your dataframe, check out what its row indexes are, and reset the index. You also might find you need to assign the values of the first row to your column headings . (Hint: recall the .columns feature of DataFrames, as well as the iloc[] method).

Don't be afraid to use StackOverflow for help with this.

In [4]:
properties_T = properties.transpose()
In [5]:
properties_T = properties_T.reset_index()
properties_T.head()
Out[5]:
index 0 1 2 3 4 5 6 7 8 ... 308 309 310 311 312 313 314 315 316 317
0 Unnamed: 0 NaT 1995-01-01 00:00:00 1995-02-01 00:00:00 1995-03-01 00:00:00 1995-04-01 00:00:00 1995-05-01 00:00:00 1995-06-01 00:00:00 1995-07-01 00:00:00 1995-08-01 00:00:00 ... 2020-08-01 00:00:00 2020-09-01 00:00:00 2020-10-01 00:00:00 2020-11-01 00:00:00 2020-12-01 00:00:00 2021-01-01 00:00:00 2021-02-01 00:00:00 2021-03-01 00:00:00 2021-04-01 00:00:00 2021-05-01 00:00:00
1 City of London E09000001 91448.98487 82202.77314 79120.70256 77101.20804 84409.14932 94900.51244 110128.0423 112329.4376 ... 827658.5167 798499.1646 836807.0518 769391.7847 801999.0466 760599.2322 729315.7243 717528.2633 713094.5391 765822.2726
2 Barking & Dagenham E09000002 50460.2266 51085.77983 51268.96956 53133.50526 53042.24852 53700.34831 52113.12157 52232.19868 ... 304923.5027 301702.4656 304852.1338 305012.3291 309560.1023 311007.1517 313835.8023 314187.4036 309680.5549 308760.1446
3 Barnet E09000003 93284.51832 93190.16963 92247.52435 90762.87492 90258.00033 90107.23471 91441.24768 92361.31512 ... 529660.1524 533619.23 530353.5205 528816.4928 532528.0926 538842.4015 538215.8867 542872.3895 545746.3543 546082.2634
4 Bexley E09000004 64958.09036 64787.92069 64367.49344 64277.66881 63997.13588 64252.32335 63722.70055 64432.60005 ... 346679.9244 345061.4945 345404.9836 346252.9145 352260.8385 356664.073 360116.1134 364480.9055 365519.4195 368554.8059

5 rows × 319 columns

In [6]:
# properties_T.iloc[[0]]
# type(properties_T.iloc[0])
properties_T.iloc[0,:]

properties_T.columns = list(properties_T.iloc[0,:])
# properties_T.columns = list(properties_T.iloc[[0]])

properties_T = properties_T.loc[1:,:]
In [7]:
properties_T.head()
Out[7]:
Unnamed: 0 NaN 1995-01-01 00:00:00 1995-02-01 00:00:00 1995-03-01 00:00:00 1995-04-01 00:00:00 1995-05-01 00:00:00 1995-06-01 00:00:00 1995-07-01 00:00:00 1995-08-01 00:00:00 ... 2020-08-01 00:00:00 2020-09-01 00:00:00 2020-10-01 00:00:00 2020-11-01 00:00:00 2020-12-01 00:00:00 2021-01-01 00:00:00 2021-02-01 00:00:00 2021-03-01 00:00:00 2021-04-01 00:00:00 2021-05-01 00:00:00
1 City of London E09000001 91448.98487 82202.77314 79120.70256 77101.20804 84409.14932 94900.51244 110128.0423 112329.4376 ... 827658.5167 798499.1646 836807.0518 769391.7847 801999.0466 760599.2322 729315.7243 717528.2633 713094.5391 765822.2726
2 Barking & Dagenham E09000002 50460.2266 51085.77983 51268.96956 53133.50526 53042.24852 53700.34831 52113.12157 52232.19868 ... 304923.5027 301702.4656 304852.1338 305012.3291 309560.1023 311007.1517 313835.8023 314187.4036 309680.5549 308760.1446
3 Barnet E09000003 93284.51832 93190.16963 92247.52435 90762.87492 90258.00033 90107.23471 91441.24768 92361.31512 ... 529660.1524 533619.23 530353.5205 528816.4928 532528.0926 538842.4015 538215.8867 542872.3895 545746.3543 546082.2634
4 Bexley E09000004 64958.09036 64787.92069 64367.49344 64277.66881 63997.13588 64252.32335 63722.70055 64432.60005 ... 346679.9244 345061.4945 345404.9836 346252.9145 352260.8385 356664.073 360116.1134 364480.9055 365519.4195 368554.8059
5 Brent E09000005 71306.56698 72022.26197 72015.76274 72965.63094 73704.04743 74310.48167 74127.03788 73547.0411 ... 497729.0327 519561.8277 522239.3216 513100.814 495066.6459 511347.8856 514388.7387 519212.9441 498083.9417 508570.7598

5 rows × 319 columns

2.3. Cleaning the data (part 2)

You might we have to rename a couple columns. How do you do this? The clue's pretty bold...

In [8]:
properties_T = properties_T.rename(columns={"Unnamed: 0": "borough", pd.NaT: "id"})
properties_T.head()
Out[8]:
borough id 1995-01-01 00:00:00 1995-02-01 00:00:00 1995-03-01 00:00:00 1995-04-01 00:00:00 1995-05-01 00:00:00 1995-06-01 00:00:00 1995-07-01 00:00:00 1995-08-01 00:00:00 ... 2020-08-01 00:00:00 2020-09-01 00:00:00 2020-10-01 00:00:00 2020-11-01 00:00:00 2020-12-01 00:00:00 2021-01-01 00:00:00 2021-02-01 00:00:00 2021-03-01 00:00:00 2021-04-01 00:00:00 2021-05-01 00:00:00
1 City of London E09000001 91448.98487 82202.77314 79120.70256 77101.20804 84409.14932 94900.51244 110128.0423 112329.4376 ... 827658.5167 798499.1646 836807.0518 769391.7847 801999.0466 760599.2322 729315.7243 717528.2633 713094.5391 765822.2726
2 Barking & Dagenham E09000002 50460.2266 51085.77983 51268.96956 53133.50526 53042.24852 53700.34831 52113.12157 52232.19868 ... 304923.5027 301702.4656 304852.1338 305012.3291 309560.1023 311007.1517 313835.8023 314187.4036 309680.5549 308760.1446
3 Barnet E09000003 93284.51832 93190.16963 92247.52435 90762.87492 90258.00033 90107.23471 91441.24768 92361.31512 ... 529660.1524 533619.23 530353.5205 528816.4928 532528.0926 538842.4015 538215.8867 542872.3895 545746.3543 546082.2634
4 Bexley E09000004 64958.09036 64787.92069 64367.49344 64277.66881 63997.13588 64252.32335 63722.70055 64432.60005 ... 346679.9244 345061.4945 345404.9836 346252.9145 352260.8385 356664.073 360116.1134 364480.9055 365519.4195 368554.8059
5 Brent E09000005 71306.56698 72022.26197 72015.76274 72965.63094 73704.04743 74310.48167 74127.03788 73547.0411 ... 497729.0327 519561.8277 522239.3216 513100.814 495066.6459 511347.8856 514388.7387 519212.9441 498083.9417 508570.7598

5 rows × 319 columns

2.4.Transforming the data

Remember what Wes McKinney said about tidy data?

You might need to melt your DataFrame here.

In [9]:
properties_T_melt = properties_T.melt(id_vars=['borough','id'])
properties_T_melt.head()
Out[9]:
borough id variable value
0 City of London E09000001 1995-01-01 91448.98487
1 Barking & Dagenham E09000002 1995-01-01 50460.2266
2 Barnet E09000003 1995-01-01 93284.51832
3 Bexley E09000004 1995-01-01 64958.09036
4 Brent E09000005 1995-01-01 71306.56698

Remember to make sure your column data types are all correct. Average prices, for example, should be floating point numbers...

In [10]:
properties_T_rename = properties_T_melt.rename(columns = {'variable':'date','value':'average_price'})
In [11]:
properties_T_rename.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15216 entries, 0 to 15215
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   borough        15216 non-null  object        
 1   id             14265 non-null  object        
 2   date           15216 non-null  datetime64[ns]
 3   average_price  14265 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 475.6+ KB
In [12]:
# This function converts stringer into integer.
def str2int(s):
    try:
        return int(s)
    except:
        return np.nan
In [13]:
properties_T_rename['average_price'] = properties_T_rename['average_price'].apply(lambda x: str2int(x))

2.5. Cleaning the data (part 3)

Do we have an equal number of observations in the ID, Average Price, Month, and London Borough columns? Remember that there are only 32 London Boroughs. How many entries do you have in that column?

Check out the contents of the London Borough column, and if you find null values, get rid of them however you see fit.

In [14]:
set(properties_T_rename['borough'].unique())
Out[14]:
{'Barking & Dagenham',
 'Barnet',
 'Bexley',
 'Brent',
 'Bromley',
 'Camden',
 'City of London',
 'Croydon',
 'EAST MIDLANDS',
 'EAST OF ENGLAND',
 'Ealing',
 'Enfield',
 'England',
 'Greenwich',
 'Hackney',
 'Hammersmith & Fulham',
 'Haringey',
 'Harrow',
 'Havering',
 'Hillingdon',
 'Hounslow',
 'Inner London',
 'Islington',
 'Kensington & Chelsea',
 'Kingston upon Thames',
 'LONDON',
 'Lambeth',
 'Lewisham',
 'Merton',
 'NORTH EAST',
 'NORTH WEST',
 'Newham',
 'Outer London',
 'Redbridge',
 'Richmond upon Thames',
 'SOUTH EAST',
 'SOUTH WEST',
 'Southwark',
 'Sutton',
 'Tower Hamlets',
 'Unnamed: 34',
 'Unnamed: 37',
 'Unnamed: 47',
 'WEST MIDLANDS',
 'Waltham Forest',
 'Wandsworth',
 'Westminster',
 'YORKS & THE HUMBER'}
In [15]:
properties_T_rename.isnull().sum()
Out[15]:
borough            0
id               951
date               0
average_price    951
dtype: int64
In [16]:
properties_T_rename[properties_T_rename['id'].isnull()]
Out[16]:
borough id date average_price
33 Unnamed: 34 NaN 1995-01-01 NaN
36 Unnamed: 37 NaN 1995-01-01 NaN
46 Unnamed: 47 NaN 1995-01-01 NaN
81 Unnamed: 34 NaN 1995-02-01 NaN
84 Unnamed: 37 NaN 1995-02-01 NaN
... ... ... ... ...
15156 Unnamed: 37 NaN 2021-04-01 NaN
15166 Unnamed: 47 NaN 2021-04-01 NaN
15201 Unnamed: 34 NaN 2021-05-01 NaN
15204 Unnamed: 37 NaN 2021-05-01 NaN
15214 Unnamed: 47 NaN 2021-05-01 NaN

951 rows × 4 columns

In [17]:
NaNFree_properties_T_rename = properties_T_rename.dropna()
In [18]:
NaNFree_properties_T_rename.isnull().sum()
Out[18]:
borough          0
id               0
date             0
average_price    0
dtype: int64

2.6. Visualizing the data

To visualize the data, why not subset on a particular London Borough? Maybe do a line plot of Month against Average Price?

In [19]:
set(NaNFree_properties_T_rename['borough'].unique())
Out[19]:
{'Barking & Dagenham',
 'Barnet',
 'Bexley',
 'Brent',
 'Bromley',
 'Camden',
 'City of London',
 'Croydon',
 'EAST MIDLANDS',
 'EAST OF ENGLAND',
 'Ealing',
 'Enfield',
 'England',
 'Greenwich',
 'Hackney',
 'Hammersmith & Fulham',
 'Haringey',
 'Harrow',
 'Havering',
 'Hillingdon',
 'Hounslow',
 'Inner London',
 'Islington',
 'Kensington & Chelsea',
 'Kingston upon Thames',
 'LONDON',
 'Lambeth',
 'Lewisham',
 'Merton',
 'NORTH EAST',
 'NORTH WEST',
 'Newham',
 'Outer London',
 'Redbridge',
 'Richmond upon Thames',
 'SOUTH EAST',
 'SOUTH WEST',
 'Southwark',
 'Sutton',
 'Tower Hamlets',
 'WEST MIDLANDS',
 'Waltham Forest',
 'Wandsworth',
 'Westminster',
 'YORKS & THE HUMBER'}
In [20]:
subBrough = ['Barking & Dagenham',
 'Barnet',
 'Bexley',
 'Brent',
 'Bromley',
 'Camden',
#  'City of London',
 'Croydon',
#  'EAST MIDLANDS',
#  'EAST OF ENGLAND',
 'Ealing',
 'Enfield',
 'England',
 'Greenwich',
 'Hackney',
 'Hammersmith & Fulham',
 'Haringey',
 'Harrow',
 'Havering',
 'Hillingdon',
 'Hounslow',
#  'Inner London',
 'Islington',
 'Kensington & Chelsea',
 'Kingston upon Thames',
 'LONDON',
 'Lambeth',
 'Lewisham',
 'Merton',
#  'NORTH EAST',
#  'NORTH WEST',
 'Newham',
 'Outer London',
 'Redbridge',
 'Richmond upon Thames',
#  'SOUTH EAST',
#  'SOUTH WEST',
 'Southwark',
 'Sutton',
 'Tower Hamlets',
 'WEST MIDLANDS',
 'Waltham Forest',
 'Wandsworth',
 'Westminster',
 'YORKS & THE HUMBER']
In [21]:
subset_NaNFree_properties_T_rename = NaNFree_properties_T_rename[NaNFree_properties_T_rename['borough'].isin(subBrough)]
# subset_NaNFree_properties_T_rename.head()

To limit the number of data points you have, you might want to extract the year from every month value your Month column.

To this end, you could apply a lambda function. Your logic could work as follows:

  1. look through the Month column
  2. extract the year from each individual value in that column
  3. store that corresponding year as separate column.

Whether you go ahead with this is up to you. Just so long as you answer our initial brief: which boroughs of London have seen the greatest house price increase, on average, over the past two decades?

In [22]:
subset_NaNFree_properties_T_rename['year']=subset_NaNFree_properties_T_rename['date'].apply(lambda x: x.year)
subset_NaNFree_properties_T_rename.head()
<ipython-input-22-c7103b470dff>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_NaNFree_properties_T_rename['year']=subset_NaNFree_properties_T_rename['date'].apply(lambda x: x.year)
Out[22]:
borough id date average_price year
1 Barking & Dagenham E09000002 1995-01-01 50460.0 1995
2 Barnet E09000003 1995-01-01 93284.0 1995
3 Bexley E09000004 1995-01-01 64958.0 1995
4 Brent E09000005 1995-01-01 71306.0 1995
5 Bromley E09000006 1995-01-01 81671.0 1995
In [23]:
import cufflinks as cf
In [24]:
#We set the all charts as public
cf.set_config_file(sharing='public',theme='pearl',offline=False)
cf.go_offline()
In [25]:
subset_NaNFree_properties_T_rename.pivot_table(index='year', columns='borough', values='average_price',aggfunc=np.mean ).iplot(kind='scatter',xTitle='date',yTitle='borough',title='average price over years')
In [26]:
fig, ax = plt.subplots(figsize=(20, 15))
subset_NaNFree_properties_T_rename.pivot_table(index='year', columns='borough', values='average_price',aggfunc=np.mean ).plot(ax=ax)
ax.set_xlim(subset_NaNFree_properties_T_rename['year'].min(), subset_NaNFree_properties_T_rename['year'].max())
ax.set_ylim(0, subset_NaNFree_properties_T_rename['average_price'].max())
ax.ticklabel_format(style='plain')
plt.show()

3. Modeling

Consider creating a function that will calculate a ratio of house prices, comparing the price of a house in 2018 to the price in 1998.

Consider calling this function create_price_ratio.

You'd want this function to:

  1. Take a filter of dfg, specifically where this filter constrains the London_Borough, as an argument. For example, one admissible argument should be: dfg[dfg['London_Borough']=='Camden'].
  2. Get the Average Price for that Borough, for the years 1998 and 2018.
  3. Calculate the ratio of the Average Price for 1998 divided by the Average Price for 2018.
  4. Return that ratio.

Once you've written this function, you ultimately want to use it to iterate through all the unique London_Boroughs and work out the ratio capturing the difference of house prices between 1998 and 2018.

Bear in mind: you don't have to write a function like this if you don't want to. If you can solve the brief otherwise, then great!

Hint: This section should test the skills you acquired in:

  • Python Data Science Toolbox - Part One, all modules
In [27]:
dfg = subset_NaNFree_properties_T_rename.groupby(['borough','year']).mean()
dfg.reset_index().head()
Out[27]:
borough year average_price
0 Barking & Dagenham 1995 51817.500000
1 Barking & Dagenham 1996 51717.833333
2 Barking & Dagenham 1997 55973.666667
3 Barking & Dagenham 1998 60285.333333
4 Barking & Dagenham 1999 65320.500000
In [28]:
ratio = (dfg.xs(2021,level='year')/dfg.xs(1998,level='year')).rename(columns={'average_price':'ratio'})
ratio.sort_values('ratio',ascending=False,inplace=True)
ratio.reset_index().head(5)
Out[28]:
borough ratio
0 Hackney 6.667049
1 Waltham Forest 6.383784
2 Lewisham 5.899919
3 Southwark 5.797158
4 Newham 5.736885
In [29]:
ratio.plot(kind='bar',y='ratio',figsize=(10,5),xlabel='Borough',ylabel='Price ratio(2021/1998)')

plt.show()
In [30]:
ratio.iplot(kind='bar',xTitle='Borough',yTitle='Price ratio(2018/1998)',title='Cufflinks - Bar Chart')

**if lambda function has to be used...

In [31]:
def create_price_ratio(df):
    return (df[df['year']==2018]['average_price']).values/(df[df['year']==1998]['average_price']).values

df[df['year']==2018]['average_price'].values

In [32]:
reset_dfg = dfg.reset_index()
reset_dfg.head()
# reset_dfg.info()
Out[32]:
borough year average_price
0 Barking & Dagenham 1995 51817.500000
1 Barking & Dagenham 1996 51717.833333
2 Barking & Dagenham 1997 55973.666667
3 Barking & Dagenham 1998 60285.333333
4 Barking & Dagenham 1999 65320.500000
In [33]:
ratio={}
for name in list(reset_dfg.borough.unique()):
    ratio[name] = create_price_ratio(reset_dfg[reset_dfg['borough']==name])
ratio
Out[33]:
{'Barking & Dagenham': array([4.89664706]),
 'Barnet': array([4.35820594]),
 'Bexley': array([4.24899905]),
 'Brent': array([4.89458596]),
 'Bromley': array([4.09479358]),
 'Camden': array([4.93536423]),
 'Croydon': array([4.20111391]),
 'Ealing': array([4.31146566]),
 'Enfield': array([4.26348447]),
 'England': array([3.81047519]),
 'Greenwich': array([4.76305948]),
 'Hackney': array([6.1983043]),
 'Hammersmith & Fulham': array([4.13780717]),
 'Haringey': array([5.13464921]),
 'Harrow': array([4.05920936]),
 'Havering': array([4.32524484]),
 'Hillingdon': array([4.20028803]),
 'Hounslow': array([3.97642502]),
 'Islington': array([4.84406149]),
 'Kensington & Chelsea': array([5.08247169]),
 'Kingston upon Thames': array([4.2705614]),
 'LONDON': array([4.67980038]),
 'Lambeth': array([4.95776824]),
 'Lewisham': array([5.44925656]),
 'Merton': array([4.7412939]),
 'Newham': array([5.30543022]),
 'Outer London': array([4.41896555]),
 'Redbridge': array([4.47119386]),
 'Richmond upon Thames': array([4.00516807]),
 'Southwark': array([5.51650489]),
 'Sutton': array([4.11854535]),
 'Tower Hamlets': array([4.62672185]),
 'WEST MIDLANDS': array([3.31128386]),
 'Waltham Forest': array([5.83478808]),
 'Wandsworth': array([4.75772672]),
 'Westminster': array([5.35357218]),
 'YORKS & THE HUMBER': array([3.35607992])}
In [34]:
pd_ratio = pd.DataFrame(ratio).T.rename(columns={0:'ratio'})
pd_ratio.sort_values('ratio').plot(kind='bar',xlabel='Borough',ylabel='Price ratio(2018/1998)',figsize=(15,5))
plt.show()

4. Conclusion

What can you conclude? Type out your conclusion below.

Look back at your notebook. Think about how you might summarize what you have done, and prepare a quick presentation on it to your mentor at your next meeting.

We hope you enjoyed this practical project. It should have consolidated your data hygiene and pandas skills by looking at a real-world problem involving just the kind of dataset you might encounter as a budding data scientist. Congratulations, and looking forward to seeing you at the next step in the course!

my answer

The graph shows the price ratio of houses in each borough, comparing the price of a house in 2018 to that in 1998.

Overall, all the rarios were positive. The median of the ratios is 4.5, the mean is 4.6, the IQR is 0.73, and the std is 0.63

The highest price ratio was about 6.2 in Hackney. This means that, the average price of a house in Hackney in 2018 was 6 times as expensive as the average price in 1998. In cnotrast, the lowest price ratio was 3.3 in WEST MIDLANDS. It can be prediced that the ratio of house price has strong correlation with borough.

When it comes to the changing rate of price, in the last year in the graph, some borough have positive trends, but the others have negatives.

In addition, as this graph indicates only average, there would be the other factors which affected the price or the changnig rate of price. These might be public transport, office, or facilities around the house.

In [ ]:
 
In [35]:
gretest_price_2021 = dfg.xs(2021,level='year').reset_index().sort_values('average_price',ascending=False)
gretest_price_2021.head(5)
Out[35]:
borough average_price
19 Kensington & Chelsea 1247367.0
35 Westminster 965211.2
5 Camden 809411.8
12 Hammersmith & Fulham 741961.2
28 Richmond upon Thames 699516.0
In [36]:
profit_ratio = ((dfg.xs(2021,level='year')/dfg.xs(1998,level='year'))**(1/26)-1).rename(columns={'average_price':'profit_ratio'})
profit_ratio.sort_values('profit_ratio',ascending=False,inplace=True)
profit_ratio.reset_index().head(30)
Out[36]:
borough profit_ratio
0 Hackney 0.075696
1 Waltham Forest 0.073902
2 Lewisham 0.070651
3 Southwark 0.069928
4 Newham 0.069498
5 Haringey 0.066576
6 Lambeth 0.065516
7 Barking & Dagenham 0.065203
8 Merton 0.064686
9 Brent 0.064411
10 Westminster 0.064393
11 Islington 0.064052
12 Greenwich 0.063723
13 Tower Hamlets 0.063379
14 LONDON 0.062962
15 Wandsworth 0.062514
16 Redbridge 0.061924
17 Camden 0.061692
18 Outer London 0.060892
19 Kensington & Chelsea 0.060877
20 Enfield 0.059800
21 Ealing 0.059728
22 Havering 0.059590
23 Bexley 0.059580
24 Barnet 0.058939
25 Croydon 0.058534
26 Hillingdon 0.058532
27 Kingston upon Thames 0.058177
28 Hounslow 0.057962
29 Bromley 0.057734
In [43]:
profit_ratio.reset_index().merge(profit_ratio.reset_index(),how='outer',on='borough')
Out[43]:
borough profit_ratio_x profit_ratio_y
0 Hackney 0.075696 0.075696
1 Waltham Forest 0.073902 0.073902
2 Lewisham 0.070651 0.070651
3 Southwark 0.069928 0.069928
4 Newham 0.069498 0.069498
5 Haringey 0.066576 0.066576
6 Lambeth 0.065516 0.065516
7 Barking & Dagenham 0.065203 0.065203
8 Merton 0.064686 0.064686
9 Brent 0.064411 0.064411
10 Westminster 0.064393 0.064393
11 Islington 0.064052 0.064052
12 Greenwich 0.063723 0.063723
13 Tower Hamlets 0.063379 0.063379
14 LONDON 0.062962 0.062962
15 Wandsworth 0.062514 0.062514
16 Redbridge 0.061924 0.061924
17 Camden 0.061692 0.061692
18 Outer London 0.060892 0.060892
19 Kensington & Chelsea 0.060877 0.060877
20 Enfield 0.059800 0.059800
21 Ealing 0.059728 0.059728
22 Havering 0.059590 0.059590
23 Bexley 0.059580 0.059580
24 Barnet 0.058939 0.058939
25 Croydon 0.058534 0.058534
26 Hillingdon 0.058532 0.058532
27 Kingston upon Thames 0.058177 0.058177
28 Hounslow 0.057962 0.057962
29 Bromley 0.057734 0.057734
30 Sutton 0.057709 0.057709
31 Richmond upon Thames 0.057635 0.057635
32 Hammersmith & Fulham 0.057051 0.057051
33 England 0.056776 0.056776
34 Harrow 0.056623 0.056623
35 WEST MIDLANDS 0.052952 0.052952
36 YORKS & THE HUMBER 0.052943 0.052943