Data Wrangling

Last updated on Feb 27, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

  • listing dataset: contains listing-level information
  • pricing dataset: contains pricing data, over time
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Editing the Dataframe

You can sort the data using the sort_values function.

Options

  • ascending: bool or list of bool, default True
  • na_position: {‘first’, ‘last’}, default ‘last’
df_listings.sort_values(by=['name', 'price'], 
                        ascending=[False, True], 
                        na_position='last').head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
2280 38601411 🏡Giardino di Annabella-relax in città-casa intera 240803020 Annabella NaN Porto - Saragozza 44.49303 11.31986 Entire home/apt 90 2 53 2021-12-13 1.96 1 76 27 392901
2988 48177313 ❤ Romantic Suite with SPA Bath ❤ 4starbologna.com 239491712 4 Star Bologna NaN Santo Stefano 44.50271 11.34998 Entire home/apt 309 1 1 2021-03-14 0.11 14 344 1 NaN
3302 52367336 ✨House of Alchemy✨ 140013413 Greta NaN Porto - Saragozza 44.49072 11.30890 Entire home/apt 96 2 7 2021-11-28 3.18 1 88 7 NaN
2039 34495335 ♥ Romantic for Couple in Love ♥ | 4 Star Boutique 239491712 4 Star Bologna NaN Santo Stefano 44.50368 11.34972 Entire home/apt 143 1 25 2021-08-20 0.79 14 262 6 NaN
2964 47866124 ♡Amazing Suite with Private SPA ♡ 4starbologna... 239491712 4 Star Bologna NaN Santo Stefano 44.50381 11.34951 Entire home/apt 347 1 2 2021-10-17 0.72 14 337 2 NaN

You can remane columns using the rename() function. It takes a dictionary as column argument in the form {"old_name": "new_name"}.

df_listings.rename(columns={'name': 'listing_name', 
                            'id': 'listing_id'}).head()

listing_id listing_name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 3 180 2021-11-12 1.32 1 161 6 NaN
1 46352 A room in Pasolini's house 467810 Eleonora NaN Porto - Saragozza 44.49168 11.33514 Private room 29 1 300 2021-11-30 2.20 2 248 37 NaN
2 59697 COZY LARGE BEDROOM in the city center 286688 Paolo NaN Santo Stefano 44.48817 11.34124 Private room 50 1 240 2020-10-04 2.18 2 327 0 NaN
3 85368 Garden House Bologna 467675 Anna Maria NaN Santo Stefano 44.47834 11.35672 Entire home/apt 126 2 40 2019-11-03 0.34 1 332 0 NaN
4 145779 SINGLE ROOM 705535 Valerio NaN Porto - Saragozza 44.49306 11.33786 Private room 50 10 69 2021-12-05 0.55 9 365 5 NaN

Aggregating

We can compute statistics by group using groupby().

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

price reviews_per_month
neighbourhood
Borgo Panigale - Reno 83.020548 0.983488
Navile 142.200993 1.156745
Porto - Saragozza 129.908312 1.340325
San Donato - San Vitale 91.618138 0.933011
Santo Stefano 119.441841 1.344810
Savena 69.626016 0.805888

If you want to perform more than one function, maybe on different columns, you can use aggregate() which can be shortened to agg(). The sintax is agg(output_var = ("input_var", function)) and it accepts also numpy functions.

df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

neighbourhood mean_reviews min_price max_price
0 Borgo Panigale - Reno 0.983488 9 1429
1 Navile 1.156745 14 5000
2 Porto - Saragozza 1.340325 7 9999
3 San Donato - San Vitale 0.933011 10 1600
4 Santo Stefano 1.344810 11 9999
5 Savena 0.805888 9 680

If we want to build a new column by group, we can use transform() on the grouped data. Unfortunately, it does not work as nicely as aggregate() and we have to do one column at the time.

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].transform('mean')

price reviews_per_month
0 119.441841 1.344810
1 129.908312 1.340325
2 119.441841 1.344810
3 119.441841 1.344810
4 129.908312 1.340325
... ... ...
3448 142.200993 1.156745
3449 129.908312 1.340325
3450 142.200993 1.156745
3451 119.441841 1.344810
3452 129.908312 1.340325

3453 rows × 2 columns

Reshaping


Combining Datasets

We can concatenate datasets using pd.concat(). It takes as argument a list of dataframes. By default, pd.concat() performs the outer join. We can change it using the join option.

df_listings1 = df_listings[:2000]
np.shape(df_listings1)
(2000, 18)
df_listings2 = df_listings[1000:]
np.shape(df_listings2)
(2453, 18)
np.shape(
    pd.concat([df_listings1, df_listings2], join='inner')
)
(4453, 18)

To instead merge dataframes, we can use the pd.merge function.

Options

  • how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
  • on: label or list
pd.merge(df_listings, df_prices, left_on='id', right_on='listing_id', how='inner')

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price_x ... availability_365 number_of_reviews_ltm license listing_id date available price_y adjusted_price minimum_nights_y maximum_nights
0 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.485070 11.347860 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-17 f $68.00 $68.00 3 360
1 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.485070 11.347860 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-18 f $68.00 $68.00 3 360
2 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.485070 11.347860 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-19 f $68.00 $68.00 3 360
3 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.485070 11.347860 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-20 f $68.00 $68.00 3 360
4 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.485070 11.347860 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-21 f $68.00 $68.00 3 360
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1260340 53854962 A pochi passi dal Mercato delle Erbe by Wonder... 13036400 Wonderful Italy NaN Porto - Saragozza 44.497456 11.337492 Entire home/apt 115 ... 98 0 NaN 53854962 2022-12-12 f $115.00 $115.00 1 1125
1260341 53854962 A pochi passi dal Mercato delle Erbe by Wonder... 13036400 Wonderful Italy NaN Porto - Saragozza 44.497456 11.337492 Entire home/apt 115 ... 98 0 NaN 53854962 2022-12-13 f $115.00 $115.00 1 1125
1260342 53854962 A pochi passi dal Mercato delle Erbe by Wonder... 13036400 Wonderful Italy NaN Porto - Saragozza 44.497456 11.337492 Entire home/apt 115 ... 98 0 NaN 53854962 2022-12-14 f $115.00 $115.00 1 1125
1260343 53854962 A pochi passi dal Mercato delle Erbe by Wonder... 13036400 Wonderful Italy NaN Porto - Saragozza 44.497456 11.337492 Entire home/apt 115 ... 98 0 NaN 53854962 2022-12-15 f $115.00 $115.00 1 1125
1260344 53854962 A pochi passi dal Mercato delle Erbe by Wonder... 13036400 Wonderful Italy NaN Porto - Saragozza 44.497456 11.337492 Entire home/apt 115 ... 98 0 NaN 53854962 2022-12-16 f $115.00 $115.00 1 1125

1260345 rows × 25 columns

As you can see, since the variable price was present in both datasets, we now have a price.x and a price.y.

Previous
Next