Data Wrangling

Last updated on May 1, 2022

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

  • listing dataset: contains listing-level information
  • pricing dataset: contains pricing data, over time
import numpy as np
import pandas as pd
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Sorting and Renaming

You can sort the data using the sort_values function.

Options

  • ascending: bool or list of bool, default True
  • na_position: {‘first’, ‘last’}, default ‘last’
df_listings.sort_values(by=['name', 'price'], 
                        ascending=[False, True], 
                        na_position='last').head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
2280 38601411 🏡Giardino di Annabella-relax in città-casa intera 240803020 Annabella NaN Porto - Saragozza 44.49303 11.31986 Entire home/apt 90 2 53 2021-12-13 1.96 1 76 27 392901
2988 48177313 ❤ Romantic Suite with SPA Bath ❤ 4starbologna.com 239491712 4 Star Bologna NaN Santo Stefano 44.50271 11.34998 Entire home/apt 309 1 1 2021-03-14 0.11 14 344 1 NaN
3302 52367336 ✨House of Alchemy✨ 140013413 Greta NaN Porto - Saragozza 44.49072 11.30890 Entire home/apt 96 2 7 2021-11-28 3.18 1 88 7 NaN
2039 34495335 ♥ Romantic for Couple in Love ♥ | 4 Star Boutique 239491712 4 Star Bologna NaN Santo Stefano 44.50368 11.34972 Entire home/apt 143 1 25 2021-08-20 0.79 14 262 6 NaN
2964 47866124 ♡Amazing Suite with Private SPA ♡ 4starbologna... 239491712 4 Star Bologna NaN Santo Stefano 44.50381 11.34951 Entire home/apt 347 1 2 2021-10-17 0.72 14 337 2 NaN

You can remane columns using the rename() function. It takes a dictionary as column argument in the form {"old_name": "new_name"}.

df_listings.rename(columns={'name': 'listing_name', 
                            'id': 'listing_id'}).head()

listing_id listing_name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 3 180 2021-11-12 1.32 1 161 6 NaN
1 46352 A room in Pasolini's house 467810 Eleonora NaN Porto - Saragozza 44.49168 11.33514 Private room 29 1 300 2021-11-30 2.20 2 248 37 NaN
2 59697 COZY LARGE BEDROOM in the city center 286688 Paolo NaN Santo Stefano 44.48817 11.34124 Private room 50 1 240 2020-10-04 2.18 2 327 0 NaN
3 85368 Garden House Bologna 467675 Anna Maria NaN Santo Stefano 44.47834 11.35672 Entire home/apt 126 2 40 2019-11-03 0.34 1 332 0 NaN
4 145779 SINGLE ROOM 705535 Valerio NaN Porto - Saragozza 44.49306 11.33786 Private room 50 10 69 2021-12-05 0.55 9 365 5 NaN

Aggregating

If we want to count observations across 2 categorical variables, we can use pd.crosstab().

pd.crosstab(df_listings['neighbourhood'], df_listings['room_type'])

room_type Entire home/apt Hotel room Private room Shared room
neighbourhood
Borgo Panigale - Reno 107 0 39 0
Navile 250 3 149 1
Porto - Saragozza 842 16 299 10
San Donato - San Vitale 280 1 134 4
Santo Stefano 924 29 237 5
Savena 73 0 48 2

We can compute statistics by group using groupby().

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

price reviews_per_month
neighbourhood
Borgo Panigale - Reno 83.020548 0.983488
Navile 142.200993 1.156745
Porto - Saragozza 129.908312 1.340325
San Donato - San Vitale 91.618138 0.933011
Santo Stefano 119.441841 1.344810
Savena 69.626016 0.805888

If you want to perform more than one function, maybe on different columns, you can use aggregate() which can be shortened to agg(). The sintax is agg(output_var = ("input_var", function)) and it accepts also numpy functions.

df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

neighbourhood mean_reviews min_price max_price
0 Borgo Panigale - Reno 0.983488 9 1429
1 Navile 1.156745 14 5000
2 Porto - Saragozza 1.340325 7 9999
3 San Donato - San Vitale 0.933011 10 1600
4 Santo Stefano 1.344810 11 9999
5 Savena 0.805888 9 680

If we want to build a new column by group, we can use transform() on the grouped data. Unfortunately, it does not work as nicely as aggregate() and we have to do one column at the time.

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].transform('mean').head()

price reviews_per_month
0 119.441841 1.344810
1 129.908312 1.340325
2 119.441841 1.344810
3 119.441841 1.344810
4 129.908312 1.340325

Combining Datasets

We can concatenate datasets using pd.concat(). It takes as argument a list of dataframes. By default, pd.concat() performs the outer join. We can change it using the join option (in this case, it makes no difference).

df_listings1 = df_listings[:2000]
np.shape(df_listings1)
(2000, 18)
df_listings2 = df_listings[1000:]
np.shape(df_listings2)
(2453, 18)
np.shape(
    pd.concat([df_listings1, df_listings2])
)
(4453, 18)

To instead merge dataframes, we can use the pd.merge function.

Options

  • how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
  • on: label or list
df_merged = pd.merge(df_listings, df_prices, left_on='id', right_on='listing_id', how='inner')
df_merged.head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price_x ... availability_365 number_of_reviews_ltm license listing_id date available price_y adjusted_price minimum_nights_y maximum_nights
0 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-17 f $68.00 $68.00 3 360
1 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-18 f $68.00 $68.00 3 360
2 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-19 f $68.00 $68.00 3 360
3 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-20 f $68.00 $68.00 3 360
4 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 ... 161 6 NaN 42196 2021-12-21 f $68.00 $68.00 3 360

5 rows × 25 columns

As you can see, since the variable price was present in both datasets, we now have a price.x and a price.y.

Reshaping

First, let’s compute average prices by neighbourhood and date using the merged dataset.

df_long = df_merged.groupby(['neighbourhood', 'date'])['price_x'].agg('mean').reset_index()
df_long.head()

neighbourhood date price_x
0 Borgo Panigale - Reno 2021-12-17 83.020548
1 Borgo Panigale - Reno 2021-12-18 83.020548
2 Borgo Panigale - Reno 2021-12-19 83.020548
3 Borgo Panigale - Reno 2021-12-20 83.020548
4 Borgo Panigale - Reno 2021-12-21 83.020548

This is what is called long format since it has one or more variables (price_x in this case) stacked vertically along a categorical variable (neighborhood and date here), which acts as index.

The alternative is the wide format where we have one separate column for each neighborhood.

We can reshape the dataset from long to wide using the pd.pivot() command. d

df_wide = pd.pivot(data=df_long, index='date', columns='neighbourhood').reset_index()
df_wide.head()

date price_x
neighbourhood Borgo Panigale - Reno Navile Porto - Saragozza San Donato - San Vitale Santo Stefano Savena
0 2021-12-17 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
1 2021-12-18 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
2 2021-12-19 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
3 2021-12-20 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
4 2021-12-21 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016

We can reshape the dataset from wide to long using the pd.melt() command. It takes the following arguments

  • data: the dataframe
  • id_vars: the variable that was indexing the old dataset
pd.melt(df_wide, id_vars='date', value_name='price').head()

date None neighbourhood price
0 2021-12-17 price_x Borgo Panigale - Reno 83.020548
1 2021-12-18 price_x Borgo Panigale - Reno 83.020548
2 2021-12-19 price_x Borgo Panigale - Reno 83.020548
3 2021-12-20 price_x Borgo Panigale - Reno 83.020548
4 2021-12-21 price_x Borgo Panigale - Reno 83.020548

If we do not have MultiIndex columns, but just a common prefix, we can reshape the dataset from wide to long using the pd.wide_to_long() command.

df_wide2 = df_wide.copy()
df_wide2.columns = [''.join(col) for col in df_wide2.columns]
df_wide2.head()

date price_xBorgo Panigale - Reno price_xNavile price_xPorto - Saragozza price_xSan Donato - San Vitale price_xSanto Stefano price_xSavena
0 2021-12-17 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
1 2021-12-18 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
2 2021-12-19 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
3 2021-12-20 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016
4 2021-12-21 83.020548 142.200993 129.908312 91.618138 119.441841 69.626016

The pd.wide_to_long() command takes the following arguments

  • data: the dataframe
  • stubnames: the prefixes of the variables that we want to reshape into one
  • i: the variable that was indexing the old dataset
  • j: the name of the new categorical variable that we extract from stubnames
  • suffix: regular expression of the suffix, the default is \d+, i.e. digits
pd.wide_to_long(df_wide2, stubnames='price_x', i='date', j='neighborhood', suffix='\D+').head()

price_x
date neighborhood
2021-12-17 Borgo Panigale - Reno 83.020548
2021-12-18 Borgo Panigale - Reno 83.020548
2021-12-19 Borgo Panigale - Reno 83.020548
2021-12-20 Borgo Panigale - Reno 83.020548
2021-12-21 Borgo Panigale - Reno 83.020548

Note that we had to change the suffix to \D+, i.e. not digits.

Window Functions

Methods

  • shift()
  • expanding()
  • rolling()

When we have time series data, we might want to do operations across time. First, let’s aggregate the df_price dataset at the year-month level.

temp = df_prices.copy()
temp['price'] = temp['price'].str.replace('[$|,]', '', regex=True).astype(float)
temp['date'] = pd.to_datetime(temp['date']).dt.to_period('M')
temp = temp.groupby(['listing_id', 'date'])['price'].mean().reset_index()\
    .sort_values(by=['listing_id', 'date'], ascending=[False, True])
temp.head()

listing_id date price
44876 53854962 2021-12 147.400000
44877 53854962 2022-01 137.645161
44878 53854962 2022-02 124.642857
44879 53854962 2022-03 285.096774
44880 53854962 2022-04 115.000000

We can lead or lag one variable using shift().

temp['price1'] = temp['price'].shift(1)
temp.head(15)

listing_id date price price1
44876 53854962 2021-12 147.400000 NaN
44877 53854962 2022-01 137.645161 147.400000
44878 53854962 2022-02 124.642857 137.645161
44879 53854962 2022-03 285.096774 124.642857
44880 53854962 2022-04 115.000000 285.096774
44881 53854962 2022-05 115.000000 115.000000
44882 53854962 2022-06 115.000000 115.000000
44883 53854962 2022-07 115.000000 115.000000
44884 53854962 2022-08 115.000000 115.000000
44885 53854962 2022-09 115.000000 115.000000
44886 53854962 2022-10 115.000000 115.000000
44887 53854962 2022-11 115.000000 115.000000
44888 53854962 2022-12 115.000000 115.000000
44863 53837654 2021-12 184.133333 115.000000
44864 53837654 2022-01 148.741935 184.133333

If we want to lead or lag a variable within a group, we can combine shift() with groupby()

temp['price1'] = temp.groupby('listing_id')['price'].shift(1)
temp.head(15)

listing_id date price price1
44876 53854962 2021-12 147.400000 NaN
44877 53854962 2022-01 137.645161 147.400000
44878 53854962 2022-02 124.642857 137.645161
44879 53854962 2022-03 285.096774 124.642857
44880 53854962 2022-04 115.000000 285.096774
44881 53854962 2022-05 115.000000 115.000000
44882 53854962 2022-06 115.000000 115.000000
44883 53854962 2022-07 115.000000 115.000000
44884 53854962 2022-08 115.000000 115.000000
44885 53854962 2022-09 115.000000 115.000000
44886 53854962 2022-10 115.000000 115.000000
44887 53854962 2022-11 115.000000 115.000000
44888 53854962 2022-12 115.000000 115.000000
44863 53837654 2021-12 184.133333 NaN
44864 53837654 2022-01 148.741935 184.133333

We can perform cumulative operations using the expanding() function

temp['avg_cum_price'] = temp['price'].expanding().mean()
temp.head(15)

listing_id date price price1 avg_cum_price
44876 53854962 2021-12 147.400000 NaN 147.400000
44877 53854962 2022-01 137.645161 147.400000 142.522581
44878 53854962 2022-02 124.642857 137.645161 136.562673
44879 53854962 2022-03 285.096774 124.642857 173.696198
44880 53854962 2022-04 115.000000 285.096774 161.956959
44881 53854962 2022-05 115.000000 115.000000 154.130799
44882 53854962 2022-06 115.000000 115.000000 148.540685
44883 53854962 2022-07 115.000000 115.000000 144.348099
44884 53854962 2022-08 115.000000 115.000000 141.087199
44885 53854962 2022-09 115.000000 115.000000 138.478479
44886 53854962 2022-10 115.000000 115.000000 136.344072
44887 53854962 2022-11 115.000000 115.000000 134.565399
44888 53854962 2022-12 115.000000 115.000000 133.060369
44863 53837654 2021-12 184.133333 NaN 136.708438
44864 53837654 2022-01 148.741935 184.133333 137.510671

To perform cumulative operations within a group, we can combine expanding() with groupby(). Since groups with not enough observations get dropped, we need to merge the dataset back.

temp.groupby('listing_id')['price'].expanding().mean().reset_index(level=0).head(15)

listing_id price
0 42196 68.000000
1 42196 68.000000
2 42196 68.000000
3 42196 68.000000
4 42196 68.000000
5 42196 68.000000
6 42196 68.000000
7 42196 68.000000
8 42196 68.000000
9 42196 68.000000
10 42196 68.000000
11 42196 68.000000
12 42196 68.000000
13 46352 29.333333
14 46352 29.311828

If we want to perform an operation over a rolling window, we can use the rolling() function

temp['avg3_price'] = temp['price'].rolling(3).mean()
temp.head(15)

listing_id date price price1 avg_cum_price avg3_price
44876 53854962 2021-12 147.400000 NaN 147.400000 NaN
44877 53854962 2022-01 137.645161 147.400000 142.522581 NaN
44878 53854962 2022-02 124.642857 137.645161 136.562673 136.562673
44879 53854962 2022-03 285.096774 124.642857 173.696198 182.461598
44880 53854962 2022-04 115.000000 285.096774 161.956959 174.913210
44881 53854962 2022-05 115.000000 115.000000 154.130799 171.698925
44882 53854962 2022-06 115.000000 115.000000 148.540685 115.000000
44883 53854962 2022-07 115.000000 115.000000 144.348099 115.000000
44884 53854962 2022-08 115.000000 115.000000 141.087199 115.000000
44885 53854962 2022-09 115.000000 115.000000 138.478479 115.000000
44886 53854962 2022-10 115.000000 115.000000 136.344072 115.000000
44887 53854962 2022-11 115.000000 115.000000 134.565399 115.000000
44888 53854962 2022-12 115.000000 115.000000 133.060369 115.000000
44863 53837654 2021-12 184.133333 NaN 136.708438 138.044444
44864 53837654 2022-01 148.741935 184.133333 137.510671 149.291756
Previous
Next