Data Wrangling
Last updated on May 1, 2022
Setup
For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.
A description of all variables in all datasets is avaliable here.
We are going to use 2 datasets:
- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time
import numpy as np
import pandas as pd
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)
# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")
Sorting and Renaming
You can sort the data using the sort_values
function.
Options
ascending
: bool or list of bool, default Truena_position
: {‘first’, ‘last’}, default ‘last’
df_listings.sort_values(by=['name', 'price'],
ascending=[False, True],
na_position='last').head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2280 | 38601411 | 🏡Giardino di Annabella-relax in città-casa intera | 240803020 | Annabella | NaN | Porto - Saragozza | 44.49303 | 11.31986 | Entire home/apt | 90 | 2 | 53 | 2021-12-13 | 1.96 | 1 | 76 | 27 | 392901 |
2988 | 48177313 | ❤ Romantic Suite with SPA Bath ❤ 4starbologna.com | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50271 | 11.34998 | Entire home/apt | 309 | 1 | 1 | 2021-03-14 | 0.11 | 14 | 344 | 1 | NaN |
3302 | 52367336 | ✨House of Alchemy✨ | 140013413 | Greta | NaN | Porto - Saragozza | 44.49072 | 11.30890 | Entire home/apt | 96 | 2 | 7 | 2021-11-28 | 3.18 | 1 | 88 | 7 | NaN |
2039 | 34495335 | ♥ Romantic for Couple in Love ♥ | 4 Star Boutique | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50368 | 11.34972 | Entire home/apt | 143 | 1 | 25 | 2021-08-20 | 0.79 | 14 | 262 | 6 | NaN |
2964 | 47866124 | ♡Amazing Suite with Private SPA ♡ 4starbologna... | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50381 | 11.34951 | Entire home/apt | 347 | 1 | 2 | 2021-10-17 | 0.72 | 14 | 337 | 2 | NaN |
You can remane columns using the rename()
function. It takes a dictionary as column
argument in the form {"old_name": "new_name"}
.
df_listings.rename(columns={'name': 'listing_name',
'id': 'listing_id'}).head()
listing_id | listing_name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | 3 | 180 | 2021-11-12 | 1.32 | 1 | 161 | 6 | NaN |
1 | 46352 | A room in Pasolini's house | 467810 | Eleonora | NaN | Porto - Saragozza | 44.49168 | 11.33514 | Private room | 29 | 1 | 300 | 2021-11-30 | 2.20 | 2 | 248 | 37 | NaN |
2 | 59697 | COZY LARGE BEDROOM in the city center | 286688 | Paolo | NaN | Santo Stefano | 44.48817 | 11.34124 | Private room | 50 | 1 | 240 | 2020-10-04 | 2.18 | 2 | 327 | 0 | NaN |
3 | 85368 | Garden House Bologna | 467675 | Anna Maria | NaN | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt | 126 | 2 | 40 | 2019-11-03 | 0.34 | 1 | 332 | 0 | NaN |
4 | 145779 | SINGLE ROOM | 705535 | Valerio | NaN | Porto - Saragozza | 44.49306 | 11.33786 | Private room | 50 | 10 | 69 | 2021-12-05 | 0.55 | 9 | 365 | 5 | NaN |
Aggregating
If we want to count observations across 2 categorical variables, we can use pd.crosstab()
.
pd.crosstab(df_listings['neighbourhood'], df_listings['room_type'])
room_type | Entire home/apt | Hotel room | Private room | Shared room |
---|---|---|---|---|
neighbourhood | ||||
Borgo Panigale - Reno | 107 | 0 | 39 | 0 |
Navile | 250 | 3 | 149 | 1 |
Porto - Saragozza | 842 | 16 | 299 | 10 |
San Donato - San Vitale | 280 | 1 | 134 | 4 |
Santo Stefano | 924 | 29 | 237 | 5 |
Savena | 73 | 0 | 48 | 2 |
We can compute statistics by group using groupby()
.
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()
price | reviews_per_month | |
---|---|---|
neighbourhood | ||
Borgo Panigale - Reno | 83.020548 | 0.983488 |
Navile | 142.200993 | 1.156745 |
Porto - Saragozza | 129.908312 | 1.340325 |
San Donato - San Vitale | 91.618138 | 0.933011 |
Santo Stefano | 119.441841 | 1.344810 |
Savena | 69.626016 | 0.805888 |
If you want to perform more than one function, maybe on different columns, you can use aggregate()
which can be shortened to agg()
. The sintax is agg(output_var = ("input_var", function))
and it accepts also numpy
functions.
df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
min_price=("price", "min"),
max_price=("price", np.max)).reset_index()
neighbourhood | mean_reviews | min_price | max_price | |
---|---|---|---|---|
0 | Borgo Panigale - Reno | 0.983488 | 9 | 1429 |
1 | Navile | 1.156745 | 14 | 5000 |
2 | Porto - Saragozza | 1.340325 | 7 | 9999 |
3 | San Donato - San Vitale | 0.933011 | 10 | 1600 |
4 | Santo Stefano | 1.344810 | 11 | 9999 |
5 | Savena | 0.805888 | 9 | 680 |
If we want to build a new column by group, we can use transform()
on the grouped data. Unfortunately, it does not work as nicely as aggregate()
and we have to do one column at the time.
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].transform('mean').head()
price | reviews_per_month | |
---|---|---|
0 | 119.441841 | 1.344810 |
1 | 129.908312 | 1.340325 |
2 | 119.441841 | 1.344810 |
3 | 119.441841 | 1.344810 |
4 | 129.908312 | 1.340325 |
Combining Datasets
We can concatenate datasets using pd.concat()
. It takes as argument a list of dataframes. By default, pd.concat()
performs the outer join. We can change it using the join
option (in this case, it makes no difference).
df_listings1 = df_listings[:2000]
np.shape(df_listings1)
(2000, 18)
df_listings2 = df_listings[1000:]
np.shape(df_listings2)
(2453, 18)
np.shape(
pd.concat([df_listings1, df_listings2])
)
(4453, 18)
To instead merge dataframes, we can use the pd.merge
function.
Options
how
: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’on
: label or list
df_merged = pd.merge(df_listings, df_prices, left_on='id', right_on='listing_id', how='inner')
df_merged.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price_x | ... | availability_365 | number_of_reviews_ltm | license | listing_id | date | available | price_y | adjusted_price | minimum_nights_y | maximum_nights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-17 | f | $68.00 | $68.00 | 3 | 360 |
1 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-18 | f | $68.00 | $68.00 | 3 | 360 |
2 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-19 | f | $68.00 | $68.00 | 3 | 360 |
3 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-20 | f | $68.00 | $68.00 | 3 | 360 |
4 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-21 | f | $68.00 | $68.00 | 3 | 360 |
5 rows × 25 columns
As you can see, since the variable price
was present in both datasets, we now have a price.x
and a price.y
.
Reshaping
First, let’s compute average prices by neighbourhood
and date
using the merged dataset.
df_long = df_merged.groupby(['neighbourhood', 'date'])['price_x'].agg('mean').reset_index()
df_long.head()
neighbourhood | date | price_x | |
---|---|---|---|
0 | Borgo Panigale - Reno | 2021-12-17 | 83.020548 |
1 | Borgo Panigale - Reno | 2021-12-18 | 83.020548 |
2 | Borgo Panigale - Reno | 2021-12-19 | 83.020548 |
3 | Borgo Panigale - Reno | 2021-12-20 | 83.020548 |
4 | Borgo Panigale - Reno | 2021-12-21 | 83.020548 |
This is what is called long format since it has one or more variables (price_x
in this case) stacked vertically along a categorical variable (neighborhood
and date
here), which acts as index.
The alternative is the wide format where we have one separate column for each neighborhood.
We can reshape the dataset from long to wide using the pd.pivot()
command. d
df_wide = pd.pivot(data=df_long, index='date', columns='neighbourhood').reset_index()
df_wide.head()
date | price_x | ||||||
---|---|---|---|---|---|---|---|
neighbourhood | Borgo Panigale - Reno | Navile | Porto - Saragozza | San Donato - San Vitale | Santo Stefano | Savena | |
0 | 2021-12-17 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
1 | 2021-12-18 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
2 | 2021-12-19 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
3 | 2021-12-20 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
4 | 2021-12-21 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
We can reshape the dataset from wide to long using the pd.melt()
command. It takes the following arguments
data
: the dataframeid_vars
: the variable that was indexing the old dataset
pd.melt(df_wide, id_vars='date', value_name='price').head()
date | None | neighbourhood | price | |
---|---|---|---|---|
0 | 2021-12-17 | price_x | Borgo Panigale - Reno | 83.020548 |
1 | 2021-12-18 | price_x | Borgo Panigale - Reno | 83.020548 |
2 | 2021-12-19 | price_x | Borgo Panigale - Reno | 83.020548 |
3 | 2021-12-20 | price_x | Borgo Panigale - Reno | 83.020548 |
4 | 2021-12-21 | price_x | Borgo Panigale - Reno | 83.020548 |
If we do not have MultiIndex
columns, but just a common prefix, we can reshape the dataset from wide to long using the pd.wide_to_long()
command.
df_wide2 = df_wide.copy()
df_wide2.columns = [''.join(col) for col in df_wide2.columns]
df_wide2.head()
date | price_xBorgo Panigale - Reno | price_xNavile | price_xPorto - Saragozza | price_xSan Donato - San Vitale | price_xSanto Stefano | price_xSavena | |
---|---|---|---|---|---|---|---|
0 | 2021-12-17 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
1 | 2021-12-18 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
2 | 2021-12-19 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
3 | 2021-12-20 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
4 | 2021-12-21 | 83.020548 | 142.200993 | 129.908312 | 91.618138 | 119.441841 | 69.626016 |
The pd.wide_to_long()
command takes the following arguments
data
: the dataframestubnames
: the prefixes of the variables that we want to reshape into onei
: the variable that was indexing the old datasetj
: the name of the new categorical variable that we extract fromstubnames
suffix
: regular expression of the suffix, the default is\d+
, i.e. digits
pd.wide_to_long(df_wide2, stubnames='price_x', i='date', j='neighborhood', suffix='\D+').head()
price_x | ||
---|---|---|
date | neighborhood | |
2021-12-17 | Borgo Panigale - Reno | 83.020548 |
2021-12-18 | Borgo Panigale - Reno | 83.020548 |
2021-12-19 | Borgo Panigale - Reno | 83.020548 |
2021-12-20 | Borgo Panigale - Reno | 83.020548 |
2021-12-21 | Borgo Panigale - Reno | 83.020548 |
Note that we had to change the suffix
to \D+
, i.e. not digits.
Window Functions
Methods
shift()
expanding()
rolling()
When we have time series data, we might want to do operations across time. First, let’s aggregate the df_price
dataset at the year-month level.
temp = df_prices.copy()
temp['price'] = temp['price'].str.replace('[$|,]', '', regex=True).astype(float)
temp['date'] = pd.to_datetime(temp['date']).dt.to_period('M')
temp = temp.groupby(['listing_id', 'date'])['price'].mean().reset_index()\
.sort_values(by=['listing_id', 'date'], ascending=[False, True])
temp.head()
listing_id | date | price | |
---|---|---|---|
44876 | 53854962 | 2021-12 | 147.400000 |
44877 | 53854962 | 2022-01 | 137.645161 |
44878 | 53854962 | 2022-02 | 124.642857 |
44879 | 53854962 | 2022-03 | 285.096774 |
44880 | 53854962 | 2022-04 | 115.000000 |
We can lead or lag one variable using shift()
.
temp['price1'] = temp['price'].shift(1)
temp.head(15)
listing_id | date | price | price1 | |
---|---|---|---|---|
44876 | 53854962 | 2021-12 | 147.400000 | NaN |
44877 | 53854962 | 2022-01 | 137.645161 | 147.400000 |
44878 | 53854962 | 2022-02 | 124.642857 | 137.645161 |
44879 | 53854962 | 2022-03 | 285.096774 | 124.642857 |
44880 | 53854962 | 2022-04 | 115.000000 | 285.096774 |
44881 | 53854962 | 2022-05 | 115.000000 | 115.000000 |
44882 | 53854962 | 2022-06 | 115.000000 | 115.000000 |
44883 | 53854962 | 2022-07 | 115.000000 | 115.000000 |
44884 | 53854962 | 2022-08 | 115.000000 | 115.000000 |
44885 | 53854962 | 2022-09 | 115.000000 | 115.000000 |
44886 | 53854962 | 2022-10 | 115.000000 | 115.000000 |
44887 | 53854962 | 2022-11 | 115.000000 | 115.000000 |
44888 | 53854962 | 2022-12 | 115.000000 | 115.000000 |
44863 | 53837654 | 2021-12 | 184.133333 | 115.000000 |
44864 | 53837654 | 2022-01 | 148.741935 | 184.133333 |
If we want to lead or lag a variable within a group, we can combine shift()
with groupby()
temp['price1'] = temp.groupby('listing_id')['price'].shift(1)
temp.head(15)
listing_id | date | price | price1 | |
---|---|---|---|---|
44876 | 53854962 | 2021-12 | 147.400000 | NaN |
44877 | 53854962 | 2022-01 | 137.645161 | 147.400000 |
44878 | 53854962 | 2022-02 | 124.642857 | 137.645161 |
44879 | 53854962 | 2022-03 | 285.096774 | 124.642857 |
44880 | 53854962 | 2022-04 | 115.000000 | 285.096774 |
44881 | 53854962 | 2022-05 | 115.000000 | 115.000000 |
44882 | 53854962 | 2022-06 | 115.000000 | 115.000000 |
44883 | 53854962 | 2022-07 | 115.000000 | 115.000000 |
44884 | 53854962 | 2022-08 | 115.000000 | 115.000000 |
44885 | 53854962 | 2022-09 | 115.000000 | 115.000000 |
44886 | 53854962 | 2022-10 | 115.000000 | 115.000000 |
44887 | 53854962 | 2022-11 | 115.000000 | 115.000000 |
44888 | 53854962 | 2022-12 | 115.000000 | 115.000000 |
44863 | 53837654 | 2021-12 | 184.133333 | NaN |
44864 | 53837654 | 2022-01 | 148.741935 | 184.133333 |
We can perform cumulative operations using the expanding()
function
temp['avg_cum_price'] = temp['price'].expanding().mean()
temp.head(15)
listing_id | date | price | price1 | avg_cum_price | |
---|---|---|---|---|---|
44876 | 53854962 | 2021-12 | 147.400000 | NaN | 147.400000 |
44877 | 53854962 | 2022-01 | 137.645161 | 147.400000 | 142.522581 |
44878 | 53854962 | 2022-02 | 124.642857 | 137.645161 | 136.562673 |
44879 | 53854962 | 2022-03 | 285.096774 | 124.642857 | 173.696198 |
44880 | 53854962 | 2022-04 | 115.000000 | 285.096774 | 161.956959 |
44881 | 53854962 | 2022-05 | 115.000000 | 115.000000 | 154.130799 |
44882 | 53854962 | 2022-06 | 115.000000 | 115.000000 | 148.540685 |
44883 | 53854962 | 2022-07 | 115.000000 | 115.000000 | 144.348099 |
44884 | 53854962 | 2022-08 | 115.000000 | 115.000000 | 141.087199 |
44885 | 53854962 | 2022-09 | 115.000000 | 115.000000 | 138.478479 |
44886 | 53854962 | 2022-10 | 115.000000 | 115.000000 | 136.344072 |
44887 | 53854962 | 2022-11 | 115.000000 | 115.000000 | 134.565399 |
44888 | 53854962 | 2022-12 | 115.000000 | 115.000000 | 133.060369 |
44863 | 53837654 | 2021-12 | 184.133333 | NaN | 136.708438 |
44864 | 53837654 | 2022-01 | 148.741935 | 184.133333 | 137.510671 |
To perform cumulative operations within a group, we can combine expanding()
with groupby()
. Since groups with not enough observations get dropped, we need to merge the dataset back.
temp.groupby('listing_id')['price'].expanding().mean().reset_index(level=0).head(15)
listing_id | price | |
---|---|---|
0 | 42196 | 68.000000 |
1 | 42196 | 68.000000 |
2 | 42196 | 68.000000 |
3 | 42196 | 68.000000 |
4 | 42196 | 68.000000 |
5 | 42196 | 68.000000 |
6 | 42196 | 68.000000 |
7 | 42196 | 68.000000 |
8 | 42196 | 68.000000 |
9 | 42196 | 68.000000 |
10 | 42196 | 68.000000 |
11 | 42196 | 68.000000 |
12 | 42196 | 68.000000 |
13 | 46352 | 29.333333 |
14 | 46352 | 29.311828 |
If we want to perform an operation over a rolling window, we can use the rolling()
function
temp['avg3_price'] = temp['price'].rolling(3).mean()
temp.head(15)
listing_id | date | price | price1 | avg_cum_price | avg3_price | |
---|---|---|---|---|---|---|
44876 | 53854962 | 2021-12 | 147.400000 | NaN | 147.400000 | NaN |
44877 | 53854962 | 2022-01 | 137.645161 | 147.400000 | 142.522581 | NaN |
44878 | 53854962 | 2022-02 | 124.642857 | 137.645161 | 136.562673 | 136.562673 |
44879 | 53854962 | 2022-03 | 285.096774 | 124.642857 | 173.696198 | 182.461598 |
44880 | 53854962 | 2022-04 | 115.000000 | 285.096774 | 161.956959 | 174.913210 |
44881 | 53854962 | 2022-05 | 115.000000 | 115.000000 | 154.130799 | 171.698925 |
44882 | 53854962 | 2022-06 | 115.000000 | 115.000000 | 148.540685 | 115.000000 |
44883 | 53854962 | 2022-07 | 115.000000 | 115.000000 | 144.348099 | 115.000000 |
44884 | 53854962 | 2022-08 | 115.000000 | 115.000000 | 141.087199 | 115.000000 |
44885 | 53854962 | 2022-09 | 115.000000 | 115.000000 | 138.478479 | 115.000000 |
44886 | 53854962 | 2022-10 | 115.000000 | 115.000000 | 136.344072 | 115.000000 |
44887 | 53854962 | 2022-11 | 115.000000 | 115.000000 | 134.565399 | 115.000000 |
44888 | 53854962 | 2022-12 | 115.000000 | 115.000000 | 133.060369 | 115.000000 |
44863 | 53837654 | 2021-12 | 184.133333 | NaN | 136.708438 | 138.044444 |
44864 | 53837654 | 2022-01 | 148.741935 | 184.133333 | 137.510671 | 149.291756 |