Data Wrangling
Last updated on Feb 27, 2022
import numpy as np
import pandas as pd
Setup
For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.
A description of all variables in all datasets is avaliable here.
We are going to use 2 datasets:
- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)
# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")
Editing the Dataframe
You can sort the data using the sort_values
function.
Options
ascending
: bool or list of bool, default Truena_position
: {‘first’, ‘last’}, default ‘last’
df_listings.sort_values(by=['name', 'price'],
ascending=[False, True],
na_position='last').head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2280 | 38601411 | 🏡Giardino di Annabella-relax in città-casa intera | 240803020 | Annabella | NaN | Porto - Saragozza | 44.49303 | 11.31986 | Entire home/apt | 90 | 2 | 53 | 2021-12-13 | 1.96 | 1 | 76 | 27 | 392901 |
2988 | 48177313 | ❤ Romantic Suite with SPA Bath ❤ 4starbologna.com | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50271 | 11.34998 | Entire home/apt | 309 | 1 | 1 | 2021-03-14 | 0.11 | 14 | 344 | 1 | NaN |
3302 | 52367336 | ✨House of Alchemy✨ | 140013413 | Greta | NaN | Porto - Saragozza | 44.49072 | 11.30890 | Entire home/apt | 96 | 2 | 7 | 2021-11-28 | 3.18 | 1 | 88 | 7 | NaN |
2039 | 34495335 | ♥ Romantic for Couple in Love ♥ | 4 Star Boutique | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50368 | 11.34972 | Entire home/apt | 143 | 1 | 25 | 2021-08-20 | 0.79 | 14 | 262 | 6 | NaN |
2964 | 47866124 | ♡Amazing Suite with Private SPA ♡ 4starbologna... | 239491712 | 4 Star Bologna | NaN | Santo Stefano | 44.50381 | 11.34951 | Entire home/apt | 347 | 1 | 2 | 2021-10-17 | 0.72 | 14 | 337 | 2 | NaN |
You can remane columns using the rename()
function. It takes a dictionary as column
argument in the form {"old_name": "new_name"}
.
df_listings.rename(columns={'name': 'listing_name',
'id': 'listing_id'}).head()
listing_id | listing_name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | 3 | 180 | 2021-11-12 | 1.32 | 1 | 161 | 6 | NaN |
1 | 46352 | A room in Pasolini's house | 467810 | Eleonora | NaN | Porto - Saragozza | 44.49168 | 11.33514 | Private room | 29 | 1 | 300 | 2021-11-30 | 2.20 | 2 | 248 | 37 | NaN |
2 | 59697 | COZY LARGE BEDROOM in the city center | 286688 | Paolo | NaN | Santo Stefano | 44.48817 | 11.34124 | Private room | 50 | 1 | 240 | 2020-10-04 | 2.18 | 2 | 327 | 0 | NaN |
3 | 85368 | Garden House Bologna | 467675 | Anna Maria | NaN | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt | 126 | 2 | 40 | 2019-11-03 | 0.34 | 1 | 332 | 0 | NaN |
4 | 145779 | SINGLE ROOM | 705535 | Valerio | NaN | Porto - Saragozza | 44.49306 | 11.33786 | Private room | 50 | 10 | 69 | 2021-12-05 | 0.55 | 9 | 365 | 5 | NaN |
Aggregating
We can compute statistics by group using groupby()
.
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()
price | reviews_per_month | |
---|---|---|
neighbourhood | ||
Borgo Panigale - Reno | 83.020548 | 0.983488 |
Navile | 142.200993 | 1.156745 |
Porto - Saragozza | 129.908312 | 1.340325 |
San Donato - San Vitale | 91.618138 | 0.933011 |
Santo Stefano | 119.441841 | 1.344810 |
Savena | 69.626016 | 0.805888 |
If you want to perform more than one function, maybe on different columns, you can use aggregate()
which can be shortened to agg()
. The sintax is agg(output_var = ("input_var", function))
and it accepts also numpy
functions.
df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
min_price=("price", "min"),
max_price=("price", np.max)).reset_index()
neighbourhood | mean_reviews | min_price | max_price | |
---|---|---|---|---|
0 | Borgo Panigale - Reno | 0.983488 | 9 | 1429 |
1 | Navile | 1.156745 | 14 | 5000 |
2 | Porto - Saragozza | 1.340325 | 7 | 9999 |
3 | San Donato - San Vitale | 0.933011 | 10 | 1600 |
4 | Santo Stefano | 1.344810 | 11 | 9999 |
5 | Savena | 0.805888 | 9 | 680 |
If we want to build a new column by group, we can use transform()
on the grouped data. Unfortunately, it does not work as nicely as aggregate()
and we have to do one column at the time.
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].transform('mean')
price | reviews_per_month | |
---|---|---|
0 | 119.441841 | 1.344810 |
1 | 129.908312 | 1.340325 |
2 | 119.441841 | 1.344810 |
3 | 119.441841 | 1.344810 |
4 | 129.908312 | 1.340325 |
... | ... | ... |
3448 | 142.200993 | 1.156745 |
3449 | 129.908312 | 1.340325 |
3450 | 142.200993 | 1.156745 |
3451 | 119.441841 | 1.344810 |
3452 | 129.908312 | 1.340325 |
3453 rows × 2 columns
Reshaping
Combining Datasets
We can concatenate datasets using pd.concat()
. It takes as argument a list of dataframes. By default, pd.concat()
performs the outer join. We can change it using the join
option.
df_listings1 = df_listings[:2000]
np.shape(df_listings1)
(2000, 18)
df_listings2 = df_listings[1000:]
np.shape(df_listings2)
(2453, 18)
np.shape(
pd.concat([df_listings1, df_listings2], join='inner')
)
(4453, 18)
To instead merge dataframes, we can use the pd.merge
function.
Options
how
: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’on
: label or list
pd.merge(df_listings, df_prices, left_on='id', right_on='listing_id', how='inner')
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price_x | ... | availability_365 | number_of_reviews_ltm | license | listing_id | date | available | price_y | adjusted_price | minimum_nights_y | maximum_nights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.485070 | 11.347860 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-17 | f | $68.00 | $68.00 | 3 | 360 |
1 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.485070 | 11.347860 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-18 | f | $68.00 | $68.00 | 3 | 360 |
2 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.485070 | 11.347860 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-19 | f | $68.00 | $68.00 | 3 | 360 |
3 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.485070 | 11.347860 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-20 | f | $68.00 | $68.00 | 3 | 360 |
4 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.485070 | 11.347860 | Entire home/apt | 68 | ... | 161 | 6 | NaN | 42196 | 2021-12-21 | f | $68.00 | $68.00 | 3 | 360 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1260340 | 53854962 | A pochi passi dal Mercato delle Erbe by Wonder... | 13036400 | Wonderful Italy | NaN | Porto - Saragozza | 44.497456 | 11.337492 | Entire home/apt | 115 | ... | 98 | 0 | NaN | 53854962 | 2022-12-12 | f | $115.00 | $115.00 | 1 | 1125 |
1260341 | 53854962 | A pochi passi dal Mercato delle Erbe by Wonder... | 13036400 | Wonderful Italy | NaN | Porto - Saragozza | 44.497456 | 11.337492 | Entire home/apt | 115 | ... | 98 | 0 | NaN | 53854962 | 2022-12-13 | f | $115.00 | $115.00 | 1 | 1125 |
1260342 | 53854962 | A pochi passi dal Mercato delle Erbe by Wonder... | 13036400 | Wonderful Italy | NaN | Porto - Saragozza | 44.497456 | 11.337492 | Entire home/apt | 115 | ... | 98 | 0 | NaN | 53854962 | 2022-12-14 | f | $115.00 | $115.00 | 1 | 1125 |
1260343 | 53854962 | A pochi passi dal Mercato delle Erbe by Wonder... | 13036400 | Wonderful Italy | NaN | Porto - Saragozza | 44.497456 | 11.337492 | Entire home/apt | 115 | ... | 98 | 0 | NaN | 53854962 | 2022-12-15 | f | $115.00 | $115.00 | 1 | 1125 |
1260344 | 53854962 | A pochi passi dal Mercato delle Erbe by Wonder... | 13036400 | Wonderful Italy | NaN | Porto - Saragozza | 44.497456 | 11.337492 | Entire home/apt | 115 | ... | 98 | 0 | NaN | 53854962 | 2022-12-16 | f | $115.00 | $115.00 | 1 | 1125 |
1260345 rows × 25 columns
As you can see, since the variable price
was present in both datasets, we now have a price.x
and a price.y
.