Data Wrangling

Last updated on Feb 27, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

listing dataset: contains listing-level information
pricing dataset: contains pricing data, over time

# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Editing the Dataframe

You can sort the data using the sort_values function.

Options

ascending: bool or list of bool, default True
na_position: {‘first’, ‘last’}, default ‘last’

df_listings.sort_values(by=['name', 'price'], 
                        ascending=[False, True], 
                        na_position='last').head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
2280	38601411	🏡Giardino di Annabella-relax in città-casa intera	240803020	Annabella	NaN	Porto - Saragozza	44.49303	11.31986	Entire home/apt	90	2	53	2021-12-13	1.96	1	76	27	392901
2988	48177313	❤ Romantic Suite with SPA Bath ❤ 4starbologna.com	239491712	4 Star Bologna	NaN	Santo Stefano	44.50271	11.34998	Entire home/apt	309	1	1	2021-03-14	0.11	14	344	1	NaN
3302	52367336	✨House of Alchemy✨	140013413	Greta	NaN	Porto - Saragozza	44.49072	11.30890	Entire home/apt	96	2	7	2021-11-28	3.18	1	88	7	NaN
2039	34495335	♥ Romantic for Couple in Love ♥ \| 4 Star Boutique	239491712	4 Star Bologna	NaN	Santo Stefano	44.50368	11.34972	Entire home/apt	143	1	25	2021-08-20	0.79	14	262	6	NaN
2964	47866124	♡Amazing Suite with Private SPA ♡ 4starbologna...	239491712	4 Star Bologna	NaN	Santo Stefano	44.50381	11.34951	Entire home/apt	347	1	2	2021-10-17	0.72	14	337	2	NaN

You can remane columns using the rename() function. It takes a dictionary as column argument in the form {"old_name": "new_name"}.

df_listings.rename(columns={'name': 'listing_name', 
                            'id': 'listing_id'}).head()

	listing_id	listing_name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
0	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.48507	11.34786	Entire home/apt	68	3	180	2021-11-12	1.32	1	161	6	NaN
1	46352	A room in Pasolini's house	467810	Eleonora	NaN	Porto - Saragozza	44.49168	11.33514	Private room	29	1	300	2021-11-30	2.20	2	248	37	NaN
2	59697	COZY LARGE BEDROOM in the city center	286688	Paolo	NaN	Santo Stefano	44.48817	11.34124	Private room	50	1	240	2020-10-04	2.18	2	327	0	NaN
3	85368	Garden House Bologna	467675	Anna Maria	NaN	Santo Stefano	44.47834	11.35672	Entire home/apt	126	2	40	2019-11-03	0.34	1	332	0	NaN
4	145779	SINGLE ROOM	705535	Valerio	NaN	Porto - Saragozza	44.49306	11.33786	Private room	50	10	69	2021-12-05	0.55	9	365	5	NaN

Aggregating

We can compute statistics by group using groupby().

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

	price	reviews_per_month
neighbourhood
Borgo Panigale - Reno	83.020548	0.983488
Navile	142.200993	1.156745
Porto - Saragozza	129.908312	1.340325
San Donato - San Vitale	91.618138	0.933011
Santo Stefano	119.441841	1.344810
Savena	69.626016	0.805888

If you want to perform more than one function, maybe on different columns, you can use aggregate() which can be shortened to agg(). The sintax is agg(output_var = ("input_var", function)) and it accepts also numpy functions.

df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

	neighbourhood	mean_reviews	min_price	max_price
0	Borgo Panigale - Reno	0.983488	9	1429
1	Navile	1.156745	14	5000
2	Porto - Saragozza	1.340325	7	9999
3	San Donato - San Vitale	0.933011	10	1600
4	Santo Stefano	1.344810	11	9999
5	Savena	0.805888	9	680

If we want to build a new column by group, we can use transform() on the grouped data. Unfortunately, it does not work as nicely as aggregate() and we have to do one column at the time.

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].transform('mean')

	price	reviews_per_month
0	119.441841	1.344810
1	129.908312	1.340325
2	119.441841	1.344810
3	119.441841	1.344810
4	129.908312	1.340325
...	...	...
3448	142.200993	1.156745
3449	129.908312	1.340325
3450	142.200993	1.156745
3451	119.441841	1.344810
3452	129.908312	1.340325

3453 rows × 2 columns

Reshaping

Combining Datasets

We can concatenate datasets using pd.concat(). It takes as argument a list of dataframes. By default, pd.concat() performs the outer join. We can change it using the join option.

df_listings1 = df_listings[:2000]
np.shape(df_listings1)

(2000, 18)

df_listings2 = df_listings[1000:]
np.shape(df_listings2)

(2453, 18)

np.shape(
    pd.concat([df_listings1, df_listings2], join='inner')
)

(4453, 18)

To instead merge dataframes, we can use the pd.merge function.

Options

how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
on: label or list

pd.merge(df_listings, df_prices, left_on='id', right_on='listing_id', how='inner')

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price_x	...	availability_365	number_of_reviews_ltm	license	listing_id	date	available	price_y	adjusted_price	minimum_nights_y	maximum_nights
0	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.485070	11.347860	Entire home/apt	68	...	161	6	NaN	42196	2021-12-17	f	$68.00	$68.00	3	360
1	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.485070	11.347860	Entire home/apt	68	...	161	6	NaN	42196	2021-12-18	f	$68.00	$68.00	3	360
2	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.485070	11.347860	Entire home/apt	68	...	161	6	NaN	42196	2021-12-19	f	$68.00	$68.00	3	360
3	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.485070	11.347860	Entire home/apt	68	...	161	6	NaN	42196	2021-12-20	f	$68.00	$68.00	3	360
4	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.485070	11.347860	Entire home/apt	68	...	161	6	NaN	42196	2021-12-21	f	$68.00	$68.00	3	360
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1260340	53854962	A pochi passi dal Mercato delle Erbe by Wonder...	13036400	Wonderful Italy	NaN	Porto - Saragozza	44.497456	11.337492	Entire home/apt	115	...	98	0	NaN	53854962	2022-12-12	f	$115.00	$115.00	1	1125
1260341	53854962	A pochi passi dal Mercato delle Erbe by Wonder...	13036400	Wonderful Italy	NaN	Porto - Saragozza	44.497456	11.337492	Entire home/apt	115	...	98	0	NaN	53854962	2022-12-13	f	$115.00	$115.00	1	1125
1260342	53854962	A pochi passi dal Mercato delle Erbe by Wonder...	13036400	Wonderful Italy	NaN	Porto - Saragozza	44.497456	11.337492	Entire home/apt	115	...	98	0	NaN	53854962	2022-12-14	f	$115.00	$115.00	1	1125
1260343	53854962	A pochi passi dal Mercato delle Erbe by Wonder...	13036400	Wonderful Italy	NaN	Porto - Saragozza	44.497456	11.337492	Entire home/apt	115	...	98	0	NaN	53854962	2022-12-15	f	$115.00	$115.00	1	1125
1260344	53854962	A pochi passi dal Mercato delle Erbe by Wonder...	13036400	Wonderful Italy	NaN	Porto - Saragozza	44.497456	11.337492	Entire home/apt	115	...	98	0	NaN	53854962	2022-12-16	f	$115.00	$115.00	1	1125

1260345 rows × 25 columns

As you can see, since the variable price was present in both datasets, we now have a price.x and a price.y.