Data Exploration

Last updated on May 1, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

listing dataset: contains listing-level information
pricing dataset: contains pricing data, over time

Importing Data

Pandas has a variety of function to import data

pd.read_csv()
pd.read_html()
pd.read_parquet()

Importatly for our purpose, pd.read_csv() can directly import data from the web.

The first dataset that we are going to import is the dataset of Airbnb listings in Bologna. It contains listing-level information.

url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

The second dataset that we are going to use is the dataset of calendar prices. This time the dataset is compressed but we can use the compression option to import it directly.

url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Inspecting Data

Methods

info()

head()

describe()

The first way yo have a quick look at the data is the info() method. If called with the option verbose=False, it gives a quick overview of the dimensions of the data.

df_listings.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3453 entries, 0 to 3452
Columns: 18 entries, id to license
dtypes: float64(4), int64(8), object(6)
memory usage: 485.7+ KB

If we want to know how the data looks like, we can use the head() method. It prints the first 5 lines of the data by default.

df_listings.head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
0	42196	50 sm Studio in the historic centre	184487	Carlo	NaN	Santo Stefano	44.48507	11.34786	Entire home/apt	68	3	180	2021-11-12	1.32	1	161	6	NaN
1	46352	A room in Pasolini's house	467810	Eleonora	NaN	Porto - Saragozza	44.49168	11.33514	Private room	29	1	300	2021-11-30	2.20	2	248	37	NaN
2	59697	COZY LARGE BEDROOM in the city center	286688	Paolo	NaN	Santo Stefano	44.48817	11.34124	Private room	50	1	240	2020-10-04	2.18	2	327	0	NaN
3	85368	Garden House Bologna	467675	Anna Maria	NaN	Santo Stefano	44.47834	11.35672	Entire home/apt	126	2	40	2019-11-03	0.34	1	332	0	NaN
4	145779	SINGLE ROOM	705535	Valerio	NaN	Porto - Saragozza	44.49306	11.33786	Private room	50	10	69	2021-12-05	0.55	9	365	5	NaN

We can print a description of the data using describe(). If we have many variables, it’s best to print it transposed using the .T attribute.

df_listings.describe().T[:5]

	count	mean	std	min	25%	50%	75%	max
id	3453.0	2.950218e+07	1.523988e+07	42196.0000	1.748597e+07	3.078707e+07	4.220094e+07	5.385496e+07
host_id	3453.0	1.236424e+08	1.160756e+08	38468.0000	2.550007e+07	8.845438e+07	2.005926e+08	4.354316e+08
neighbourhood_group	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
latitude	3453.0	4.449756e+01	1.173569e-02	44.4236	4.449186e+01	4.449699e+01	4.450271e+01	4.455093e+01
longitude	3453.0	1.134509e+01	1.986071e-02	11.2320	1.133732e+01	1.134519e+01	1.135406e+01	1.142027e+01

You can select which variables to display using the include option. include='all' includes also categorical variables.

df_listings.describe(include='all').T[:5]

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
id	3453.0	NaN	NaN	NaN	29502177.118158	15239877.346777	42196.0	17485973.0	30787074.0	42200938.0	53854962.0
name	3453	3410	Luxury Industrial Design LOFT, HEPA UV airpuri...	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN
host_id	3453.0	NaN	NaN	NaN	123642405.854619	116075571.230048	38468.0	25500072.0	88454378.0	200592620.0	435431590.0
host_name	3444	747	Andrea	101	NaN	NaN	NaN	NaN	NaN	NaN	NaN
neighbourhood_group	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

We can get the list of columns using the .columns attribute.

df_listings.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'license'],
      dtype='object')

We can get the index using the .index attribute,

df_listings.index

RangeIndex(start=0, stop=3453, step=1)

Data Selection

We can access single columns as if the DataFrame was a dictionary.

df_listings['price']

0        68
1        29
2        50
3       126
4        50
       ... 
3448     32
3449     45
3450     50
3451    134
3452    115
Name: price, Length: 3453, dtype: int64

We can select rows and columns by index, using the .iloc attribute.

df_listings.iloc[:7, 5:9]

	neighbourhood	latitude	longitude	room_type
0	Santo Stefano	44.48507	11.34786	Entire home/apt
1	Porto - Saragozza	44.49168	11.33514	Private room
2	Santo Stefano	44.48817	11.34124	Private room
3	Santo Stefano	44.47834	11.35672	Entire home/apt
4	Porto - Saragozza	44.49306	11.33786	Private room
5	Navile	44.51628	11.33074	Private room
6	Santo Stefano	44.48787	11.35392	Entire home/apt

If we want to condition only on rows or columns, we have use : for the unrestricted dimesion, otherwise we get an error.

df_listings.iloc[:, 5:9].head()

	neighbourhood	latitude	longitude	room_type
0	Santo Stefano	44.48507	11.34786	Entire home/apt
1	Porto - Saragozza	44.49168	11.33514	Private room
2	Santo Stefano	44.48817	11.34124	Private room
3	Santo Stefano	44.47834	11.35672	Entire home/apt
4	Porto - Saragozza	44.49306	11.33786	Private room

Instead, the .loc attribute allows us to use row and column names.

df_listings.loc[:, ['neighbourhood', 'latitude', 'longitude']].head()

	neighbourhood	latitude	longitude
0	Santo Stefano	44.48507	11.34786
1	Porto - Saragozza	44.49168	11.33514
2	Santo Stefano	44.48817	11.34124
3	Santo Stefano	44.47834	11.35672
4	Porto - Saragozza	44.49306	11.33786

We can also select ranges.

df_listings.loc[:, 'neighbourhood':'room_type'].head()

	neighbourhood	latitude	longitude	room_type
0	Santo Stefano	44.48507	11.34786	Entire home/apt
1	Porto - Saragozza	44.49168	11.33514	Private room
2	Santo Stefano	44.48817	11.34124	Private room
3	Santo Stefano	44.47834	11.35672	Entire home/apt
4	Porto - Saragozza	44.49306	11.33786	Private room

There is an easy way to select numerical columns, the .select_dtypes() function.

df_listings.select_dtypes(include=['number']).head()

	id	host_id	neighbourhood_group	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm
0	42196	184487	NaN	44.48507	11.34786	68	3	180	1.32	1	161	6
1	46352	467810	NaN	44.49168	11.33514	29	1	300	2.20	2	248	37
2	59697	286688	NaN	44.48817	11.34124	50	1	240	2.18	2	327	0
3	85368	467675	NaN	44.47834	11.35672	126	2	40	0.34	1	332	0
4	145779	705535	NaN	44.49306	11.33786	50	10	69	0.55	9	365	5

Other types include

object for strings
bool for booleans
int for integers
float for floats (numbers that are not integers)

We can also use logical operators to selet rows.

df_listings.loc[df_listings['number_of_reviews']>500, :].head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
52	884148	APOSA FLAT / CITY CENTER - BO	4664996	Vie D'Acqua Di Sandra Maria	NaN	Santo Stefano	44.49945	11.34566	Entire home/apt	46	1	668	2021-12-11	6.24	5	252	20	NaN
92	1435627	heart of Bologna Piazza Maggiore	7714013	Carlotta	NaN	Porto - Saragozza	44.49321	11.33569	Entire home/apt	56	2	508	2021-12-12	5.08	1	131	69	NaN
98	1566003	"i portici di via Piella "	8325248	Massimo	NaN	Santo Stefano	44.49855	11.34411	Entire home/apt	51	2	764	2021-12-14	7.62	3	119	120	NaN
131	2282623	S.Orsola zone,parking for free and self check-in	11658074	Cecilia	NaN	San Donato - San Vitale	44.49328	11.36650	Entire home/apt	38	1	689	2021-10-24	7.20	1	5	72	NaN
175	3216486	Stanza Privata	16289536	Fabio	NaN	Navile	44.50903	11.34200	Private room	82	1	569	2021-12-05	6.93	1	7	5	NaN

We can use logical operations as well. But remember to use paranthesis.

Note: the and and or expressions do not work in this setting. We have to use & and | instead.

df_listings.loc[(df_listings['number_of_reviews']>300) &
                (df_listings['reviews_per_month']>7), 
                :].head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
98	1566003	"i portici di via Piella "	8325248	Massimo	NaN	Santo Stefano	44.498550	11.344110	Entire home/apt	51	2	764	2021-12-14	7.62	3	119	120	NaN
131	2282623	S.Orsola zone,parking for free and self check-in	11658074	Cecilia	NaN	San Donato - San Vitale	44.493280	11.366500	Entire home/apt	38	1	689	2021-10-24	7.20	1	5	72	NaN
204	4166793	Centralissimo a Bologna	8325248	Massimo	NaN	Santo Stefano	44.500920	11.344560	Entire home/apt	71	2	750	2021-12-10	9.21	3	233	84	NaN
751	15508481	Monolocale in zona fiera /centro	99632788	Walid	NaN	Navile	44.514462	11.353731	Entire home/apt	64	1	475	2021-12-01	7.56	1	4	48	NaN
773	15886516	Monolocale nel cuore del ghetto ebraico di Bol...	103024123	Catia	NaN	Santo Stefano	44.495080	11.347220	Entire home/apt	58	1	428	2021-12-15	7.88	1	285	17	NaN

For a single column (i.e. a Series), we can get the unique values using the unique() function.

df_listings['neighbourhood'].unique()

array(['Santo Stefano', 'Porto - Saragozza', 'Navile',
       'San Donato - San Vitale', 'Savena', 'Borgo Panigale - Reno'],
      dtype=object)

For multiple columns, we can use the drop_duplicates function.

df_listings[['neighbourhood', 'room_type']].drop_duplicates()

	neighbourhood	room_type
0	Santo Stefano	Entire home/apt
1	Porto - Saragozza	Private room
2	Santo Stefano	Private room
5	Navile	Private room
7	Navile	Entire home/apt
8	Porto - Saragozza	Entire home/apt
19	San Donato - San Vitale	Private room
24	Savena	Private room
36	Borgo Panigale - Reno	Entire home/apt
41	San Donato - San Vitale	Entire home/apt
70	Porto - Saragozza	Hotel room
75	Borgo Panigale - Reno	Private room
110	Santo Stefano	Hotel room
111	Savena	Entire home/apt
388	Porto - Saragozza	Shared room
678	Navile	Shared room
1393	Savena	Shared room
1416	San Donato - San Vitale	Shared room
1572	San Donato - San Vitale	Hotel room
1637	Santo Stefano	Shared room
1751	Navile	Hotel room

Aggregation and Pivot Tables

We can compute statistics by group using .groupby().

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

	price	reviews_per_month
neighbourhood
Borgo Panigale - Reno	83.020548	0.983488
Navile	142.200993	1.156745
Porto - Saragozza	129.908312	1.340325
San Donato - San Vitale	91.618138	0.933011
Santo Stefano	119.441841	1.344810
Savena	69.626016	0.805888

If you want to perform more than one function, maybe on different columns, you can use .aggregate() which can be shortened to .agg(). It takes as argument a dictionary with variables as keys and lists of functions as values.

df_listings.groupby('neighbourhood').agg({"reviews_per_month": ["mean"],
                                          "price": ["min", np.max]}).reset_index()

	neighbourhood	reviews_per_month	price
		mean	min	amax
0	Borgo Panigale - Reno	0.983488	9	1429
1	Navile	1.156745	14	5000
2	Porto - Saragozza	1.340325	7	9999
3	San Donato - San Vitale	0.933011	10	1600
4	Santo Stefano	1.344810	11	9999
5	Savena	0.805888	9	680

The problem with this syntax is that it generates a hierarchical structure for variable names, which might not be so easy to work with. In the example above, to access the mean price, you have to use df.price["min"].

To perform variable naming and aggregation and the same time, you can ise the following syntax: agg(output_var = ("input_var", function)).

df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

	neighbourhood	mean_reviews	min_price	max_price
0	Borgo Panigale - Reno	0.983488	9	1429
1	Navile	1.156745	14	5000
2	Porto - Saragozza	1.340325	7	9999
3	San Donato - San Vitale	0.933011	10	1600
4	Santo Stefano	1.344810	11	9999
5	Savena	0.805888	9	680

We can make pivot tables with the .pivot_table() function. It takes the folling arguments:

index: rows
columns: columns
values: values
aggfunc: aggregation function

df_listings.pivot_table(index='neighbourhood', columns='room_type', values='price', aggfunc='mean')

room_type	Entire home/apt	Hotel room	Private room	Shared room
neighbourhood
Borgo Panigale - Reno	96.700935	NaN	45.487179	NaN
Navile	172.140000	1350.000000	68.416107	28.0
Porto - Saragozza	148.410926	102.375000	83.070234	16.5
San Donato - San Vitale	106.775000	55.000000	61.194030	59.0
Santo Stefano	129.990260	103.827586	80.734177	95.4
Savena	86.301370	NaN	46.229167	22.5