Data Exploration
Last updated on May 1, 2022
import numpy as np
import pandas as pd
Setup
For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.
A description of all variables in all datasets is avaliable here.
We are going to use 2 datasets:
- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time
Importing Data
Pandas has a variety of function to import data
pd.read_csv()
pd.read_html()
pd.read_parquet()
Importatly for our purpose, pd.read_csv()
can directly import data from the web.
The first dataset that we are going to import is the dataset of Airbnb listings in Bologna. It contains listing-level information.
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)
The second dataset that we are going to use is the dataset of calendar prices. This time the dataset is compressed but we can use the compression
option to import it directly.
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")
Inspecting Data
Methods
info()
head()
describe()
The first way yo have a quick look at the data is the info()
method. If called with the option verbose=False
, it gives a quick overview of the dimensions of the data.
df_listings.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3453 entries, 0 to 3452
Columns: 18 entries, id to license
dtypes: float64(4), int64(8), object(6)
memory usage: 485.7+ KB
If we want to know how the data looks like, we can use the head()
method. It prints the first 5 lines of the data by default.
df_listings.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | NaN | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | 3 | 180 | 2021-11-12 | 1.32 | 1 | 161 | 6 | NaN |
1 | 46352 | A room in Pasolini's house | 467810 | Eleonora | NaN | Porto - Saragozza | 44.49168 | 11.33514 | Private room | 29 | 1 | 300 | 2021-11-30 | 2.20 | 2 | 248 | 37 | NaN |
2 | 59697 | COZY LARGE BEDROOM in the city center | 286688 | Paolo | NaN | Santo Stefano | 44.48817 | 11.34124 | Private room | 50 | 1 | 240 | 2020-10-04 | 2.18 | 2 | 327 | 0 | NaN |
3 | 85368 | Garden House Bologna | 467675 | Anna Maria | NaN | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt | 126 | 2 | 40 | 2019-11-03 | 0.34 | 1 | 332 | 0 | NaN |
4 | 145779 | SINGLE ROOM | 705535 | Valerio | NaN | Porto - Saragozza | 44.49306 | 11.33786 | Private room | 50 | 10 | 69 | 2021-12-05 | 0.55 | 9 | 365 | 5 | NaN |
We can print a description of the data using describe()
. If we have many variables, it’s best to print it transposed using the .T
attribute.
df_listings.describe().T[:5]
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
id | 3453.0 | 2.950218e+07 | 1.523988e+07 | 42196.0000 | 1.748597e+07 | 3.078707e+07 | 4.220094e+07 | 5.385496e+07 |
host_id | 3453.0 | 1.236424e+08 | 1.160756e+08 | 38468.0000 | 2.550007e+07 | 8.845438e+07 | 2.005926e+08 | 4.354316e+08 |
neighbourhood_group | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
latitude | 3453.0 | 4.449756e+01 | 1.173569e-02 | 44.4236 | 4.449186e+01 | 4.449699e+01 | 4.450271e+01 | 4.455093e+01 |
longitude | 3453.0 | 1.134509e+01 | 1.986071e-02 | 11.2320 | 1.133732e+01 | 1.134519e+01 | 1.135406e+01 | 1.142027e+01 |
You can select which variables to display using the include
option. include='all'
includes also categorical variables.
df_listings.describe(include='all').T[:5]
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | 3453.0 | NaN | NaN | NaN | 29502177.118158 | 15239877.346777 | 42196.0 | 17485973.0 | 30787074.0 | 42200938.0 | 53854962.0 |
name | 3453 | 3410 | Luxury Industrial Design LOFT, HEPA UV airpuri... | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
host_id | 3453.0 | NaN | NaN | NaN | 123642405.854619 | 116075571.230048 | 38468.0 | 25500072.0 | 88454378.0 | 200592620.0 | 435431590.0 |
host_name | 3444 | 747 | Andrea | 101 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
neighbourhood_group | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We can get the list of columns using the .columns
attribute.
df_listings.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
'minimum_nights', 'number_of_reviews', 'last_review',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365', 'number_of_reviews_ltm', 'license'],
dtype='object')
We can get the index using the .index
attribute,
df_listings.index
RangeIndex(start=0, stop=3453, step=1)
Data Selection
We can access single columns as if the DataFrame was a dictionary.
df_listings['price']
0 68
1 29
2 50
3 126
4 50
...
3448 32
3449 45
3450 50
3451 134
3452 115
Name: price, Length: 3453, dtype: int64
We can select rows and columns by index, using the .iloc
attribute.
df_listings.iloc[:7, 5:9]
neighbourhood | latitude | longitude | room_type | |
---|---|---|---|---|
0 | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt |
1 | Porto - Saragozza | 44.49168 | 11.33514 | Private room |
2 | Santo Stefano | 44.48817 | 11.34124 | Private room |
3 | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt |
4 | Porto - Saragozza | 44.49306 | 11.33786 | Private room |
5 | Navile | 44.51628 | 11.33074 | Private room |
6 | Santo Stefano | 44.48787 | 11.35392 | Entire home/apt |
If we want to condition only on rows or columns, we have use :
for the unrestricted dimesion, otherwise we get an error.
df_listings.iloc[:, 5:9].head()
neighbourhood | latitude | longitude | room_type | |
---|---|---|---|---|
0 | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt |
1 | Porto - Saragozza | 44.49168 | 11.33514 | Private room |
2 | Santo Stefano | 44.48817 | 11.34124 | Private room |
3 | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt |
4 | Porto - Saragozza | 44.49306 | 11.33786 | Private room |
Instead, the .loc
attribute allows us to use row and column names.
df_listings.loc[:, ['neighbourhood', 'latitude', 'longitude']].head()
neighbourhood | latitude | longitude | |
---|---|---|---|
0 | Santo Stefano | 44.48507 | 11.34786 |
1 | Porto - Saragozza | 44.49168 | 11.33514 |
2 | Santo Stefano | 44.48817 | 11.34124 |
3 | Santo Stefano | 44.47834 | 11.35672 |
4 | Porto - Saragozza | 44.49306 | 11.33786 |
We can also select ranges.
df_listings.loc[:, 'neighbourhood':'room_type'].head()
neighbourhood | latitude | longitude | room_type | |
---|---|---|---|---|
0 | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt |
1 | Porto - Saragozza | 44.49168 | 11.33514 | Private room |
2 | Santo Stefano | 44.48817 | 11.34124 | Private room |
3 | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt |
4 | Porto - Saragozza | 44.49306 | 11.33786 | Private room |
There is an easy way to select numerical columns, the .select_dtypes()
function.
df_listings.select_dtypes(include=['number']).head()
id | host_id | neighbourhood_group | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 184487 | NaN | 44.48507 | 11.34786 | 68 | 3 | 180 | 1.32 | 1 | 161 | 6 |
1 | 46352 | 467810 | NaN | 44.49168 | 11.33514 | 29 | 1 | 300 | 2.20 | 2 | 248 | 37 |
2 | 59697 | 286688 | NaN | 44.48817 | 11.34124 | 50 | 1 | 240 | 2.18 | 2 | 327 | 0 |
3 | 85368 | 467675 | NaN | 44.47834 | 11.35672 | 126 | 2 | 40 | 0.34 | 1 | 332 | 0 |
4 | 145779 | 705535 | NaN | 44.49306 | 11.33786 | 50 | 10 | 69 | 0.55 | 9 | 365 | 5 |
Other types include
object
for stringsbool
for booleansint
for integersfloat
for floats (numbers that are not integers)
We can also use logical operators to selet rows.
df_listings.loc[df_listings['number_of_reviews']>500, :].head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
52 | 884148 | APOSA FLAT / CITY CENTER - BO | 4664996 | Vie D'Acqua Di Sandra Maria | NaN | Santo Stefano | 44.49945 | 11.34566 | Entire home/apt | 46 | 1 | 668 | 2021-12-11 | 6.24 | 5 | 252 | 20 | NaN |
92 | 1435627 | heart of Bologna Piazza Maggiore | 7714013 | Carlotta | NaN | Porto - Saragozza | 44.49321 | 11.33569 | Entire home/apt | 56 | 2 | 508 | 2021-12-12 | 5.08 | 1 | 131 | 69 | NaN |
98 | 1566003 | "i portici di via Piella " | 8325248 | Massimo | NaN | Santo Stefano | 44.49855 | 11.34411 | Entire home/apt | 51 | 2 | 764 | 2021-12-14 | 7.62 | 3 | 119 | 120 | NaN |
131 | 2282623 | S.Orsola zone,parking for free and self check-in | 11658074 | Cecilia | NaN | San Donato - San Vitale | 44.49328 | 11.36650 | Entire home/apt | 38 | 1 | 689 | 2021-10-24 | 7.20 | 1 | 5 | 72 | NaN |
175 | 3216486 | Stanza Privata | 16289536 | Fabio | NaN | Navile | 44.50903 | 11.34200 | Private room | 82 | 1 | 569 | 2021-12-05 | 6.93 | 1 | 7 | 5 | NaN |
We can use logical operations as well. But remember to use paranthesis.
Note: the and
and or
expressions do not work in this setting. We have to use &
and |
instead.
df_listings.loc[(df_listings['number_of_reviews']>300) &
(df_listings['reviews_per_month']>7),
:].head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
98 | 1566003 | "i portici di via Piella " | 8325248 | Massimo | NaN | Santo Stefano | 44.498550 | 11.344110 | Entire home/apt | 51 | 2 | 764 | 2021-12-14 | 7.62 | 3 | 119 | 120 | NaN |
131 | 2282623 | S.Orsola zone,parking for free and self check-in | 11658074 | Cecilia | NaN | San Donato - San Vitale | 44.493280 | 11.366500 | Entire home/apt | 38 | 1 | 689 | 2021-10-24 | 7.20 | 1 | 5 | 72 | NaN |
204 | 4166793 | Centralissimo a Bologna | 8325248 | Massimo | NaN | Santo Stefano | 44.500920 | 11.344560 | Entire home/apt | 71 | 2 | 750 | 2021-12-10 | 9.21 | 3 | 233 | 84 | NaN |
751 | 15508481 | Monolocale in zona fiera /centro | 99632788 | Walid | NaN | Navile | 44.514462 | 11.353731 | Entire home/apt | 64 | 1 | 475 | 2021-12-01 | 7.56 | 1 | 4 | 48 | NaN |
773 | 15886516 | Monolocale nel cuore del ghetto ebraico di Bol... | 103024123 | Catia | NaN | Santo Stefano | 44.495080 | 11.347220 | Entire home/apt | 58 | 1 | 428 | 2021-12-15 | 7.88 | 1 | 285 | 17 | NaN |
For a single column (i.e. a Series), we can get the unique values using the unique()
function.
df_listings['neighbourhood'].unique()
array(['Santo Stefano', 'Porto - Saragozza', 'Navile',
'San Donato - San Vitale', 'Savena', 'Borgo Panigale - Reno'],
dtype=object)
For multiple columns, we can use the drop_duplicates
function.
df_listings[['neighbourhood', 'room_type']].drop_duplicates()
neighbourhood | room_type | |
---|---|---|
0 | Santo Stefano | Entire home/apt |
1 | Porto - Saragozza | Private room |
2 | Santo Stefano | Private room |
5 | Navile | Private room |
7 | Navile | Entire home/apt |
8 | Porto - Saragozza | Entire home/apt |
19 | San Donato - San Vitale | Private room |
24 | Savena | Private room |
36 | Borgo Panigale - Reno | Entire home/apt |
41 | San Donato - San Vitale | Entire home/apt |
70 | Porto - Saragozza | Hotel room |
75 | Borgo Panigale - Reno | Private room |
110 | Santo Stefano | Hotel room |
111 | Savena | Entire home/apt |
388 | Porto - Saragozza | Shared room |
678 | Navile | Shared room |
1393 | Savena | Shared room |
1416 | San Donato - San Vitale | Shared room |
1572 | San Donato - San Vitale | Hotel room |
1637 | Santo Stefano | Shared room |
1751 | Navile | Hotel room |
Aggregation and Pivot Tables
We can compute statistics by group using .groupby()
.
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()
price | reviews_per_month | |
---|---|---|
neighbourhood | ||
Borgo Panigale - Reno | 83.020548 | 0.983488 |
Navile | 142.200993 | 1.156745 |
Porto - Saragozza | 129.908312 | 1.340325 |
San Donato - San Vitale | 91.618138 | 0.933011 |
Santo Stefano | 119.441841 | 1.344810 |
Savena | 69.626016 | 0.805888 |
If you want to perform more than one function, maybe on different columns, you can use .aggregate()
which can be shortened to .agg()
. It takes as argument a dictionary with variables as keys and lists of functions as values.
df_listings.groupby('neighbourhood').agg({"reviews_per_month": ["mean"],
"price": ["min", np.max]}).reset_index()
neighbourhood | reviews_per_month | price | ||
---|---|---|---|---|
mean | min | amax | ||
0 | Borgo Panigale - Reno | 0.983488 | 9 | 1429 |
1 | Navile | 1.156745 | 14 | 5000 |
2 | Porto - Saragozza | 1.340325 | 7 | 9999 |
3 | San Donato - San Vitale | 0.933011 | 10 | 1600 |
4 | Santo Stefano | 1.344810 | 11 | 9999 |
5 | Savena | 0.805888 | 9 | 680 |
The problem with this syntax is that it generates a hierarchical structure for variable names, which might not be so easy to work with. In the example above, to access the mean price, you have to use df.price["min"]
.
To perform variable naming and aggregation and the same time, you can ise the following syntax: agg(output_var = ("input_var", function))
.
df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
min_price=("price", "min"),
max_price=("price", np.max)).reset_index()
neighbourhood | mean_reviews | min_price | max_price | |
---|---|---|---|---|
0 | Borgo Panigale - Reno | 0.983488 | 9 | 1429 |
1 | Navile | 1.156745 | 14 | 5000 |
2 | Porto - Saragozza | 1.340325 | 7 | 9999 |
3 | San Donato - San Vitale | 0.933011 | 10 | 1600 |
4 | Santo Stefano | 1.344810 | 11 | 9999 |
5 | Savena | 0.805888 | 9 | 680 |
We can make pivot tables with the .pivot_table()
function. It takes the folling arguments:
index
: rowscolumns
: columnsvalues
: valuesaggfunc
: aggregation function
df_listings.pivot_table(index='neighbourhood', columns='room_type', values='price', aggfunc='mean')
room_type | Entire home/apt | Hotel room | Private room | Shared room |
---|---|---|---|---|
neighbourhood | ||||
Borgo Panigale - Reno | 96.700935 | NaN | 45.487179 | NaN |
Navile | 172.140000 | 1350.000000 | 68.416107 | 28.0 |
Porto - Saragozza | 148.410926 | 102.375000 | 83.070234 | 16.5 |
San Donato - San Vitale | 106.775000 | 55.000000 | 61.194030 | 59.0 |
Santo Stefano | 129.990260 | 103.827586 | 80.734177 | 95.4 |
Savena | 86.301370 | NaN | 46.229167 | 22.5 |