Data Exploration

Last updated on May 1, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

  • listing dataset: contains listing-level information
  • pricing dataset: contains pricing data, over time

Importing Data

Pandas has a variety of function to import data

  • pd.read_csv()
  • pd.read_html()
  • pd.read_parquet()

Importatly for our purpose, pd.read_csv() can directly import data from the web.

The first dataset that we are going to import is the dataset of Airbnb listings in Bologna. It contains listing-level information.

url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

The second dataset that we are going to use is the dataset of calendar prices. This time the dataset is compressed but we can use the compression option to import it directly.

url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Inspecting Data

Methods

  • info()
  • head()
  • describe()

The first way yo have a quick look at the data is the info() method. If called with the option verbose=False, it gives a quick overview of the dimensions of the data.

df_listings.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3453 entries, 0 to 3452
Columns: 18 entries, id to license
dtypes: float64(4), int64(8), object(6)
memory usage: 485.7+ KB

If we want to know how the data looks like, we can use the head() method. It prints the first 5 lines of the data by default.

df_listings.head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 42196 50 sm Studio in the historic centre 184487 Carlo NaN Santo Stefano 44.48507 11.34786 Entire home/apt 68 3 180 2021-11-12 1.32 1 161 6 NaN
1 46352 A room in Pasolini's house 467810 Eleonora NaN Porto - Saragozza 44.49168 11.33514 Private room 29 1 300 2021-11-30 2.20 2 248 37 NaN
2 59697 COZY LARGE BEDROOM in the city center 286688 Paolo NaN Santo Stefano 44.48817 11.34124 Private room 50 1 240 2020-10-04 2.18 2 327 0 NaN
3 85368 Garden House Bologna 467675 Anna Maria NaN Santo Stefano 44.47834 11.35672 Entire home/apt 126 2 40 2019-11-03 0.34 1 332 0 NaN
4 145779 SINGLE ROOM 705535 Valerio NaN Porto - Saragozza 44.49306 11.33786 Private room 50 10 69 2021-12-05 0.55 9 365 5 NaN

We can print a description of the data using describe(). If we have many variables, it’s best to print it transposed using the .T attribute.

df_listings.describe().T[:5]

count mean std min 25% 50% 75% max
id 3453.0 2.950218e+07 1.523988e+07 42196.0000 1.748597e+07 3.078707e+07 4.220094e+07 5.385496e+07
host_id 3453.0 1.236424e+08 1.160756e+08 38468.0000 2.550007e+07 8.845438e+07 2.005926e+08 4.354316e+08
neighbourhood_group 0.0 NaN NaN NaN NaN NaN NaN NaN
latitude 3453.0 4.449756e+01 1.173569e-02 44.4236 4.449186e+01 4.449699e+01 4.450271e+01 4.455093e+01
longitude 3453.0 1.134509e+01 1.986071e-02 11.2320 1.133732e+01 1.134519e+01 1.135406e+01 1.142027e+01

You can select which variables to display using the include option. include='all' includes also categorical variables.

df_listings.describe(include='all').T[:5]

count unique top freq mean std min 25% 50% 75% max
id 3453.0 NaN NaN NaN 29502177.118158 15239877.346777 42196.0 17485973.0 30787074.0 42200938.0 53854962.0
name 3453 3410 Luxury Industrial Design LOFT, HEPA UV airpuri... 5 NaN NaN NaN NaN NaN NaN NaN
host_id 3453.0 NaN NaN NaN 123642405.854619 116075571.230048 38468.0 25500072.0 88454378.0 200592620.0 435431590.0
host_name 3444 747 Andrea 101 NaN NaN NaN NaN NaN NaN NaN
neighbourhood_group 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We can get the list of columns using the .columns attribute.

df_listings.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'license'],
      dtype='object')

We can get the index using the .index attribute,

df_listings.index
RangeIndex(start=0, stop=3453, step=1)

Data Selection

We can access single columns as if the DataFrame was a dictionary.

df_listings['price']
0        68
1        29
2        50
3       126
4        50
       ... 
3448     32
3449     45
3450     50
3451    134
3452    115
Name: price, Length: 3453, dtype: int64

We can select rows and columns by index, using the .iloc attribute.

df_listings.iloc[:7, 5:9]

neighbourhood latitude longitude room_type
0 Santo Stefano 44.48507 11.34786 Entire home/apt
1 Porto - Saragozza 44.49168 11.33514 Private room
2 Santo Stefano 44.48817 11.34124 Private room
3 Santo Stefano 44.47834 11.35672 Entire home/apt
4 Porto - Saragozza 44.49306 11.33786 Private room
5 Navile 44.51628 11.33074 Private room
6 Santo Stefano 44.48787 11.35392 Entire home/apt

If we want to condition only on rows or columns, we have use : for the unrestricted dimesion, otherwise we get an error.

df_listings.iloc[:, 5:9].head()

neighbourhood latitude longitude room_type
0 Santo Stefano 44.48507 11.34786 Entire home/apt
1 Porto - Saragozza 44.49168 11.33514 Private room
2 Santo Stefano 44.48817 11.34124 Private room
3 Santo Stefano 44.47834 11.35672 Entire home/apt
4 Porto - Saragozza 44.49306 11.33786 Private room

Instead, the .loc attribute allows us to use row and column names.

df_listings.loc[:, ['neighbourhood', 'latitude', 'longitude']].head()

neighbourhood latitude longitude
0 Santo Stefano 44.48507 11.34786
1 Porto - Saragozza 44.49168 11.33514
2 Santo Stefano 44.48817 11.34124
3 Santo Stefano 44.47834 11.35672
4 Porto - Saragozza 44.49306 11.33786

We can also select ranges.

df_listings.loc[:, 'neighbourhood':'room_type'].head()

neighbourhood latitude longitude room_type
0 Santo Stefano 44.48507 11.34786 Entire home/apt
1 Porto - Saragozza 44.49168 11.33514 Private room
2 Santo Stefano 44.48817 11.34124 Private room
3 Santo Stefano 44.47834 11.35672 Entire home/apt
4 Porto - Saragozza 44.49306 11.33786 Private room

There is an easy way to select numerical columns, the .select_dtypes() function.

df_listings.select_dtypes(include=['number']).head()

id host_id neighbourhood_group latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm
0 42196 184487 NaN 44.48507 11.34786 68 3 180 1.32 1 161 6
1 46352 467810 NaN 44.49168 11.33514 29 1 300 2.20 2 248 37
2 59697 286688 NaN 44.48817 11.34124 50 1 240 2.18 2 327 0
3 85368 467675 NaN 44.47834 11.35672 126 2 40 0.34 1 332 0
4 145779 705535 NaN 44.49306 11.33786 50 10 69 0.55 9 365 5

Other types include

  • object for strings
  • bool for booleans
  • int for integers
  • float for floats (numbers that are not integers)

We can also use logical operators to selet rows.

df_listings.loc[df_listings['number_of_reviews']>500, :].head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
52 884148 APOSA FLAT / CITY CENTER - BO 4664996 Vie D'Acqua Di Sandra Maria NaN Santo Stefano 44.49945 11.34566 Entire home/apt 46 1 668 2021-12-11 6.24 5 252 20 NaN
92 1435627 heart of Bologna Piazza Maggiore 7714013 Carlotta NaN Porto - Saragozza 44.49321 11.33569 Entire home/apt 56 2 508 2021-12-12 5.08 1 131 69 NaN
98 1566003 "i portici di via Piella " 8325248 Massimo NaN Santo Stefano 44.49855 11.34411 Entire home/apt 51 2 764 2021-12-14 7.62 3 119 120 NaN
131 2282623 S.Orsola zone,parking for free and self check-in 11658074 Cecilia NaN San Donato - San Vitale 44.49328 11.36650 Entire home/apt 38 1 689 2021-10-24 7.20 1 5 72 NaN
175 3216486 Stanza Privata 16289536 Fabio NaN Navile 44.50903 11.34200 Private room 82 1 569 2021-12-05 6.93 1 7 5 NaN

We can use logical operations as well. But remember to use paranthesis.

Note: the and and or expressions do not work in this setting. We have to use & and | instead.

df_listings.loc[(df_listings['number_of_reviews']>300) &
                (df_listings['reviews_per_month']>7), 
                :].head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
98 1566003 "i portici di via Piella " 8325248 Massimo NaN Santo Stefano 44.498550 11.344110 Entire home/apt 51 2 764 2021-12-14 7.62 3 119 120 NaN
131 2282623 S.Orsola zone,parking for free and self check-in 11658074 Cecilia NaN San Donato - San Vitale 44.493280 11.366500 Entire home/apt 38 1 689 2021-10-24 7.20 1 5 72 NaN
204 4166793 Centralissimo a Bologna 8325248 Massimo NaN Santo Stefano 44.500920 11.344560 Entire home/apt 71 2 750 2021-12-10 9.21 3 233 84 NaN
751 15508481 Monolocale in zona fiera /centro 99632788 Walid NaN Navile 44.514462 11.353731 Entire home/apt 64 1 475 2021-12-01 7.56 1 4 48 NaN
773 15886516 Monolocale nel cuore del ghetto ebraico di Bol... 103024123 Catia NaN Santo Stefano 44.495080 11.347220 Entire home/apt 58 1 428 2021-12-15 7.88 1 285 17 NaN

For a single column (i.e. a Series), we can get the unique values using the unique() function.

df_listings['neighbourhood'].unique()
array(['Santo Stefano', 'Porto - Saragozza', 'Navile',
       'San Donato - San Vitale', 'Savena', 'Borgo Panigale - Reno'],
      dtype=object)

For multiple columns, we can use the drop_duplicates function.

df_listings[['neighbourhood', 'room_type']].drop_duplicates()

neighbourhood room_type
0 Santo Stefano Entire home/apt
1 Porto - Saragozza Private room
2 Santo Stefano Private room
5 Navile Private room
7 Navile Entire home/apt
8 Porto - Saragozza Entire home/apt
19 San Donato - San Vitale Private room
24 Savena Private room
36 Borgo Panigale - Reno Entire home/apt
41 San Donato - San Vitale Entire home/apt
70 Porto - Saragozza Hotel room
75 Borgo Panigale - Reno Private room
110 Santo Stefano Hotel room
111 Savena Entire home/apt
388 Porto - Saragozza Shared room
678 Navile Shared room
1393 Savena Shared room
1416 San Donato - San Vitale Shared room
1572 San Donato - San Vitale Hotel room
1637 Santo Stefano Shared room
1751 Navile Hotel room

Aggregation and Pivot Tables

We can compute statistics by group using .groupby().

df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

price reviews_per_month
neighbourhood
Borgo Panigale - Reno 83.020548 0.983488
Navile 142.200993 1.156745
Porto - Saragozza 129.908312 1.340325
San Donato - San Vitale 91.618138 0.933011
Santo Stefano 119.441841 1.344810
Savena 69.626016 0.805888

If you want to perform more than one function, maybe on different columns, you can use .aggregate() which can be shortened to .agg(). It takes as argument a dictionary with variables as keys and lists of functions as values.

df_listings.groupby('neighbourhood').agg({"reviews_per_month": ["mean"],
                                          "price": ["min", np.max]}).reset_index()

neighbourhood reviews_per_month price
mean min amax
0 Borgo Panigale - Reno 0.983488 9 1429
1 Navile 1.156745 14 5000
2 Porto - Saragozza 1.340325 7 9999
3 San Donato - San Vitale 0.933011 10 1600
4 Santo Stefano 1.344810 11 9999
5 Savena 0.805888 9 680

The problem with this syntax is that it generates a hierarchical structure for variable names, which might not be so easy to work with. In the example above, to access the mean price, you have to use df.price["min"].

To perform variable naming and aggregation and the same time, you can ise the following syntax: agg(output_var = ("input_var", function)).

df_listings.groupby('neighbourhood').agg(mean_reviews=("reviews_per_month", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

neighbourhood mean_reviews min_price max_price
0 Borgo Panigale - Reno 0.983488 9 1429
1 Navile 1.156745 14 5000
2 Porto - Saragozza 1.340325 7 9999
3 San Donato - San Vitale 0.933011 10 1600
4 Santo Stefano 1.344810 11 9999
5 Savena 0.805888 9 680

We can make pivot tables with the .pivot_table() function. It takes the folling arguments:

  • index: rows
  • columns: columns
  • values: values
  • aggfunc: aggregation function
df_listings.pivot_table(index='neighbourhood', columns='room_type', values='price', aggfunc='mean')

room_type Entire home/apt Hotel room Private room Shared room
neighbourhood
Borgo Panigale - Reno 96.700935 NaN 45.487179 NaN
Navile 172.140000 1350.000000 68.416107 28.0
Porto - Saragozza 148.410926 102.375000 83.070234 16.5
San Donato - San Vitale 106.775000 55.000000 61.194030 59.0
Santo Stefano 129.990260 103.827586 80.734177 95.4
Savena 86.301370 NaN 46.229167 22.5
Previous
Next