Data Types

Last updated on May 1, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

listing dataset: contains listing-level information
pricing dataset: contains pricing data, over time

# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Numerical Data

Methods

+, -, *, /

numpy functions

pd.cut()

Standard mathematical operations between columns are done row-wise.

df_prices['maximum_nights'] - df_prices['minimum_nights']

0           148
1           357
2           357
3           357
4           357
           ... 
1260340    1124
1260341    1124
1260342    1124
1260343    1124
1260344    1124
Length: 1260345, dtype: int64

We can use most numpy operations element-wise on a single column.

np.log(df_listings['price'])

0       4.219508
1       3.367296
2       3.912023
3       4.836282
4       3.912023
          ...   
3448    3.465736
3449    3.806662
3450    3.912023
3451    4.897840
3452    4.744932
Name: price, Length: 3453, dtype: float64

We can create a categorical variables from a numerical one using the pd.cut() function.

pd.cut(df_listings['price'], 
       bins = [0, 50, 100, np.inf], 
       labels=['cheap', 'ok', 'expensive'])

0              ok
1           cheap
2           cheap
3       expensive
4           cheap
          ...    
3448        cheap
3449        cheap
3450        cheap
3451    expensive
3452    expensive
Name: price, Length: 3453, dtype: category
Categories (3, object): ['cheap' < 'ok' < 'expensive']

String Data

Methods

+

.str.replace

.str.contains

.astype(str) -pd.get_dummies()

We can use the + operator between columns, to do pairwise append.

Note: we cannot do it with strings.

df_listings['host_name'] + df_listings['neighbourhood']

0                     CarloSanto Stefano
1              EleonoraPorto - Saragozza
2                     PaoloSanto Stefano
3                Anna MariaSanto Stefano
4               ValerioPorto - Saragozza
                      ...               
3448                        IleanaNavile
3449           FernandaPorto - Saragozza
3450                        IleanaNavile
3451        Wonderful ItalySanto Stefano
3452    Wonderful ItalyPorto - Saragozza
Length: 3453, dtype: object

Pandas Series have a lot of vectorized string functions. You can find a list here.

For example, we want to remove the dollar symbol from the price variable in the df_prices dataset.

df_prices['price'].str.replace('$', '', regex=False)

0           70.00
1           68.00
2           68.00
3           68.00
4           68.00
            ...  
1260340    115.00
1260341    115.00
1260342    115.00
1260343    115.00
1260344    115.00
Name: price, Length: 1260345, dtype: object

Some of these functions use regular expressions.

match(): Call re.match() on each element, returning a boolean.
extract(): Call re.match() on each element, returning matched groups as strings.
findall(): Call re.findall() on each element
replace(): Replace occurrences of pattern with some other string
contains(): Call re.search() on each element, returning a boolean
count(): Count occurrences of pattern
split(): Equivalent to str.split(), but accepts regexps rsplit()

For example, the next code checks whether in the word centre or center are contained in the text description.

df_listings['name'].str.contains('centre|center')

0        True
1       False
2        True
3       False
4       False
        ...  
3448    False
3449    False
3450    False
3451    False
3452    False
Name: name, Length: 3453, dtype: bool

Lastly, we can (try to) convert string variables to numeric using astype(float).

df_prices['price'].str.replace('[$|,]', '', regex=True).astype(float)

0           70.0
1           68.0
2           68.0
3           68.0
4           68.0
           ...  
1260340    115.0
1260341    115.0
1260342    115.0
1260343    115.0
1260344    115.0
Name: price, Length: 1260345, dtype: float64

We can also use it to convert numerics to strings using astype(str).

df_listings['id'].astype(str)

0          42196
1          46352
2          59697
3          85368
4         145779
          ...   
3448    53810648
3449    53820830
3450    53837098
3451    53837654
3452    53854962
Name: id, Length: 3453, dtype: object

We can generate dummies from a categorical variable using pd.get_dummies()

pd.get_dummies(df_listings['neighbourhood']).head()

	Porto - Saragozza	Santo Stefano
0	0	1
1	1	0
2	0	1
3	0	1
4	1	0

Time Data

Methods

pd.to_datetime()

.dt.year

.df.to_period()

pd.to_timedelta()

In the df_prices we have a date variable, date. Which format is it in? We can check it with the .dtypes attribute.

df_prices['date'].dtypes

dtype('O')

We can convert a variable into a date using the

df_prices['datetime'] = pd.to_datetime(df_prices['date'])

Indeed, if we now check the format of the datetime variable, it’s datetime.

df_prices['datetime'].dtypes

dtype('<M8[ns]')

Once we have a variable in datetime format, we gain plenty of datetime operations through the dt accessor object for datetime-like properties.

For example, we can extract the year using .dt.year. We can do the same with month, week and day.

df_prices['datetime'].dt.year

0          2021
1          2021
2          2021
3          2021
4          2021
           ... 
1260340    2022
1260341    2022
1260342    2022
1260343    2022
1260344    2022
Name: datetime, Length: 1260345, dtype: int64

We can change the level of aggregation of a date using .dt.to_period(). The option M converts to year-month level.

df_prices['datetime'].dt.to_period('M')

0          2021-12
1          2021-12
2          2021-12
3          2021-12
4          2021-12
            ...   
1260340    2022-12
1260341    2022-12
1260342    2022-12
1260343    2022-12
1260344    2022-12
Name: datetime, Length: 1260345, dtype: period[M]

We can add or subtract time periods from a date using the pd.to_timedelta() function. We need to specify the unit of measurement with the unit option.

df_prices['datetime'] -  pd.to_timedelta(3, unit='d')

0         2021-12-14
1         2021-12-14
2         2021-12-15
3         2021-12-16
4         2021-12-17
             ...    
1260340   2022-12-09
1260341   2022-12-10
1260342   2022-12-11
1260343   2022-12-12
1260344   2022-12-13
Name: datetime, Length: 1260345, dtype: datetime64[ns]

Missing Data

Methods

.isna()

.dropna()

.fillna()

The function isna() reports missing values.

df_listings.isna().head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
0	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	True
1	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	True
2	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	True
3	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	True
4	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	True

To get a quick description of the amount of missing data in the dataset, we can use

df_listings.isna().sum()

id                                   0
name                                 0
host_id                              0
host_name                            9
neighbourhood_group               3453
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                        409
reviews_per_month                  409
calculated_host_listings_count       0
availability_365                     0
number_of_reviews_ltm                0
license                           3318
dtype: int64

We can drop missing values using dropna(). It drops all rows with at least one missing value.

df_listings.dropna().shape

(0, 18)

In this case unfortunately, it drops all the rows. If we wa to drop only rows with all missing values, we can use the parameter how='all'.

df_listings.dropna(how='all').shape

(3453, 18)

If we want to drop only missing values for one particular value, we can use the subset option.

df_listings.dropna(subset=['reviews_per_month']).shape

(3044, 18)

We can also fill the missing values instead of dropping them, using fillna().

df_listings.fillna(' -- This was NA  -- ').head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
0	42196	50 sm Studio in the historic centre	184487	Carlo	-- This was NA --	Santo Stefano	44.48507	11.34786	Entire home/apt	68	3	180	2021-11-12	1.32	1	161	6	-- This was NA --
1	46352	A room in Pasolini's house	467810	Eleonora	-- This was NA --	Porto - Saragozza	44.49168	11.33514	Private room	29	1	300	2021-11-30	2.2	2	248	37	-- This was NA --
2	59697	COZY LARGE BEDROOM in the city center	286688	Paolo	-- This was NA --	Santo Stefano	44.48817	11.34124	Private room	50	1	240	2020-10-04	2.18	2	327	0	-- This was NA --
3	85368	Garden House Bologna	467675	Anna Maria	-- This was NA --	Santo Stefano	44.47834	11.35672	Entire home/apt	126	2	40	2019-11-03	0.34	1	332	0	-- This was NA --
4	145779	SINGLE ROOM	705535	Valerio	-- This was NA --	Porto - Saragozza	44.49306	11.33786	Private room	50	10	69	2021-12-05	0.55	9	365	5	-- This was NA --

We can also make missing values if we want.

df_listings.iloc[2, 2] = np.nan
df_listings.iloc[:3, :3]

	id	name	host_id
0	42196	50 sm Studio in the historic centre	184487.0
1	46352	A room in Pasolini's house	467810.0
2	59697	COZY LARGE BEDROOM in the city center	NaN