Data Types
Last updated on Feb 27, 2022
import numpy as np
import pandas as pd
Setup
For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.
A description of all variables in all datasets is avaliable here.
We are going to use 2 datasets:
- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)
# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")
Numerical Data
Methods
pd.cut()
Standard mathematical operations between columns are done row-wise.
df_prices['maximum_nights'] - df_prices['minimum_nights']
0 148
1 357
2 357
3 357
4 357
...
1260340 1124
1260341 1124
1260342 1124
1260343 1124
1260344 1124
Length: 1260345, dtype: int64
We can use most numpy
operations element-wise on a single column.
np.log(df_listings['price'])
0 4.219508
1 3.367296
2 3.912023
3 4.836282
4 3.912023
...
3448 3.465736
3449 3.806662
3450 3.912023
3451 4.897840
3452 4.744932
Name: price, Length: 3453, dtype: float64
We can create a categorical variables from a numerical one using the pd.cut()
function.
pd.cut(df_listings['price'],
bins = [0, 50, 100, np.inf],
labels=['cheap', 'ok', 'expensive'])
0 ok
1 cheap
2 cheap
3 expensive
4 cheap
...
3448 cheap
3449 cheap
3450 cheap
3451 expensive
3452 expensive
Name: price, Length: 3453, dtype: category
Categories (3, object): ['cheap' < 'ok' < 'expensive']
String Data
Methods
.str()
We can use the +
operator between columns, to do pairwise append.
Note: we cannot do it with strings.
df_listings['host_name'] + df_listings['neighbourhood']
0 CarloSanto Stefano
1 EleonoraPorto - Saragozza
2 PaoloSanto Stefano
3 Anna MariaSanto Stefano
4 ValerioPorto - Saragozza
...
3448 IleanaNavile
3449 FernandaPorto - Saragozza
3450 IleanaNavile
3451 Wonderful ItalySanto Stefano
3452 Wonderful ItalyPorto - Saragozza
Length: 3453, dtype: object
Pandas Series have a lot of vectorized string functions you can call on them.
[i for i in dir(pd.Series.str) if '_' not in i]
['capitalize',
'casefold',
'cat',
'center',
'contains',
'count',
'decode',
'encode',
'endswith',
'extract',
'extractall',
'find',
'findall',
'fullmatch',
'get',
'index',
'isalnum',
'isalpha',
'isdecimal',
'isdigit',
'islower',
'isnumeric',
'isspace',
'istitle',
'isupper',
'join',
'len',
'ljust',
'lower',
'lstrip',
'match',
'normalize',
'pad',
'partition',
'repeat',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'slice',
'split',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'wrap',
'zfill']
For example, we want to remove the dollar symbol from the price
variable in the df_prices
dataset.
df_prices['price'].str.replace('$', '')
/var/folders/xy/szm3b7211cj16ldwcywbvfzdqv7yhd/T/ipykernel_74833/117745113.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
df_prices['price'].str.replace('$', '')
0 70.00
1 68.00
2 68.00
3 68.00
4 68.00
...
1260340 115.00
1260341 115.00
1260342 115.00
1260343 115.00
1260344 115.00
Name: price, Length: 1260345, dtype: object
Some of these functions use regular expressions.
match()
: Call re.match() on each element, returning a boolean.extract()
: Call re.match() on each element, returning matched groups as strings.findall()
: Call re.findall() on each elementreplace()
: Replace occurrences of pattern with some other stringcontains()
: Call re.search() on each element, returning a booleancount()
: Count occurrences of patternsplit()
: Equivalent to str.split(), but accepts regexps rsplit()
For example, the next code checks whether in the word centre
or center
are contained in the text description.
df_listings['name'].str.contains('centre|center')
0 True
1 False
2 True
3 False
4 False
...
3448 False
3449 False
3450 False
3451 False
3452 False
Name: name, Length: 3453, dtype: bool
Lastly, we can (try to) convert string variables to numeric using astype(float)
.
df_prices['price'].str.replace('[$|,]', '', regex=True).astype(float)
0 70.0
1 68.0
2 68.0
3 68.0
4 68.0
...
1260340 115.0
1260341 115.0
1260342 115.0
1260343 115.0
1260344 115.0
Name: price, Length: 1260345, dtype: float64
We can also use it to convert numerics to strings using astype(str)
.
df_listings['id'].astype(str)
0 42196
1 46352
2 59697
3 85368
4 145779
...
3448 53810648
3449 53820830
3450 53837098
3451 53837654
3452 53854962
Name: id, Length: 3453, dtype: object
Time Data
Methods
pd.to_datetime()
.dt.year()
pd.to_timedelta()
In the df_prices
we have a date variable, date
. Which format is it in? We can check it with the .dtypes
attribute.
df_prices['date'].dtypes
dtype('O')
We can convert a variable into a date using the
df_prices['datetime'] = pd.to_datetime(df_prices['date'])
Indeed, if we now check the format of the datetime
variable, it’s datetime
.
df_prices['datetime'].dtypes
dtype('<M8[ns]')
Once we have a variable in datetime
format, we gain plenty of datetime operations through the dt
library. For example, we can extract the year using .dt.year
.
df_prices['datetime'].dt.year
0 2021
1 2021
2 2021
3 2021
4 2021
...
1260340 2022
1260341 2022
1260342 2022
1260343 2022
1260344 2022
Name: datetime, Length: 1260345, dtype: int64
We can add or subtract time periods from a date using the pd.to_timedelta()
function. We need to specify the unit of measurement with the unit
option.
df_prices['datetime'] - pd.to_timedelta(3, unit='d')
0 2021-12-14
1 2021-12-14
2 2021-12-15
3 2021-12-16
4 2021-12-17
...
1260340 2022-12-09
1260341 2022-12-10
1260342 2022-12-11
1260343 2022-12-12
1260344 2022-12-13
Name: datetime, Length: 1260345, dtype: datetime64[ns]
Missing Data
Methods
.isna()
.dropna()
.fillna()
The function isna()
reports missing values.
df_listings.isna().head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True |
1 | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True |
2 | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True |
3 | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True |
4 | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True |
To get a quick description of the amount of missing data in the dataset, we can use
df_listings.isna().sum()
id 0
name 0
host_id 0
host_name 9
neighbourhood_group 3453
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 409
reviews_per_month 409
calculated_host_listings_count 0
availability_365 0
number_of_reviews_ltm 0
license 3318
dtype: int64
We can drop missing values using dropna()
. It drops all rows with at least one missing value.
df_listings.dropna().shape
(0, 18)
In this case unfortunately, it drops all the rows. If we wa to drop only rows with all missing values, we can use the parameter how='all'
.
df_listings.dropna(how='all').shape
(3453, 18)
If we want to drop only missing values for one particular value, we can use the subset
option.
df_listings.dropna(subset=['reviews_per_month']).shape
(3044, 18)
We can also fill the missing values instead of dropping them, using fillna()
.
df_listings.fillna(' -- This was NA -- ').head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487 | Carlo | -- This was NA -- | Santo Stefano | 44.48507 | 11.34786 | Entire home/apt | 68 | 3 | 180 | 2021-11-12 | 1.32 | 1 | 161 | 6 | -- This was NA -- |
1 | 46352 | A room in Pasolini's house | 467810 | Eleonora | -- This was NA -- | Porto - Saragozza | 44.49168 | 11.33514 | Private room | 29 | 1 | 300 | 2021-11-30 | 2.2 | 2 | 248 | 37 | -- This was NA -- |
2 | 59697 | COZY LARGE BEDROOM in the city center | 286688 | Paolo | -- This was NA -- | Santo Stefano | 44.48817 | 11.34124 | Private room | 50 | 1 | 240 | 2020-10-04 | 2.18 | 2 | 327 | 0 | -- This was NA -- |
3 | 85368 | Garden House Bologna | 467675 | Anna Maria | -- This was NA -- | Santo Stefano | 44.47834 | 11.35672 | Entire home/apt | 126 | 2 | 40 | 2019-11-03 | 0.34 | 1 | 332 | 0 | -- This was NA -- |
4 | 145779 | SINGLE ROOM | 705535 | Valerio | -- This was NA -- | Porto - Saragozza | 44.49306 | 11.33786 | Private room | 50 | 10 | 69 | 2021-12-05 | 0.55 | 9 | 365 | 5 | -- This was NA -- |
We can also make missing values if we want.
df_listings.iloc[2, 2] = np.nan
df_listings.iloc[:3, :3]
id | name | host_id | |
---|---|---|---|
0 | 42196 | 50 sm Studio in the historic centre | 184487.0 |
1 | 46352 | A room in Pasolini's house | 467810.0 |
2 | 59697 | COZY LARGE BEDROOM in the city center | NaN |