Data Types

Last updated on May 1, 2022

import numpy as np
import pandas as pd

Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Bologna. The data is freely available at Inside AirBnb: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable here.

We are going to use 2 datasets:

  • listing dataset: contains listing-level information
  • pricing dataset: contains pricing data, over time
# Import listings data
url_listings = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/italy/emilia-romagna/bologna/2021-12-17/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

Numerical Data

Methods

  • +, -, *, /
  • numpy functions
  • pd.cut()

Standard mathematical operations between columns are done row-wise.

df_prices['maximum_nights'] - df_prices['minimum_nights']
0           148
1           357
2           357
3           357
4           357
           ... 
1260340    1124
1260341    1124
1260342    1124
1260343    1124
1260344    1124
Length: 1260345, dtype: int64

We can use most numpy operations element-wise on a single column.

np.log(df_listings['price'])
0       4.219508
1       3.367296
2       3.912023
3       4.836282
4       3.912023
          ...   
3448    3.465736
3449    3.806662
3450    3.912023
3451    4.897840
3452    4.744932
Name: price, Length: 3453, dtype: float64

We can create a categorical variables from a numerical one using the pd.cut() function.

pd.cut(df_listings['price'], 
       bins = [0, 50, 100, np.inf], 
       labels=['cheap', 'ok', 'expensive'])
0              ok
1           cheap
2           cheap
3       expensive
4           cheap
          ...    
3448        cheap
3449        cheap
3450        cheap
3451    expensive
3452    expensive
Name: price, Length: 3453, dtype: category
Categories (3, object): ['cheap' < 'ok' < 'expensive']

String Data

Methods

  • +
  • .str.replace
  • .str.contains
  • .astype(str) -pd.get_dummies()

We can use the + operator between columns, to do pairwise append.

Note: we cannot do it with strings.

df_listings['host_name'] + df_listings['neighbourhood']
0                     CarloSanto Stefano
1              EleonoraPorto - Saragozza
2                     PaoloSanto Stefano
3                Anna MariaSanto Stefano
4               ValerioPorto - Saragozza
                      ...               
3448                        IleanaNavile
3449           FernandaPorto - Saragozza
3450                        IleanaNavile
3451        Wonderful ItalySanto Stefano
3452    Wonderful ItalyPorto - Saragozza
Length: 3453, dtype: object

Pandas Series have a lot of vectorized string functions. You can find a list here.

For example, we want to remove the dollar symbol from the price variable in the df_prices dataset.

df_prices['price'].str.replace('$', '', regex=False)
0           70.00
1           68.00
2           68.00
3           68.00
4           68.00
            ...  
1260340    115.00
1260341    115.00
1260342    115.00
1260343    115.00
1260344    115.00
Name: price, Length: 1260345, dtype: object

Some of these functions use regular expressions.

  • match(): Call re.match() on each element, returning a boolean.
  • extract(): Call re.match() on each element, returning matched groups as strings.
  • findall(): Call re.findall() on each element
  • replace(): Replace occurrences of pattern with some other string
  • contains(): Call re.search() on each element, returning a boolean
  • count(): Count occurrences of pattern
  • split(): Equivalent to str.split(), but accepts regexps rsplit()

For example, the next code checks whether in the word centre or center are contained in the text description.

df_listings['name'].str.contains('centre|center')
0        True
1       False
2        True
3       False
4       False
        ...  
3448    False
3449    False
3450    False
3451    False
3452    False
Name: name, Length: 3453, dtype: bool

Lastly, we can (try to) convert string variables to numeric using astype(float).

df_prices['price'].str.replace('[$|,]', '', regex=True).astype(float)
0           70.0
1           68.0
2           68.0
3           68.0
4           68.0
           ...  
1260340    115.0
1260341    115.0
1260342    115.0
1260343    115.0
1260344    115.0
Name: price, Length: 1260345, dtype: float64

We can also use it to convert numerics to strings using astype(str).

df_listings['id'].astype(str)
0          42196
1          46352
2          59697
3          85368
4         145779
          ...   
3448    53810648
3449    53820830
3450    53837098
3451    53837654
3452    53854962
Name: id, Length: 3453, dtype: object

We can generate dummies from a categorical variable using pd.get_dummies()

pd.get_dummies(df_listings['neighbourhood']).head()

Borgo Panigale - Reno Navile Porto - Saragozza San Donato - San Vitale Santo Stefano Savena
0 0 0 0 0 1 0
1 0 0 1 0 0 0
2 0 0 0 0 1 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0

Time Data

Methods

  • pd.to_datetime()
  • .dt.year
  • .df.to_period()
  • pd.to_timedelta()

In the df_prices we have a date variable, date. Which format is it in? We can check it with the .dtypes attribute.

df_prices['date'].dtypes
dtype('O')

We can convert a variable into a date using the

df_prices['datetime'] = pd.to_datetime(df_prices['date'])

Indeed, if we now check the format of the datetime variable, it’s datetime.

df_prices['datetime'].dtypes
dtype('<M8[ns]')

Once we have a variable in datetime format, we gain plenty of datetime operations through the dt accessor object for datetime-like properties.

For example, we can extract the year using .dt.year. We can do the same with month, week and day.

df_prices['datetime'].dt.year
0          2021
1          2021
2          2021
3          2021
4          2021
           ... 
1260340    2022
1260341    2022
1260342    2022
1260343    2022
1260344    2022
Name: datetime, Length: 1260345, dtype: int64

We can change the level of aggregation of a date using .dt.to_period(). The option M converts to year-month level.

df_prices['datetime'].dt.to_period('M')
0          2021-12
1          2021-12
2          2021-12
3          2021-12
4          2021-12
            ...   
1260340    2022-12
1260341    2022-12
1260342    2022-12
1260343    2022-12
1260344    2022-12
Name: datetime, Length: 1260345, dtype: period[M]

We can add or subtract time periods from a date using the pd.to_timedelta() function. We need to specify the unit of measurement with the unit option.

df_prices['datetime'] -  pd.to_timedelta(3, unit='d')
0         2021-12-14
1         2021-12-14
2         2021-12-15
3         2021-12-16
4         2021-12-17
             ...    
1260340   2022-12-09
1260341   2022-12-10
1260342   2022-12-11
1260343   2022-12-12
1260344   2022-12-13
Name: datetime, Length: 1260345, dtype: datetime64[ns]

Missing Data

Methods

  • .isna()
  • .dropna()
  • .fillna()

The function isna() reports missing values.

df_listings.isna().head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 False False False False True False False False False False False False False False False False False True
1 False False False False True False False False False False False False False False False False False True
2 False False False False True False False False False False False False False False False False False True
3 False False False False True False False False False False False False False False False False False True
4 False False False False True False False False False False False False False False False False False True

To get a quick description of the amount of missing data in the dataset, we can use

df_listings.isna().sum()
id                                   0
name                                 0
host_id                              0
host_name                            9
neighbourhood_group               3453
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                        409
reviews_per_month                  409
calculated_host_listings_count       0
availability_365                     0
number_of_reviews_ltm                0
license                           3318
dtype: int64

We can drop missing values using dropna(). It drops all rows with at least one missing value.

df_listings.dropna().shape
(0, 18)

In this case unfortunately, it drops all the rows. If we wa to drop only rows with all missing values, we can use the parameter how='all'.

df_listings.dropna(how='all').shape
(3453, 18)

If we want to drop only missing values for one particular value, we can use the subset option.

df_listings.dropna(subset=['reviews_per_month']).shape
(3044, 18)

We can also fill the missing values instead of dropping them, using fillna().

df_listings.fillna(' -- This was NA  -- ').head()

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 42196 50 sm Studio in the historic centre 184487 Carlo -- This was NA -- Santo Stefano 44.48507 11.34786 Entire home/apt 68 3 180 2021-11-12 1.32 1 161 6 -- This was NA --
1 46352 A room in Pasolini's house 467810 Eleonora -- This was NA -- Porto - Saragozza 44.49168 11.33514 Private room 29 1 300 2021-11-30 2.2 2 248 37 -- This was NA --
2 59697 COZY LARGE BEDROOM in the city center 286688 Paolo -- This was NA -- Santo Stefano 44.48817 11.34124 Private room 50 1 240 2020-10-04 2.18 2 327 0 -- This was NA --
3 85368 Garden House Bologna 467675 Anna Maria -- This was NA -- Santo Stefano 44.47834 11.35672 Entire home/apt 126 2 40 2019-11-03 0.34 1 332 0 -- This was NA --
4 145779 SINGLE ROOM 705535 Valerio -- This was NA -- Porto - Saragozza 44.49306 11.33786 Private room 50 10 69 2021-12-05 0.55 9 365 5 -- This was NA --

We can also make missing values if we want.

df_listings.iloc[2, 2] = np.nan
df_listings.iloc[:3, :3]

id name host_id
0 42196 50 sm Studio in the historic centre 184487.0
1 46352 A room in Pasolini's house 467810.0
2 59697 COZY LARGE BEDROOM in the city center NaN
Previous
Next