Non-Parametric Regression

Last updated on Mar 9, 2022

# Setup
%matplotlib inline
from utils.lecture03 import *

Dataset

For this session, we are mostly going to work with the wage dataset.

df = pd.read_csv('data/Wage.csv', index_col=0)
df.head(3)

	year	age	maritl	race	education	region	jobclass	health	health_ins	logwage	wage
231655	2006	18	1. Never Married	1. White	1. < HS Grad	2. Middle Atlantic	1. Industrial	1. <=Good	2. No	4.318063	75.043154
86582	2004	24	1. Never Married	1. White	4. College Grad	2. Middle Atlantic	2. Information	2. >=Very Good	2. No	4.255273	70.476020
161300	2003	45	2. Married	1. White	3. Some College	2. Middle Atlantic	1. Industrial	1. <=Good	1. Yes	4.875061	130.982177

This dataset contains information on wages and individual characteristics.

Our main objective is going to be to explain wages using the observables contained in the dataset.

Polynomial Regression and Step Functions

As we have seen in the first lecture, the most common way to introduce linearities is to replace the standard linear model

$$ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i $$

with a polynomial function

$$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_2 x_i^3 + … + \varepsilon_i $$

Explore the Data

Suppose we want to investigate the relationship between wage and age. Let’s first plot the two variables.

# Scatterplot of the data
df.plot.scatter('age','wage',color='w', edgecolors='k', alpha=0.3);

png

Polynomials of different degrees

The relationship is highly complex and non-linear. Let’s expand our linear regression polynomials of different degrees: 1 to 5.

X_poly1 = PolynomialFeatures(1).fit_transform(df.age.values.reshape(-1,1))
X_poly2 = PolynomialFeatures(2).fit_transform(df.age.values.reshape(-1,1))
X_poly3 = PolynomialFeatures(3).fit_transform(df.age.values.reshape(-1,1))
X_poly4 = PolynomialFeatures(4).fit_transform(df.age.values.reshape(-1,1))
X_poly5 = PolynomialFeatures(5).fit_transform(df.age.values.reshape(-1,1))

Variables

Our dependent varaible is going to be a dummy for income above 250.000 USD.

# Get X and y
X = df.age
y = df.wage
y01 = (df.wage > 250).map({False:0, True:1}).values

Polynomia Regression

If we run a linear regression on a 4-degree polinomial expansion of age, this is what it looks like`:

# Fit ols on 4th degree polynomial
fit = sm.OLS(y, X_poly4).fit()
fit.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-184.1542	60.040	-3.067	0.002	-301.879	-66.430
x1	21.2455	5.887	3.609	0.000	9.703	32.788
x2	-0.5639	0.206	-2.736	0.006	-0.968	-0.160
x3	0.0068	0.003	2.221	0.026	0.001	0.013
x4	-3.204e-05	1.64e-05	-1.952	0.051	-6.42e-05	1.45e-07

Measures of Fit

In this case, the single coefficients are not of particular interest. We are mostly interested in the best capturing the relationship between age and wage. How can we pick among thedifferent polynomials?

We compare different polynomial degrees. For each regression, we are going to look at a series of metrics:

absolute residuals
sum of squared residuals
the difference in SSR w.r (SSR).t the 0-degree case
F statistic

# Run regressions
fit_1 = sm.OLS(y, X_poly1).fit()
fit_2 = sm.OLS(y, X_poly2).fit()
fit_3 = sm.OLS(y, X_poly3).fit()
fit_4 = sm.OLS(y, X_poly4).fit()
fit_5 = sm.OLS(y, X_poly5).fit()

# Compare fit
sm.stats.anova_lm(fit_1, fit_2, fit_3, fit_4, fit_5, typ=1)

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	2998.0	5.022216e+06	0.0	NaN	NaN	NaN
1	2997.0	4.793430e+06	1.0	228786.010128	143.593107	2.363850e-32
2	2996.0	4.777674e+06	1.0	15755.693664	9.888756	1.679202e-03
3	2995.0	4.771604e+06	1.0	6070.152124	3.809813	5.104620e-02
4	2994.0	4.770322e+06	1.0	1282.563017	0.804976	3.696820e-01

The polynomial degree 4 seems best.

# Set polynomial X to 4th degree
X_poly = X_poly4

Binary Dependent Variable

Since we have a binary dependent variable, it would be best to account for it in our regression framework. One way to do so, is to run a logistic regression.

How to interpret a Logistic Regression?

$$ y = \mathbb I \ \Big( \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_2 x_i^3 + … + \varepsilon_i \Big) $$

where $\mathbb I(\cdot)$ is an indicator function and now $\varepsilon_i$ is the error term.

Binomial Link Functions

Depending on the assumed distribution of the error term, we get different results. I list below the error types supported by the Binomial family.

# List link functions for the Binomial family
sm.families.family.Binomial.links

[statsmodels.genmod.families.links.logit,
 statsmodels.genmod.families.links.probit,
 statsmodels.genmod.families.links.cauchy,
 statsmodels.genmod.families.links.log,
 statsmodels.genmod.families.links.cloglog,
 statsmodels.genmod.families.links.identity]

Logit Link Function

We are going to pick the logit link, i.e. we are going to assume that the error term is Type 1 Extreme Value (or Gumbel) distributed. It instead we take the usual standard normal distribution assumption for $\varepsilon_i$, we get probit regression.

# Pick the logit link for the Binomial family
logit_link = sm.families.Binomial(sm.genmod.families.links.logit())

Given the error distribution, we can write the probability that $y=1$ as

$$ \Pr(y=1) = \frac{e^{ \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_2 x_i^3 + … + \varepsilon_i }}{1 + e^{ \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_2 x_i^3 + … + \varepsilon_i } } $$

Logistic Regression

We now estimate the regression and plot the estimated relationship between age and wage.

# Run logistic regression
logit_poly = sm.GLM(y01, X_poly, family=logit_link).fit()
logit_poly.summary().tables[1]

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-109.5530	47.655	-2.299	0.022	-202.956	-16.150
x1	8.9950	4.187	2.148	0.032	0.789	17.201
x2	-0.2816	0.135	-2.081	0.037	-0.547	-0.016
x3	0.0039	0.002	2.022	0.043	0.000	0.008
x4	-1.949e-05	9.91e-06	-1.966	0.049	-3.89e-05	-6.41e-08

Linear Model Comparison

What is the difference with the linear model?

# Run OLS regression with binary outcome
ols_poly = sm.OLS(y01, X_poly).fit()
ols_poly.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.1126	0.240	-0.468	0.640	-0.584	0.359
x1	0.0086	0.024	0.363	0.717	-0.038	0.055
x2	-0.0002	0.001	-0.270	0.787	-0.002	0.001
x3	3.194e-06	1.23e-05	0.260	0.795	-2.09e-05	2.73e-05
x4	-1.939e-08	6.57e-08	-0.295	0.768	-1.48e-07	1.09e-07

The magnitude of the coefficients is different, but the signs are the same.

Plot data and predictions

Let’s plot the estimated curves against the data distribution.

# Generate predictions
x_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)
X_poly_test = PolynomialFeatures(4).fit_transform(x_grid)
y_hat1 = sm.OLS(y, X_poly).fit().predict(X_poly_test)
y01_hat1 = logit_poly.predict(X_poly_test)

plot_predictions(X, y, x_grid, y01, y_hat1, y01_hat1, 'Figure 7.1: Degree-4 Polynomial')

png

Which is remindful of Le Petit Prince - Elephant figure

Using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of age. We can instead use step functions in order to avoid imposing such a global structure.

For example, we could break the range of age into bins, and fit a different constant in each bin.

Step Functions

Building a step function means first picking $K$ cutpoints $c_1 , c_2 , . . . , c_K$ in the range of age, and then construct $K + 1$ new variables

$$ C_0(age) = \mathbb I ( age < c_1) \ C_1(age) = \mathbb I ( c_1 < age < c_2) \ C_2(age) = \mathbb I ( c_2 < age < c_3) \ … \ C_{K-1}(age) = \mathbb I ( c_{K-1} < age < c_K) \ C_K(age) = \mathbb I ( c_K < age) \ $$

where $\mathbb I(\cdot)$ is the indicator function.

Binning

First, we generate the cuts.

# Generate cuts for the variable age
df_cut, bins = pd.cut(df.age, 4, retbins=True, right=True)
df_cut.value_counts(sort=False)
type(df_cut)

pandas.core.series.Series

Let’s generate a DataFrame out of this series.

# Generate bins for "age" from the cuts
df_steps = pd.concat([df.age, df_cut, df.wage], keys=['age','age_cuts','wage'], axis=1)
df_steps.head(5)

	age	age_cuts	wage
231655	18	(17.938, 33.5]	75.043154
86582	24	(17.938, 33.5]	70.476020
161300	45	(33.5, 49.0]	130.982177
155159	43	(33.5, 49.0]	154.685293
11443	50	(49.0, 64.5]	75.043154

Dummy Variables

Now we can generate different dummy variables out of each bin.

# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_steps['age_cuts'])

# Statsmodels requires explicit adding of a constant (intercept)
df_steps_dummies = sm.add_constant(df_steps_dummies)
df_steps_dummies.head(5)

	const	(17.938, 33.5]	(33.5, 49.0]	(49.0, 64.5]
231655	1.0	1	0	0
86582	1.0	1	0	0
161300	1.0	0	1	0
155159	1.0	0	1	0
11443	1.0	0	0	1

Stepwise Regression

We are now ready to run our regression

# Generate our new X variable
X_step = df_steps_dummies.drop(df_steps_dummies.columns[1], axis=1)

# OLS Regression on step functions
ols_step = sm.OLS(y, X_step).fit()
ols_step.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	94.1584	1.476	63.790	0.000	91.264	97.053
(33.5, 49.0]	24.0535	1.829	13.148	0.000	20.466	27.641
(49.0, 64.5]	23.6646	2.068	11.443	0.000	19.610	27.719
(64.5, 80.0]	7.6406	4.987	1.532	0.126	-2.139	17.420

From the regression outcome we can see that most bin coefficients are significant, except for the last one.

# Put the test data in the same bins as the training data.
bin_mapping = np.digitize(x_grid.ravel(), bins)
bin_mapping

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

# Get dummies, drop first dummy category, add constant
X_step_test = sm.add_constant(pd.get_dummies(bin_mapping).drop(1, axis=1))
X_step_test.head()

	const	2	3	4
0	1.0	0	0	0
1	1.0	0	0	0
2	1.0	0	0	0
3	1.0	0	0	0
4	1.0	0	0	0

# Step prediction
y_hat2 = ols_step.predict(X_step_test)

Logistic Step Regression

We are going again to run a logistic regression, given that our outcome is binary.

# Logistic regression on step functions
logit_step = sm.GLM(y01, X_step, family=logit_link).fit()
y01_hat2 = logit_step.predict(X_step_test)
logit_step.summary().tables[1]

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-5.0039	0.449	-11.152	0.000	-5.883	-4.124
(33.5, 49.0]	1.5998	0.474	3.378	0.001	0.672	2.528
(49.0, 64.5]	1.7147	0.488	3.512	0.000	0.758	2.672
(64.5, 80.0]	0.7413	1.102	0.672	0.501	-1.420	2.902

Plotting

How does the predicted function looks like?

plot_predictions(X, y, x_grid, y01, y_hat2, y01_hat2, 'Figure 7.2: Piecewise Constant')

png

Regression Splines

Spline regression, or piece-wise polynomial regression, involves fitting separate low-degree polynomials over different regions of $X$. The idea is to have one regression specification but with different coefficients in different parts of the $X$ range. The points where the coefficients change are called knots.

For example, we could have a third degree polynomial and splitting the sample in two.

$$ y_{i}=\left{\begin{array}{ll} \beta_{01}+\beta_{11} x_{i}+\beta_{21} x_{i}^{2}+\beta_{31} x_{i}^{3}+\epsilon_{i} & \text { if } x_{i}<c \ \beta_{02}+\beta_{12} x_{i}+\beta_{22} x_{i}^{2}+\beta_{32} x_{i}^{3}+\epsilon_{i} & \text { if } x_{i} \geq c \end{array}\right. $$

We have now two sets of coefficients, one for each subsample.

Generally, using more knots leads to a more flexible piecewise polynomial. Also increasing the degree of the polynomial increases the degree of flexibility.

Example

We are now going to plot 4 different examples for the age wage relationship:

Discontinuous piecewise cubic
Continuous piecewise cubic
Quadratic (continuous)
Continuous piecewise linear

# Cut dataset
df_short = df.iloc[:80,:]
X_short = df_short.age
y_short = df_short.wage
x_grid_short = np.arange(df_short.age.min(), df_short.age.max()+1).reshape(-1,1)

# 1. Discontinuous piecewise cubic
spline1 = "bs(x, knots=(50,50,50,50), degree=3, include_intercept=False)"

# 2. Continuous piecewise cubic
spline2 = "bs(x, knots=(50,50,50), degree=3, include_intercept=False)"

# 3. Quadratic (continuous)
spline3 = "bs(x, knots=(%s,%s), degree=2, include_intercept=False)" % (min(df.age), min(df.age))

# 4. Continuous piecewise linear
spline4 = "bs(x, knots=(%s,50), degree=1, include_intercept=False)" % min(df.age)

Generate Predictions

# Generate spline predictions
def fit_predict_spline(spline, X, y, x_grid):
    transformed_x = dmatrix(spline, {"x": X}, return_type='dataframe')
    fit = sm.GLM(y, transformed_x).fit()
    y_hat = fit.predict(dmatrix(spline, {"x": x_grid}, return_type='dataframe'))
    return y_hat

y_hats = [fit_predict_spline(s, X_short, y_short, x_grid_short) for s in [spline1, spline2, spline3, spline4]]

Plotting

plot_splines(df_short, x_grid_short, y_hats)

png

Comment

The first example makes us think on why would we want out function to be discontinuous. Unless we expect a sudden wage jump at a certain age, we would like the function to be continuous. However, if for example we split age around the retirement age, we might expect a discontinuity.

The second example (top right) makes us think on why would we want out function not to be differentiable. Unless we have some specific mechanism in mind, ususally there is a trade-off between making the function non-differentiable or increasing the degree of the polynomial, as the last two examples show us. We get a similar fit with a quadratic fit or a discontinuous linear fit. The main difference is that in the second case we are picking the discontinuity point by hand instead of letting the data choose how to change the slope of the curve.

The Spline Basis Representation

How can we fit a piecewise degree-d polynomial under the constraint that it (and possibly its first d − 1 derivatives) be continuous?

The most direct way to represent a cubic spline is to start off with a basis for a cubic polynomial—namely, x,x2,x3—and then add one truncated power basis function per knot. A truncated power basis function is defined as

$$ h(x, c)=(x-c)_{+}^{3} = \Bigg{\begin{array}{cl} (x-c)^{3} & \text { if } x>c \ 0 & \text { otherwise } \end{array} $$

One can show that adding a term of the form $\beta_4 h(x, c)$ to the model for a cubic polynomial will lead to a discontinuity in only the third derivative at $c$; the function will remain continuous, with continuous first and second derivatives, at each of the knots.

Cubic Splines

One way to specify the spline is using nodes and degrees of freedom.

# Specifying 3 knots and 3 degrees of freedom
spline5 = "bs(x, knots=(25,40,60), degree=3, include_intercept=False)"
pred5 = fit_predict_spline(spline5, X, y, x_grid)

No Knots

When we fit a spline, where should we place the knots?

The regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable. While this option can work well, in practice it is common to place knots in a uniform fashion. One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.

# Specifying degree 3 and 6 degrees of freedom 
spline6 = "bs(x, df=6, degree=3, include_intercept=False)"
pred6 = fit_predict_spline(spline6, X, y, x_grid)

Natural Splines

A natural spline is a regression spline with additional boundary constraints: the function is required to be linear at the boundary (in the region where X is smaller than the smallest knot, or larger than the largest knot). This addi- tional constraint means that natural splines generally produce more stable estimates at the boundaries.

# Natural spline with 4 degrees of freedom
spline7 = "cr(x, df=4)"
pred7 = fit_predict_spline(spline7, X, y, x_grid)

Comparison

# Compare predictons
preds = [pred5, pred6, pred7]
labels = ['degree 3, knots 3', 'degree 3, degrees of freedom 3', 'natural, degrees of freedom 4']
compare_predictions(X, y, x_grid, preds, labels)

png

Comparison to Polynomial Regression

Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed.

We are now fitting a polynomial of degree 15 and a spline with 15 degrees of freedom.

# Polynomial of degree 15
X_poly15 = PolynomialFeatures(15).fit_transform(df.age.values.reshape(-1,1))
ols_poly_15 = sm.OLS(y, X_poly15).fit()
pred8 = ols_poly_15.predict(PolynomialFeatures(15).fit_transform(x_grid))

# Spline with 15 degrees of freedon
spline9 = "bs(x, df=15, degree=3, include_intercept=False)"
pred9 = fit_predict_spline(spline9, X, y, x_grid)

Plotting

# Compare predictons
preds = [pred8, pred9]
labels = ['Polynomial', 'Spline']
compare_predictions(X, y, x_grid, preds, labels)

png

As we can see, despite the two regressions having the same degrees of freedom, the polynomial fit is much more volatile. We can compare them along some dimensions.

Local Regression

So far we have looked at so-called “global methods”: methods that try to fit a unique function specification over the whole data. The function specification can be complex, as in the case of splines, but can be expressed globally.

Local regression is a different approach for fitting flexible non-linear functions, which involves computing the fit at a target point $x_0$ using only the nearby training observations.

Details

How does local regression work?

Ingredients: $X$, $y$.

How to you output a prediction $\hat y_i$ at a new point $x_i$?

Take a number of points in $X$ close to $x_i$: $X_{\text{close-to-i}}$
Assign a weight to each of there points
Fit a weigthed least squares regression of $X_{\text{close-to-i}}$ on $y_{\text{close-to-i}}$
Use the estimated coefficients $\hat \beta$ to predict $\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$

Generate Data

# Set seed
np.random.seed(1)

# Generate data
X_sim = np.sort(np.random.uniform(0,1,100))
e = np.random.uniform(-.5,.5,100)
y_sim = -4*X_sim**2 + 3*X_sim + e

# True Generating process without noise
X_grid = np.linspace(0,1,100)
y_grid = -4*X_grid**2 + 3*X_grid

Plotting

Let’s visualize the simulated data and the curve without noise.

plot_simulated_data(X_sim, y_sim, X_grid, y_grid);

png

Fit LL Regression

Now we fit a local linear regression.

# Settings
spec = 'll'
bandwidth = 0.1
kernel = 'gaussian'

# Locally linear regression
local_reg = KernelReg(y_sim, X_sim.reshape(-1,1), 
                      var_type='c', 
                      reg_type=spec, 
                      bw=[bandwidth])
y_hat = KernelReg.fit(local_reg)

What do the parameters mean?

var_type: dependent variable type (c i.e. continuous)
reg_type: local regression specification (ll i.e. locally linear)
bw : bandwidth length (0.1)
ckertype: kernel type (gaussian)

Prediction

What does the prediction looks like?

fig, ax = plot_simulated_data(X_sim, y_sim, X_grid, y_grid);
make_figure_7_9a(fig, ax, X_sim, y_hat);

png

Details

How exactly was the prediction generated? It was generated pointwise. We are now going to look at the prediction at one particular point: $x_i=0.5$.

We proceed as follows:

We select the focal point: $x_i=0.5$
We select observations close to $\ x_i$, i.e. $x_{\text{close to i}} = { x \in X : |x_i - x| < 0.1 } \ $ and $ \ y_{\text{close to i}} = { y \in Y : |x_i - x| < 0.1 }$
We apply gaussian weights
We run a weighted linear regression of $y_{\text{close to i}}$ on $x_{\text{close to i}}$

# Get local X and y
x_i = 0.5
close_to_i = (x_i-bandwidth < X_sim) & (X_sim < x_i+bandwidth)
X_tilde = X_sim[close_to_i]
y_tilde = y_sim[close_to_i]

# Get local estimates
local_estimate = KernelReg.fit(local_reg, data_predict=[x_i])
y_i_hat = local_estimate[0]
beta_i_hat = local_estimate[1]
alpha_i_hat = y_i_hat - beta_i_hat*x_i
print('Estimates: alpha=%1.4f, beta=%1.4f' % (alpha_i_hat, beta_i_hat))

Estimates: alpha=0.7006, beta=-0.6141

Visualization

Now we can use the locally estimated coefficients to predict the value of $\hat y_i(x_i)$ for $x_i = 0.5$.

# Build local predictions
close_to_i_grid = (x_i-bandwidth < X_grid) & (X_grid < x_i+bandwidth)
X_grid_tilde = X_grid[close_to_i_grid].reshape(-1,1)
y_grid_tilde = alpha_i_hat + X_grid_tilde*beta_i_hat

fig, ax = plot_simulated_data(X_sim, y_sim, X_grid, y_grid);
make_figure_7_9a(fig, ax, X_sim, y_hat);
make_figure_7_9b(fig, ax, X_tilde, y_tilde, X_grid_tilde, y_grid_tilde, x_i, y_i_hat)

png

Zooming in

We can zoom in and look only at the “close to i” sample.

sns.regplot(X_tilde, y_tilde);

png

Why is the line upward sloped? We forgot the gaussian weights.

# Weights
w = norm.pdf((X_sim-x_i)/bandwidth)

# Estimate LWS
mod_wls = sm.WLS(y_sim, sm.add_constant(X_sim), weights=w)
results = mod_wls.fit()

print('Estimates: alpha=%1.4f, beta=%1.4f' % tuple(results.params))

Estimates: alpha=0.7006, beta=-0.6141

We indeed got the same estimates as before. Note two things:

the badwidth defines the scale parameter of the gaussian weights
our locally linear regression is acqually global

Plotting

make_figure_7_9d(X_sim, y_sim, w, results, X_grid, x_i, y_i_hat)

png

Now the slope is indeed negative, as in the locally linear regression.

Generalized Additive Models

Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.

GAM for Regression Problems

Imagine to extend the general regression framework to some separabily additive model of the form

$$ y_i = \beta_0 + \sum_{k=1}^K \beta_k f_k(x_{ik}) + \varepsilon_i $$

It is called an additive model because we calculate a separate $f_k$ for each $X_k$, and then add together all of their contributions.

Consider for example the following model

$$ \text{wage} = \beta_0 + f_1(\text{year}) + f_2(\text{age}) + f_3(\text{education}) + \varepsilon $$

Example

We are going to use the following functions:

$f_1$: natural spline with 8 degrees of freedom
$f_2$: natural spline with 10 degrees of freedom
$f_3$: step function

# Set X and y
df['education_'] = LabelEncoder().fit_transform(df["education"])
X = df[['year','age','education_']].to_numpy()
y = df[['wage']].to_numpy()

## model
linear_gam = LinearGAM(s(0, n_splines=8) + s(1, n_splines=10) + f(2))
linear_gam.gridsearch(X, y);

100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time:  0:00:00

Plotting

plot_gam(linear_gam)

png

Pros and Cons

Before we move on, let us summarize the advantages of a GAM.

GAMs allow us to fit a non-linear $f_k$ to each $X_k$, so that we can automatically model non-linear relationships that standard linear regression will miss
The non-linear fits can potentially make more accurate predictions
Because the model is additive, we can still examine the effect of each $X_k$ on $Y$ separately
The smoothness of the function $f_k$ for the variable $X_k$ can be summarized via degrees of freedom.

The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed. However, as with linear regression, we can manually add interaction terms to the GAM model by including additional predictors of the form $X_j \times X_k$.

GAMs for Classification Problems

We can use GAMs also with a binary dependent variable.

# Binary dependent variable
y_binary = (y>250)

## Logit link function
logit_gam = LogisticGAM(s(0, n_splines=8) + s(1, n_splines=10) + f(2), fit_intercept=True)
logit_gam.gridsearch(X, y_binary);

100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time:  0:00:00

Plotting

plot_gam(logit_gam)

png

The results are qualitatively similar to the non-binary case.