Abstract
Turtorial looking at Spline regression model; we also explore other advanced regression analysis models extending splines like smoothing splines, wavelets, multivariate adaptive splines and generalised additive models.
We used several packages in the tutorial, they are all shown below. More details of the packages and CRAN links can be found at the end of the tutorial.
library(dplyr)
library(ggplot2)
library(ggpubr)
library(reshape2)
library(splines)
library(gam)
library(ISLR2)
The work and content in this tutorial is based of the learning from several Statistics Courses taken during the course of my time from UCT and in turn, is largely taken from Introduction to Statistical Learning in R (Second Edition) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, as well as Elements of Statistical Learning book.
We look at a host of regression models and techniques. The specific selection of regression models which shall be explored are listed below.
Regression Splines
Smoothing Splines
Generalized Additive Models (GAMS)
Multivariate Adaptive Regression Splines (MARS)
Wavelets
Functional Data Analysis (FDA)
Due to the breadth of models covered in this tutorial, we essentially summarise each regression method listed and briefly look at their details. The majority of the chapters will be concerned with exploring different examples of either complex or simple nature.
In what follows we shall present we present a number of approaches for modeling the relationship between a response \(Y\) and a single predictor \(X\) in a flexible way.
We covered polynomial regression and piece-wise constant regression (using step functions) in the previous tutorial - Advanced Regression 2.0. These models are special cases of a basis function approach. The idea is to have at hand a family of functions or transformations that can be applied to a variable \(X\) : \(b_{1}(X), b_{2}(X), \ldots, b_{K}(X)\).
So, instead of fitting a linear model in \(X\), we fit the model \[ y_{i}=\beta_{0}+\beta_{1} b_{1}\left(x_{i}\right)+\beta_{2} b_{2}\left(x_{i}\right)+\beta_{3} b_{3}\left(x_{i}\right)+\ldots+\beta_{K} b_{K}\left(x_{i}\right)+\epsilon_{i} . \] Note that the basis functions \(b_{1}(\cdot), b_{2}(\cdot), \ldots, b_{K}(\cdot)\) are fixed and known.
So, we can think of this, as essentially, choosing the functions ahead of time. For polynomial regression, the basis functions are \(b_{j}\left(x_{i}\right)=x_{i}^{j}\), and for piece-wise constant functions they are \(b_{j}\left(x_{i}\right)=I\left(c_{j} \leq x_{i}<c_{j+1}\right)\).
We can think of the new fitted model with \(y_i\) (shown above) as a standard linear model with predictors \(b_{1}\left(x_{i}\right), b_{2}\left(x_{i}\right), \ldots, b_{K}\left(x_{i}\right)\). Hence, we can use least squares to estimate the unknown regression coefficients. Importantly, this means that all of the inference tools for linear models that were discussed in the first tutorial in Regression (Basic Regression Analysis), such as standard errors for the coefficient estimates and \(\mathrm{F}\)-statistics for the model’s overall significance, are available in this setting.
We have only considered the use of polynomial functions and piece- wise constant functions for our basis functions; however, many alternatives are possible. For instance, we can use wavelets or Fourier series to construct basis functions. Next, we look at a very common choice for a basis function: regression splines.
Ultimately, we use the basis expansions as a device to achieve more flexible representations for \(f(X)\).
Basis Function: functions or transformations that can be applied to a variable \(X\): \(b_{1}(X), b_{2}(X), \ldots, b_{K}(X)\).
Global Basis Function: A global basis function extends over the whole range of \(x\). So each observation influences the entire curve.
Local Basis Function: a local basis function is only non-zero for a small section of the \(x\) variable.
Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of \(X\) into \(K\) distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.
later we shall look at Penalised Regression Splines, Natural Splines, Smoothing Splines.
Instead of fitting a high-degree polynomial over the entire range of \(X\), piece-wise polynomial regression involves fitting separate low-degree polynomials over different regions of \(X\).
So, a piece-wise polynomial function \(f(X)\) is obtained by dividing the domain of \(X\) into contiguous intervals, and representing \(f\) by a separate polynomial in each interval. The figure below shows two simple piece-wise polynomials. The first is piece-wise constant, with three basis functions: \[ h_{1}(X)=I\left(X<\xi_{1}\right), \quad h_{2}(X)=I\left(\xi_{1} \leq X<\xi_{2}\right), \quad h_{3}(X)=I\left(\xi_{2} \leq X\right) . \]
and piece-wise linear fit, which has three additional basis function: \(h_{m+3}=h_{m}(X) X, m=1, \ldots, 3\).
The panel shows a piece-wise constant function fit to some artificial data. The broken vertical lines indicate the positions of the two knots \(\xi_1\) and \(\xi_2\). The blue curve represents the true function, from which the data were generated with Gaussian noise. We also show continuous piece-wise linear and piece-wise linear basis function, below.
The remaining two panels show piece-wise linear functions fit to the same data—the right unrestricted, and the left restricted to be continuous at the knots. The right panel shows a piece-wise linear basis function \(h_{3}(X)=\left(X-\xi_{1}\right)_{+}\), continuous at \(\xi_{1}\). The black points indicate the sample evaluations \(h_{3}\left(x_{i}\right), i=1, \ldots, N\).
We also can have a look at piece-wise cubic polynomials fit to artificial data.
Remember the blue curve is the true function - from which the data were generated from. Note that the function in the lower right panel is continuous, and has continuous first and second derivatives at the knots. It is known as a cubic spline - more on this in the next sub section.
So, a piece-wise cubic polynomial works by fitting a cubic regression model of the form \[ y_{i}=\beta_{0}+\beta_{1} x_{i}+\beta_{2} x_{i}^{2}+\beta_{3} x_{i}^{3}+\epsilon_{i}, \]where the coefficients \(\beta_{0}, \beta_{1}, \beta_{2}\), and \(\beta_{3}\) differ in different parts of the range of \(X\). The points where the coefficients change are called knots. For example, a piece-wise cubic with no knots is just a standard cubic polynomial.
A piece-wise cubic polynomial with a single knot at a point \(c\) takes the form:
\[y_{i}= \begin{cases}\beta_{01}+\beta_{11} x_{i}+\beta_{21} x_{i}^{2}+\beta_{31} x_{i}^{3}+\epsilon_{i} & \text { if } x_{i}<c \\ \beta_{02}+\beta_{12} x_{i}+\beta_{22} x_{i}^{2}+\beta_{32} x_{i}^{3}+\epsilon_{i} & \text { if } x_{i} \geq c\end{cases}\]
In other words, we fit two different polynomial functions to the data, one on the subset of the observations with \(x_{i}<c\), and one on the subset of the observations with \(x_{i} \geq c\). The first polynomial function has coefficients \(\beta_{01}, \beta_{11}, \beta_{21}, \beta_{31}\), and the second has coefficients \(\beta_{02}, \beta_{12}, \beta_{22}, \beta_{32}\). Each of these polynomial functions can be fit using least squares applied to simple functions of the original predictor.
Using more knots leads to a more flexible piece-wise polynomial - the number of knots generally determines the flexibility/wiggliness or smoothness of the curve. In general, if we place \(K\) different knots throughout the range of \(X\), then we will end up fitting \(K+1\) different cubic polynomials.
When we fit this piece-wise polynomial what we will notice is that the function is discontinuous and will often look a little crazy or nonsensical. It looks wrong because the fitted curve is just too flexible. To remedy this problem, we can fit a piece-wise polynomial under the constraint that the fitted curve must be continuous. In other words, there cannot be a jump when we arrive at the knot. We can add a further constraint, thus giving two constrains on our piece-wise polynomial: both the first and second derivatives of the piecewise polynomials are continuous at the knot. In other words, we are requiring that the piecewise polynomial be not only continuous when arrive at knot, but also very smooth. This is now known as a cubic spline with \(K\) knots, which uses a total of \(4+K\) degrees of freedom.
The general definition of a degree-\(d\) spline is that it is a piece-wise degree-\(d\) polynomial, with continuity in derivatives up to degree \(d−1\) at each knot.
A cubic spline with \(K\) knots can be modeled as \[ y_{i}=\beta_{0}+\beta_{1} b_{1}\left(x_{i}\right)+\beta_{2} b_{2}\left(x_{i}\right)+\cdots+\beta_{K+3} b_{K+3}\left(x_{i}\right)+\epsilon_{i}, \]for an appropriate choice of basis functions \(b_{1}, b_{2}, \ldots, b_{K+3}\). The model (shown above) can then be fit using least squares.
Just as there were several ways to represent polynomials, there are also many equivalent ways to represent cubic splines using different choices of basis functions. The most direct way to represent a cubic spline using is to start off with a basis for a cubic polynomial-namely, \(x, x^{2}, x^{3}\)- and then add one truncated power basis function per knot. A truncated power basis function is defined as
\[ h(x, \xi)=(x-\xi)_{+}^{3}=\left\{\begin{array}{cl} (x-\xi)^{3} & \text { if } x>\xi \\ 0 & \text { otherwise } \end{array}\right. \]where \(\xi\) is the knot. So, in order to fit a cubic spline to a data set with \(K\) knots, we perform least squares regression with an intercept and \(3+K\) predictors, of the form \(X, X^{2}, X^{3}, h\left(X, \xi_{1}\right), h\left(X, \xi_{2}\right), \ldots, h\left(X, \xi_{K}\right)\), where \(\xi_{1}, \ldots, \xi_{K}\) are the knots. This amounts to estimating a total of \(K+4\) regression coefficients; for this reason, fitting a cubic spline with \(K\) knots uses \(K+4\) degrees of freedom.
Splines can have high variance at the outer range of the predictors — that is, when X takes on either a very small or very large value. This is where Natural Splines step in.
In summary, the basis representation for a cubic spline with knots at \(\xi_1\) and \(\xi_2\):
\[ \begin{array}{lll} h_{1}(X)=1, & h_{3}(X)=X^{2}, & h_{5}(X)=\left(X-\xi_{1}\right)_{+}^{3}, \\ h_{2}(X)=X, & h_{4}(X)=X^{3}, & h_{6}(X)=\left(X-\xi_{2}\right)_{+}^{3} \end{array} \]
More generally, an order-\(M\) spline with knots \(\xi_{j}, j=1, \ldots, K\) is a piecewise-polynomial of order \(M\), and has continuous derivatives up to order \(M-2\) (a cubic spline has \(M=4\)). The general form for the truncated-power basis set would be \[ \begin{aligned} h_{j}(X) &=X^{j-1}, j=1, \ldots, M \\ h_{M+\ell}(X) &=\left(X-\xi_{\ell}\right)_{+}^{M-1}, \ell=1, \ldots, K \end{aligned} \] Note that the fixed-knot splines are also known as regression splines.
The regression spline function \(f(x)\) can be written as:
\[ \begin{aligned} f(x) &=\sum_{m=1}^{M} \beta_{m} h_{m}(x) \\ &=\sum \beta_{j} h_{j}(x)=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}+\sum_{k=1}^{K} b_{k}\left(x-\xi_{k}\right)_{+}^{3} \end{aligned} \] where the \(h_{m}(x)\) are basis functions (all still functions or transformations of our variable \(x\) ). The model is still linear (linear in parameters), and can still be written as \(X \beta\), and can still be fitted using our usual regression equations - \(\hat{\beta}=\left(X^{\prime} X\right)^{-1} X^{\prime} Y\).
basis for cubic spline: \(1, x, x^{2},x^{3},\left(x-\xi_{1}\right)_{+}^{3}, \ldots,\left(x-\xi_{K}\right)_{+}^{3}\)
using other notation: \[ \begin{gathered} h_{1}(x)=1, h_{2}(x)=x, h_{3}(x)=x^{2}, h_{4}(x)=x^{3} \\ h_{5}(x)=\left(x-\xi_{1}\right)_{+}^{3}, \ldots \\ h_{4+K}(x)=\left(x-\xi_{K}\right)_{+}^{3} \end{gathered} \]
The expression bs\((x, d f=7)\) in \(\mathrm{R}\) generates a basis matrix of cubic-spline functions evaluated at the \(N\) observations in \(\mathrm{x}\), with the \(7-3=4^{1}\) interior knots at the appropriate percentiles of \(x\) (20,40,60 and 80th). One can be more explicit, however; \(\text{bs}(x, \text{degree} =1, \text{knots} =c(0.2,0.4,0.6))\) generates a basis for linear splines, with three interior knots, and returns an \(N \times 4\) matrix.
A natural spline is a regression spline with additional boundary constraints: the function is required to be linear at the boundary (in the region where X is smaller than the smallest knot, or larger than the largest knot). This additional constraint means that natural splines generally produce more stable estimates at the boundaries.
When we say linear beyond the boundary knots; we are essentially saying that in the first and last section we dont have cubic polynomials but just a line. So we shall have 2 parameters fewer.
In fact; compared to a cubic spline, and with the same number of knots, a natural spline has four fewer degrees of freedom / parameters.
A natural cubic spline with \(K\) knots is represented by \(K\) basis functions.
The natural cubic spline basis is: \[ N_{1}(x)=1, N_{2}(X)=X, N_{k+2}(X)=d_{k}(X)-d_{K-1}(X) \] where \[ d_{k}(X)=\frac{\left(X-\xi_{k}\right)_{+}^{3}-\left(X-\xi_{K}\right)_{+}^{3}}{\xi_{K}-\xi_{k}} \]
Here we look at duplicating the South African Heart Disease Example. It essentially involves, logistic regression with spline transformed features. We explore non-linearities in the functions using natural splines.
Below is the functional form of the model:
\[\operatorname{logit}[\operatorname{Pr}(\operatorname{chd} \mid X)]=\theta_{0}+h_{1}\left(X_{1}\right)^{T} \theta_{1}+h_{2}\left(X_{2}\right)^{T} \theta_{2}+\cdots+h_{p}\left(X_{p}\right)^{T} \theta_{p}\] where each of the \(\theta_{j}\) are vectors of coefficients multiplying their associated vector of natural spline basis functions \(h_{j}\).
We use four natural spline bases for each term in the model. For example, with \(X_{1}\) representing sbp, \(h_{1}\left(X_{1}\right)\) is a basis consisting of four basis functions. This actually implies three rather than two interior knots (chosen at uniform quantiles of sbp), plus two boundary knots at the extremes of the data, since we exclude the constant term from each of the \(h_{j}\).
Since famhist is a two-level factor, it is coded by a simple binary or dummy variable, and is associated with a single coefficient in the fit of the model.
More compactly we can combine all \(p\) vectors of basis functions (and the constant term) into one big vector \(h(X)\), and then the model is simply \(h(X)^{T} \theta\).
In the textbook, a backward step-wise deletion process, dropping terms from this model while preserving the group structure of each term, rather than dropping one coefficient at a time. The AIC statistic was used to drop terms, and all the terms remaining in the final model would cause AIC to increase if deleted from the model.
We mimic the plots of the final model selected by the step-wise regression. The functions displayed are \(\hat{f}_{j}\left(X_{j}\right)=h_{j}\left(X_{j}\right)^{T} \hat{\theta}_{j}\) for each variable \(X_{j}\).
library(ElemStatLearn) # loads the data frame SA Heart
library(splines)
attach(SAheart)
# Computes the logistic regression model using natural splines (note famhist is included as a factor):
form = "chd ~ ns(sbp,df=4) + ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4)"
form = formula(form)
m = glm(form, data=SAheart, family=binomial)
print( summary(m), digits=3 )
##
## Call:
## glm(formula = form, family = binomial, data = SAheart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.724 -0.827 -0.388 0.887 2.959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1160 2.4124 -0.88 0.38040
## ns(sbp, df = 4)1 -1.4794 0.8450 -1.75 0.07998 .
## ns(sbp, df = 4)2 -1.3197 0.7637 -1.73 0.08397 .
## ns(sbp, df = 4)3 -3.7537 2.0254 -1.85 0.06383 .
## ns(sbp, df = 4)4 1.3973 1.0039 1.39 0.16395
## ns(tobacco, df = 4)1 0.6495 0.4587 1.42 0.15677
## ns(tobacco, df = 4)2 0.4181 0.9033 0.46 0.64346
## ns(tobacco, df = 4)3 3.3626 1.1877 2.83 0.00464 **
## ns(tobacco, df = 4)4 3.8534 2.3775 1.62 0.10506
## ns(ldl, df = 4)1 1.8688 1.3280 1.41 0.15937
## ns(ldl, df = 4)2 1.7217 1.0327 1.67 0.09547 .
## ns(ldl, df = 4)3 4.5209 3.0015 1.51 0.13201
## ns(ldl, df = 4)4 3.3454 1.4526 2.30 0.02127 *
## famhistPresent 1.0787 0.2390 4.51 6.4e-06 ***
## ns(obesity, df = 4)1 -3.1058 1.7216 -1.80 0.07122 .
## ns(obesity, df = 4)2 -2.3753 1.2057 -1.97 0.04884 *
## ns(obesity, df = 4)3 -5.0542 3.8266 -1.32 0.18657
## ns(obesity, df = 4)4 0.0545 1.7499 0.03 0.97516
## ns(alcohol, df = 4)1 0.1367 0.4010 0.34 0.73314
## ns(alcohol, df = 4)2 -0.3561 0.7422 -0.48 0.63144
## ns(alcohol, df = 4)3 -0.1101 0.8081 -0.14 0.89168
## ns(alcohol, df = 4)4 0.5908 1.0934 0.54 0.58893
## ns(age, df = 4)1 2.5975 1.1189 2.32 0.02026 *
## ns(age, df = 4)2 3.1008 0.9179 3.38 0.00073 ***
## ns(age, df = 4)3 7.5511 2.5674 2.94 0.00327 **
## ns(age, df = 4)4 1.5199 0.5964 2.55 0.01082 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 457.63 on 436 degrees of freedom
## AIC: 509.6
##
## Number of Fisher Scoring iterations: 6
plot(m)
# Duplicates the numbers from Table 5.1:
drop1( m, scope=form, test="Chisq" )
## Single term deletions
##
## Model:
## chd ~ ns(sbp, df = 4) + ns(tobacco, df = 4) + ns(ldl, df = 4) +
## famhist + ns(obesity, df = 4) + ns(alcohol, df = 4) + ns(age,
## df = 4)
## Df Deviance AIC LRT Pr(>Chi)
## <none> 457.63 509.63
## ns(sbp, df = 4) 4 466.77 510.77 9.1429 0.0576257 .
## ns(tobacco, df = 4) 4 469.61 513.61 11.9753 0.0175355 *
## ns(ldl, df = 4) 4 470.90 514.90 13.2710 0.0100249 *
## famhist 1 478.76 528.76 21.1319 4.287e-06 ***
## ns(obesity, df = 4) 4 465.41 509.41 7.7749 0.1001811
## ns(alcohol, df = 4) 4 458.09 502.09 0.4562 0.9776262
## ns(age, df = 4) 4 480.37 524.37 22.7414 0.0001426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(splines)
library(nnet)
SA <- SAheart
attach(SA)
## The following objects are masked from SAheart:
##
## adiposity, age, alcohol, chd, famhist, ldl, obesity, sbp, tobacco,
## typea
X <- cbind(ns(sbp,4), ns(tobacco,4), ns(ldl,4), famhist, ns(obesity,4), ns(age,4))
df <- data.frame(chd,X)
logis <- multinom(chd~., data = df) #Fit Multinomial Log-linear Models
## # weights: 23 (22 variable)
## initial value 320.233997
## iter 10 value 231.163766
## iter 20 value 229.112192
## iter 30 value 229.044140
## final value 229.043974
## converged
beta <- coef(logis)
weights <- as.vector(logis$fitted.values*(1-logis$fitted.values))
H=cbind(1,X)
sigma=solve(t(H)%*%(weights*H))
# Table
pval=beta/sqrt(diag(sigma))
# Plot
par(mfrow=c(3,2))
# SBP
N <- cbind(1,ns(sbp,df=4))
yhat <- N%*%beta[1:5]
SigmaHat <- solve(t(N)%*%(weights*N))
ycov <- N%*%SigmaHat%*%t(N)
yvar <- diag(ycov)
m <- min(yhat-2*sqrt(yvar))
M <- max(yhat+2*sqrt(yvar))
vv <- order(sbp)
plot(sort(sbp),yhat[vv],type='l',ylim=c(m,M),pch=19,xlab = 'sbp',ylab='f(sbp)')
lines(sort(sbp),yhat[vv]+2*sqrt(yvar[vv]),col='red',type='l',pch=19)
lines(sort(sbp),yhat[vv]-2*sqrt(yvar[vv]),col='red',type='l',pch=19)
# Tobacco
N <- cbind(1,ns(tobacco,df=4))
yhat <- N%*%beta[c(1,6:9)]
SigmaHat <- solve(t(N)%*%(weights*N))
ycov <- N%*%SigmaHat%*%t(N)
yvar <- diag(ycov)
m <- min(yhat-2*sqrt(yvar))
M <- max(yhat+2*sqrt(yvar))
vv <- order(tobacco)
plot(sort(tobacco),yhat[vv],type='l',ylim=c(m,M),pch=19,xlab='tobacco',ylab='f(tobacco)')
lines(sort(tobacco),yhat[vv]+2*sqrt(yvar[vv]),col='red',type='l',pch=19)
lines(sort(tobacco),yhat[vv]-2*sqrt(yvar[vv]),col='red',type='l',pch=19)
# LDL
N <- cbind(1,ns(ldl,df=4))
yhat <- N%*%beta[c(1,10:13)]
SigmaHat <- solve(t(N)%*%(weights*N))
ycov <- N%*%SigmaHat%*%t(N)
yvar <- diag(ycov)
m <- min(yhat-2*sqrt(yvar))
M <- max(yhat+2*sqrt(yvar))
vv <- order(ldl)
plot(sort(ldl),yhat[vv],type='l',ylim=c(m,M),pch=19,xlab="ldl",ylab='f(ldl)')
lines(sort(ldl),yhat[vv]+2*sqrt(yvar[vv]),col='red',type='l',pch=19)
lines(sort(ldl),yhat[vv]-2*sqrt(yvar[vv]),col='red',type='l',pch=19)
# Obesity
N <- cbind(1,ns(obesity,df=4))
yhat <- N%*%beta[c(1,15:18)]
SigmaHat <- solve(t(N)%*%(weights*N))
ycov <- N%*%SigmaHat%*%t(N)
yvar <- diag(ycov)
m <- min(yhat-2*sqrt(yvar))
M <- max(yhat+2*sqrt(yvar))
xx <- obesity[order()]
vv <- order(obesity)
plot(sort(obesity),yhat[vv],type='l',ylim=c(m,M),pch=19,xlab='Obesity',ylab='f(Obesity)')
lines(sort(obesity),yhat[vv]+2*sqrt(yvar[vv]),col='red',type='l',pch=19)
lines(sort(obesity),yhat[vv]-2*sqrt(yvar[vv]),col='red',type='l',pch=19)
# Age
N <- cbind(1,ns(age,df=4))
yhat <- N%*%beta[c(1,19:22)]
SigmaHat <- solve(t(N)%*%(weights*N))
ycov <- N%*%SigmaHat%*%t(N)
yvar <- diag(ycov)
m <- min(yhat-2*sqrt(yvar))
M <- max(yhat+2*sqrt(yvar))
vv <- order(age)
plot(sort(age),yhat[vv],type='l',ylim=c(m,M),pch=19,xlab='Age',ylab='f(Age)')
lines(sort(age),yhat[vv]+2*sqrt(yvar[vv]),col='red',type='l',pch=19)
lines(sort(age),yhat[vv]-2*sqrt(yvar[vv]),col='red',type='l',pch=19)
detach(SA)
Fitted natural-spline functions for each of the terms in the final model selected by the step-wise procedure (not shown here). Included are point-wise standard-error bands.
library(ElemStatLearn)
bone <- bone
attach(bone)
## The following object is masked from SAheart:
##
## age
males <- gender == "male"
females <- gender == "female"
boneMaleSmooth = smooth.spline(age[males], spnbmd[males], df=12 )
boneFemaleSmooth = smooth.spline(age[females], spnbmd[females], df=12)
plot(boneMaleSmooth$x, boneMaleSmooth$y, type='l', ylim=c(min(spnbmd),max(spnbmd)),
col="blue", xlab="Age", ylab="relative change in Spinal BMD")
points(age[males], spnbmd[males], col="blue", pch=20)
lines(boneFemaleSmooth$x, boneFemaleSmooth$y, col="red", type='l')
points(age[females], spnbmd[females], col="red", pch=20)
legend("topright", legend = c("Male","Female"), col=c("blue","red"), lty=c(1,1), cex = 0.7)
detach(bone)
The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline was fit to the males and females, with \(\lambda \approx\) 0.00022. This choice corresponds to about 12 degrees of freedom.
The plot shows a smoothing spline fit to some data on bone mineral density (BMD) in adolescents. The data are color coded by gender, and two separate curves were fit. This simple summary reinforces the evidence in the data that the growth spurt for females precedes that for males by about two years
library(ElemStatLearn)
ozone.dat <- read.csv("~/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Tutorials/Rmarkdown Tutorials/Regression/Data/LAozone.data.txt")
attach(ozone.dat)
## The following object is masked from package:ElemStatLearn:
##
## ozone
smoother1 <- smooth.spline(dpg, ozone.dat$ozone,df=5,tol = 0.01)
smoother2 <- smooth.spline(dpg, ozone.dat$ozone,df=11,tol=0.01)
M <- max(smoother1$y, smoother2$y, ozone.dat$ozone)
m <- min(smoother1$y, smoother2$y, ozone.dat$ozone)
plot(dpg, ozone.dat$ozone, ylim=c(m-1,M+1),type='p',pch=20,
xlab="Daggot Pressure Gradient",ylab="Ozone Concentration")
points(smoother1$x, smoother1$y, col="blue",type='l')
points(smoother1$x, smoother2$y, col="red",type='l')
detach(ozone.dat)
The plot shows the results of applying a cubic smoothing spline to some air pollution data (128 observations). Two fits are given: a smoother fit corresponding to a larger penalty \(\lamdba\) and a rougher fit for a smaller penalty. That is; Smoothing spline fit of ozone concentration versus Daggot pressure gradient. The two fits correspond to different values of the smoothing parameter, chosen to achieve five and eleven effective degrees of freedom
We know that the regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. A possible choice of location for knots, is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable. However, in practice it is common to place knots in a uniform fashion.
To decide on the number of knots to use (i.e. how many degrees of freedom should our spline contain), we can try out different numbers of knots and see which produces the best looking curve. A somewhat more objective approach is to use cross-validation. With this method, we remove a portion of the data (say \(10\%\)), fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion. We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS. This procedure can be re-peated for different numbers of knots K. Then the value of K giving the smallest RSS is chosen.
Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree (e.g. \(X^{15}\)) to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed. Generally, this approach produces more stable estimates. Splines also allow us to place more knots, and hence flexibility, over regions where the function \(f\) seems to be changing rapidly, and fewer knots where \(f\) appears more stable.
The truncated power basis is conceptually simple, but it can result in numeric problems, for example, the basis functions are highly correlated.
The B-spline basis allows for efficient computations even when the number of knots \(\mathrm{K}\) is large. It is an alternative to the truncated power basis. Its basis functions are local (vs global), which means that each basis function is only non-zero over a limited range.
This reduces the collinearity problem.
The basis functions have considerable computational benefits compared to the truncated power basis (computationally more stable).
x <- seq(0, 1, length = 1000)
knots <- seq(0.1, 0.9, by = 0.1)
BM <- bs(x, knots = knots, degree = 3)
matplot(x = x, y = BM, type = "l", lwd = 2, las = 1, ylab = "")
#plotting multiple sets of observations from the same object, particularly from a matrix - use matplot (could also use plot() with lines() seperately)
There are 9 knots, and 12 basis functions, the intercept (constant) is often omitted by default. The B-spline basis functions are locally defined, and are not power functions of the x-variable. ESL page 186 looks at how we can construct B-spline basis recursively.
So how do we get from our initial \(x\) variable to basis functions (shown above) and then to the fitted curve. Consider figure below:
The curve in the second plot is obtained as a weighted sum of the bumps (basis functions) in the first plot: \(f(x)=\sum \beta_{i} h_{i}(x)\).
Now consider the final example; which demonstrates the fitting part of the procedurfe. The curve shown in the below figure is smooth, it is a B-spline with 5 degrees of freedom (no intercept) and has 2 knots at 0.333 and 0.66. The \(bs\) function from library splines was used to create the B-spline basis: \(\text{bs}(x, d f=5)\).
The first plot shows observations. The second plot (top right) shows the B-spline basis. The bottom left plot shows the same basis functions but scaled by coefficients, some negative, some positive. Note how the resulting functions have been adapted in order to explain the observations. The bottom right plot shows the same weighted basis functions plus their sum (black line \(=f(x))\).
To fit these curves we use the below eqautions:
\[ \begin{gathered} \hat{f}(x)=X \hat{\beta}=X\left(X^{\prime} X\right)^{-1} X^{\prime} Y=H Y \\ \operatorname{Var}(\hat{\beta})=\hat{\sigma}^{2}\left(X^{\prime} X\right)^{-1}=\Sigma \\ \operatorname{Var}(\hat{f}(x))=\operatorname{Var}(X \hat{\beta})=X \Sigma X^{\prime} \end{gathered} \]
Natural Splines)
Our aim was to find more flexible options for \(f(x)\) than offered by global polynomials. We can do this using piece-wise polynomials or splines.
In terms of splines, we have so far dealt with splines, cubic splines, \(\underline{B}\)-splines, and natural splines. These are all regression splines, which means that we determine knot positions, calculate the basis functions and fit them to data using the usual regression (linear or generalized linear) procedures. Their purpose is to highlight the underlying trend or to fit flexible non-linear relationships between response and predictors.
A spline is a sequence of piecewise functions (often polynomials) (each of order \(M\) ), joined at knots, continuous up to \((M-2)\) th derivative at each knot.
A regression spline is a spline set up with a set of fixed knots, as above.
order of spline \((M)=\) order of polynomial
\(\mathrm{M}=\) order
knots at \(\xi_{1}, \ldots, \xi_{K}\)
cubic spline: \(M=4\) (need \(1, x, x^{2}, x^{3}\) )
linear spline: \(M=2\), continuous at knots
degree of polynomial \(=M-1\)
$M = $ order = degree \(+ 1\)
Involve dividing the range of \(X\) into \(K\) distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots.
So a cubic spline is essentially a piece-wise polynomial where both the first and second derivatives of the piece-wise polynomials are continuous at the knot.
The most direct way to represent a cubic spline using is to start off with a basis for a cubic polynomial-namely, \(x, x^{2}, x^{3}\)-and then add one truncated power basis function per knot. A truncated power basis function is defined as
\[ h(x, \xi)=(x-\xi)_{+}^{3}=\left\{\begin{array}{cl} (x-\xi)^{3} & \text { if } x>\xi \\ 0 & \text { otherwise } \end{array}\right. \] where \(\xi\) is the knot.
So, in order to fit a cubic spline to a data set with \(K\) knots, we perform least squares regression with an intercept and \(3+K\) predictors, of the form \(X, X^{2}, X^{3}, h\left(X, \xi_{1}\right), h\left(X, \xi_{2}\right), \ldots, h\left(X, \xi_{K}\right)\), where \(\xi_{1}, \ldots, \xi_{K}\) are the knots.
basis for cubic spline: \(1, x, x^{2},x^{3},\left(x-\xi_{1}\right)_{+}^{3}, \ldots,\left(x-\xi_{K}\right)_{+}^{3}\)
using other notation: \[ \begin{gathered} h_{1}(x)=1, h_{2}(x)=x, h_{3}(x)=x^{2}, h_{4}(x)=x^{3} \\ h_{5}(x)=\left(x-\xi_{1}\right)_{+}^{3}, \ldots \\ h_{4+K}(x)=\left(x-\xi_{K}\right)_{+}^{3} \end{gathered} \]
The regression spline function \(f(x)\) can be written as:
\[ \begin{aligned} f(x) &=\sum_{m=1}^{M} \beta_{m} h_{m}(x) \\ &=\sum \beta_{j} h_{j}(x)=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}+\sum_{k=1}^{K} b_{k}\left(x-\xi_{k}\right)_{+}^{3} \end{aligned} \] where the \(h_{m}(x)\) are basis functions (all still functions or transformations of our variable \(x\) ).
A natural spline, is essentially the same as the regression spline (described above), but with additional boundary constraints: the function is required to be linear at the boundary. We use cross-validation to decide on the number of knots to use and we place the knots in a uniform fashion across the range of \(X\).
To fit regression splines in R, we use the splines library. Following the basis approach, we know that regression splines can be fit by constructing an appropriate matrix of basis functions. The \(bs()\) function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic splines are produced.
Fitting wage to age using a regression spline is shown:
First we need to get a set of values for age for which we shall use for predictions to create plot using our regression spline.
library(ISLR2)
data(Wage)
attach(Wage)
## The following object is masked from SAheart:
##
## age
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2], by = 1)
Now we can fit wage to age using a regression spline.
library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
pred <- predict(fit, newdata = list(age = age.grid), se = T)
plot(age, wage, col = "gray")
title("Regression Spline with knots at 25, 40 and 60")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed")
abline(v = c(25, 40, 60), col = "red", lty = 2)
As noted earlier, splines can have high variance at the outer range of the predictors - that is, when \(X\) takes on either a very small or very large value. The plot shows a fit to the Wage data with three knots. We see that the confidence bands in the boundary region appear fairly wild (or larger than the other regious - between boundaries).
We could look to fit a natural spline, which is a regression spline with additional boundary constraints: the function is required to be linear at the boundary (in the region where \(X\) is smaller than the smallest knot, or larger than the largest knot). This additional constraint means that natural splines generally produce more stable estimates at the boundaries.
We have pre-specified knots at ages 25, 40, and 60. This produces a spline with six basis functions. Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions. We could also use the df option to produce a spline with knots at uniform quantiles of the data.
# Using pre-specified knots
dim(bs(age, knots = c(25, 40, 60)))
## [1] 3000 6
# Using df for uniform knots
dim(bs(age, df = 6))
## [1] 3000 6
# Where are the uniform knots location
attr(bs(age, df = 6), "knots")
## 25% 50% 75%
## 33.75 42.00 51.00
In this case \(\mathrm{R}\) chooses knots at ages \(33.8,42.0\), and \(51.0\), which correspond to the 25th, 50th, and 75th percentiles of age. The function \(bs()\) also has a degree argument, so we can fit splines of any degree, rather than the default degree of 3 (which yields a cubic spline).
In order to instead fit a natural spline, we use the \(ns()\) function. Here we fit a natural spline with four degrees of freedom.
fit2 <- lm(wage ~ ns(age, df = 4), data = Wage)
pred2 <- predict(fit2, newdata = list(age = age.grid),se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2) #Regression Spline from Example 1
lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed")
lines(age.grid, pred2$fit, col = "red", lwd = 2) #Natural Spline
abline(v=attr(bs(age, df = 6), "knots"))
detach(Wage)
The red line is the natural spline. The black line is the regression spline. We do not include confidence bands for the natural spline in the plot, although it would be very simple to add. Note, as with the \(bs()\) function, we could instead specify the knots directly using the knots option.
With regression splines (i.e., splines we have covered so far - cubic, B-Spline and natural), we have difficulty in selecting where to place knots and how many knots to choose. Also, model selection would involve dropping knots - this could be problematic and could potentially involve many models. A more flexible and automatic approach is needed. A spline should be a good trade-off between fit and smooth.
Penalized regression splines take away the problem of choosing number and placement of knots. Instead, one chooses a large number of knots (enough), but then constrains their influence. With this approach, we often end up with a far larger basis than data points. We need to control the complexity of such a model.
we can restrict the number of terms in the model
we can use selection methods
scan the set of variables or basis functions and only select those which contribute to the fit of the model
MARS, for example
we can use regularization methods
use the entire dictionary but restrict the coefficients.
Ridge regression, for example
In the spline context, a less complex model generally requires a smoother function. One way of achieving a smoother function is by penalizing large \(\beta\) coefficients through a constraint.
In general with regularization we minimize a penalized residual sum of squares:
\[ \text { PRSS }=\sum\left(y_{i}-f\left(x_{i}\right)\right)^{2}+\lambda J(f) \] The first term measures how well the data are explained by \(f(x)\), the second term is a penalty for complexity. \(J(f)\) is the regularization or penalty term. \(\lambda\) is often called the smoothing parameter. By choosing different values for \(\lambda\) we can control the importance of the regularization term.
\(\lambda=0: f\) can be any function that interpolates the data.
\(\lambda=\infty: f\) all penalized coefficients will be set to zero.
The effect of this penalty is to shrink the coefficients of the spline basis functions towards 0.
The normal equations for regression models with quadratic penalties are \[ \hat{\beta}=\left(\mathbf{X}^{\prime} \mathbf{X}+\lambda \mathbf{D}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y} \]
Most often penalized regression splines refer to the following penalty: \[ \|\mathbf{y}-\mathbf{X} \boldsymbol{\beta}\|^{2}+\lambda \int\left[f^{\prime \prime}(x)\right]^{2} d x \]
We can write \(\int\left[f^{\prime \prime}(x)\right]^{2} d x=\boldsymbol{\beta}^{\prime} \mathbf{P} \boldsymbol{\beta}\) and then have a similar form of penalized error sum of squares (penalized least squares) as above: \[ P S S=\|\mathbf{y}-\mathbf{X} \boldsymbol{\beta}\|^{2}+\lambda \boldsymbol{\beta}^{\prime} \mathbf{P} \boldsymbol{\beta} \]
\[ \hat{\boldsymbol{\beta}}=\left(\mathbf{X}^{\prime} \mathbf{X}+\lambda \mathbf{P}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y} \]
When minimizing the above penalized sum of squares, \(\lambda\) is fixed. An appropriate lambda can be chosen using cross-validation (more on this later).
P-splines use the B-spline basis plus a difference penalty. P-splines penalize the squared difference between adjacent \(\beta_{i}\) values: \[ \mathcal{P}=\sum_{i=1}^{k-1}\left(\beta_{i+1}-\beta_{i}\right)^{2} \]
P-splines are a form of penalized regression spline. They are easy to implement, calculating \(P\) is straightforward: \[ \mathcal{P}=\sum_{i=1}^{k-1}\left(\beta_{i+1}-\beta_{i}\right)^{2}=\beta^{\prime} P^{\prime} P \beta \] In \(R, P\) can be calculated as follows:
k <- 6 #example basis dimension
P <- diff(diag(k), differences = 1) #sqrt of penalty matrix
S <- t(P) %*% P #penalty matrix
P-spline is not synonymous with “penalized spline”! P-splines are a form of penalized spline, but use a very specific penalty. The penalty encourages adjacent coefficients to be similar. This means that adjacent bumps of the B-spline basis are weighted sim- ilarly. The effect of this is that the curve changes slower (e.g. you wouldn’t have a positive and then a negative bump next to each other). The overall effect is a smoother curve.
So far we have looked at regression splines, which we create by specifying a set of knots, producing a sequence of basis functions, and then using least squares to estimate the spline coefficients - this is the basis function approach! What we look at now is smoothing splines, which are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty. This is a somewhat different apporach to produce a spline from what we looked at - splines from the basis approach.
this is essentially a method that avoids the knot selection problem completely by using a maximal set of knots. The complexity of the fit is controlled by regularization. Among all functions \(f(x)\) with two continuous derivatives, find one that minimizes the penalized residual sum of squares:
\[ \operatorname{RSS}(f, \lambda)=\sum_{i=1}^{N}\left\{y_{i}-f\left(x_{i}\right)\right\}^{2}+\lambda \int\left\{f^{\prime \prime}(t)\right\}^{2} d t, \] where \(\lambda\) is a fixed smoothing parameter. The first term measures closeness to the data, while the second term penalizes curvature in the function, and \(\lambda\) establishes a trade-off between the two.
Two special cases are:
\(\lambda=0: f\) can be any function that interpolates the data.
\(\lambda=\infty\) : the simple least squares line fit, since no second derivative can be tolerated.
When we fit a smooth curve to a set of data, what we essentially want is to find some function \(g(x)\), that fits the observed data well: that is, we want RSS \(=\sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}\) to be small. If we don’t put any constraints on \(g\left(x_{i}\right)\), then we can always make RSS zero simply by choosing \(g\) such that it interpolates all of the \(y_{i}\) - over fit the data and would be far too flexible. So what we actually want is a function \(g\) that makes RSS small, but that is also smooth. To make it smooth, we find the function \(g\) that minimizes \[ \sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}+\lambda \int g^{\prime \prime}(t)^{2} d t \] where \(\lambda\) is a non-negative tuning parameter. The function \(g\) that minimizes the above equation is known as a smoothing spline.
This equation is essentially a \(Loss + Penalty\) formulation. The term \(\sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}\) is a loss function that encourages \(g\) to fit the data well, and the term \(\lambda \int g^{\prime \prime}(t)^{2} d t\) is a penalty term that penalizes the variability in \(g\).
Note that, the first derivative \(g^{\prime}(t)\) measures the slope of a function at \(t\), and the second derivative corresponds to the amount by which the slope is changing. Hence, broadly speaking, the second derivative of a function is a measure of its roughness: it is large in absolute value if \(g(t)\) is very wiggly near \(t\), and it is close to zero otherwise. (The second derivative of a straight line is zero; note that a line is perfectly smooth.)
The \(\int\) notation is an integral, which we can think of as a summation over the range of \(t\) - \(\int g^{\prime \prime}(t)^{2} d t\) is simply a measure of the total change in the function \(g^{\prime}(t)\), over its entire range. If \(g\) is very smooth, then \(g^{\prime}(t)\) will be close to constant and \(\int g^{\prime \prime}(t)^{2} d t\) will take on a small value. Conversely, if \(g\) is jumpy and variable then \(g^{\prime}(t)\) will vary significantly and \(\int g^{\prime \prime}(t)^{2} d t\) will take on a large value. Therefore, in the full equation, \(\lambda \int g^{\prime \prime}(t)^{2} d t\) encourages \(g\) to be smooth. ’
The larger the value of \(\lambda\), the smoother \(g\) will be. When \(\lambda=0\), then the penalty term has no effect, and so the function \(g\) will be very jumpy and will exactly interpolate the training observations. When \(\lambda \rightarrow \infty, g\) will be perfectly smooth-it will just be a straight line that passes as closely as possible to the training points. For an intermediate value of \(\lambda, g\) will approximate the training observations but will be somewhat smooth. We see that \(\lambda\) controls the bias-variance trade-off of the smoothing spline.
The function \(g(x)\) that minimizes the equation we gave (Loss + Penalty equation) can be shown to have some special properties: it is a piece-wise cubic polynomial with knots at the unique values of \(x_{1}, \ldots, x_{n}\), and continuous first and second derivatives at each knot. Furthermore, it is linear in the region outside of the extreme knots. In other words, the function \(g(x)\) is a natural cubic spline with knots at \(x_{1}, \ldots, x_{n}\). However, it is not the same natural cubic spline that one would get if one applied the basis function approach described in previous section with knots at \(x_{1}, \ldots, x_{n}\) - rather, it is a shrunken version of such a natural cubic spline, where the value of the tuning parameter \(\lambda\) controls the level of shrinkage.
What we have established is that a smoothing spline is simply a natural cubic spline with knots at every unique value of \(x_{i}\) . It might seem that a smoothing spline will have far too many degrees of freedom, since a knot at each data point allows a great deal of flexibility. But the tuning parameter \(\lambda\) controls the roughness of the smoothing spline, and hence the effective degrees of freedom. It is possible to show that as \(\lambda\) increases from 0 to \(\infty\), the effective degrees of freedom, which we write \(d f_{\lambda}\), decrease from \(n\) to 2 .
In fitting a smoothing spline, we do not need to select the number or location of the knots - there will be a knot at each training observation, \(x_{1}, \ldots, x_{n}\). Instead, we have another problem: we need to choose the value of \(\lambda\). It should come as no surprise that one possible solution to this problem is cross-validation. In other words, we can find the value of \(\lambda\) that makes the cross-validated RSS as small as possible.
Remember in general, simpler models are better unless the data provides evidence in support of a more complex model.
Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty. What we want is a function \(g\) that makes RSS small, but that is also smooth, i.e., the function \(g\) that minimizes \[ \sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}+\lambda \int g^{\prime \prime}(t)^{2} d t \] where \(\lambda\) is a non-negative tuning parameter. The function \(g\) that minimizes the above equation is known as a smoothing spline. The function \(g(x)\) is a natural cubic spline with knots at \(x_{1}, \ldots, x_{n}\). However, it is not the same natural cubic spline that one would get if one applied the basis function approach.
In order to fit a smoothing spline, we use the \(smooth.spline()\) function.
attach(Wage)
## The following object is masked from SAheart:
##
## age
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Smoothing Spline")
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
## Warning in smooth.spline(age, wage, cv = TRUE): cross-validation with non-unique
## 'x' values seems doubtful
fit2$df
## [1] 6.794596
lines(fit, col = "red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright", legend = c("16 DF", "6.8 DF"),col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)
Notice that in the first call to \(smooth.spline()\), we specified \(df = 16\). The function then determines which value of \(lambda\) leads to 16 degrees of freedom. In the second call to \(smooth.spline()\), we select the smoothness level by cross- validation; this results in a value of \(lambda\) that yields 6.8 degrees of freedom.
Looking at the plot we have a smoothing spline fit to the Wage data. The red curve results from specifying 16 effective degrees of freedom. For the blue curve, \(\lambda\) was found automatically by leave-one-out cross-validation, which resulted in 6.8 effective degrees of freedom.
For this data, there is little discernible difference between the two smoothing splines, beyond the fact that the one with 16 degrees of freedom seems slightly wigglier. Since there is little difference between the two fits, the smoothing spline fit with 6.8 degrees of freedom is preferable, since in general simpler models are better unless the data provides evidence in support of a more complex model.
So far we have looked at regression and smoothing splines as an approach for flexibly predicting a response \(Y\) on the basis of a single predictor \(X\). Now, we explore the problem of flexibly predicting \(Y\) on the basis of several predictors, \(X_{1}, \ldots, X_{p}\). This amounts to an extension of multiple linear regression.
Essentially, Generalized additive models (GAMs) allow us to extend the methods above to deal with multiple predictors. That is, GAMs provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. Just like linear models, GAMs can be applied with both quantitative and qualitative responses.
We can extend the multiple linear regression model of
\[ y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{p} x_{i p}+\epsilon_{i} \] in order to allow for non-linear relationships between each feature and the response is to replace each linear component \(\beta_{j} x_{i j}\) with a (smooth) nonlinear function \(f_{j}\left(x_{i j}\right)\).
So now we can write the model as:
\[ \begin{aligned} y_{i} &=\beta_{0}+\sum_{j=1}^{p} f_{j}\left(x_{i j}\right)+\epsilon_{i} \\ &=\beta_{0}+f_{1}\left(x_{i 1}\right)+f_{2}\left(x_{i 2}\right)+\cdots+f_{p}\left(x_{i p}\right)+\epsilon_{i} . \end{aligned} \] This is a GAM. It is called an additive model because we calculate a separate \(f_{j}\) for each \(X_{j}\), and then add together all of their contributions.
We can use the methods we have covered, like polynomial, step functions and splines - where we fit functions to a single variable - in GAMS for fitting an additive model.
For example, consider using natural splines. We want to fit the model:
\[ \text { wage } \left.=\beta_{0}+f_{1}(\text { year })+f_{2} \text { (age }\right)+f_{3}(\text { education })+\epsilon \]
Here year and age are quantitative variables, and education is a qualitative variable with five levels: HS, HS, <Coll, Coll, \(>\) Coll, referring to the amount of high school or college education that an individual has completed. We fit the first two functions using natural splines. We fit the third function using a separate constant for each level, via the usual dummy variable approach.
The natural splines can be constructed using an appropriately chosen set of basis functions. The entire model is just a big regression onto spline basis variables and dummy variables, all packed into one big regression matrix.
Note, that we do not have to use splines only, as the building blocks for GAMs. We can just as well use, polynomial regression or any combination of the apporaches we have seen thus far, in order to create a GAM.
GAMs allow us to fit a non-linear \(f_{j}\) to each \(X_{j}\), so that we can automatically model non-linear relationships that standard linear regression will miss. This means that we do not need to manually try out many different transformations on each variable individually.
The non-linear fits can potentially make more accurate predictions for the response \(Y\). And because the model is additive, we can examine the effect of each \(X_{j}\) on \(Y\) individually while holding all of the other variables fixed.
The main limitation of GAMs is that the model is restricted to be additive. We can manually add interaction terms to the GAM model by including additional predictors of the form \(X_{j} \times X_{k}\). In addition we can add low-dimensional interaction functions of the form \(f_{j k}\left(X_{j}, X_{k}\right)\) into the model; such terms can be fit using two-dimensional smoothers such as two-dimensional splines.
We fit a GAM to predict wage using natural spline functions of year and age, treating education as a qualitative predictor. This is essentially just a big linear regression model using an appropriate choice of basis functions, we can simply do this using the \(lm()\) function
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
contrasts(education)
## 2. HS Grad 3. Some College 4. College Grad
## 1. < HS Grad 0 0 0
## 2. HS Grad 1 0 0
## 3. Some College 0 1 0
## 4. College Grad 0 0 1
## 5. Advanced Degree 0 0 0
## 5. Advanced Degree
## 1. < HS Grad 0
## 2. HS Grad 0
## 3. Some College 0
## 4. College Grad 0
## 5. Advanced Degree 1
We see that education is stored as a factor variable. We also see the encoding of the variable. We now proceed to fitting the model using \(lm()\). We plot the GAM using \(plot.Gam()\).
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
plot.Gam(gam1, se = TRUE, col = "blue")
We now fit the model using smoothing splines rather than natural
splines. For the Wage data, plots of the relationship between
each feature and the response, wage, in the fitted model
\[ \text { wage }=\beta_{0}+f_{1}(\text { year })+f_{2}(\text { age })+f_{3}(\text { education })+\epsilon \]
Each plot displays the fitted function and pointwise standard errors. The first two functions are natural splines in year and age, with four and five degrees of freedom, respectively. The third function is a step function, fit to the qualitative variable education.
The left-hand panel indicates that holding age and education fixed, wage tends to increase slightly with year; this may be due to inflation. The center panel indicates that holding education and year fixed, wage tends to be highest for intermediate values of age, and lowest for the very young and very old. The right-hand panel indicates that holding year and age fixed, wage tends to increase with education: the more educated a person is, the higher their salary, on average.
In order to fit more general sorts of GAMs, using smoothing splines or other components that cannot be expressed in terms of basis functions and then fit using least squares regression, we will need to use the gam library in R.
The \(s()\) function, which is part of the gam library, is used to indicate that we would like to use a smoothing spline. We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is qualitative, we leave it as is, and it is converted into four dummy variables. We use the \(gam()\) function in order to fit a GAM using these components.
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
par(mfrow = c(1, 3))
plot(gam.m3, se = TRUE, col = "blue")
NMow \(f_1\) and \(f_2\) are smoothing splines with four and five degrees of freedom, respectively. Fitting a GAM with a smoothing spline is not quite as simple as fitting a GAM with a natural spline, since in the case of smoothing splines, least squares cannot be used. However, standard software such as the \(gam()\) function in R can be used to fit GAMs using smoothing splines, via an approach known as backfitting.
Backfitting: This method fits a model involving multiple predictors by repeatedly updating the fit for each predictor in turn, holding the others fixed. The beauty of this approach is that each time we update a function, we simply apply the fitting method for that variable to a partial residual.
In these plots, the function of year looks rather linear. We can perform a series of ANOVA tests in order to determine which of these three models is best.
That is, a GAM that excludes year \(\left(\mathcal{M}_{1}\right)\), a GAM that uses a linear function of year \(\left(\mathcal{M}_{2}\right)\), or a GAM that uses a spline function of year \(\left(\mathcal{M}_{3}\right)\).
In model 1: we have a model that does not include year In model 2: we have a model that uses a linear function of year In model 3: we have a model that uses a spline function for year
gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage)
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Having a look at the p-values and resulting ANOVA output. We find that there is compelling evidence that a GAM with a linear function of year is better than a GAM that does not include year at all \((\mathrm{p}\)-value \(=0.00014)\).
Also, there is no evidence that a non-linear function of year is needed \((\mathrm{p}\)-value \(=0.349)\). In other words, based on the results of this ANOVA, \(\mathcal{M}_{2}\) is preferred.
The \(summary()\) function produces a summary of the GAM fit.
summary(gam.m3)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The “Anova for Parametric Effects” p-values clearly demonstrate that year, age, and education are all highly statistically significant, even when only assuming a linear relationship.
Alternatively, the “Anova for Nonparametric Effects” p-values for year and age correspond to a null hypothesis of a linear relationship versus the alternative of a non-linear relationship. The large p-value for year reinforces our conclusion from the ANOVA test that a linear function is adequate for this term. There is very clear evidence that a non-linear term is required for age.
We can make predictions using the \(predict()\) method for the class GAM. Here we make predictions on the training set.
preds <- predict(gam.m2, newdata = Wage)
The GAM we fit takes the form \[ \log \left(\frac{p(X)}{1-p(X)}\right)=\beta_{0}+\beta_{1} \times \text { year }+f_{2}(\text { age })+f_{3}(\text { education }) \] where
\[ p(X)=\operatorname{Pr}(\text { wage }>250 \mid \text { year, age, education }) . \]
In order to fit a logistic regression GAM, we once again use the \(I()\) function in constructing the binary response variable, and set \(family=binomial\).
gam.lr <- gam(I(wage > 250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage)
par(mfrow = c(1, 3))
plot(gam.lr, se = T, col = "green")
Once again \(f_{2}\) is fit using a smoothing spline with five degrees of freedom, and \(f_{3}\) is fit as a step function, by creating dummy variables for each of the levels of education. The resulting fit is shown in the plots.
The last panel looks suspicious, with very wide confidence intervals for level \(<\text{HS}\). In fact, no response values equal one for that category: no individuals with less than a high school education make more than \(\$ 250,000\) per year.
It is easy to see that there are no high earners in the \(< HS\) category:
table(education, I(wage > 250))
##
## education FALSE TRUE
## 1. < HS Grad 268 0
## 2. HS Grad 966 5
## 3. Some College 643 7
## 4. College Grad 663 22
## 5. Advanced Degree 381 45
Hence we refit the GAM, excluding the individuals with less than a high school education. That is, we fit a logistic regression GAM using all but this category. This provides more sensible results.
gam.lr.s <- gam(I(wage > 250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage, subset = (education != "1. < HS Grad"))
plot(gam.lr.s, se = T, col = "green")
The resulting model is shown in plot. As in previous plots, all three panels have similar vertical scales. This allows us to visually assess the relative contributions of each of the variables. We observe that age and education have a much larger effect than year on the probability of being a high earner.
The MARS procedure is a greedy forward algorithm for including only those tensor products that are deemed necessary by least squares.
So far we have focused on one-dimensional spline models.