Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression. Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant?

My initial intuition was to prefer the second alternative, like in autoregressive models in R. When we fit an AR(6) model, it’s not really a big deal if most coefficients are not significant (but the last one). It’s won’t affect much the forecast. So here, it might be the same. With a larger number of knots, we should be able to capture small bumps that we’ll never capture with a smaller number.

Here is what a have with a small number of knots, and cubic splines

and with a larger number of knots

In order to understand what’s going on, consider a simple model, with the two splines above, in red

> set.seed(1) > library(splines) > x=seq(0,1,by=.01) > v=bs(x,10) > x2=v[,2] > x10=v[,10] > set.seed(1) > y=1+3*x2+5*x10+rnorm(length(x))/4 > y_test=1+3*x2+5*x10+rnorm(length(x))/4

Note that here I have generated two sets of data, one to train a model, and one to test it. Here, the data looks like that

> plot(x,y)

It is based on two splines,

> lines(df$x,1+3*x2+5*x10)

If we use a spline model with 10 degrees of freedom, we get

> df=data.frame(x,y) > reg=lm(y~bs(x,10),data=df) > summary(reg) Coefficients: Estimate Std. Er t value Pr(>|t|) (Intercept) 0.91671 0.17068 5.371 6.08e-07 *** bs(x, 10)1 0.20485 0.32696 0.627 0.533 bs(x, 10)2 3.15593 0.22534 14.005 < 2e-16 *** bs(x, 10)3 0.04847 0.25075 0.193 0.847 bs(x, 10)4 0.09373 0.21597 0.434 0.665 bs(x, 10)5 0.11624 0.22939 0.507 0.614 bs(x, 10)6 0.24829 0.22293 1.114 0.268 bs(x, 10)7 -0.06825 0.23498 -0.290 0.772 bs(x, 10)8 0.19633 0.26241 0.748 0.456 bs(x, 10)9 0.27557 0.26976 1.022 0.310 bs(x, 10)10 4.78134 0.24116 19.826 < 2e-16 ***

which makes sense, from what we have generated. Indeed, most of the components are not significant, but the second and the tenth. We can actually test that all those components are null (at the same time)

> A=matrix(0,8,11) > colnames(A)=names(coefficients(reg)) > A[1,2]=A[2,4]=A[3,5]=A[4,6]=A[5,7]= + A[6,8]=A[7,9]=A[8,10]=1 > b=rep(0,8) > linearHypothesis(reg, A,b) Linear hypothesis test Hypothesis: bs(x, 10)1 = 0 bs(x, 10)3 = 0 bs(x, 10)4 = 0 bs(x, 10)5 = 0 bs(x, 10)6 = 0 bs(x, 10)7 = 0 bs(x, 10)8 = 0 bs(x, 10)9 = 0 Model 1: restricted model Model 2: y ~ bs(x, 10) Res.Df RSS Df Sum of Sq F Pr(>F) 1 98 4.8766 2 90 4.6196 8 0.25701 0.6259 0.754

and yes, those coefficients are not significant.

> yp10=predict(reg) > lines(df$x,yp10,col="red")

The red curve is not far away from the black one (the simulated model, when the noise is removed).

Now, if we compare with a spline regression with three degrees of freedom

> reg=lm(y~bs(x),data=df) > yp3=predict(reg) > lines(df$x,yp3,col="blue",lwd=3) > summary(reg) Coefficients: Estimate Std. Err t value Pr(>|t|) (Intercept) 1.5381 0.2188 7.029 2.91e-10 *** bs(x)1 2.2276 0.6348 3.509 0.000683 *** bs(x)2 -4.4788 0.4089 -10.952 < 2e-16 *** bs(x)3 1.9936 0.3434 5.805 8.12e-08 ***

This time, all components are significant, but the quality of fit is rather poor, isn’t it?

What is the best thing we can do? Let us fit different models. Say consider 8 degrees of freedom, but keep only 1, or 2, or 3, etc components. Then get a prediction on the test-dataset we have, and see which model has the smallest variance,

> me=1e9 > ERR=matrix(NA,15,15) > for(k in 4:15){ + bsX=bs(df$x,k) + dfX=data.frame(y=df$y,bsX) + reg=lm(y~.,data=dfX) + library(leaps) + W=leaps( bsX , df$y , method="Cp" , + nbest=1)$which + for(i in 1:k){ + dfXi=data.frame(y=df$y,bsX[,W[i,]]) + reg=lm(y~.,data=dfXi) + dfXi_s=data.frame(y=y_test,bsX[,W[i,]]) + names(dfXi_s)=c("y", + names(coefficients(reg))[-1]) + erreur=dfXi_s$y-predict(reg,newdata=dfXi_s) + ERR[k,i]=sum(erreur^2) + if(ERR[k,i]<=me){LOC=c(k,i);me=ERR[k,i]} + }}

The best model is when we keep the ten best components, out of 11 degrees of freedom

> LOC [1] 11 10

More specifically,

> library(leaps) > bsX=bs(df$x,LOC[1]) > W=leaps( bsX , df$y , method="Cp" , + nbest=1)$which > dfXi=data.frame(y=df$y,bsX[,W[LOC[2],]]) > reg=lm(y~.,data=dfXi) > ypopt=predict(reg) > lines(df$x,ypopt,col="purple")

The *best *model here is one with more degrees of freedom than the one we used to generate the data,

Actually, if we use 10 degrees of freedom, but keep only the best two components, we get something rather close

> bsX=bs(df$x,10) > W=leaps( bsX , df$y , method="Cp" , + nbest=1)$which > dfXi=data.frame(y=df$y,bsX[,W[2,]]) > reg=lm(y~.,data=dfXi) > ypopt=predict(reg) > lines(df$x,ypopt,col="red",lwd=3)

So, it looks like having a lot of non significant components in a spline regression is not a major issue. And reducing the degrees of freedom is clearly a bad option.

Thanks very a great deal for continuing the disputation. Iâ€™d roll in the hay to answer to too each

one of you individually, merely alas, itâ€™s difficult to maintain up with everything

sometimes, and for tgat I excuse.Stig,Thanks for the Run nexus.

Some selfsame discriminate bring there, for for sure.

I feel this rejoins with the critics of stepwise regression, that uses significance-based tools to select features at each step. At some point, the final and most significant components may be very far from providing the best fit.

It appears to me that there is a bug in the model selection code using leaps below.

+ for(i in 1:k){

+ dfXi=data.frame(y=df$y,bsX[W[i,]])

…

}

It should really select columns over bsX as follows. Please double check.

+ for(i in 1:k){

+ dfXi=data.frame(y=df$y,bsX[,W[i,]])

….

}

indeed, sorry for the typo ! I will change the code in the post

Very interesting as usual! Thanks.

This post echos the ongoing debate: whether we are after an interpretable model, or a one which provides accurate information.

There is a very nice discussion here: http://projecteuclid.org/euclid.ss/1009213726 (‘Statistical Modeling: The Two Cultures ‘ by Breiman).

This is a rather interesting post! I always wandered what is the purpose of testing the significance of the knots. I find it very hard to interpret their meaning. When i have to deal with splines usually i resort to information criterions. Even this is tricky as AIC picks the more complex models in big samples, even when there is no need for that. BIC on the other had can be too stringent in smaller samples. Just as others i never really figured out were is the line between the two.

This semester, I have my ‘time series’ course, and I think I will publish a post on the same topic in AR times series… a few years back, I had a course based on SAS, where we could specify the lag we have in the autoregressive part, like t-1, t-3 and t-12… and then, when I move to R, it took me some time to understand that it was not a big deal to have all lags up to t-12… I will probably post something in a few weeks !

Well the “best” model will always be the one that matches the data-generating model, which in this case uses the second and 10th splines from a the b-spline basis of dimension 10 that you defined. This is equivalent to your last model (the thick red line). Fitting a model with the full b(10) basis has the potential to over-fit the data, but as you can see (by the purple line) in this case it isn’t that bad. If you had a lower signal-to-noise ratio, it might become a problem.

Reducing to bs(3) fails because the splines in this basis are wider and in the wrong location from the data-generating model – this is clear by looking at your first two figures.

Usually the best thing to do (when you don’t know the data-generating model) is to use penalized splines (e.g., p-splines). As long as you basis is big enough to capture the complexity of the data (here at least dimension 10), fitting with a penalized spline will get you something similar to the red line without an arbitrary selection procedure. The penalty helps avoid over-fitting.

> library(mgcv)

> fit plot(fit)

yes, that’s a commnent I got from Twitter, to use p-splines

on my to-do list

Twitter is so smart.

My code got cut off when I tried to use the assignment operator.

> library(mgcv)

> fit = gam(y~s(x))

> plot(fit)

thanks

I am a big fan of the mgcv package (especially for multivariate spline functions)

Important post! Would be interesting to see how AIC, BIC and reduced chi-square perform here for model selection…

Cheers,

Andrej