2 Moving Beyond Linearity
Literature:
- Moving Beyond Linearity (ISL CH7)
Recall that complexity = also means lower interpretibility. This subject extents the linear models with the following:
- Polunomial Regression - where polynomials of the variables are added.
- Step Functions - where the x range is cut into k distinct regions to produce a qualitative variable. Hence also the name, piecewise constant function.
- Regression Splines - a combination / extensions of number one and two. Where polynomials functions are applied in specified regions of an X range.
- Smoothing Splines - Similar to the one above, but slightly different in the fitting process.
- Local Regression - Similar to regression splines, but these are able to overlap.
- Generalized Additive Models - allows to extent the model with several predictors.
2.1 Models Beyond Linearity
Notice that all approaches despite GAM are extensions of simple linear regression, as it only takes on one predictor variable.
2.1.1 Polynomial Regression
Can be defined by the following
\[\begin{equation} y_{i\ }=\ \beta_0+\beta_1x_i+\beta_2x_i^2+...+\ \beta_dx_i^d\ +\ \epsilon_i \tag{2.1} \end{equation}\]
Rules of thumb:
- We don’t take on more than 3 or 4 degrees of d, as that yields strange lines
Note that we can still use standard errors for coefficient estimates.
2.1.1.1 Beta coefficients and variance
Each beta coefficient has its own variance (just as in linear regression).
It can be defined by a matrix of j dimensions, e.g., if you have 5 betas (including beta 0) we can construct the correlation matrix.
Covariance matrix can be identified by \(\hat{C}\).
Generally we get point estimates, but it is also interesting to show the confidence intervals (using 2 standard errors).
Notice, that we cant really interprete beta coefficients as we do with linear regression, hence we dont have the same ability to do inference as the coefficients are missleading.
2.1.1.2 Application procedure
- Use
lm()
orglm()
- Use DV~poly(IV,degree)
- Perform CV with
cv.glment()
/ aonva F-test to select degree- This is basically either visually selecting the degrees that are the best using CV or using an ANOVA to assess if the MSE are significantly different from each other, hence an ANOVA test. Regarding ANOVA, if there stop being significance, then significant changes, e.g., in a poly 8, where the previous polynomials was not significant, then we can also disregard the 8’th polynomial.
- Fit the selected model
- Look at
coef(summary(fit))
- Plot data and predictions with
predict()
- Check residuals
- Interpret
The lecture shows exercise number 6 in chapter 7.
2.1.2 Step Functions
This is literally just fitting a constant in different bins, see example on page 269. This is also called discretizing x. Thus it is not polynoomial, but it is non linear.
It is often applied when we see, e.g., five year age bins, e.g., a 20-25 year old is expected to earn so and so much etc. But we can only really use it, when there are natural cuts, hence one must be considerate using the model.
Major disadvantage: If there are no natural breakpoints, then the model is likely to miss variance and also generalize too much.
Remember, that the steps reflect the average increase in Y for each step. Hence the first bin (range) is defined by \(\beta_0\), and can be regarded as the average of that range of x. Thus each coefficient of the ranges of x is to be understood as the average increase of response.
In other words, \(\beta_0\) is the reference level, where the following cuts reflect the average increase or decrease hereon.
Note that we can still use standard errors for coefficient estimates.
2.1.3 Regression Splines
2.1.3.1 Piecewise Polynomials
This is basically polynomial regression, where the coefficients apply to specified ranges of X. The points where the coefficients changes are called knots. Hence a cubic function will look like the following.
Notice, that a piecewise polynomial function with no knots, is merely a standard polynomial function.
\[\begin{equation} y_{i\ }=\ \beta_0+\beta_1x_i+\beta_2x_i^2+\ \beta_3x_i^3\ +\ \epsilon_i \tag{2.2} \end{equation}\]
Where it can be extended to be written for each range.
The coefficients can then be written with: \(\beta_{01},\beta_{11},\beta_{21}\) etc. and for the second set: \(\beta_{02},\beta_{12},\beta_{22}\)
And for each of the betas in all of the cuts, you add one degree of freedom.
Rule of thumb:
- The more knots, the more complex = the more variance capture and noise trailed, hence low model bias but large model variance.

Figure 2.1: Piecewise Polynomials
2.1.3.2 Constraints and Splines
It is a spline when the piecewise polynomials have been imposed with restrictions for continouity and derivatives, so the knot don’t break or so to say, hence the knot will not be visible.
Figure 7.3 show how the splines look, as the top left window tell, the jump is rather odd. Hence, we can force the fit to be continuous, by imposing a constraint.
Notice, that it is a piecewise polynomial regression, when you merely fit polynomials onto bins of data. If you want to have the ends tied together, you impose contraints and thus created a spline.
We can further constrain the model, with adding derivatives of the functions, hence the first derivative and the second derivative (where it in this instance created linear splines, that is because we have added \(d-1\) i.e. 2 derivatives, if the function was to the power of 4, then we should have imposed 3 derivatives to achieve linearity in the splines.)
Hence, the linear spline can be defined by: It is piecewise splines of degree-d polynomials, with continuity in derivatives up to degree \(d-1\) at each knot.
Hence we have the following constraints:
- Continuity
- Derivatives
2.1.3.3 Choosing the number and location of the Knots
Choosing amount of knots? One may ask themself, how many degrees of freedom do you want to include in the model?
Amount of knots is therefore corresponding to amount of degrees of freedom.
We can let software estimate the best amount and the best locations. Here one can use:
- In-sample performance
- Out-of-sample performance, e.g. with CV, perhaps extent to K folds with K tests, to ensure, that each variable has been held out once.
This can be followed by visualizing the MSE for the different simulations with different amount of knots.
2.1.3.4 Degrees of freedom
You count the amount of coefficients in the piecewise polynomial. Then you deduct df as you impose restrictions.
e.g. with one knot, you impose continuity, then you deduct one 1 df. for each derivative that we impose we can subtract one df.
Therefore the example above: 8 df in the beginning less 1 for the cut and two for the two derivatives. Hence, we end up with 5 df.
2.1.3.5 Basis splines vs. natural splines
We see that in polynomial regression and splines we are able to apply the finction bs()
and ns()
, this stands for basis splines and natural splines. In principle they are very similar, although the natural spline just have an additional constraint in the regions below and above the first and last knot, where it is linear.
Notice, that a polynomial and step functions are basically basis functions
Natiral splies are often beneficiary as you often don’t have much information in these regions, hence a normal polynomial (basis functions) will become very wiggly, hence we seek for linearity here.
2.1.3.6 Comparison with Polynomial Regression
With regression splines we are able to introduce knots that account for variance as it slightly resets the model in each section, hence we can fit the model to the data without having to impose as much complexity as we would in normal polynomial regression.
Hence one often observes that regression splines have more stability than polynomial regression.
2.1.4 Smoothing Splines
In prediction this is slightly better than basic functions and natural splines.
This is basically attempting to find a model that captures the variance by a smoothing line. Doing so, we fit a very flexible model and impose restrictions upon this, to achieve a shrunken model, just as with Lasso and Ridge Regression. Thus, the smoothing (imposing restrictions) deals with overfitting.
Also as we havee discovered previously, that degrees of freedom is equivilant with amount of knots, e.g., in polynomial splines, then three knots in a cubic function leads to 6 degrees of freedom, hence an smooth spline with df = 6.8 can be said to approximately have 3 knots, but we will never really know.
This can be defined as a cubic spline with knot at every unique value of \(x_i\)
Hence we have the following model:
\[\begin{equation} RSS=\sum_{i=1}^n\left(y_i-g\left(x_i\right)\right)^{^2}+\lambda\int g''\left(t\right)^{^2}dt \tag{2.3} \end{equation}\]
i.e. Loss + Penalty
Recall that the penalty controls the curvature of the function
Where:
- We define model g(x)
- \((y_i-g(x_i))^{^2}\) = the loss, meaning the difference between the fitted model and the actual y’s
- \(\lambda\) = the tuning parameter, hence the restriction that we want to impose. If lambda is low, then much flexibility, if lambda is high, then low flexibility. Hence, controls the bias variance tradeoff.
- \(\int g''\left(t\right)^{^2}dt\) = a measure of how much \(g'(x)\) changes oer time. Hence, the higher we set \(\lambda\) the more imposed restrictions, meaning the smoother the model, as lambda gets closer to infinity, the model becomes linear.
2.1.4.1 Choosing optimal tuning parameter
The analytic LOOCV can be calculated, the procedure appears to be the same as for lasso and ridge regression. The book (page 279) describes this a bit in details. However it says that software is able to do this.
Basically what is done, is LOOCV and simulating different tuning parameters to assess what model that performs the best.
Notice, that in R we are not working with \(\lambda\), but we can control the df associated with lambda. Or we can just let the model choose.
With this, degrees of freedom is not the same as we are used to. This creates sparsity as we know from regularization.
We are not expected to explain this, but we should be able to interpret the results
2.1.5 Local Regression
This i basically fitting a linear regression to each x, where s observations are included in the fitting procedure. Thus, one creates several fits, that are based on the observations weighted, where observations close to \(x_0\) (the center of the regression) are given the highest weight and then the weight is gradually decreasing.
This is often really good when you have outliers, as you define how big a neighborhood you want to evaluate (also called the span, e.g. span of 0.5 = 50% of the observations).
This can be visualized with:

Figure 2.2: Local regression
Doing local regression has the following procedure (algorithm)
- Gather the fraction \(s = k/n\) of training points whose \(x_i\) are closest to \(x_0\).
- Assign a weight \(K_{i0} = K(x_i, x_0)\) to each point in this neighborhood, so that the point furthest from x0 has weight zero, and the closest has the highest weight. All but these k nearest neighbors get weight zero.
- Fit a weighted least squares regression of the \(y_i\) on the \(x_i\) using the aforementioned weights, by finding \(\hat\beta_0\) and \(\hat\beta_1\) that minimize
\[\begin{equation} \sum_{i=1}^nK_{i0}\left(y_i-\beta_0-\beta_1x_i\right)^{^2} \tag{2.4} \end{equation}\]
- The fitted value at \(x_0\) is given by \(\hat{f}(x_0)=\hat\beta_0+\hat\beta_1x_0\)
Where we see how the model is
2.1.6 Generalized Additive Models
This can naturally both be applied in regression and classification problems, futher elaborated in the following.
It is called generalized, as the dependent variable can be both continuous (e.g. Gaussian) and categorical (e.g., binomial, Poisson, or other distributions) distributed
Additive = the model is adding different polynomials of the IDV toghether. Notice, as the model is additive, it does not account for interactions, then you have to specify the interactions.
Thus GAM is merely an approach to make a model, where we include the posibility of having non linear components. Hence we include more complex model (with the ability to trail the observations more than linear models).
But the advantage of linear regressions, are that we are able to quickly deduct the effects the variables. Although we dont always have a linear relationship, hence you can be forced to choose a more complex model. (see the R file “GAMs with discussion R”).
We have previously worked with non parametric models (e.g., KNN regression). GAM is in between linear regression and non parametric models.
That is the beuty of GAMs, as we preserve the ability of having transparancy in the model, despite it coming at a cost of worse prediction power than neural networks, but at such complex models, you are not able to deduct how the variables are interrelated, you can only say which are important and which are not.
What are the assumptions for GAMS??
2.1.6.1 Feature Selection
There are naturally a bunch of different approaches, we either manually assess this, running an automatic feature selection algorithm or both, see examples below. Also the casestudy in section 2.5 provides an example of all approaches to assess the features.
Feature selection approaches includes:
Generally looking at the variables “one by one,” to understand what features are important and to figure out how they contribute towards solving the problem.
Looking at the correlation matrix: If we are working with a model which assumes a linear relationship between the dependent and the independent variables, corr matrix can help us come up with an initial list of variable importance. However, corr matrix also works as a “rough informative tool” for nonlinear modelling.
Running automatic feature selection algorithms. Functions in R include, among others:
- c1. regsubsets() function in “leaps” library (presented in ISL, p. 244); used to select the best size model that contains a given number of predictors, where best is quantified using Residual Sum of Squares (RSS). Although regsubsets() is based on testing linear models, it works as a “rough” list for nonlinear models.
- c2. step.Gam() function in “gam” library for stepwise selection of variables in GAM models. This is useful when the number of predictors is not very high.
- c3. advanced feature selection methods based on other data-mining techniques, including but not only: random forests, Bayesian Networks, Neural Networks, or other. Notice, that we can use very complex methods of feature selection, and then construct a model that is more transparant, for instance GAMs
We will end up with candidate models, during the case study a new approach was introduced, which was quite nice, this has the following procedure:
- Train the candidate model
- Do cross validation of k partitions on the train data.
- Select the best model.
- Run this on the test data. NB: Ideally we don’t want to run more than one model on the test data, as that will lead to data leakage and we may choose the best model based on the best fit to the test data. That is wrong.
2.1.6.2 GAM for regression problems
Now we move beyond being constrained to only one predictor variable, hence GAM can be seen more as an extension of multiple linear regression. Hence GAM is a combination of different functions, where they are each fitted while holding the other variables fixed. GAM can consist of any different non-linear model, e.g., we can just as well use local regression, polynomial regression, or any combination of the approaches seen above in this subsection.
See section @ref(fig:GAMPlotLab7.8.3) for explanation of interpretation of the plots.
Disadvantages of GAM:
- The fitting procedure holds the other variables fixed, hence it does not count for interactions. Therefore, one may manually construct interaction variables to account for this, just like in mulitple linear regression.
- Prediction wise it is not competitive with Neural Networks and Support Vector Machines.
Advantages of GAM:
- Allowing to fit non linear function for j variables (\(f_j\))
- Has potential of making more accurate predictions
- As the model is additive (meaning that each function is fitted holding the other variables fixed) we are still able to make inference, e.g., assessing how one variable affects the y variable.
- Smoothness of function \(f_j\) can be summarized with degrees of freedom.
- Often applied when aiming for explanatory analysis (instead of prediction)
2.1.6.2.1 Interpretation of output
Remember GAM is a construct of M different models, hence there are different areas one must be aware of.
With smooth variables:
summary()
will output approximate signifcance of smooth terms,- Here we can assess if there is statistical evidence for including the smoothing on the variable. But notice, that this is merely the p-value.
- We can also assess the
edf
= estimated degrees of freedom, hence the flexibility of each smoothed variable.
With linear variables we get Parametric coefficients (from
summary()
): These are what we know from the linear scenario. It assigns a p-value and coefficients to the linear variables. Interpretation wise, we should not pay too much attention to this, although it is always good to have the p-values in mind.With
summary()
we also get the traditional information, such as \(R^2\) and adjusted. This can also be found in the documentation.
2.1.6.3 GAM for classification problems
When y is qualitative (categorical), GAM can also be applied in the logistical form.
As discovered in the classification section, we can apply logits (log of odds) and odds, see material from first semester.
The same advantages and disadvantages as in the prior section applies.
2.1.7 Model assessment
The same applies as in ML1, thus I refer to ML1.
Also one must always assess the residuals, e.g., for normality.
2.2 Lecture notes
Talking about polynomials and what they are, e.g., can be parabula, etc.
With splines we set polynomials in each of the X regions. Where just polynomial regression is fitted to the whole dataset, and not just in regions.
By default poly() will make orthogonal polynomials. Meaning that it tries to create orthogonal terms, where the polynomials are orthogonal (not related to each other). As that is default, then we have to define, that we want to use raw data, hence raw = TRUE, hence we will get the regular polynomials of the dataa.
See notes in her R file.
General Wrap-Up
- The approaches are similar
- Must be aware of why the models are used
- Try them out
- Assess how they look
2.3 Lab section
Loading the data that will be used throughout the lab section.
library(ISLR)
attach(Wage)
<- Wage df
2.3.1 Polynomial Regression and Step Functions
2.3.1.1 Continous model
Fitting the model:
<- lm(wage ~ poly(age,4) #Orthogonal polynomials
fit data = df)
,<- lm(wage ~ poly(age,4,raw = TRUE) #Orthogonal polynomials
fit2 data = df) ,
Note: poly()
returns orthogonal polynomials, which is some linear combination of the variables to the d power. See the following two examples when using orthogonal and normal polynomials:
{print("Orthogonal")
cbind(df$age,poly(x = df$age,degree = 4))[1:5,] %>% print()
print("Regular")
cbind(df$age,poly(x = df$age,degree = 4,raw = TRUE))[1:5,] %>% print()
}
## [1] "Orthogonal"
## 1 2 3 4
## [1,] 18 -0.0386247992 0.055908727 -0.0717405794 0.08672985
## [2,] 24 -0.0291326034 0.026298066 -0.0145499511 -0.00259928
## [3,] 45 0.0040900817 -0.014506548 -0.0001331835 0.01448009
## [4,] 43 0.0009260164 -0.014831404 0.0045136682 0.01265751
## [5,] 50 0.0120002448 -0.009815846 -0.0111366263 0.01021146
## [1] "Regular"
## 1 2 3 4
## [1,] 18 18 324 5832 104976
## [2,] 24 24 576 13824 331776
## [3,] 45 45 2025 91125 4100625
## [4,] 43 43 1849 79507 3418801
## [5,] 50 50 2500 125000 6250000
In the end, it does not have a noticeable effect.
options(scipen = 5)
{coef(summary(fit)) %>% print()
coef(summary(fit2)) %>% print()
}
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -184.1541797743 60.04037718327 -3.067172 0.0021802539
## poly(age, 4, raw = TRUE)1 21.2455205321 5.88674824448 3.609042 0.0003123618
## poly(age, 4, raw = TRUE)2 -0.5638593126 0.20610825640 -2.735743 0.0062606446
## poly(age, 4, raw = TRUE)3 0.0068106877 0.00306593115 2.221409 0.0263977518
## poly(age, 4, raw = TRUE)4 -0.0000320383 0.00001641359 -1.951938 0.0510386498
Even though the coefficients are different and the p-values hereof, the fitted values will be indistinguishable (Hastie et al. 2013, 288). This is also shown later.
Alternatives to using poly()
??
We have two alternatives:
Using
I()
Using
cbind()
Using
I()
<- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4) #Note that 'I()' is added
fit2a data = df)
,coef(fit2a)
## (Intercept) age I(age^2) I(age^3) I(age^4)
## -184.1541797743 21.2455205321 -0.5638593126 0.0068106877 -0.0000320383
Notice I()
as ‘^’ has another special meaning in formulas
Hence we see that the coefficients are the same.
- Using
cbind()
<- lm(wage ~ cbind(age,age^2,age^3,age^4)
fit2b data = df)
,coef(fit2b)
## (Intercept) cbind(age, age^2, age^3, age^4)age
## -184.1541797743 21.2455205321
## cbind(age, age^2, age^3, age^4) cbind(age, age^2, age^3, age^4)
## -0.5638593126 0.0068106877
## cbind(age, age^2, age^3, age^4)
## -0.0000320383
We see that we are now able to use ‘^’ within the cbind()
.
proceeding with the lab sections. We can now present a grid of values for age, at which we want predictions and then call the predict()
and also plot the standard errors.
<- range(df$age) #The min and max
agelims <- seq(from = agelims[1],to = agelims[2]) #Creating a counter within the range
age.grid <- predict(object = fit
preds newdata = list(age = age.grid) #Creating a list with the counter named age, so it fits the IV naming
,se.fit = TRUE)
,<- cbind(preds$fit + 2*preds$se.fit #Upper band
se.bands $fit-2*preds$se.fit) #Lower band ,preds
Notice that 2 SE (2 sd) = 95%, hence we expect to contain 95% of the data within confidence levels.
Now we can plot the data
plot(x = df$age,y = df$wage
xlim = agelims
,cex = 0.5 #Size of dots
,col = "darkgrey")
,title("Degree-4 Polynomial",outer = TRUE)
lines(x = age.grid,y = preds$fit
lwd = 2
,col = "blue") ,
Comparison between raw polynomials and orthogonal polynomials
With the following we see that the difference of the fitted values are practically 0.
<- predict(object = fit2
preds2 newdata = list(age = age.grid)
,se.fit = TRUE)
,max(abs(preds$fit-preds2$fit))
## [1] 7.81597e-11
In terms of predictions, the two approaches are more or less the same, although the orthogonal polynomials removes some effect of collinearity.
Assessing what polynomial to include
Now we can compare models with different orthogonal polynomials. Using ANOVA, which compare the RSS to see if the decrease in RSS is significant.
.1 <- lm(wage~age,data=df)
fit.2 <- lm(wage~poly(df$age,2),data=df)
fit.3 <- lm(wage~poly(df$age,3),data=df)
fit.4 <- lm(wage~poly(df$age,4),data=df)
fit.5 <- lm(wage~poly(df$age,5),data=df)
fitanova(fit.1,fit.2,fit.3,fit.4,fit.5)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
2998 | 5022216 | NA | NA | NA | NA |
2997 | 4793430 | 1 | 228786.010 | 143.5931074 | 0.0000000 |
2996 | 4777674 | 1 | 15755.694 | 9.8887559 | 0.0016792 |
2995 | 4771604 | 1 | 6070.152 | 3.8098134 | 0.0510462 |
2994 | 4770322 | 1 | 1282.563 | 0.8049758 | 0.3696820 |
Note,
- the anova compares the sum of resduals squared.
- the anova follows an F distribution, hence we could apple the critical values
based on the anova we see that the errors change significantly until the 5th degree, hence the decision should be to take the model with order 4 of polynomials.
Notice, that the model will never become worse in sample when complexity is added, as we fit the model more to the data.
Alternative
We could also have obtained the same output using coef()
instead of the anove, where we see that teh p-values are the same, also the squared value of t (\(t^2=F\)).
coef(summary(fit.5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
## poly(df$age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
## poly(df$age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(df$age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
## poly(df$age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
## poly(df$age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
Notice: this is only an alternative when we exclusively have polynomials in the model!
Using ANOVA to assess for best models
The following is another example of using ANOVA where different variables are used:
And recall, that we should never apply p-values of variables in a model, to decide which that should be included.
NOTE; this approach only works when the models are nested, meanin that the overall variables are the same, hence M2 could not have regian for instance, they all need to have the same overall variable
.1 = lm(wage~education +age ,data=df)
fit.2 = lm(wage~education +poly(age ,2) ,data=df)
fit.3 = lm(wage~education +poly(age ,3) ,data=df)
fitanova(fit.1,fit.2,fit.3)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
2994 | 3867992 | NA | NA | NA | NA |
2993 | 3725395 | 1 | 142597.10 | 114.696898 | 0.0000000 |
2992 | 3719809 | 1 | 5586.66 | 4.493588 | 0.0341043 |
What are we looking for?
- What model lowers the RSS significantly.
Using CV
We could also have chosen the order of polynomials using cross validation.
library(boot)
set.seed(19)
= rep (0, 5)
cv.error for (i in 1:5)
{=glm(wage~poly(age,i),data=Wage) # notice glm here in conjunction with cv.glm function
fit.i=cv.glm(Wage, fit.i, K=10)$delta[1] #K fold CV
cv.error[i]
}# the CV errors of the five polynomials models cv.error
## [1] 1674.979 1600.176 1595.913 1594.003 1596.304
Concl: A 5 order model is not justified as the it starts increasing. Also we see that the 4th order d, is the best, which corresponds with previous findings.
2.3.1.2 Logarithmic model
The procedure is per se the same, but now we are working with a probabilistic model instead of. Hence the outcome must be binary. Thus, it is decided to predict whether a persons wage is higher or lower than 250.
<- glm(I(wage > 250) ~ poly(age,4) #Note the use of I()
fit data = df
,family = binomial) ,
Note, that again I()
is used, where the expression is evaluated on the fly, one could naturally also had made a vector of the classes.
Note, by default glm()
will transform TRUE and FALSE to respectively 1 and 0.
Now we can make predictions, which are logits, these can be used for much, hence later we will transform them into probabilities.
= predict(fit
preds newdata = list(age=age.grid)
,se.fit = TRUE)
,# We could have added type = response to get probabilities
$fit[1:10]#First 10 logits preds
## 1 2 3 4 5 6 7
## -18.438190 -16.395452 -14.560646 -12.919746 -11.459196 -10.165904 -9.027249
## 8 9 10
## -8.031077 -7.165700 -6.419899
To make confidence intevals for Pr(Y = 1|X), i.e.
\[\begin{equation} Pr(Y=1|X)= \frac{exp(X\beta)}{1+exp(X\beta)} \tag{2.5} \end{equation}\]
Where \(X\beta\) can be explained by:
\[\begin{equation} log(\frac{Pr(Y=1|X)}{1-Pr(Y=1|X)})=X\beta \tag{2.6} \end{equation}\]
Hence we must first calculate \(X\beta\) to find Pr(Y=1|X).
#Making Prbabilities
= exp(preds$fit)/(1+exp(preds$fit)) #See equation above
pfit
#X beta
= cbind(preds$fit+2*preds$se.fit #Upper level
se.bands.logit $fit-2*preds$se.fit) #Lower level
,preds
#Pr(Y = 1|X)
= exp(se.bands.logit)/(1+exp(se.bands.logit)) se.bands
Remember that 2 SE = 95%, thus with the confidence levels we expect to contain 95% of the data.
Notice, that the posterior probabilities could also have been found by using predict()
, see the following:
= predict (fit
preds newdata = list(age = age.grid)
,type = "response" #Getting probabilities instead of logits
,se.fit = TRUE) ,
NOTICE: for some reason this will lead to wrong confidence intervals (Hastie et al. 2013, 292), thus we prefer the regular approach, as shown before
Now we can make the right hand plot, so we can compare with continous result.
par(mfrow = c(1,2)
mar = c(3,4.5,4.5,1.1) #Controls the margins
,oma = c(0,0,4,0)) #Controls the margins
,#Copy from earlier to combine plots
<- lm(wage ~ poly(age,4) #Orthogonal polynomials
fit data = df)
,<- predict(object = fit
preds newdata = list(age = age.grid)
,se.fit = TRUE)
,plot(x = df$age,y = df$wage
xlim = agelims
,cex = 0.5 #Size of dots
,col = "darkgrey")
,title("Degree-4 Polynomial",outer = TRUE)
lines(x = age.grid,y = preds$fit
lwd = 2
,col = "blue")
,
#The new plot
plot(x = age,y = I(wage >250)
xlim = agelims
,type ="n"
,ylim = c(0,.2))
,
points(jitter(age)
I((wage>250)/5)
,cex = .5
,pch = "|"
,col = "darkgrey")
,
lines(x = age.grid,y = pfit
lwd = 2
,col= "blue")
,
matlines(x = age.grid
y = se.bands
,lwd = 1
,col = "blue"
,lty = 3) ,
We see on the right hand panel that the all the observations that have a wage above 250 is in the top and all those below hare in the bottom of the visualization. Although at the tail, we aren’t able to conclude much, as confidence interval is really high, hence it can both be high and low earners.
jitter()
is merely an approach to avoid observations to overlap each other.
2.3.1.3 Step function
To fit the step function we must do:
- Define the cuts,
cut()
is able to automatically pick cutpoints. One could also usebreak()
to define where the cuts should be. - Train the model. Notice, that
lm()
will automatically create dummy variables for the ranges.
table(cut(df$age,4)) %>% print()
{<- lm(wage ~ cut(df$age,4)
fit data = df)
,coef(summary(fit)) %>% print()}
##
## (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
## 750 1399 779 72
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.158392 1.476069 63.789970 0.000000e+00
## cut(df$age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
## cut(df$age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
## cut(df$age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
We see that the p value of the cuts are significant, not that we can use the p-values for much.
Notice, that the first range is the base level, thus it is also left out. We can then use the intercept as the average wage for all in the range of up to 33.5 years.
Hence for a 40 year old person, the model will say that he has an wage of 94 + 24 = 118
rm(list = ls())
2.3.2 Splines
The different approaches to splines are presented in the following.
2.3.2.1 Basis Function Splines
library(ISLR)
<- Wage
df library(splines)
<- range(df$age) #The min and max
agelims <- seq(from = agelims[1],to = agelims[2]) #Creating a counter within the range age.grid
The splines library contain what we need. We introduce the following functions:
bs()
: Basis functions for splines. Generates entire matrix of basis functions for splines with the specified set of knots.ns()
: Natural splines.smooth.spline()
: Used when fitting smoothing splines.loess()
: When fitting local regression.
Note, that by default the splines will be choosen to be 3, this can also be found in the function documentation.
par(mfrow = c(1,1),oma = c(0,0,0,0))
<- lm(wage ~ bs(age,knots = c(25,40,60)) #we just chose the knots randomly
fit.bs data = df)
,<- predict(fit.bs
pred.bs newdata = list(age = age.grid)
,se.fit = TRUE)
,plot(df$age
$wage
,dfcol = "gray")
,
lines(age.grid
$fit
,pred.bslwd = 2)
,
lines(age.grid
$fit+2*pred.bs$se
,pred.bslty = "dashed")
,
lines(age.grid
$fit-2*pred.bs$se
,pred.bslty = "dashed")
,
title("Splines - Basis Functions")
We see that the splines have been fitted to the data and notice that the tails have wider confidence intervals.
We can get the amount of degrees of freedom by calling the dim()
function.
{#Specifying the knots
dim(bs(age,knots = c(25,40,60))) %>% print()
#df can be specified instead of knots
dim(bs(age,df = 6)) %>% print()
}
## [1] 3000 6
## [1] 3000 6
We see that the two alternatives produce the same results.
Notice, that there are packages that will optimize the amount of knots.
We can assess where the bs()
placed the knots, by calling the attr()
.
attr(bs(age,df=6),"knots")
## 25% 50% 75%
## 33.75 42.00 51.00
In this case, R chose the 25%, 50% and 75% quantiles.
2.3.2.2 Natural Splines
It similar to bs(), but it has an additional condition. I did not really get it.
The fitting procedure is the same, but now we just use ns()
instead of bs()
.
= lm(wage ~ ns(age
fit.ns df = 4 #Note, as with bs() we could have specified the knots instead of.
,
)data = df)
,
= predict(fit.ns
pred.ns newdata = list(age=age.grid)
,se.fit = TRUE)
,
#Copy of old plot
plot(df$age
$wage
,dfcol = "gray")
,
lines(age.grid
$fit
,pred.bslwd = 2)
,
lines(age.grid
$fit+2*pred.bs$se
,pred.bslty = "dashed")
,
lines(age.grid
$fit-2*pred.bs$se
,pred.bslty = "dashed")
,
#Adding natural splines
lines(age.grid
$fit
,pred.nscol ="red"
,lwd =2)
,
title("Splines - Basis Functions + Natural Splines")
legend("topright",c("Basis","Natural"),lty = 1,col = c("Black","Red"),cex = 0.6)
2.3.2.3 Smooth Splines
As we discovered in the first part of the chapter, it sets a knot at each observation, and then we will penalize the function with a lamda (\(\lambda\)), to avoid overfitting.
The code show the procedure.
#Hardcoding degrees of freedom
<- smooth.spline(x = df$age,y = df$wage
fit.ss df = 16) #Remember that we must impose constraints
,
#Choosing smoothing param with CV
<- smooth.spline (df$age
fit.ss2 $wage
,dfcv = TRUE) #we choose cv instead of fixed amount of df
,$df fit.ss2
## [1] 6.794596
We get sparsity hence we have degrees of freedom of 6.8. That is due to the tuning parameter which was found by the cross validation proces. We can find the specific lambda value with the following:
$lambda fit.ss2
## [1] 0.02792303
plot(age,wage
xlim = agelims
,cex = .5
,col = "darkgrey")
,title("Smoothing Spline")
lines(fit.ss,col = "red",lwd = 2)
lines(fit.ss2,col = "blue",lwd =2)
legend("topright",legend = c("16 DF","6.8 DF")
col = c("red","blue")
,lty = 1
,lwd = 2
,cex = .8) ,
As expected, we see that the more complex model (highest amount of df) is the more flexible model.
Note: tuning parameter = \(\lambda\), where the CV seeks to choose the parameter that leads to the lowest error and return the df that leads to this level.
2.3.2.4 Local Regression
Recall that local regression makes a linear regression for the observations that are close to the observation under evaluation (\(x_0\)).
Thus we have to specify the span, the larger the span the smoother the fit, as we will include more observations.
NB: locfit
library can also be used for fitting local regress
plot(x = df$age,y = df$wage
xlim = agelims
,cex = .5
,col = "darkgrey")
,title ("Local Regression")
<- loess(wage ~ age
fit.lr span = .2 #Degree of smoothing / neighborhood to be included
,data = df)
,
<- loess(wage ~ age
fit.lr2 span = .5 #Degree of smoothing / neighborhood to be included
,data = df)
,
lines(x = age.grid,y = predict(object = fit.lr,newdata = data.frame(age=age.grid))
col = "red"
,lwd = 2)
,
lines(x = age.grid,y = predict(object = fit.lr2,newdata = data.frame(age=age.grid))
col =" blue"
,lwd = 2)
,
legend(x = "topright"
legend = c("Span = 0.2","Span = 0.5")
,col=c("red","blue")
,lty = 1
,lwd = 2
,cex = .8) ,
From the plot we also see that the model with the largest span has the smoothest fit.
2.3.3 GAMs
We want to predict wage, where year, age and education (as categorical) as predictors.
2.3.3.1 With only natural splines
According to the Hastie et al. (2013), 294, this is just a bunch of linear functions, hence we can merely apply lm()
, see the following.
<- lm(wage ~ ns(year,df = 4) + ns(age,df = 5) + education #NOTICE, that we just use lm()
gam.m1 data = df)
,summary(gam.m1)
##
## Call:
## lm(formula = wage ~ ns(year, df = 4) + ns(age, df = 5) + education,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.513 -19.608 -3.583 14.112 214.535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.949 4.704 9.980 < 2e-16 ***
## ns(year, df = 4)1 8.625 3.466 2.488 0.01289 *
## ns(year, df = 4)2 3.762 2.959 1.271 0.20369
## ns(year, df = 4)3 8.127 4.211 1.930 0.05375 .
## ns(year, df = 4)4 6.806 2.397 2.840 0.00455 **
## ns(age, df = 5)1 45.170 4.193 10.771 < 2e-16 ***
## ns(age, df = 5)2 38.450 5.076 7.575 4.78e-14 ***
## ns(age, df = 5)3 34.239 4.383 7.813 7.69e-15 ***
## ns(age, df = 5)4 48.678 10.572 4.605 4.31e-06 ***
## ns(age, df = 5)5 6.557 8.367 0.784 0.43328
## education2. HS Grad 10.983 2.430 4.520 6.43e-06 ***
## education3. Some College 23.473 2.562 9.163 < 2e-16 ***
## education4. College Grad 38.314 2.547 15.042 < 2e-16 ***
## education5. Advanced Degree 62.554 2.761 22.654 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.16 on 2986 degrees of freedom
## Multiple R-squared: 0.293, Adjusted R-squared: 0.2899
## F-statistic: 95.2 on 13 and 2986 DF, p-value: < 2.2e-16
From the summary we see the variables that have been created and also the factor levels for education.
Again, we don’t have to interprete the coefficients, we just need to look at the shape.
2.3.3.2 With different splines
Now we have to apply the package gam
.
This is the best approach.
library(gam)
We can also construct a GAM model, that contains smoothing splines, that is done by calling s()
. Where year and age will be included with up to 4 and 5 degrees of freedom.
<- gam(wage ~ s(year,df = 4) + s(age,df = 5) + education
gam.m3 data = df) ,
Remember, that GAM fits each variable while holding all other variables fixed. The actual fitting procedure is called backfitting, and fits variables by repeatedly updating the fit for each predictor (Hastie et al. 2013, 284–85). Hence, we create plots to interprete how.
par(mfrow = c(1,3))
plot(gam.m3 #Note, automatically identifies the GAM object, hence plots for each variable
se = TRUE
,col ="blue") ,

(#fig:GAMPlotLab7.8.3)GAM plot and intepretation
Interpreting the plot: Recall that the plots assumes that we hold the other variables fixed, hence we see the following:
- Left: We see that holding education and age fixed, the wage tends to increase over the years, that is quite natural, e.g., because of inflation.
- Center: Holding year and education fixed, we see that the wage tends to be highest in the middle region around 40-45 years of age. That is also quite intuitive that the wage first increase and then decreasing as the person gets closer to the retirement age.
- Right: Holding year and age fixed, we see that the higher education you have, the higher will your wage be.
par(mfrow = c(1,3))
plot.Gam(gam.m1
se = TRUE
,col = "red") ,

Figure 2.3: GAM of natural splines
Notice, that this plot looks very similar to @(fig:GAMPlotLab7.8.3).
This command could naturally also be used for the other GAM object, it is just that plot()
does not automatically identify, that it is in fact intended to be interpretet as a GAM.
2.3.3.3 But what variables to include?
It looks as is year is rather linear. To make this assessment, we can apply an ANOVA test of the different combinations. Hence:
Note, the first model is nested in the second model (has the same variables), hence we can use ANOVA
#Excluding year
<- gam(wage ~ s(age,df = 5) + education
gam.m1 data = df)
,
#Including year, but as a linear
<- gam(wage ~ year + s(age,df = 5) + education
gam.m2 data = df)
,
anova(gam.m1,gam.m2,gam.m3,test = "F")
Resid. Df | Resid. Dev | Df | Deviance | F | Pr(>F) |
---|---|---|---|---|---|
2990 | 3711731 | NA | NA | NA | NA |
2989 | 3693842 | 1.000000 | 17889.243 | 14.477130 | 0.0001447 |
2986 | 3689770 | 2.999989 | 4071.134 | 1.098212 | 0.3485661 |
We see that performance is significantly better going from model 1 to model 2, but on a five percent level, we are able to say, that we don’t gain anything with the third model, which is most complex model.
Thus, the linear constellation of year, with polynomials on age + education as factors, appear to be the best performing model.
With this in mind, it is interesting to assess the summary of the complex model:
summary(gam.m3)
##
## Call: gam(formula = wage ~ s(year, df = 4) + s(age, df = 5) + education,
## data = df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, df = 4) 1 27162 27162 21.981 0.000002877 ***
## s(age, df = 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, df = 4) 3 1.086 0.3537
## s(age, df = 5) 4 32.380 <2e-16 ***
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the section: “Anova for Nonparametric Effects,” we see that the smoothing spline on year, is not significant, hence it supports the conclusion from above, that we are better off, including the year as a linear variable.
Now we can make predictions.
#Predictions with linear year, non linear age and factors of education
<-predict(gam.m2
preds newdata = df) ,
2.3.3.4 GAM with local regression
We are also able to make GAM on other building blocks, for instance local regression, that will be shown in the following
For some reason the following cant be run.
#GAM with local regression
<- gam(wage ~ s(df$year,df = 4) + lo(df$age,span = 0.7) + education
gam.lo data = df)
,
# plot.Gam(gam.lo #For some reason it cant be plotted
# ,se = TRUE
# ,col = "green")
Making interactions in the local regression:
<- gam(wage ~ lo(year,age,span = 0.5) + education
gam.lo.i data = df)
,
library(akima)
plot(gam.lo.i)
2.3.3.5 Logistic Regression
Plotting logistic regression GAM, here we can apply I()
as previous used, to make the expression on the fly.
<- gam(I(wage > 250) ~ year + s(age,df = 5) + education
gam.lr family = binomial
,data = df)
,
par(mfrow =c(1,3))
plot(gam.lr,se=T,col =" green ")
One could interprete the plot and assess each window to see how the variable influence the decision wether the observation is above or below. Remember that the outcome can be seen as probabilities, these can also be plotted to be shown the spread:
par(mfrow = c(1,1))
plot(gam.lr$fitted.values)
From th plot, we see that there is a tendency that the lower the education the lower the wage, the following table show how the high earners are distributed.
table(education,I(wage > 250))
##
## education FALSE TRUE
## 1. < HS Grad 268 0
## 2. HS Grad 966 5
## 3. Some College 643 7
## 4. College Grad 663 22
## 5. Advanced Degree 381 45
We see that there are no people with less than high school degree that earns more than 250.
To get more sensible result, we can remove the observations with a low degree, this will also show a more sensible result for the other degrees, see the following.
= gam(I(wage > 250) ~ year + s(age,df = 5) + education
gam.lr.s family = binomial
,data = df
,subset = (education != "1. < HS Grad")) #removing people in the lowest group of education.
,par(mfrow = c(1,3))
plot(gam.lr.s
se = TRUE #Standard errors
,col =" green ") ,
Do we need a nonlinear term for year? Use anova for comparing the previous model with a model that includes a smooth spline of year with df=4
We can do an ANOVA, but please notice, we use Chi Square now.
= gam(I(wage>250) ~ s(year, 4) + s(age,5) + education,family=binomial,data = df,subset=(education!="1. < HS Grad"))
gam.y.s anova(gam.lr.s,gam.y.s, test="Chisq") # Chi-square test as Dep variable is categorical
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
2722 | 602.4588 | NA | NA | NA |
2719 | 601.5718 | 2.999982 | 0.8869514 | 0.8285731 |
We do not need a non-linear term for year.
2.4 Exercises
2.4.1 Exercise 6
Purpose, to practice polynomial regression and step functions
library(ISLR)
<- Wage df
2.4.1.1 6.a Polynomial Regression
We use orthogonal polynomials in the modeling process as we know that these are slightly better than raw polynomials due to the fact that this tend to avoid collinearity.
Training the model
library(boot) #For the cv.glm() function
set.seed(1337)
= rep (0,10)
cv.error for (i in seq(from = 1,to = length(cv.error),by = 1)) {
#Training
<- glm(wage ~ poly(age,i),data = df) # notice glm here in conjunction with cv.glm function
fit.i
#Performing cross validation
<- cv.glm(data = df,glmfit = fit.i,K = 10)$delta[1] #K fold CV, delta = prediction errer i.e. MSE
cv.error[i]
}
#Printing the
# MSE the CV errors of the five polynomials models cv.error
## [1] 1675.056 1600.832 1594.505 1594.872 1594.608 1593.053 1594.069 1596.428
## [9] 1593.284 1595.530
The vector above are all of the prediction errors computed in the loop.
which.min(cv.error)
## [1] 6
We see that the fifth prediction appear to yield the lowest MSE, but is it significantly different than e.g. forth or third order polynomial?
.1 <- glm(wage ~ poly(age,1),data = df)
fit.2 <- glm(wage ~ poly(age,2),data = df)
fit.3 <- glm(wage ~ poly(age,3),data = df)
fit.4 <- glm(wage ~ poly(age,4),data = df)
fit.5 <- glm(wage ~ poly(age,5),data = df)
fitanova(fit.1,fit.2,fit.3,fit.4,fit.5,test = "F")
Resid. Df | Resid. Dev | Df | Deviance | F | Pr(>F) |
---|---|---|---|---|---|
2998 | 5022216 | NA | NA | NA | NA |
2997 | 4793430 | 1 | 228786.010 | 143.5931074 | 0.0000000 |
2996 | 4777674 | 1 | 15755.694 | 9.8887559 | 0.0016792 |
2995 | 4771604 | 1 | 6070.152 | 3.8098134 | 0.0510462 |
2994 | 4770322 | 1 | 1282.563 | 0.8049758 | 0.3696820 |
We can only use these as the models are nested as the variables are the same
Using the F test, we see that on a five percent level the 4th polynomial is not justified, but close to. This argues that we should select the third order of polynomials as that is the last where there is statistical evidence for lowering the residuals.
Thus we select a model with three polynomials. Plotting the errors, we also see that there does not happen much after the third polynomial. We also plotted the standard errors and thus we are able to select based on this.
plot(cv.error,type = "b")
= min(cv.error)
min.point = sd(cv.error)
sd.points abline(h=min.point + 0.2 * sd.points, col="red", lty="dashed") #0.2 is just a rule of thumb, could be anything
abline(h=min.point - 0.2 * sd.points, col="red", lty="dashed")
legend("topright", "0.2-standard deviation lines", lty="dashed", col="red")
Thus, there is even more information supporting selecting three degrees of freedom.
Plotting the polynomial regression
This is done with the following procedure:
- Make a grid counting IDV (Age)
- Make predictions
- Make a plot with the variables
- Fit a line onto the predictions
- Perhaps calculate confidence levels and plot these
#Grid of X
<- seq(from = min(df$age),to = max(df$age),by = 1)
age.grid
#Predictions
<- predict(object = fit.3
preds newdata = list(age = age.grid) #Renaming age.grid to age
,se.fit = TRUE) #We want to produce confidence levels
,
#Plotting
plot(x = df$age,y = df$wage,col = "darkgrey",cex = 0.8)
grid()
lines(x = age.grid #We need to define the grid, otherwise the fit will not be alligned with the data
y = preds$fit
,col = "red")
,title("Polynomial of 3rd order")
2.4.1.2 6.b Step function
<- 4
cuts
#Cutting the x variable
table(cut(df$age
breaks = cuts)) ,
##
## (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
## 750 1399 779 72
#' Note, this only shows where the cuts lie and how many there are in each
#Fitting the step function
<- lm(wage ~ cut(df$age,4)
fit.step data = df)
,
coef(summary(fit.step))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.158392 1.476069 63.789970 0.000000e+00
## cut(df$age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
## cut(df$age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
## cut(df$age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
We see that the the first cut (bin with people up to 33,5) have been left out. That is because they are contained in the intercept.
Now we can fit the step function
library(stats)
#Predictions
<- predict(object = fit.step
preds newdata = list(age = age.grid)) #Renaming age.grid to age
,
#Plotting
# plot(x = df$age,y = df$wage,col = "darkgrey",cex = 0.8)
# grid()
# lines(age.grid
# ,preds
# ,col = "red")
# title("Step function of 3rd order")
I need to check what she is doing, one could perhaps manually order the
2.4.2 Exercise 7
<- Wage df
Evaluating features other features to see how age respond hereon.
We can plot the variables agains each other, to see how they interact.
library(dplyr)
for (i in 1:10) {
plot(y = df$wage,x = df[,i],xlab = names(df)[i],ylab = "Wage")
grid()
names(df)[i] %>% title()
}
Looking at race, it appears as if there is some relationship between race and wage the same with maritial status. Region only has values in one category, jobclass appear to visually have different means. The same goes for health and health insurance. Naturally log of wage has a non linear relationship with wage. Although the variable is the same, thus it cant be used for much to predict wage levels.
Since all the variables of interest, and we haven’t worked with are all categorical, then we can’t really do any polynomial regression with the data, as they are all factors.
What one could do is a mutlivariate linear model with different factors, or step functions or perhaps GAM where a continous varaible with polynomials are included.
Therefore, I will not elaborate much more on this.
Ana made three different models, notice, that these are linear models, as the polynomial regression is not able to handle this.
= lm(wage ~ maritl, data = df)
fit1 deviance(fit1) # here deviance = RSS
## [1] 4858941
= lm(wage ~ jobclass, data = df)
fit2 deviance(fit2)
## [1] 4998547
= lm(wage ~ maritl + jobclass, data = df)
fit3 deviance(fit3)# Select model fit3 (smallest deviance)
## [1] 4654752
summary(fit3)
##
## Call:
## lm(formula = wage ~ maritl + jobclass, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.108 -22.689 -5.749 16.445 212.492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.315 1.679 50.818 < 2e-16 ***
## maritl2. Married 25.356 1.776 14.279 < 2e-16 ***
## maritl3. Widowed 8.137 9.178 0.887 0.37541
## maritl4. Divorced 9.664 3.166 3.052 0.00229 **
## maritl5. Separated 7.189 5.539 1.298 0.19441
## jobclass2. Information 16.523 1.442 11.460 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.43 on 2994 degrees of freedom
## Multiple R-squared: 0.1086, Adjusted R-squared: 0.1072
## F-statistic: 72.98 on 5 and 2994 DF, p-value: < 2.2e-16
We can assess the groups with the contrasts function.
# To interpret first identify which is ref category
contrasts(Wage$maritl) # Never Married is the reference category
## 2. Married 3. Widowed 4. Divorced 5. Separated
## 1. Never Married 0 0 0 0
## 2. Married 1 0 0 0
## 3. Widowed 0 1 0 0
## 4. Divorced 0 0 1 0
## 5. Separated 0 0 0 1
contrasts(Wage$jobclass) # Industrial is the reference category
## 2. Information
## 1. Industrial 0
## 2. Information 1
Anova can also show the deviances etc. but notice, these does not appear to be neste d(JK note)?????
- The answer, fit 1 and fit 2 are nested into fit 3. Thus we dont compare fit 1 and fit 2, as these are not nested.
anova(fit1,fit2,fit3)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
2995 | 4858941 | NA | NA | NA | NA |
2998 | 4998547 | -3 | -139606 | 29.93215 | 0 |
2994 | 4654752 | 4 | 343795 | 55.28341 | 0 |
Now we can check the residuals
par(mfrow = c(2,2))
plot(fit3)
Looking at:
- The top left of the plot (residuals vs fitted), we would like these to be around 0.
- The top right, we want them to have a linear shape. This looks odd
Based on this, the model may be questionable. The solution:
- Exclude the extreme values
- Finding a variable that account for them.
2.4.3 Exercise 8
<- Auto df
Are we able to predict how old a car is based on the variables at hand?
Hence year = DV
Name contains a lot of value, let us only use the first word, as that appear to be the brand. Therefore a loop is created to correct all the misspelled names.
<- strsplit(x = as.character(df$name),split = " ")
brand
<- as.vector(rep(0,length(brand)))
brand.name
for (i in c(1:length(brand))) {
<- brand[[i]][1]
brand.name[i]
}
table(brand.name)
## brand.name
## amc audi bmw buick cadillac
## 27 7 2 17 2
## capri chevroelt chevrolet chevy chrysler
## 1 1 43 3 6
## datsun dodge fiat ford hi
## 23 28 8 48 1
## honda maxda mazda mercedes mercedes-benz
## 13 2 10 1 2
## mercury nissan oldsmobile opel peugeot
## 11 1 10 4 8
## plymouth pontiac renault saab subaru
## 31 16 3 4 4
## toyota toyouta triumph vokswagen volkswagen
## 25 1 1 1 15
## volvo vw
## 6 6
<- matrix(byrow = TRUE,ncol = 2
misspelled data = c("mercedes","mercedes-benz"
,"toyouta","toyota"
,"chevroelt","chevrolet"
,"maxda","mazda"
,"vokswagen","volkswagen"
,"vw","volkswagen"))
,
<- as.vector("")
index <- 0
n <- as.list(0)
bn.list <- brand.name
brand.name.recent for (i in c(misspelled[,1])) {
<- n + 1
n <- rep(FALSE,length(brand.name))
index == i] <- TRUE
index[brand.name <- replace(x = brand.name.recent,list = index,values = misspelled[n,2])
bn.list[[n]] <- replace(x = brand.name.recent,list = index,values = misspelled[n,2])
brand.name.recent
}<- cbind(df[,-9],as.factor(bn.list[[6]]))
df names(df)[names(df) == 'bn.list[[6]]'] <- "brand.name"
Also we must convert origin to a factor.
$origin <- as.factor(df$origin) df
Checking correlations.
The following can be run to see all the combinations
# par(mfrow = c(1,1))
# for (i in 1:dim(mm)[2]) {
# plot(y = df$year,x = mm[,i],xlab = names(mm)[i],ylab = "Year")
# grid()
# colnames(mm)[i] %>% title()
# }
Before training the model, we can partition the data to test the model out of sample
set.seed(1337)
<- round(x = nrow(df)*0.8,digits = 0) #Setting the training size
train.size <- sample(x = c(1:nrow(df)),size = train.size) #setting seed and creating vector for index
train.index <- model.matrix(year ~ .,data = df)[,-1] #tried to make it mm first, to get rid of having variables that were in one partition but not the other.
mm <- df$year
year <- as.data.frame(cbind(year,mm[train.index,])) #crating the training set
train.df <- as.data.frame(cbind(year,mm[-train.index,])) #creating the testing set test.df
library(gam)
<- gam(year ~ s(train.df$mpg,df = 5) + s(train.df$cylinders,df = 5) + s(train.df$displacement,df = 5) + s(train.df$horsepower,df = 5) + s(train.df$weight,df = 5) + s(train.df$acceleration,df = 5) + .
gam.m1 data = train.df)
,
summary(gam.m1)
##
## Call: gam(formula = year ~ s(train.df$mpg, df = 5) + s(train.df$cylinders,
## df = 5) + s(train.df$displacement, df = 5) + s(train.df$horsepower,
## df = 5) + s(train.df$weight, df = 5) + s(train.df$acceleration,
## df = 5) + ., data = train.df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.5818 -2.0705 -0.2068 2.0660 6.0843
##
## (Dispersion Parameter for gaussian family taken to be 8.889)
##
## Null Deviance: 2675.86 on 313 degrees of freedom
## Residual Deviance: 2266.693 on 255.0004 degrees of freedom
## AIC: 1631.771
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(train.df$mpg, df = 5) 1 10.71 10.707 1.2045
## s(train.df$cylinders, df = 5) 1 0.00 0.002 0.0003
## s(train.df$displacement, df = 5) 1 2.44 2.441 0.2747
## s(train.df$horsepower, df = 5) 1 0.13 0.134 0.0151
## s(train.df$weight, df = 5) 1 5.93 5.926 0.6667
## s(train.df$acceleration, df = 5) 1 1.52 1.517 0.1706
## origin2 1 16.05 16.050 1.8056
## origin3 1 11.10 11.097 1.2484
## `\\`as.factor(bn.list[[6]])\\`audi` 1 6.53 6.526 0.7342
## `\\`as.factor(bn.list[[6]])\\`bmw` 1 32.67 32.668 3.6751
## `\\`as.factor(bn.list[[6]])\\`buick` 1 8.31 8.307 0.9345
## `\\`as.factor(bn.list[[6]])\\`cadillac` 1 0.85 0.854 0.0960
## `\\`as.factor(bn.list[[6]])\\`capri` 1 0.00 0.003 0.0004
## `\\`as.factor(bn.list[[6]])\\`chevrolet` 1 0.23 0.230 0.0258
## `\\`as.factor(bn.list[[6]])\\`chevy` 1 0.00 0.003 0.0004
## `\\`as.factor(bn.list[[6]])\\`chrysler` 1 46.30 46.301 5.2088
## `\\`as.factor(bn.list[[6]])\\`datsun` 1 18.32 18.321 2.0611
## `\\`as.factor(bn.list[[6]])\\`dodge` 1 2.70 2.702 0.3040
## `\\`as.factor(bn.list[[6]])\\`fiat` 1 1.89 1.895 0.2132
## `\\`as.factor(bn.list[[6]])\\`ford` 1 25.85 25.847 2.9077
## `\\`as.factor(bn.list[[6]])\\`hi` 1 0.34 0.337 0.0379
## `\\`as.factor(bn.list[[6]])\\`honda` 1 3.40 3.401 0.3826
## `\\`as.factor(bn.list[[6]])\\`mazda` 1 4.99 4.986 0.5609
## `\\`as.factor(bn.list[[6]])\\`mercedes-benz` 1 0.01 0.014 0.0016
## `\\`as.factor(bn.list[[6]])\\`mercury` 1 0.31 0.313 0.0352
## `\\`as.factor(bn.list[[6]])\\`nissan` 1 3.68 3.684 0.4145
## `\\`as.factor(bn.list[[6]])\\`oldsmobile` 1 18.12 18.124 2.0389
## `\\`as.factor(bn.list[[6]])\\`opel` 1 0.30 0.303 0.0341
## `\\`as.factor(bn.list[[6]])\\`peugeot` 1 1.79 1.794 0.2018
## `\\`as.factor(bn.list[[6]])\\`plymouth` 1 0.53 0.527 0.0593
## `\\`as.factor(bn.list[[6]])\\`pontiac` 1 4.37 4.374 0.4921
## `\\`as.factor(bn.list[[6]])\\`renault` 1 16.53 16.533 1.8600
## `\\`as.factor(bn.list[[6]])\\`saab` 1 0.10 0.103 0.0115
## `\\`as.factor(bn.list[[6]])\\`subaru` 1 0.00 0.001 0.0001
## `\\`as.factor(bn.list[[6]])\\`volkswagen` 1 0.63 0.634 0.0713
## Residuals 255 2266.69 8.889
## Pr(>F)
## s(train.df$mpg, df = 5) 0.27346
## s(train.df$cylinders, df = 5) 0.98665
## s(train.df$displacement, df = 5) 0.60068
## s(train.df$horsepower, df = 5) 0.90226
## s(train.df$weight, df = 5) 0.41496
## s(train.df$acceleration, df = 5) 0.67989
## origin2 0.18023
## origin3 0.26491
## `\\`as.factor(bn.list[[6]])\\`audi` 0.39233
## `\\`as.factor(bn.list[[6]])\\`bmw` 0.05635 .
## `\\`as.factor(bn.list[[6]])\\`buick` 0.33460
## `\\`as.factor(bn.list[[6]])\\`cadillac` 0.75690
## `\\`as.factor(bn.list[[6]])\\`capri` 0.98494
## `\\`as.factor(bn.list[[6]])\\`chevrolet` 0.87244
## `\\`as.factor(bn.list[[6]])\\`chevy` 0.98454
## `\\`as.factor(bn.list[[6]])\\`chrysler` 0.02330 *
## `\\`as.factor(bn.list[[6]])\\`datsun` 0.15233
## `\\`as.factor(bn.list[[6]])\\`dodge` 0.58186
## `\\`as.factor(bn.list[[6]])\\`fiat` 0.64469
## `\\`as.factor(bn.list[[6]])\\`ford` 0.08937 .
## `\\`as.factor(bn.list[[6]])\\`hi` 0.84586
## `\\`as.factor(bn.list[[6]])\\`honda` 0.53679
## `\\`as.factor(bn.list[[6]])\\`mazda` 0.45459
## `\\`as.factor(bn.list[[6]])\\`mercedes-benz` 0.96840
## `\\`as.factor(bn.list[[6]])\\`mercury` 0.85133
## `\\`as.factor(bn.list[[6]])\\`nissan` 0.52028
## `\\`as.factor(bn.list[[6]])\\`oldsmobile` 0.15454
## `\\`as.factor(bn.list[[6]])\\`opel` 0.85372
## `\\`as.factor(bn.list[[6]])\\`peugeot` 0.65362
## `\\`as.factor(bn.list[[6]])\\`plymouth` 0.80775
## `\\`as.factor(bn.list[[6]])\\`pontiac` 0.48365
## `\\`as.factor(bn.list[[6]])\\`renault` 0.17383
## `\\`as.factor(bn.list[[6]])\\`saab` 0.91452
## `\\`as.factor(bn.list[[6]])\\`subaru` 0.99320
## `\\`as.factor(bn.list[[6]])\\`volkswagen` 0.78966
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(train.df$mpg, df = 5) 4 0.97458 0.4219
## s(train.df$cylinders, df = 5) 3 0.63467 0.5933
## s(train.df$displacement, df = 5) 4 1.72358 0.1452
## s(train.df$horsepower, df = 5) 4 1.13935 0.3384
## s(train.df$weight, df = 5) 4 0.69772 0.5941
## s(train.df$acceleration, df = 5) 4 0.93342 0.4451
## mpg
## cylinders
## displacement
## horsepower
## weight
## acceleration
## origin2
## origin3
## `\\`as.factor(bn.list[[6]])\\`audi`
## `\\`as.factor(bn.list[[6]])\\`bmw`
## `\\`as.factor(bn.list[[6]])\\`buick`
## `\\`as.factor(bn.list[[6]])\\`cadillac`
## `\\`as.factor(bn.list[[6]])\\`capri`
## `\\`as.factor(bn.list[[6]])\\`chevrolet`
## `\\`as.factor(bn.list[[6]])\\`chevy`
## `\\`as.factor(bn.list[[6]])\\`chrysler`
## `\\`as.factor(bn.list[[6]])\\`datsun`
## `\\`as.factor(bn.list[[6]])\\`dodge`
## `\\`as.factor(bn.list[[6]])\\`fiat`
## `\\`as.factor(bn.list[[6]])\\`ford`
## `\\`as.factor(bn.list[[6]])\\`hi`
## `\\`as.factor(bn.list[[6]])\\`honda`
## `\\`as.factor(bn.list[[6]])\\`mazda`
## `\\`as.factor(bn.list[[6]])\\`mercedes-benz`
## `\\`as.factor(bn.list[[6]])\\`mercury`
## `\\`as.factor(bn.list[[6]])\\`nissan`
## `\\`as.factor(bn.list[[6]])\\`oldsmobile`
## `\\`as.factor(bn.list[[6]])\\`opel`
## `\\`as.factor(bn.list[[6]])\\`peugeot`
## `\\`as.factor(bn.list[[6]])\\`plymouth`
## `\\`as.factor(bn.list[[6]])\\`pontiac`
## `\\`as.factor(bn.list[[6]])\\`renault`
## `\\`as.factor(bn.list[[6]])\\`saab`
## `\\`as.factor(bn.list[[6]])\\`subaru`
## `\\`as.factor(bn.list[[6]])\\`toyota`
## `\\`as.factor(bn.list[[6]])\\`triumph`
## `\\`as.factor(bn.list[[6]])\\`volkswagen`
## `\\`as.factor(bn.list[[6]])\\`volvo`
It appears as if non of the parameters are good predictors.
Then one could try out other models, or perhaps it is just very difficult with the data at hand to predict the year of the car.
2.4.4 Exercise 9
library(MASS)
<- Boston
df <- as.data.frame(cbind(df$nox,df$dis))
df colnames(df) <- c("nox","dis")
2.4.4.1 (a) using poly function to fit cubic polynomial regression
<- lm(nox ~ poly(dis,3),data = df)
fit.poly summary(fit.poly)
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 0.000000427 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
Remember that we are not interested in the coefficients as they are misleading, thus we want to look at the shape.
The table above is mostly presented for explanatory reasons.
As we are interested in the curve, we can fit that.
#Defining range
<- range(df$dis)
dislims <- (dislims[2]-dislims[1])/nrow(df)
n <- seq(from = dislims[1],to = dislims[2],by = n)
dis.grid
#Predictions for the plot
<- predict(object = fit.poly,newdata = list(dis = dis.grid))
preds
#Plotting
plot(nox ~ dis, data = df, col = "darkgrey")
grid()
lines(x = dis.grid,y = preds, col = "blue",lwd = 2)
title("Cubic polynomial")
2.4.4.2 (b) Plotting polynomial fits for a range of polynomials
<- list()
models <- 0
RSS for (d in 1:10) {
<- lm(nox ~ poly(dis,d),data = df)
models[[d]] <- sum(residuals(models[[d]])^2)
RSS[d]
}plot(RSS,type = "b")
points(x = which.min(RSS),y = RSS[which.min(RSS)],col = "red",pch = 19)
grid()
abline(h = min(RSS),col = "blue",lty = 2)
title("In-sample error")
We see that the RSS decrease with complexity, that it as expected, as we fit to the in sample data. We could do this with a partition of the data to see out of performance instead.
2.4.4.3 (c) Using CV to select best degree of d
Here we run a loop with cross validation to see how the different order of d performs. As the partitions are randomly selected, we preduce 10 simulations to see which orders that tend to occur most often.
<- list()
models <- 0
RSS <- 0
CV.RSS <- 0
CV.RSS.sim
for (i in 1:20) {
for (d in 1:10) {
<- glm(nox ~ poly(dis,d),data = df)
models[[d]] <- sum(residuals(models[[d]])^2)
RSS[d] <- cv.glm(data = df,glmfit = models[[d]],K = 10)$delta[2] #Delta = prediction error (adjusted)
CV.RSS[d]
}
<- which.min(CV.RSS)
CV.RSS.sim[i]
}
#Plotting prediction error
plot(CV.RSS,type = "b")
points(x = which.min(CV.RSS),y = CV.RSS[which.min(CV.RSS)],col = "red",pch = 19)
grid()
abline(h = min(CV.RSS),col = "blue",lty = 2)
title("CV K = 10 prediction error")
#Plotting simulations
barplot(table(CV.RSS.sim),xlab = "Degree of d",ylab = "Frequency")
abline(h = 0)
title("CV K = 10, Iterations = 20")
In these simulations we see that the best fit is likely to be with using .
It is actually quite interesting that a model with 10 degrees of d is as competitive as 4 in this example, although the cubic model is far superior than the other models.
2.4.4.4 (d) Use bs()
to fit a regression spline
par(mfrow = c(1,1))
<- lm(nox ~ bs(dis,df = 4),data = df) #Note as degree is not defined, default = 3
fit.bs <- predict(object = fit.bs,newdata = list(dis = dis.grid))
preds plot(x = df$dis,y = df$nox,col = "darkgrey",pch = 20,ylab = "nox",xlab = "dis")
lines(x = dis.grid,y = preds,col = "blue",lwd = 2)
grid()
title("Regression Spline df = 4")
abline(v = 3.20745,col = "red",lty = 2) #This is the cut, found in next chunk
legend(x = "topright",legend = c("fit","cut"),lty = 1:2,col = c("blue","red"))
Notice that we merely specified the amount of df that we wanted. The function merely specified them automatically. We can interpret these, by using dim()
and attr()
.
print(dim(bs(df$dis,df = 4)))
## [1] 506 4
attr(bs(df$dis,df = 4),"knots")
## 50%
## 3.20745
We see that a model with 4 degrees of freedom yields one cut. Where the model put this at 50%, hence the first half (up to 3.20745). For simplicity, this cut has been added to the plot above, to show where the spline is split.
2.4.4.5 (e) Now fit a regression spline
par(mfrow = c(2,2))
for (d in 4:7) {
#The fit + preds
<- lm(nox ~ bs(dis,df = d,degree = 3),data = df)
fit.bs <- predict(object = fit.bs,newdata = list(dis = dis.grid))
preds
#Cut
<- attr(bs(df$dis,df = d),"knots")
cut
#Plot
plot(x = df$dis,y = df$nox,col = "darkgrey",pch = 20,ylab = "nox",xlab = "dis")
lines(x = dis.grid,y = preds,col = "blue",lwd = 2)
grid()
title(paste("Cubic Regression Spline, df =",d))
abline(v = cut,col = "red",lty = 2) #This is the cut, found in next chunk
legend(x = "topright",legend = c("fit","cut"),lty = 1:2,col = c("blue","red"))
}
We start at four degrees of freedom as a model with only three degrees of freedom, hence cubic regression (three orders of polynomials) = three degrees of freedom (this has to be fact checked).
As we add complexity with knots we also adds degrees of freedom, where we add one degree of freedom for each cut, hence for the cubic spline with 7 degrees of freedom, four cuts and three polynomials (this has to be fact checked).
2.4.4.6 (f) Perform cross-validation, to select degrees
<- list()
models <- 0
RSS <- 0
CV.RSS.sim
for (i in 1:20) {
for (d in 4:15) {
<- glm(nox ~ bs(dis,df = d,degree = 3),data = df)
models[[d]] <- sum(residuals(models[[d]])^2)
RSS[d] <- cv.glm(data = df,glmfit = models[[d]],K = 10)$delta[2] #Delta = prediction error (adjusted)
CV.RSS[d]
}
<- which.min(CV.RSS)
CV.RSS.sim[i]
}
par(mfrow = c(1,1),mar = c(5,4.5,4.5,2.1),oma = c(0,0,0,0))
#Plotting prediction error
plot(CV.RSS,type = "b")
points(x = which.min(CV.RSS),y = CV.RSS[which.min(CV.RSS)],col = "red",pch = 19)
grid()
abline(h = min(CV.RSS),col = "blue",lty = 2)
title("CV K = 10 prediction error")
#Plotting simulations
barplot(table(CV.RSS.sim),xlab = "Degree of d",ylab = "Frequency")
abline(h = 0)
title("CV K = 10, Iterations = 20")
First we see the last iteration and the prediction error hereof. Overall we see that it tend to be the rather complex models tend to be
2.4.5 Exercise 10
2.4.5.1 (a) Partitioning the data
#Loading
<- College
df
#Partitioning
set.seed(1337)
<- round(x = nrow(df)*0.8,digits = 0) #Setting the training size
train.size <- sample(x = c(1:nrow(df)),size = train.size) #setting seed and creating vector for index
train.index <- df[train.index,] #crating the training set
train.df <- df[-train.index,] #creating the testing set
test.df rm(train.size)
rm(train.index)
Finding the best subset using forward selection
<- lm(Outstate ~ 1,data = train.df) #The null models
reg_null <- lm(Outstate ~ .,data = train.df) #The full model
reg_full <- stepAIC(direction = "forward",object = reg_null,trace = TRUE,scope = list(upper = reg_full,lower = reg_null)) #This could also have been done with regsubsets() step.for
## Start: AIC=10345.68
## Outstate ~ 1
##
## Df Sum of Sq RSS AIC
## + Expend 1 4629213116 5745572675 9980.1
## + Room.Board 1 4590249421 5784536370 9984.3
## + Grad.Rate 1 3627082940 6747702852 10080.1
## + Top10perc 1 3457381465 6917404327 10095.6
## + perc.alumni 1 3450011987 6924773804 10096.2
## + S.F.Ratio 1 3359826734 7014959058 10104.3
## + Private 1 3075173999 7299611793 10129.0
## + Top25perc 1 2644625673 7730160118 10164.7
## + Terminal 1 2115825269 8258960523 10205.8
## + PhD 1 1886906305 8487879487 10222.8
## + Personal 1 735337851 9639447941 10301.9
## + P.Undergrad 1 571871111 9802914681 10312.4
## + F.Undergrad 1 404016737 9970769054 10323.0
## + Enroll 1 180722874 10194062918 10336.7
## + Apps 1 52841694 10321944097 10344.5
## <none> 10374785791 10345.7
## + Books 1 14584993 10360200799 10346.8
## + Accept 1 90338 10374695453 10347.7
##
## Step: AIC=9980.11
## Outstate ~ Expend
##
## Df Sum of Sq RSS AIC
## + Private 1 1549696145 4195876531 9786.6
## + Room.Board 1 1512070540 4233502135 9792.1
## + Grad.Rate 1 1328830834 4416741842 9818.5
## + perc.alumni 1 1089768014 4655804661 9851.3
## + Personal 1 501146318 5244426358 9925.3
## + S.F.Ratio 1 464809224 5280763451 9929.6
## + F.Undergrad 1 454756490 5290816185 9930.8
## + P.Undergrad 1 349207380 5396365295 9943.1
## + Top10perc 1 346779721 5398792955 9943.4
## + Top25perc 1 345553704 5400018971 9943.5
## + Enroll 1 331862326 5413710349 9945.1
## + Terminal 1 269786468 5475786207 9952.2
## + PhD 1 210830779 5534741896 9958.9
## + Apps 1 110876800 5634695875 9970.0
## + Accept 1 85558833 5660013842 9972.8
## + Books 1 18850616 5726722060 9980.1
## <none> 5745572675 9980.1
##
## Step: AIC=9786.59
## Outstate ~ Expend + Private
##
## Df Sum of Sq RSS AIC
## + Room.Board 1 876836109 3319040422 9642.8
## + Terminal 1 743790341 3452086190 9667.2
## + Grad.Rate 1 703122598 3492753933 9674.5
## + PhD 1 693220664 3502655866 9676.3
## + perc.alumni 1 408270963 3787605568 9724.9
## + Top25perc 1 401010315 3794866215 9726.1
## + Top10perc 1 319956878 3875919653 9739.3
## + Accept 1 152840390 4043036140 9765.5
## + Personal 1 128047157 4067829374 9769.3
## + Apps 1 118364448 4077512083 9770.8
## + Enroll 1 37847497 4158029034 9783.0
## + S.F.Ratio 1 28944091 4166932439 9784.3
## + F.Undergrad 1 16848576 4179027955 9786.1
## <none> 4195876531 9786.6
## + Books 1 5437161 4190439370 9787.8
## + P.Undergrad 1 3721562 4192154968 9788.0
##
## Step: AIC=9642.78
## Outstate ~ Expend + Private + Room.Board
##
## Df Sum of Sq RSS AIC
## + perc.alumni 1 419740777 2899299645 9560.7
## + Grad.Rate 1 412477957 2906562465 9562.2
## + PhD 1 379305385 2939735036 9569.3
## + Terminal 1 369656216 2949384206 9571.3
## + Top25perc 1 311299881 3007740541 9583.5
## + Top10perc 1 280108774 3038931648 9589.9
## + Personal 1 84780679 3234259743 9628.7
## + Accept 1 42471882 3276568540 9636.8
## + Books 1 35245677 3283794745 9638.1
## + P.Undergrad 1 29148614 3289891808 9639.3
## + S.F.Ratio 1 27660641 3291379781 9639.6
## + Apps 1 24109291 3294931131 9640.2
## + Enroll 1 12561816 3306478606 9642.4
## <none> 3319040422 9642.8
## + F.Undergrad 1 2021439 3317018983 9644.4
##
## Step: AIC=9560.68
## Outstate ~ Expend + Private + Room.Board + perc.alumni
##
## Df Sum of Sq RSS AIC
## + PhD 1 248319695 2650979950 9507.0
## + Terminal 1 241588269 2657711376 9508.6
## + Grad.Rate 1 194250957 2705048688 9519.5
## + Top25perc 1 146504265 2752795380 9530.4
## + Top10perc 1 124783671 2774515974 9535.3
## + Accept 1 68407796 2830891849 9547.8
## + Apps 1 37351605 2861948040 9554.6
## + Personal 1 22021761 2877277884 9557.9
## + Enroll 1 21250394 2878049251 9558.1
## + Books 1 16826748 2882472897 9559.1
## + S.F.Ratio 1 12022816 2887276829 9560.1
## <none> 2899299645 9560.7
## + F.Undergrad 1 9250654 2890048991 9560.7
## + P.Undergrad 1 5862353 2893437292 9561.4
##
## Step: AIC=9506.99
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD
##
## Df Sum of Sq RSS AIC
## + Grad.Rate 1 138096695 2512883255 9475.7
## + Top25perc 1 42169169 2608810781 9499.0
## + Top10perc 1 35379421 2615600528 9500.6
## + Terminal 1 26636450 2624343499 9502.7
## + Personal 1 26014281 2624965669 9502.9
## + Accept 1 25856057 2625123892 9502.9
## + S.F.Ratio 1 20504248 2630475702 9504.2
## + P.Undergrad 1 15565029 2635414921 9505.3
## + Books 1 14472738 2636507212 9505.6
## + Apps 1 10845099 2640134851 9506.4
## <none> 2650979950 9507.0
## + Enroll 1 1715151 2649264798 9508.6
## + F.Undergrad 1 235072 2650744878 9508.9
##
## Step: AIC=9475.71
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate
##
## Df Sum of Sq RSS AIC
## + Terminal 1 29069264 2483813991 9470.5
## + S.F.Ratio 1 23980164 2488903091 9471.7
## + Personal 1 15282065 2497601190 9473.9
## + Top25perc 1 14080081 2498803174 9474.2
## + Books 1 13787420 2499095834 9474.3
## + Top10perc 1 10584708 2502298546 9475.1
## + Accept 1 10459352 2502423903 9475.1
## <none> 2512883255 9475.7
## + P.Undergrad 1 5510297 2507372958 9476.3
## + F.Undergrad 1 1533109 2511350146 9477.3
## + Apps 1 974955 2511908300 9477.5
## + Enroll 1 16562 2512866693 9477.7
##
## Step: AIC=9470.48
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal
##
## Df Sum of Sq RSS AIC
## + S.F.Ratio 1 21234904 2462579087 9467.1
## + Books 1 18033724 2465780267 9467.9
## + Personal 1 17438422 2466375568 9468.1
## + Top25perc 1 11061012 2472752979 9469.7
## + Top10perc 1 10503050 2473310941 9469.8
## + Accept 1 8692072 2475121919 9470.3
## <none> 2483813991 9470.5
## + P.Undergrad 1 6212554 2477601437 9470.9
## + F.Undergrad 1 2755569 2481058422 9471.8
## + Apps 1 617630 2483196361 9472.3
## + Enroll 1 29501 2483784490 9472.5
##
## Step: AIC=9467.13
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio
##
## Df Sum of Sq RSS AIC
## + Personal 1 19556149 2443022938 9464.2
## + Books 1 17642529 2444936558 9464.7
## + Accept 1 12220653 2450358434 9466.0
## + Top25perc 1 10915780 2451663307 9466.4
## + Top10perc 1 9884752 2452694335 9466.6
## <none> 2462579087 9467.1
## + P.Undergrad 1 5425216 2457153871 9467.8
## + Apps 1 1708616 2460870471 9468.7
## + F.Undergrad 1 1057807 2461521279 9468.9
## + Enroll 1 169448 2462409639 9469.1
##
## Step: AIC=9464.18
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal
##
## Df Sum of Sq RSS AIC
## + Accept 1 16268824 2426754113 9462.0
## + Books 1 12585987 2430436951 9463.0
## + Top25perc 1 11662715 2431360223 9463.2
## + Top10perc 1 10776478 2432246460 9463.4
## <none> 2443022938 9464.2
## + Apps 1 3319537 2439703401 9465.3
## + P.Undergrad 1 2234491 2440788447 9465.6
## + Enroll 1 1504102 2441518836 9465.8
## + F.Undergrad 1 17762 2443005175 9466.2
##
## Step: AIC=9462.02
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal + Accept
##
## Df Sum of Sq RSS AIC
## + F.Undergrad 1 34267884 2392486229 9455.2
## + Apps 1 28349962 2398404152 9456.7
## + Enroll 1 21722919 2405031194 9458.4
## + Books 1 13764794 2412989319 9460.5
## + Top10perc 1 9007820 2417746293 9461.7
## + Top25perc 1 8657418 2418096695 9461.8
## + P.Undergrad 1 7813774 2418940340 9462.0
## <none> 2426754113 9462.0
##
## Step: AIC=9455.17
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal + Accept + F.Undergrad
##
## Df Sum of Sq RSS AIC
## + Apps 1 33237860 2359248369 9448.5
## + Top10perc 1 14141658 2378344572 9453.5
## + Top25perc 1 13224939 2379261290 9453.7
## + Books 1 11896812 2380589418 9454.1
## <none> 2392486229 9455.2
## + P.Undergrad 1 793001 2391693228 9457.0
## + Enroll 1 19024 2392467206 9457.2
##
## Step: AIC=9448.47
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal + Accept + F.Undergrad +
## Apps
##
## Df Sum of Sq RSS AIC
## + Top10perc 1 34199271 2325049099 9441.4
## + Top25perc 1 24042385 2335205985 9444.1
## + Books 1 10661970 2348586399 9447.7
## <none> 2359248369 9448.5
## + Enroll 1 1072888 2358175482 9450.2
## + P.Undergrad 1 755069 2358493300 9450.3
##
## Step: AIC=9441.39
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal + Accept + F.Undergrad +
## Apps + Top10perc
##
## Df Sum of Sq RSS AIC
## + Books 1 15416176 2309632923 9439.3
## <none> 2325049099 9441.4
## + Enroll 1 3260513 2321788586 9442.5
## + Top25perc 1 166941 2324882158 9443.3
## + P.Undergrad 1 256 2325048842 9443.4
##
## Step: AIC=9439.25
## Outstate ~ Expend + Private + Room.Board + perc.alumni + PhD +
## Grad.Rate + Terminal + S.F.Ratio + Personal + Accept + F.Undergrad +
## Apps + Top10perc + Books
##
## Df Sum of Sq RSS AIC
## <none> 2309632923 9439.3
## + Enroll 1 3152187 2306480736 9440.4
## + Top25perc 1 252598 2309380324 9441.2
## + P.Undergrad 1 53 2309632870 9441.3
summary(step.for)
##
## Call:
## lm(formula = Outstate ~ Expend + Private + Room.Board + perc.alumni +
## PhD + Grad.Rate + Terminal + S.F.Ratio + Personal + Accept +
## F.Undergrad + Apps + Top10perc + Books, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6206 -1233 10 1325 5390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2286.93077 865.13003 -2.643 0.008418 **
## Expend 0.16360 0.02379 6.878 1.51e-11 ***
## PrivateYes 2242.59519 268.94651 8.338 5.06e-16 ***
## Room.Board 0.94499 0.09401 10.052 < 2e-16 ***
## perc.alumni 42.13604 8.50315 4.955 9.38e-07 ***
## PhD 14.09387 10.16544 1.386 0.166118
## Grad.Rate 28.68532 6.13106 4.679 3.56e-06 ***
## Terminal 32.16147 11.27001 2.854 0.004468 **
## S.F.Ratio -61.88646 28.54270 -2.168 0.030531 *
## Personal -0.19357 0.13361 -1.449 0.147935
## Accept 0.67907 0.13205 5.143 3.66e-07 ***
## F.Undergrad -0.14101 0.03952 -3.568 0.000388 ***
## Apps -0.28245 0.07533 -3.749 0.000194 ***
## Top10perc 23.14400 7.23340 3.200 0.001448 **
## Books -1.00042 0.49701 -2.013 0.044572 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1951 on 607 degrees of freedom
## Multiple R-squared: 0.7774, Adjusted R-squared: 0.7722
## F-statistic: 151.4 on 14 and 607 DF, p-value: < 2.2e-16
NOTE: Ana’s solution has a nice example with regsubsets, where she applies standard errors to see if the results of the different combinations are the same.
We see that the forward selection decides on 14 variables to be included
2.4.5.2 (b) Fitting a GAM
par(mfrow = c(1,1))
library(forecast)
<- formula(step.for)
best.formula <-gam::gam(formula = best.formula,data = train.df) #Notice that it is linear
fit.gam <- predict(object = fit.gam,newdata = test.df)
preds plot(x = preds,y = preds-test.df$Outstate,ylab = "Residuals",xlab = "Predicted values")
From the plot we see that in general we have a resdiausl around 0 withinn +/- 5000. Also the residuals do look rather normal. Although one could argue that the variance is a but smaller in the lower region of the predicted value, and it does in fact appear as if we are under estimating these result.
We can interprete how Outstate responds in the following:
par(mfrow = c(4,4))
plot(fit.gam #Note, automatically identifies the GAM object, hence plots for each variable
se = TRUE
,col ="blue") ,
Interpreting the plot: Recall that the plots assumes that we hold the other variables fixed, hence we see the following:
- e.g., Expend: We see that holding the other variables fixed, the outstate tends to increase over the expenditure.
- e.g., Apps: I assume that this is applicants, we see that holder the other variables fixed, outstate students tend to decrease as amount of applicants decrease.
We can also assess the overall accuracy
accuracy(preds,x = test.df$Outstate)
## ME RMSE MAE MPE MAPE
## Test set 85.03267 2041.19 1587.582 -1.863903 16.86691
We see that the MAE is 1587, where the mean absolute percentage error is almost 17%, hence it appear to be rather high.
2.4.5.3 (c) Evaluating on the test set
This is what was done above. It is expected that if we compared applying on the train and test set, we will observe that the model has a lot of optimism on the train data, thus we should also see that the MAPE is lower on this partition, has this is what the model was trained on.
2.4.5.4 (d) Which variables appear to have a non linear relationship?
par(mfrow = c(3,3))
for (i in c(1:8,10:18)) {
plot(y = df$Outstate,x = df[,i],xlab = names(df)[i],ylab = "Outstate",pch = 20,col = "darkgrey")
grid()
colnames(df)[i] %>% title()
}
It appears as if expend has som non linear relationship with Outstate. Perhaps enroll, F.Undergrad, and P.Undergrad also have a non linear trend.
To further decide if there is evidence for a non linear relationship, one could make, e.g., a smoothing model to assess the performance hereof.
set.seed(1337)
library(mgcv)
<- mgcv::gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train.df,method = 'REML')
gam.mgcv summary(gam.mgcv)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) +
## s(Expend) + s(Grad.Rate)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8775.6 175.3 50.05 <2e-16 ***
## PrivateYes 2370.6 217.5 10.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Room.Board) 2.318 2.953 25.81 < 2e-16 ***
## s(PhD) 1.258 1.478 11.03 0.000303 ***
## s(perc.alumni) 1.677 2.115 10.98 0.0000173 ***
## s(Expend) 5.727 6.882 31.37 < 2e-16 ***
## s(Grad.Rate) 2.816 3.593 12.85 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.794 Deviance explained = 79.9%
## -REML = 5531.6 Scale est. = 3.446e+06 n = 622
# We look at the Approximate significance of smooth terms table, in particular to edf.
# These edf´s suggests that the previous gam model (imposing all nonlinear) may be a little too restrictive
par(mfrow = c(2, 3))
plot(gam.mgcv, se = T, col = "blue")
# Check residuals diagnostics
par(mfrow = c(2,2))
gam.check(gam.mgcv)
##
## Method: REML Optimizer: outer newton
## full convergence after 5 iterations.
## Gradient range [-0.001152703,0.003980755]
## (score 5531.561 & scale 3445952).
## Hessian positive definite, eigenvalue range [0.01619762,307.5188].
## Model rank = 47 / 47
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Room.Board) 9.00 2.32 0.98 0.290
## s(PhD) 9.00 1.26 0.94 0.075 .
## s(perc.alumni) 9.00 1.68 1.07 0.955
## s(Expend) 9.00 5.73 1.08 0.970
## s(Grad.Rate) 9.00 2.82 0.95 0.120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 Casestudy - Predicting the Return on Advertising Spent
(Case and data source: kaggle.com)
2.5.1 1. Background
Along with Google’s search and display networks, Facebook is one of the big players when it comes to online advertising. As Facebook users interact with the platform, adding demographic information, liking particular pages and commenting on specific posts, Facebook builds a profile of that user based on who they are and what they’re interested in.
This fact makes Facebook very attractive for advertisers. Advertisers can create Facebook adverts, then create an ‘Audience’ for that advert or group of adverts. Audiences can be built from a range of attributes including gender, age, location and interests. This specific targeting means advertisers can tailor content appropriately for a specific audience, even if the product being marketed is the same.
For example, let’s imagine a company wants to advertise its new car. They may wish to promote one set of features, performance and the 2 kW stereo, to women in their early twenties. They might decide that they want to talk about it’s fuel efficiency and reduced emissions to men in their thirties, and they might want to push the spacious interior and safety rating to men and women in their thirties and early forties who are interested in Families magazine and who like pages of nappy and baby clothes manufacturers.
In 2016, Facebook’s revenue from advertising was 26bn, up from 17bn the year before. This compares to Google’s 79bn, the 638m that Twitter advertising made in Q4 2016, and $173m that LinkedIn made from ads in Q3 2016. These figures illustrate just how big an advertising platform is, although it faces challenges for the future with a decline in younger users in 2017, with generation Z moving to Snapchat and Instagram. When it comes to analyzing the Facebook adverts dataset, there are a lot of questions we can ask, and a lot of insight we can generate. However, from a business perspective we want to ask questions that will give us answers we can use to improve business performance.
2.5.2 2. Case study (Business Understanding Phase)
The company xyz is an e-marketing company. It earns money based on how many people click the ads and how many people actually purchase the product advertised. The company will receive from the manufacturer 5 for each user that clicks on ad and signs up in the e-commerce shop, and respectively, 100 for each registered user that buys the product. On the other hand, the xyz pays to Facebook to display the ads. The objective of the company xyz is to maximize the return on advertising spent, while minimizing the amount spent on advertising. In order to attain this objective, the company needs to identify the factors (e.g. spending, campaign type, customer type, etc.) influencing most significantly the return on advertising spent (ROAS) based on the available data.
The analytical objective in this case is to predict the Return on Advertising Spent (ROAS) using the dataset KAG_conversion_data.csv. Given the scope of this case study, we aim to set up a non-linear regression model. The model can be used either to predict the value of the target variable, ROAS, for future observations of the explanatory variables or to provide a better understanding of the relationships (form and direction) between dependent and independent variables.
2.5.3 3. The data (Data Understanding Phase)
The documentation describes the columns in the data as follows:
- ad_id: unique ID for each ad. It is just an identifier.
- xyz_campaign_id: an ID associated with each ad campaign of XYZ company
- fb_campaign_id: an ID associated with how Facebook tracks each campaign. Ignore this variable in the analysis
- age: age of the person to whom the ad is shown
- gender: gender of the person to whom the add is shown
- interest: a code specifying the category to which the person’s interest belongs (interests are as mentioned in the person’s Facebook public profile)
- Impressions: the number of times the ad was shown
- Clicks: number of clicks on for that ad
- Spent: Amount paid by company xyz to Facebook, to show that ad
- Total conversion: Total number of people who sign up to the webshop after seeing the ad (here we (the add company) receives 5$)
- Approved conversion/leads: Total number of people who bought the product after seeing the ad (here we (the add company) receives 10$)
We want to predict ROAS, that is return on advertising spending
2.5.4 4. Specific requirements:
2.5.4.1 4.1 Task 1 - Import and overview
Import and view the data. A critical step in the data mining process is to understand the variables, and further to ensure that the software interprets the variables in the correct way.
<- read.csv("Data/1. MovingBeyondLinearity/FacebookCasestudy/KAG_conversion_data.csv") KAG_conversion_data
2.5.4.2 4.2 Task 2 - Data inspection
Inspect your data and do the required variable adaptations and transformations. These may include: treating the missing data, treating the outliers, log transformations, binning, standardizing, and creating additional features (e.g. in this particular case, one needs to calculate the DV (ROAS), which is not ready available in the dataset).
NOTICE for the exam, we will not be asked to transform the data. Or at least the will specify what to do
In real practice, data preprocessing takes about 75% of the total time assigned to a project.
library(tidyverse)
<- KAG_conversion_data
dataTf glimpse(KAG_conversion_data) #Just a quick inspection
## Rows: 1,143
## Columns: 11
## $ ad_id <int> 708746, 708749, 708771, 708815, 708818, 708820, 70…
## $ xyz_campaign_id <int> 916, 916, 916, 916, 916, 916, 916, 916, 916, 916, …
## $ fb_campaign_id <int> 103916, 103917, 103920, 103928, 103928, 103929, 10…
## $ age <chr> "30-34", "30-34", "30-34", "30-34", "30-34", "30-3…
## $ gender <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", …
## $ interest <int> 15, 16, 20, 28, 28, 29, 15, 16, 27, 28, 31, 7, 16,…
## $ Impressions <int> 7350, 17861, 693, 4259, 4133, 1915, 15615, 10951, …
## $ Clicks <int> 1, 2, 0, 1, 1, 0, 3, 1, 1, 3, 0, 0, 0, 0, 7, 0, 1,…
## $ Spent <dbl> 1.43, 1.82, 0.00, 1.25, 1.29, 0.00, 4.77, 1.27, 1.…
## $ Total_Conversion <int> 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Approved_Conversion <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,…
Renaming variables
In general it is a good idea not to have special charactors in the variable name.
<- dataTf %>%
dataTf rename(xyzCampId = xyz_campaign_id, fbCampId = fb_campaign_id, impr = Impressions,
conv = Total_Conversion, appConv = Approved_Conversion)
Create new features and calculate ROAS
Notice, that we use mutate()
to make the new variables
- Click-through-rate: how many of the impressions became clicks.CTR = Clicks/Impressions
- Cost Per Click: how much on average did each click cost.CPC = Spent/Clicks
library(dplyr)
<- dataTf %>%
dataTf mutate(CTR = ((Clicks / impr) * 100), CPC = Spent / Clicks)
$CTR <- round(dataTf$CTR, 4) #NB: Sometime rounding is necessary for packages to work
dataTf$CPC <- round(dataTf$CPC, 2) dataTf
Now we can also make the following variables:
- Conversion value: conval = 5 * conv
- Approved conversion value: appConVal = 100 * appConv
- Total conversion: totConVal = conval + appConVal
- Cost per conversion: costPerCon = Spent/totConv
- Return on Advertising Spend (the revenue as a % of advertising spend) : ROAS = totConVal/Spent
- Cost Per Mille: the cost of one thousand impressions. If our objective is ad exposure to increase brand awareness,this might be an important to measure. CPM = Spent/Impr *1000. I.e. how much of the overall spenditure divided by the amount of impressions. We multiply with 1000, to have more readable numbers instead of decimals.
<- dataTf %>%
dataTf mutate(totConv = conv + appConv,
conVal = conv * 5,
appConVal = appConv * 100) %>%
mutate(totConVal = conVal + appConVal) %>%
mutate(costPerCon = round(Spent / totConv, 2),
ROAS = round(totConVal / Spent, 2))
<- dataTf %>%
dataTf mutate(CPM = round((Spent / impr) * 1000, 2)) #Cost per Mille.
We see that this produce a lot of INF, that is because some of the underlying varaibles contain NaNs, which is regarded as non available values, hence we cant do math with it. These observations will be identified in the following.
Hence: Dealing with missing, currupt and invalid data
# Decide between missing values imputation or deletion
table(is.na(dataTf))
##
## FALSE TRUE
## 22647 213
summary(dataTf)
## ad_id xyzCampId fbCampId age
## Min. : 708746 Min. : 916 Min. :103916 Length:1143
## 1st Qu.: 777632 1st Qu.: 936 1st Qu.:115716 Class :character
## Median :1121185 Median :1178 Median :144549 Mode :character
## Mean : 987261 Mean :1067 Mean :133784
## 3rd Qu.:1121804 3rd Qu.:1178 3rd Qu.:144658
## Max. :1314415 Max. :1178 Max. :179982
##
## gender interest impr Clicks
## Length:1143 Min. : 2.00 Min. : 87 Min. : 0.00
## Class :character 1st Qu.: 16.00 1st Qu.: 6504 1st Qu.: 1.00
## Mode :character Median : 25.00 Median : 51509 Median : 8.00
## Mean : 32.77 Mean : 186732 Mean : 33.39
## 3rd Qu.: 31.00 3rd Qu.: 221769 3rd Qu.: 37.50
## Max. :114.00 Max. :3052003 Max. :421.00
##
## Spent conv appConv CTR
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. :0.00000
## 1st Qu.: 1.48 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:0.01005
## Median : 12.37 Median : 1.000 Median : 1.000 Median :0.01600
## Mean : 51.36 Mean : 2.856 Mean : 0.944 Mean :0.01642
## 3rd Qu.: 60.02 3rd Qu.: 3.000 3rd Qu.: 1.000 3rd Qu.:0.02340
## Max. :639.95 Max. :60.000 Max. :21.000 Max. :0.10590
##
## CPC totConv conVal appConVal
## Min. :0.180 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:1.390 1st Qu.: 1.0 1st Qu.: 5.00 1st Qu.: 0.0
## Median :1.500 Median : 2.0 Median : 5.00 Median : 100.0
## Mean :1.499 Mean : 3.8 Mean : 14.28 Mean : 94.4
## 3rd Qu.:1.643 3rd Qu.: 4.0 3rd Qu.: 15.00 3rd Qu.: 100.0
## Max. :2.210 Max. :77.0 Max. :300.00 Max. :2100.0
## NA's :207
## totConVal costPerCon ROAS CPM
## Min. : 0.0 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 5.0 1st Qu.: 1.048 1st Qu.: 0.800 1st Qu.:0.1500
## Median : 105.0 Median : 6.290 Median : 3.025 Median :0.2500
## Mean : 108.7 Mean : Inf Mean : Inf Mean :0.2392
## 3rd Qu.: 110.0 3rd Qu.:17.415 3rd Qu.:33.097 3rd Qu.:0.3300
## Max. :2300.0 Max. : Inf Max. : Inf Max. :1.5000
## NA's :3 NA's :3
library(DataExplorer)
plot_missing(dataTf)
Genereal rule of thumb, if we have less than 5%, we can exclude them, if more than 5%, then we should impute values, so we preserve the dataset. One must make aware of this
We just choose to ommit despite missing 18% in CPC
= na.omit(dataTf) #Df with the NA's omitted
dataTfo
# Dividing by zero create Inf values
$costPerCon <- ifelse(dataTfo$costPerCon == Inf, 0, dataTfo$costPerCon)
dataTfosummary(dataTfo$costPerCon)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.072 9.215 16.131 19.773 332.990
# Re-check the correctness of data type and adapt
$xyzCampId=as.factor(dataTfo$xyzCampId)
dataTfo$age=as.factor(dataTfo$age)
dataTfo$gender=as.factor(dataTfo$gender)
dataTfo$interest=as.factor(dataTfo$interest) # too many (non-representative) categories?
dataTfo
$appConv= as.numeric(dataTfo$appConv)
dataTfo$conv= as.numeric(dataTfo$conv)
dataTfo$impr= as.numeric(dataTfo$impr)
dataTfo$Clicks= as.numeric(dataTfo$Clicks)
dataTfo$totConv= as.numeric(dataTfo$totConv)
dataTfo
str(dataTfo)
## 'data.frame': 936 obs. of 20 variables:
## $ ad_id : int 708746 708749 708815 708818 708889 708895 708953 708958 709059 709115 ...
## $ xyzCampId : Factor w/ 3 levels "916","936","1178": 1 1 1 1 1 1 1 1 1 1 ...
## $ fbCampId : int 103916 103917 103928 103928 103940 103941 103951 103952 103968 103978 ...
## $ age : Factor w/ 4 levels "30-34","35-39",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ interest : Factor w/ 40 levels "2","7","10","15",..: 4 5 16 16 4 5 15 16 8 18 ...
## $ impr : num 7350 17861 4259 4133 15615 ...
## $ Clicks : num 1 2 1 1 3 1 1 3 7 1 ...
## $ Spent : num 1.43 1.82 1.25 1.29 4.77 ...
## $ conv : num 2 2 1 1 1 1 1 1 1 1 ...
## $ appConv : num 1 0 0 1 0 1 0 0 1 0 ...
## $ CTR : num 0.0136 0.0112 0.0235 0.0242 0.0192 0.0091 0.0425 0.0316 0.0477 0.0434 ...
## $ CPC : num 1.43 0.91 1.25 1.29 1.59 1.27 1.5 1.05 1.47 0.57 ...
## $ totConv : num 3 2 1 2 1 2 1 1 2 1 ...
## $ conVal : num 10 10 5 5 5 5 5 5 5 5 ...
## $ appConVal : num 100 0 0 100 0 100 0 0 100 0 ...
## $ totConVal : num 110 10 5 105 5 105 5 5 105 5 ...
## $ costPerCon: num 0.48 0.91 1.25 0.64 4.77 0.63 1.5 3.16 5.14 0.57 ...
## $ ROAS : num 76.92 5.49 4 81.4 1.05 ...
## $ CPM : num 0.19 0.1 0.29 0.31 0.31 0.12 0.64 0.33 0.7 0.25 ...
## - attr(*, "na.action")= 'omit' Named int [1:207] 3 6 11 12 13 14 16 18 21 25 ...
## ..- attr(*, "names")= chr [1:207] "3" "6" "11" "12" ...
# Check distribution of variables (univariate analysis)
library(DataExplorer)
options(repr.plot.width=4, repr.plot.height=4)
plot_bar(dataTfo) #This is a really good tool to plot all bars
We can also check all of the distributions.
plot_histogram(dataTfo)
Now we can treat outliers.
# Treat outliers
attach(dataTfo)
# e.g. for ROAS
boxplot(ROAS)
We want to see the boxplot being like a plot.
We observe that some of the observations are outside of the whiskers, these appear to be outliers. We can identify these with the $out command, see the following.
boxplot(ROAS)$out
## [1] 76.92 81.40 82.68 23.76 34.31 35.23 62.13 12.40 68.63 65.22
## [11] 37.50 97.35 18.32 18.65 19.23 71.92 36.97 77.78 68.63 15.52
## [21] 75.54 66.88 81.40 74.47 31.44 20.15 20.83 66.46 36.21 122.09
## [31] 19.23 19.13 16.67 91.30 33.02 107.14 70.95 35.71 22.83 44.49
## [41] 28.23 16.56 23.97 214.29 62.50 31.72 21.04 34.37 184.21 58.01
## [51] 88.98 23.26 76.09 29.33 70.00 583.33 19.48 106.06 19.89 100.00
## [61] 78.36 57.38 23.18 33.98 14.59 39.62 25.42 145.83 62.50 29.33
## [71] 38.18 60.69 26.58 78.95 73.94 85.37 80.29 76.09 17.77 25.74
## [81] 22.98 64.02 85.37 72.92 38.46 45.26 17.02 22.27 16.56 23.57
## [91] 67.31 34.27 36.97 48.50 70.00 25.75 68.18 25.64 32.31 66.46
## [101] 41.98 132.08 33.33 12.20 69.54 67.31 69.18 13.62 13.92 52.38
## [111] 14.85 21.40 14.10 37.10 15.80 63.25 26.16 19.49 18.42 35.29
## [121] 27.85 13.62 13.32 21.78 24.88 23.03 60.69 40.08 37.77 39.92
## [131] 25.93 18.75 55.78 40.89 35.94 35.47 12.69 70.47 14.46 18.81
## [141] 12.22
We see that all of these above are outliers.
# Here I assign outliers to a vector and remove them
<- boxplot(ROAS
outliers plot=FALSE)$out #PLOT = TRUE will plot the boxplot
,
# remove the rows containing the outliers
<- dataTfo[-which(ROAS %in% outliers),]
dataTfo1 boxplot(dataTfo1$ROAS)
# Alterative ways to treat outliers exist
Naturally there are other ways of identifying outliers, this is just one approach.
Looking at the new boxplot, we see that most observations are within the whiskers (the lines).
Notice, it does not have to be errors in the data and thus it does not necessarily be non representing of the data, thus one may not want to delete them
# Check bivariate distributions or measures of association
# pairs(data) # only numeric
# or one by one
ggplot(dataTfo1, aes(x=xyzCampId, y=ROAS)) + stat_summary(fun.y="mean", geom="bar")
ggplot(dataTfo1, aes(x=age, y=ROAS)) + stat_summary(fun.y="mean", geom="bar")
ggplot(dataTfo1, aes(x=gender, y=ROAS)) + stat_summary(fun.y="mean", geom="bar")
ggplot(dataTfo1, aes(x=interest, y=ROAS)) + stat_summary(fun.y="mean", geom="bar")
ggplot(dataTfo1, aes(Spent, ROAS)) + geom_point() + labs(x = "Amount spent on campaign", y = "ROAS")
options(repr.plot.width=6, repr.plot.height=3)
ggplot(dataTfo1, aes(Spent, totConv)) + geom_point() + labs(x = "Amount spent on campaign", y = "Total Conversions")
# Measure of associations
par(mfrow = c(1, 1))
library(corrplot)
= corrplot(cor(dataTfo1 [, -c(1:6)])) cormatrix
We seee that some variables appear to have a strong correlation with other variables, e.g. CTR and CPM.
We want to see how they interact with ROAS as well, to get an idea of wether they are positive and negative relationship.
NOTICE, this is based on a linear relationship between the variables. Since we are now in non linear models, we cannot use it the same way as when working in the linear scenarios. But we must be aware of variables that appear to have a really high correlation with one and other, hence we may exclude these.
The following is the same, but presented just with actual correlations.
cor(dataTfo1[,-c(1:6)],dataTfo1$ROAS)
## [,1]
## impr -0.12495737
## Clicks -0.19118476
## Spent -0.18289935
## conv 0.09806281
## appConv 0.27926699
## CTR -0.11644181
## CPC -0.08201179
## totConv 0.15227797
## conVal 0.09806281
## appConVal 0.27926699
## totConVal 0.26170129
## costPerCon -0.37974060
## ROAS 1.00000000
## CPM -0.17622449
Other exploration techniques can be implemented. Data pre-processing remains an active area of research because of the huge amount of inconsistent or “dirty” data.
2.5.4.3 4.3 Task 3 - Building different models
Build several predictive models and evaluate their performance. First, consider reflecting about these specific questions:
Q1) Which of the so far introduced models can be applied to explain and predict the return on advertising spent (ROAS)? Later on, construct several models that answer to this question, using a training dataset.
Q2) Which sample should one use to evaluate the model? Furthermore, which model assessment and selection criteria applies in this case? Later on, evaluate the model based on applicable assessment and selection criteria.
Q3) Reflect about how one can improve the model prediction power.
# Splitting the data
library(ISLR)
<- floor(0.75 * nrow(dataTfo1))
smp_size set.seed(123)
<- sample(seq_len(nrow(dataTfo1)), size = smp_size)
train_ind <- dataTfo1[train_ind, ]
trainTfo1 <- dataTfo1[-train_ind, ] testTfo1
2.5.4.3.1 A Generalized Additive Model (GAM) to predict ROAS
The questions that arise are:
- What subset of all the predictors can best explain and predict ROAS?
- What is the form of the relationships between them and ROAS?
Feature selection approaches include:
Generally looking at the variables “one by one,” to understand what features are important and to figure out how they contribute towards solving the problem.
Looking at the correlation matrix: If we are working with a model which assumes a linear relationship between the dependent and the independent variables, corr matrix can help us come up with an initial list of variable importance. However, corr matrix also works as a “rough informative tool” for nonlinear modelling.
Running automatic feature selection algorithms. Functions in R include, among others:
- c1. regsubsets() function in “leaps” library (presented in ISL, p. 244); used to select the best size model that contains a given number of predictors, where best is quantified using Residual Sum of Squares (RSS). Although regsubsets() is based on testing linear models, it works as a “rough” list for nonlinear models.
- c2. step.Gam() function in “gam” library for stepwise selection of variables in GAM models. This is useful when the number of predictors is not very high.
- c3. advanced feature selection methods based on other data-mining techniques, including but not only: random forests, Bayesian Networks, Neural Networks, or other. Notice, that we can use very complex methods of feature selection, and then construct a model that is more transparant, for instance GAMs
2.5.4.3.1.1 c1) Feature selection using regsubsets()
One may exclude ad_id and variables used to calculate our dependent variable ROAS, as they have a deterministic relation with it. The variable “interests” can be preprocessed to reduce the categories to the most representative ones, but for simplicity we keep it as it is. See later it creates some problems.
str(trainTfo1)
## 'data.frame': 596 obs. of 20 variables:
## $ ad_id : int 1121532 1121642 781508 1121798 950068 776861 1121284 1121110 1121132 710480 ...
## $ xyzCampId : Factor w/ 3 levels "916","936","1178": 3 3 2 3 2 2 3 3 3 1 ...
## $ fbCampId : int 144612 144630 116397 144656 123438 115574 144571 144534 144537 104205 ...
## $ age : Factor w/ 4 levels "30-34","35-39",..: 3 4 1 1 1 4 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 1 1 2 2 2 2 2 1 ...
## $ interest : Factor w/ 40 levels "2","7","10","15",..: 18 11 22 12 3 2 25 6 9 5 ...
## $ impr : num 123126 99698 5040 107548 4012 ...
## $ Clicks : num 25 21 1 19 1 6 23 123 12 14 ...
## $ Spent : num 39.73 33.35 1.44 29.31 1.57 ...
## $ conv : num 2 1 1 1 1 1 1 6 4 1 ...
## $ appConv : num 1 0 0 0 0 0 0 2 2 1 ...
## $ CTR : num 0.0203 0.0211 0.0198 0.0177 0.0249 0.0364 0.0131 0.014 0.0104 0.0243 ...
## $ CPC : num 1.59 1.59 1.44 1.54 1.57 1.54 1.77 1.71 1.98 1.29 ...
## $ totConv : num 3 1 1 1 1 1 1 8 6 2 ...
## $ conVal : num 10 5 5 5 5 5 5 30 20 5 ...
## $ appConVal : num 100 0 0 0 0 0 0 200 200 100 ...
## $ totConVal : num 110 5 5 5 5 5 5 230 220 105 ...
## $ costPerCon: num 13.24 33.35 1.44 29.31 1.57 ...
## $ ROAS : num 2.77 0.15 3.47 0.17 3.18 0.54 0.12 1.09 9.27 5.81 ...
## $ CPM : num 0.32 0.33 0.29 0.27 0.39 0.56 0.23 0.24 0.21 0.31 ...
## - attr(*, "na.action")= 'omit' Named int [1:207] 3 6 11 12 13 14 16 18 21 25 ...
## ..- attr(*, "names")= chr [1:207] "3" "6" "11" "12" ...
Now we can make the selection with forward selection.
library(leaps)
= regsubsets(trainTfo1$ROAS ~ ., data = trainTfo1[,-c(1,3, 14:17)], method = "forward", nvmax=14)
reg.fit = summary(reg.fit)
reg.summary reg.summary
## Subset selection object
## Call: regsubsets.formula(trainTfo1$ROAS ~ ., data = trainTfo1[, -c(1,
## 3, 14:17)], method = "forward", nvmax = 14)
## 54 Variables (and intercept)
## Forced in Forced out
## xyzCampId936 FALSE FALSE
## xyzCampId1178 FALSE FALSE
## age35-39 FALSE FALSE
## age40-44 FALSE FALSE
## age45-49 FALSE FALSE
## genderM FALSE FALSE
## interest7 FALSE FALSE
## interest10 FALSE FALSE
## interest15 FALSE FALSE
## interest16 FALSE FALSE
## interest18 FALSE FALSE
## interest19 FALSE FALSE
## interest20 FALSE FALSE
## interest21 FALSE FALSE
## interest22 FALSE FALSE
## interest23 FALSE FALSE
## interest24 FALSE FALSE
## interest25 FALSE FALSE
## interest26 FALSE FALSE
## interest27 FALSE FALSE
## interest28 FALSE FALSE
## interest29 FALSE FALSE
## interest30 FALSE FALSE
## interest31 FALSE FALSE
## interest32 FALSE FALSE
## interest36 FALSE FALSE
## interest63 FALSE FALSE
## interest64 FALSE FALSE
## interest65 FALSE FALSE
## interest66 FALSE FALSE
## interest100 FALSE FALSE
## interest101 FALSE FALSE
## interest102 FALSE FALSE
## interest103 FALSE FALSE
## interest104 FALSE FALSE
## interest105 FALSE FALSE
## interest106 FALSE FALSE
## interest107 FALSE FALSE
## interest108 FALSE FALSE
## interest109 FALSE FALSE
## interest110 FALSE FALSE
## interest111 FALSE FALSE
## interest112 FALSE FALSE
## interest113 FALSE FALSE
## interest114 FALSE FALSE
## impr FALSE FALSE
## Clicks FALSE FALSE
## Spent FALSE FALSE
## conv FALSE FALSE
## appConv FALSE FALSE
## CTR FALSE FALSE
## CPC FALSE FALSE
## costPerCon FALSE FALSE
## CPM FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: forward
## xyzCampId936 xyzCampId1178 age35-39 age40-44 age45-49 genderM
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " "*" " "
## 11 ( 1 ) " " " " " " " " "*" " "
## 12 ( 1 ) " " " " " " " " "*" " "
## 13 ( 1 ) " " " " " " " " "*" " "
## 14 ( 1 ) " " " " " " " " "*" " "
## interest7 interest10 interest15 interest16 interest18 interest19
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " " " "*" " " "*" " "
## 12 ( 1 ) " " " " "*" " " "*" " "
## 13 ( 1 ) " " " " "*" " " "*" " "
## 14 ( 1 ) " " " " "*" " " "*" " "
## interest20 interest21 interest22 interest23 interest24 interest25
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" " " " " " "
## 8 ( 1 ) " " " " "*" " " " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " " " "*" " " " " " "
## 12 ( 1 ) " " " " "*" " " " " " "
## 13 ( 1 ) " " " " "*" " " " " " "
## 14 ( 1 ) " " " " "*" " " " " " "
## interest26 interest27 interest28 interest29 interest30 interest31
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " " " "
## interest32 interest36 interest63 interest64 interest65 interest66
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " "*" " " " " " " " "
## 9 ( 1 ) " " "*" " " " " " " " "
## 10 ( 1 ) " " "*" " " " " " " " "
## 11 ( 1 ) " " "*" " " " " " " " "
## 12 ( 1 ) " " "*" " " " " " " " "
## 13 ( 1 ) " " "*" " " " " " " " "
## 14 ( 1 ) " " "*" " " " " " " " "
## interest100 interest101 interest102 interest103 interest104
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " "
## interest105 interest106 interest107 interest108 interest109
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " "
## interest110 interest111 interest112 interest113 interest114 impr
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " " " "*"
## 10 ( 1 ) " " " " " " " " " " "*"
## 11 ( 1 ) " " " " " " " " " " "*"
## 12 ( 1 ) "*" " " " " " " " " "*"
## 13 ( 1 ) "*" " " " " " " " " "*"
## 14 ( 1 ) "*" " " " " "*" " " "*"
## Clicks Spent conv appConv CTR CPC costPerCon CPM
## 1 ( 1 ) " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " "*" " " " " "*" " "
## 4 ( 1 ) " " " " "*" "*" " " " " "*" " "
## 5 ( 1 ) " " " " "*" "*" " " " " "*" "*"
## 6 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 7 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 8 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 9 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 10 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 13 ( 1 ) " " "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) " " "*" "*" "*" "*" " " "*" "*"
Recall, an asteriks indicates that a given variable is included in the corresponding model. The default is “1 subsets of each size up to 8,” but using “nvmax” option we can control as many variables as are desired.
Use a prediction error criteria (Adj^R2 BIC, Cp) to select the “best overall” model: names(reg.summary)
In this example \(R^2\) is applied.
$rsq reg.summary
## [1] 0.1596916 0.2163240 0.3276550 0.3437687 0.3584745 0.3676564 0.3766836
## [8] 0.3843347 0.3880903 0.3913334 0.3946034 0.3973916 0.4003809 0.4033252
Rsquare increases monotonically as more variables are included. This is expected; the model containing all predictors will always have the smallest RSS and the largest \(R^2\).
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "p", ylim = c(0.1, 1), main = " Adj R2 for each model")
grid(col = "darkgrey")
= max(reg.summary$adjr2)
max.adjr2 = sd(reg.summary$adjr2)
std.adjr2 abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)
The plot shows the size 10-11 is the minimum size for the subset for which the scores are within 0.2 standard deviations of optimum.
That is because the \(R^2\) does not significantly change. Therefore, we can just limit ourselves with a more simple model.
We can see how the selection changes if we use the full model
A common approach when dataset is small is to find best 10 variables using entire dataset.
Notice, that some researches will argue that this implies data leakage.
= regsubsets(dataTfo1$ROAS ~ ., data = dataTfo1[,-c(1,3, 14:17)], method = "forward", nvmax=14)
reg.fit summary(reg.fit)
## Subset selection object
## Call: regsubsets.formula(dataTfo1$ROAS ~ ., data = dataTfo1[, -c(1,
## 3, 14:17)], method = "forward", nvmax = 14)
## 54 Variables (and intercept)
## Forced in Forced out
## xyzCampId936 FALSE FALSE
## xyzCampId1178 FALSE FALSE
## age35-39 FALSE FALSE
## age40-44 FALSE FALSE
## age45-49 FALSE FALSE
## genderM FALSE FALSE
## interest7 FALSE FALSE
## interest10 FALSE FALSE
## interest15 FALSE FALSE
## interest16 FALSE FALSE
## interest18 FALSE FALSE
## interest19 FALSE FALSE
## interest20 FALSE FALSE
## interest21 FALSE FALSE
## interest22 FALSE FALSE
## interest23 FALSE FALSE
## interest24 FALSE FALSE
## interest25 FALSE FALSE
## interest26 FALSE FALSE
## interest27 FALSE FALSE
## interest28 FALSE FALSE
## interest29 FALSE FALSE
## interest30 FALSE FALSE
## interest31 FALSE FALSE
## interest32 FALSE FALSE
## interest36 FALSE FALSE
## interest63 FALSE FALSE
## interest64 FALSE FALSE
## interest65 FALSE FALSE
## interest66 FALSE FALSE
## interest100 FALSE FALSE
## interest101 FALSE FALSE
## interest102 FALSE FALSE
## interest103 FALSE FALSE
## interest104 FALSE FALSE
## interest105 FALSE FALSE
## interest106 FALSE FALSE
## interest107 FALSE FALSE
## interest108 FALSE FALSE
## interest109 FALSE FALSE
## interest110 FALSE FALSE
## interest111 FALSE FALSE
## interest112 FALSE FALSE
## interest113 FALSE FALSE
## interest114 FALSE FALSE
## impr FALSE FALSE
## Clicks FALSE FALSE
## Spent FALSE FALSE
## conv FALSE FALSE
## appConv FALSE FALSE
## CTR FALSE FALSE
## CPC FALSE FALSE
## costPerCon FALSE FALSE
## CPM FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: forward
## xyzCampId936 xyzCampId1178 age35-39 age40-44 age45-49 genderM
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " "*" " "
## 12 ( 1 ) " " " " " " " " "*" " "
## 13 ( 1 ) " " " " " " " " "*" " "
## 14 ( 1 ) " " "*" " " " " "*" " "
## interest7 interest10 interest15 interest16 interest18 interest19
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " " " "*" " " " " " "
## 12 ( 1 ) " " " " "*" " " " " " "
## 13 ( 1 ) " " " " "*" " " " " " "
## 14 ( 1 ) " " " " "*" " " " " " "
## interest20 interest21 interest22 interest23 interest24 interest25
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" " " " " " "
## 8 ( 1 ) " " " " "*" " " " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " " " "*" " " " " " "
## 12 ( 1 ) " " " " "*" " " " " " "
## 13 ( 1 ) " " " " "*" " " " " " "
## 14 ( 1 ) " " " " "*" " " " " " "
## interest26 interest27 interest28 interest29 interest30 interest31
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " " " "
## 13 ( 1 ) "*" " " " " " " " " " "
## 14 ( 1 ) "*" " " " " " " " " " "
## interest32 interest36 interest63 interest64 interest65 interest66
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " "*" " " " " " " " "
## 9 ( 1 ) " " "*" " " " " " " " "
## 10 ( 1 ) " " "*" " " " " " " " "
## 11 ( 1 ) " " "*" " " " " " " " "
## 12 ( 1 ) " " "*" " " " " " " " "
## 13 ( 1 ) " " "*" " " " " " " " "
## 14 ( 1 ) " " "*" " " " " " " " "
## interest100 interest101 interest102 interest103 interest104
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " "
## interest105 interest106 interest107 interest108 interest109
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " "
## interest110 interest111 interest112 interest113 interest114 impr
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " " " "*"
## 9 ( 1 ) " " " " " " "*" " " "*"
## 10 ( 1 ) " " " " " " "*" " " "*"
## 11 ( 1 ) " " " " " " "*" " " "*"
## 12 ( 1 ) " " " " " " "*" " " "*"
## 13 ( 1 ) " " " " " " "*" " " "*"
## 14 ( 1 ) " " " " " " "*" " " "*"
## Clicks Spent conv appConv CTR CPC costPerCon CPM
## 1 ( 1 ) " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " "*" " " " " "*" " "
## 4 ( 1 ) " " " " " " "*" " " " " "*" "*"
## 5 ( 1 ) " " " " " " "*" "*" " " "*" "*"
## 6 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 7 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 8 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 9 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 10 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) " " " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) " " "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) " " "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) " " "*" "*" "*" "*" " " "*" "*"
We want to select 10 variables.
= coef(reg.fit
coefi id=10) #Selection 10 variables
, names(coefi) # most important variables based on regsubsets() feature selection
## [1] "(Intercept)" "interest15" "interest22" "interest36" "interest113"
## [6] "impr" "conv" "appConv" "CTR" "costPerCon"
## [11] "CPM"
2.5.4.3.1.1.1 1) I run a linear regression using the selected features, to have a benchmark
In general, one should start with a simple model to see graudally advance, instead of jumping directly to complex models.
<-lm(ROAS ~ interest + impr + conv + appConv + CTR + costPerCon + CPM, data = trainTfo1)
model1summary(model1) # summary.aov(model1) #Alternatice
##
## Call:
## lm(formula = ROAS ~ interest + impr + conv + appConv + CTR +
## costPerCon + CPM, data = trainTfo1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1662 -1.2349 -0.4432 0.7776 8.6722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3763879089 0.7060735443 4.782 0.0000022324 ***
## interest7 -0.0583638167 0.8693104092 -0.067 0.946496
## interest10 0.3734569279 0.7214261646 0.518 0.604900
## interest15 1.2644513305 0.8019953469 1.577 0.115455
## interest16 0.3374435422 0.7113702609 0.474 0.635434
## interest18 1.1057788779 0.7863738142 1.406 0.160237
## interest19 0.9323670080 0.8226657597 1.133 0.257562
## interest20 0.5455427694 0.7734543762 0.705 0.480902
## interest21 0.4776314482 0.8140258838 0.587 0.557611
## interest22 -0.9810753002 0.8151526846 -1.204 0.229282
## interest23 -0.3833780502 0.8244934737 -0.465 0.642125
## interest24 0.4104935167 0.8587144041 0.478 0.632817
## interest25 0.0810874657 0.8224831441 0.099 0.921501
## interest26 -0.3607817261 0.8259069717 -0.437 0.662405
## interest27 0.5381079815 0.7408381010 0.726 0.467933
## interest28 0.0740605292 0.7759433375 0.095 0.923996
## interest29 0.1688327587 0.7394183904 0.228 0.819473
## interest30 0.2578505667 0.9459553954 0.273 0.785277
## interest31 1.1640980299 1.0164662116 1.145 0.252608
## interest32 0.0403419742 0.8559559577 0.047 0.962426
## interest36 2.1016082503 0.9016484081 2.331 0.020122 *
## interest63 0.5435879396 0.7789099391 0.698 0.485545
## interest64 0.6639079122 0.7722974638 0.860 0.390355
## interest65 0.2431197639 0.9014227709 0.270 0.787487
## interest66 0.8330235765 1.0160487256 0.820 0.412648
## interest100 0.8516465674 1.2281375647 0.693 0.488323
## interest101 -0.0641588098 1.3666934608 -0.047 0.962574
## interest102 0.9276318226 1.3630425863 0.681 0.496437
## interest103 0.8208051392 1.2313704057 0.667 0.505321
## interest104 0.7144614519 1.4011137921 0.510 0.610309
## interest105 -0.0104297991 1.1441788217 -0.009 0.992730
## interest106 -0.4410980827 1.1341681728 -0.389 0.697488
## interest107 0.8812751901 1.0344524204 0.852 0.394627
## interest108 0.8096481829 1.1383717983 0.711 0.477241
## interest109 0.2773726276 1.1440075722 0.242 0.808516
## interest110 1.6981730219 1.0731149157 1.582 0.114117
## interest111 0.8419734890 1.2354163562 0.682 0.495823
## interest112 -0.8046055498 1.2397091379 -0.649 0.516591
## interest113 1.5852227480 1.1413885504 1.389 0.165439
## interest114 -0.9363468962 1.2231230818 -0.766 0.444280
## impr -0.0000030933 0.0000005615 -5.509 0.0000000556 ***
## conv -0.1835589139 0.0474635767 -3.867 0.000123 ***
## appConv 1.1139191669 0.0950624431 11.718 < 2e-16 ***
## CTR 109.1401641334 31.7080460628 3.442 0.000621 ***
## costPerCon -0.0195060636 0.0047454249 -4.110 0.0000455039 ***
## CPM -10.0887785837 2.3882226628 -4.224 0.0000280419 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.058 on 550 degrees of freedom
## Multiple R-squared: 0.413, Adjusted R-squared: 0.3649
## F-statistic: 8.598 on 45 and 550 DF, p-value: < 2.2e-16
AIC(model1)
## [1] 2597.989
The model diagnostics
par(mfrow = c(2, 2))
plot(model1)
Notice, that the top right is the theoretical distribution (or quantiles) of a normal distribution. We see that the is a wave in the residuals, implying that there may be some non linearity.
Further inspecting the residuals
par(mfrow = c(1,1))
hist(model1$residuals)
The residuals look quite well
2.5.4.3.1.1.2 2) Linked to the topic of this lecture, run a non-liner model, such as the Generalized Additive Model (GAM) with smoothing splines.**
We have two available approaches at hand:
- Option 1: using “gam” function from “gam” package
- Option 2: using “gam” function from “mgcv” package
- Recall mgcv package and gam package may interact. Uncall one of them if you get an error.
library('mgcv')
set.seed(123)
<- mgcv::gam(ROAS ~ interest + s(impr) + s(conv) + s(appConv) + s(CTR) + s(costPerCon) + s(CPM), data = trainTfo1) # "gam" function from "mgcv" package estimates the smoothness automatically based on cross-validation (LOOCV, the default) or REML (Restricted Maximum Likelihood)
gam.mgcv summary(gam.mgcv)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ROAS ~ interest + s(impr) + s(conv) + s(appConv) + s(CTR) + s(costPerCon) +
## s(CPM)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17980 0.43775 4.980 0.000000869 ***
## interest7 0.24123 0.57911 0.417 0.6772
## interest10 0.18304 0.48420 0.378 0.7056
## interest15 0.38606 0.53880 0.717 0.4740
## interest16 0.11986 0.47583 0.252 0.8012
## interest18 -0.09142 0.52458 -0.174 0.8617
## interest19 0.82120 0.54766 1.499 0.1344
## interest20 0.13610 0.51544 0.264 0.7918
## interest21 0.02640 0.53909 0.049 0.9610
## interest22 0.09319 0.53946 0.173 0.8629
## interest23 -0.56172 0.54856 -1.024 0.3063
## interest24 -0.08992 0.57528 -0.156 0.8759
## interest25 -0.04603 0.55372 -0.083 0.9338
## interest26 -0.30219 0.55505 -0.544 0.5864
## interest27 0.51573 0.49750 1.037 0.3004
## interest28 0.12121 0.51861 0.234 0.8153
## interest29 -0.03154 0.49268 -0.064 0.9490
## interest30 -0.40478 0.62268 -0.650 0.5159
## interest31 0.83810 0.66915 1.252 0.2110
## interest32 0.26961 0.56769 0.475 0.6350
## interest36 1.12968 0.59284 1.906 0.0573 .
## interest63 -0.04652 0.51940 -0.090 0.9287
## interest64 0.41739 0.51161 0.816 0.4150
## interest65 0.03027 0.59451 0.051 0.9594
## interest66 0.52158 0.66676 0.782 0.4344
## interest100 -0.18099 0.82344 -0.220 0.8261
## interest101 0.13743 0.90918 0.151 0.8799
## interest102 -0.04860 0.90212 -0.054 0.9571
## interest103 0.47957 0.81182 0.591 0.5550
## interest104 -1.30671 0.94229 -1.387 0.1661
## interest105 1.19263 0.76381 1.561 0.1190
## interest106 0.24244 0.74505 0.325 0.7450
## interest107 0.32292 0.70132 0.460 0.6454
## interest108 -0.63161 0.75881 -0.832 0.4056
## interest109 -0.29230 0.76418 -0.383 0.7022
## interest110 0.53685 0.71649 0.749 0.4540
## interest111 0.68357 0.83983 0.814 0.4161
## interest112 0.36783 0.88298 0.417 0.6772
## interest113 0.49405 0.76593 0.645 0.5192
## interest114 -0.50182 0.80131 -0.626 0.5314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(impr) 6.429 7.371 15.745 < 2e-16 ***
## s(conv) 6.220 7.047 3.453 0.000984 ***
## s(appConv) 7.191 7.717 84.077 < 2e-16 ***
## s(CTR) 5.281 6.340 2.375 0.026540 *
## s(costPerCon) 7.980 8.690 28.832 < 2e-16 ***
## s(CPM) 3.160 4.015 4.666 0.001024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.735 Deviance explained = 76.9%
## GCV = 2.027 Scale est. = 1.7676 n = 596
We see from the approximate significance of the smooth terms, that the smoothed variables appear to be siginificant, hence the smoothing splines are justified. We can also see the edf (estimated degrees of freedom), for each of the variables.
The first parametric coefficients show how the varibles perform compared to the reference (baseline), we see that there are different p-values, where it looks as if interest 36 is the only significant, although we must remember that this is in reference to the baseline, hence it does not imply that the categories with p > 5% is insignificant. If we look further into the variable, we see that some of the categories have very few variables, hence we can bin the categories together. E.g.,
table(trainTfo1$interest)
##
## 2 7 10 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 10 13 48 21 60 23 17 25 18 19 17 14 18 17 37 26 38 9 7 14
## 36 63 64 65 66 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
## 11 24 25 11 7 4 3 3 4 3 5 5 7 5 5 6 4 4 5 4
E.g. interest 114 have only four obsevationbs, we can aggregate groups, e.g., by looking at the means of the groups.
aggregate(trainTfo1$ROAS,by = list(trainTfo1$interest),mean)
Group.1 | x |
---|---|
2 | 2.3410000 |
7 | 2.2507692 |
10 | 2.2810417 |
15 | 3.1423810 |
16 | 2.3663333 |
18 | 2.8282609 |
19 | 2.8247059 |
20 | 2.9944000 |
21 | 2.6822222 |
22 | 0.5963158 |
23 | 1.4623529 |
24 | 2.2957143 |
25 | 1.8588889 |
26 | 1.3188235 |
27 | 1.5972973 |
28 | 1.6423077 |
29 | 2.4734211 |
30 | 2.4400000 |
31 | 3.5300000 |
32 | 2.0764286 |
36 | 4.4445455 |
63 | 2.2858333 |
64 | 2.6624000 |
65 | 2.4272727 |
66 | 2.8700000 |
100 | 3.8350000 |
101 | 1.9500000 |
102 | 3.3900000 |
103 | 1.7875000 |
104 | 2.3033333 |
105 | 0.9300000 |
106 | 1.1320000 |
107 | 3.1528571 |
108 | 3.2280000 |
109 | 1.1380000 |
110 | 4.3983333 |
111 | 3.6075000 |
112 | 1.9150000 |
113 | 3.7440000 |
114 | 1.0275000 |
We see that they are close to each other, where one could combine groups based on this.
For which variables, if any, is there evidence of a non-linear relationship with the response? in the output, the edf (estimated degrees of freedom) column allow us to interpret the degree of nonlinearity for each predictor.
# contrasts(trainTfo1$interest)
AIC(gam.mgcv) # decreases compared to linear model
## [1] 2103.794
Other checks
One should always check the residuals
par(mfrow = c(1,1))
plot(gam.mgcv$residuals)
qqnorm(gam.mgcv$residuals)
qqline(gam.mgcv$residuals)
hist((gam.mgcv$residuals))
interpreting the partial effects of the smooth terms visually using plot(gam_model) (see p. 284 ISL).
par(mfrow = c(2, 2))
plot(gam.mgcv, se = T, col = "blue", scheme=1,unconditional = TRUE, residuals = TRUE)
Holding all the other variables in the model fixed, return on advertising spent tends to decrease with increasing impr (the number of times the ad was shown). holding all the other variables in the model fixed, total people signing up to the webshop (conv) has no significant effect on ROAS. Holding all the other variables in the model fixed, total people purchasing the product (appConv) and click-through-rate (CTR) have a significant positive nonlinear effect on ROAS. However, the CI in some areas is also very large due to very few observations, thus more data is required to evaluate the results. Finally cost-per-click (CPM = Spent/Impr *1000) has a decreasing nonlinear effect on ROAS, but the effect is very uncertain given CI become very large when CPM >1. Overall, the relationships found are intuitive but the effec ts are not strong.
# Evaluating test MSE using using k-fold cross-validation
set.seed(1)
=5
k= sample(1:k, nrow(trainTfo1), replace = TRUE)
folds = matrix(NA, k)
gam.err = matrix(NA, k)
gam.tss = matrix(NA, k)
cv.r2
for (j in 1:k) {
= mgcv::gam(ROAS ~ s(impr) + s(conv) + s(appConv) + s(CTR) + s(costPerCon) + s(CPM), data = trainTfo1[folds!=j,]) # folds !=j returns the in-sample data
gam.mgcv = predict(gam.mgcv, trainTfo1[folds==j,])
pred = mean((trainTfo1$ROAS[folds==j] - pred)^2)
gam.err[j] = mean((trainTfo1$ROAS[folds==j] - mean(trainTfo1$ROAS[folds==j]))^2)
gam.tss[j] = 1 - gam.err[j]/gam.tss[j]
cv.r2[j]
cv.r2[j]
}
cv.r2
## [,1]
## [1,] 0.6894129
## [2,] 0.7195426
## [3,] 0.7299054
## [4,] 0.7580761
## [5,] 0.5879333
We see the different R squares, this we are now going to take the mean of.
=apply (cv.r2, 2, mean)
mean.cv.R2 mean.cv.R2
## [1] 0.6969741
The mean is 0.697. Now we can compare the model with the test partition.
Evaluating the model test MSE and R^2 using the test dataset
= predict(gam.mgcv, testTfo1)
gam.pred = mean((testTfo1$ROAS - gam.pred)^2)
gam.err = mean((testTfo1$ROAS - mean(testTfo1$ROAS))^2)
gam.tss = 1 - gam.err/gam.tss
test.r2 # the % of explained variance (R^2 in the data test) test.r2
## [1] 0.6703216
We see that R2 is similar, hence the model does not appear to have high optimisim.
Accuracy
Accuracy: correlation Actual vs. Predicted values
<- data.frame(cbind(actuals=testTfo1$ROAS,predicteds=gam.pred))
actuals_preds <- cor(actuals_preds)
correlation_accuracy correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.8305841
## predicteds 0.8305841 1.0000000
General discussion: Although the fitted GAM yields to a decent R^2 on the test set and cross-validation, it could further be evaluated in comparison with the fit of other models. It is also possible to improve the model by including additional variables. Product advertised or Price may improve the model fit. It is important to consider all possible input variables and make a good selection early in the analytical process.
2.5.4.3.1.2 c2) Feature selection using step.GAM
Notice, that this is just another approach.
in package gam, the analysist has to search for the effective df. One can create a scope list with possible splines for each continous variable. Below, we considered splines up to maximum 10 degrees of freedom.
Notice, that the categorical variables are just included as raw data, as we can’t take the polynomial of this.
library(gam)
#Making a list with the different models
= list("xyzCampId" = ~1 + xyzCampId,
scope_list "age" = ~1 + age,
"gender" = ~1 + gender,
"interest" = ~1 + interest,
"impr" = ~1 + impr + s(impr, df=2) + s(impr, df=3) + s(impr, df =4) + s(impr, df=5) + s(impr, df=6) + s(impr, df=7) + s(impr, df=8) + s(impr, df=9) + s(impr, df=10),
"Clicks" = ~1+Clicks+s(Clicks, df=2)+s(Clicks, df=3)+s(Clicks, df=4)+s(Clicks, df=5)+s(Clicks, df=6)+s(Clicks, df=7)+s(Clicks, df=8)+s(Clicks, df=9)+s(Clicks, df=10),
"Spent" = ~1 + Spent + s(Spent, df=2) + s(Spent, df=3) + s(Spent, df=4) + s(Spent, df=5) + s(Spent, df=6) + s(Spent, df=7) + s(Spent, df=8) + s(Spent, df=9) + s(Spent, df=10),
"conv" = ~1 + conv + s(conv, df=2) + s(conv, df=3) + s(conv, df=4) + s(conv, df=5) + s(conv, df=6) + s(conv, df=7) + s(conv, df=8) + s(conv, df=9) + s(conv, df=10),
"appConv" = ~1+appConv+s(appConv,df=2)+s(appConv,df=3)+s(appConv,df=4)+s(appConv,df=5) + s(appConv,df=6) + s(appConv,df=7) + s(appConv,df=8) + s(appConv,df=9) + s(appConv,df=10),
"CTR" = ~1 + CTR + s(CTR, df=2) + s(CTR, df=3) + s(CTR, df=4) + s(CTR, df=5) + s(CTR, df=6) + s(CTR, df=7) + s(CTR, df=8) + s(CTR, df=9) + s(CTR, df=10),
"CPC" = ~1 + CPC + s(CPC, df=2) + s(CPC, df=3) + s(CPC, df=4) + s(CPC, df=5) + s(CPC, df=6) + s(CPC, df=7) + s(CPC, df=8) + s(CPC, df=9) + s(CPC, df=10),
"CPM" = ~1 + CPM + s(CPM, df=2) + s(CPM, df=3) + s(CPM, df=4) + s(CPM, df=5) + s(CPM, df=6) + s(CPM, df=7) + s(CPM, df=8) + s(CPM, df=9) + s(CPM, df=10))
# Define the starting model
<- gam::gam(ROAS ~ xyzCampId + age + gender + interest + impr + Clicks + Spent + conv + appConv + CTR + CPC + CPM, data = trainTfo1) #Note, this is just linear
gam.start
# Finally, search among different models
# Remove "tick / check mark" from the package mgcv if you get an error next
::step.Gam(gam.start,scope = scope_list) gam
## Start: ROAS ~ xyzCampId + age + gender + interest + impr + Clicks + Spent + conv + appConv + CTR + CPC + CPM; AIC= 2611.094
## Step:1 ROAS ~ xyzCampId + age + gender + interest + impr + Clicks + Spent + conv + s(appConv, df = 2) + CTR + CPC + CPM ; AIC= 2541.326
## Step:2 ROAS ~ xyzCampId + age + gender + interest + impr + Clicks + Spent + conv + s(appConv, df = 3) + CTR + CPC + CPM ; AIC= 2487.485
## Step:3 ROAS ~ xyzCampId + age + gender + impr + Clicks + Spent + conv + s(appConv, df = 3) + CTR + CPC + CPM ; AIC= 2449.452
## Step:4 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 2) + Spent + conv + s(appConv, df = 3) + CTR + CPC + CPM ; AIC= 2409.099
## Step:5 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 2) + Spent + conv + s(appConv, df = 4) + CTR + CPC + CPM ; AIC= 2368.959
## Step:6 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 3) + Spent + conv + s(appConv, df = 4) + CTR + CPC + CPM ; AIC= 2347.551
## Step:7 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 3) + Spent + conv + s(appConv, df = 4) + CTR + s(CPC, df = 2) + CPM ; AIC= 2324.868
## Step:8 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 3) + Spent + conv + s(appConv, df = 5) + CTR + s(CPC, df = 2) + CPM ; AIC= 2302.618
## Step:9 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 4) + Spent + conv + s(appConv, df = 5) + CTR + s(CPC, df = 2) + CPM ; AIC= 2290.11
## Step:10 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 4) + Spent + conv + s(appConv, df = 6) + CTR + s(CPC, df = 2) + CPM ; AIC= 2279.844
## Step:11 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 4) + Spent + conv + s(appConv, df = 6) + CTR + s(CPC, df = 3) + CPM ; AIC= 2272.156
## Step:12 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 5) + Spent + conv + s(appConv, df = 6) + CTR + s(CPC, df = 3) + CPM ; AIC= 2265.412
## Step:13 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 5) + Spent + conv + s(appConv, df = 7) + CTR + s(CPC, df = 3) + CPM ; AIC= 2261.135
## Step:14 ROAS ~ xyzCampId + age + gender + impr + s(Clicks, df = 5) + Spent + conv + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2256.847
## Step:15 ROAS ~ xyzCampId + gender + impr + s(Clicks, df = 5) + Spent + conv + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2253.772
## Step:16 ROAS ~ xyzCampId + gender + impr + s(Clicks, df = 6) + Spent + conv + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2250.892
## Step:17 ROAS ~ xyzCampId + impr + s(Clicks, df = 6) + Spent + conv + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2249.067
## Step:18 ROAS ~ xyzCampId + impr + s(Clicks, df = 6) + Spent + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2247.226
## Step:19 ROAS ~ xyzCampId + impr + s(Clicks, df = 6) + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2245.43
## Step:20 ROAS ~ impr + s(Clicks, df = 6) + s(appConv, df = 7) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2243.744
## Step:21 ROAS ~ impr + s(Clicks, df = 6) + s(appConv, df = 8) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2242.46
## Step:22 ROAS ~ impr + s(Clicks, df = 7) + s(appConv, df = 8) + s(CTR, df = 2) + s(CPC, df = 3) + CPM ; AIC= 2241.376
## Step:23 ROAS ~ impr + s(Clicks, df = 7) + s(appConv, df = 8) + s(CTR, df = 2) + s(CPC, df = 4) + CPM ; AIC= 2241.011
## Step:24 ROAS ~ impr + s(Clicks, df = 7) + s(appConv, df = 8) + s(CTR, df = 3) + s(CPC, df = 4) + CPM ; AIC= 2240.781
## Step:25 ROAS ~ impr + s(Clicks, df = 8) + s(appConv, df = 8) + s(CTR, df = 3) + s(CPC, df = 4) + CPM ; AIC= 2240.731
## Call:
## gam::gam(formula = ROAS ~ impr + s(Clicks, df = 8) + s(appConv,
## df = 8) + s(CTR, df = 3) + s(CPC, df = 4) + CPM, data = trainTfo1,
## trace = FALSE)
##
## Degrees of Freedom: 595 total; 570 Residual
## Residual Deviance: 1368.37
The last model is the best model, hence
gam::gam(formula = ROAS ~ impr + s(Clicks, df = 5) + s(appConv,df = 5) + s(CTR, df = 3) + s(CPC, df = 4) + CPM, data = trainTfo1,trace = FALSE)
- AIC is used internally to pick the best model.
- This is reported in the function output at every step.
- The “best” model meaning it has the lowest AIC in the space it has searched.
- In this case, the best model was:
- ROAS ~ impr + s(Clicks, df = 5) + s(appConv,df = 5) + s(CTR, df = 3) + s(CPC, df = 4) + CPM.
Next, run the recommended model and evaluate the AIC. Is it better than the previous ones?
library(gam)
#The model found in gam.step()
<- gam::gam(ROAS ~ impr + s(Clicks, df = 5) + s(appConv,df = 5) + s(CTR, df = 3) + s(CPC, df = 4) + CPM, data = trainTfo1)
gam.step
summary(gam.step)
##
## Call: gam::gam(formula = ROAS ~ impr + s(Clicks, df = 5) + s(appConv,
## df = 5) + s(CTR, df = 3) + s(CPC, df = 4) + CPM, data = trainTfo1)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1065 -0.9782 -0.4980 0.6788 6.5816
##
## (Dispersion Parameter for gaussian family taken to be 2.5114)
##
## Null Deviance: 3969.325 on 595 degrees of freedom
## Residual Deviance: 1446.543 on 575.9999 degrees of freedom
## AIC: 2261.843
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## impr 1 77.85 77.85 30.9991 0.00000003963 ***
## s(Clicks, df = 5) 1 315.38 315.38 125.5807 < 2.2e-16 ***
## s(appConv, df = 5) 1 958.24 958.24 381.5628 < 2.2e-16 ***
## s(CTR, df = 3) 1 3.49 3.49 1.3895 0.23897
## s(CPC, df = 4) 1 70.54 70.54 28.0879 0.00000016536 ***
## CPM 1 15.41 15.41 6.1352 0.01354 *
## Residuals 576 1446.54 2.51
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## impr
## s(Clicks, df = 5) 4 35.223 < 2.2e-16 ***
## s(appConv, df = 5) 4 99.433 < 2.2e-16 ***
## s(CTR, df = 3) 2 5.207 0.005736 **
## s(CPC, df = 4) 3 14.083 0.000000007189 ***
## CPM
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 3))
plot(gam.step, se = T, col = "blue")
AIC(gam.step)
## [1] 2261.843
Compare AIC values for all fitted models
AIC(model1, gam.mgcv, gam.step)
df | AIC | |
---|---|---|
model1 | 47.00000 | 2597.989 |
gam.mgcv | 39.72367 | 1640.318 |
gam.step | 8.00000 | 2261.843 |
We see that the second approach appear to be better than the step model.
2.5.4.3.1.3 c3) Fetaure selection using random forest
Advanced feature selection methods is an active domain of research. Modern techniques include, among others, using random forests to find a set of predictors that best explains the variance in the response variable (Random forests models is covered in Ch.8). For a review of all methods, consider this suplementary readings: A. Jović, K. Brkić and N. Bogunović “A review of feature selection methods with applications.”
Feature selection using Random Forest (RF)
RF is covered in-depth in the next lectures
library(party)
<- cforest(trainTfo1$ROAS ~ . , data = trainTfo1[,-c(1, 3, 14:17)]
cf1 control=cforest_unbiased(mtry=2,ntree=100))
,
#We sort the variables for where importance is decreasing
sort(varimp(cf1), decreasing = TRUE)
## appConv costPerCon Clicks Spent impr conv
## 4.982585204 2.651832080 1.193535977 1.028742086 0.664008402 0.269207329
## CPM xyzCampId CTR age CPC interest
## 0.248213011 0.225695612 0.136556541 0.119650200 0.089944044 0.042268625
## gender
## 0.009347033
Based on the estimated variable importance, select the most relevant ones. Different feature selection methods, yield to slighly different variables, but majority coincide.
library(mgcv)
<- mgcv::gam(ROAS ~ s(appConv) + s(costPerCon) + s(Spent) + s(Clicks) + s(impr) + s(conv) + s(CPM) + xyzCampId + s(CTR), data = trainTfo1)
gam.rforest summary(gam.rforest)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ROAS ~ s(appConv) + s(costPerCon) + s(Spent) + s(Clicks) + s(impr) +
## s(conv) + s(CPM) + xyzCampId + s(CTR)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6312 0.3864 6.809 2.57e-11 ***
## xyzCampId936 -0.4511 0.3706 -1.217 0.224
## xyzCampId1178 -0.2580 0.4153 -0.621 0.535
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(appConv) 7.475 7.871 90.791 < 2e-16 ***
## s(costPerCon) 8.052 8.727 29.497 < 2e-16 ***
## s(Spent) 1.000 1.000 0.368 0.544537
## s(Clicks) 3.372 4.335 1.748 0.127220
## s(impr) 5.979 6.988 2.725 0.008853 **
## s(conv) 6.429 7.214 4.220 0.000236 ***
## s(CPM) 3.506 4.434 4.058 0.002305 **
## s(CTR) 5.515 6.594 2.771 0.008908 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.74 Deviance explained = 75.9%
## GCV = 1.8726 Scale est. = 1.7334 n = 596
discuss sig.
AIC(gam.rforest)
## [1] 2063.802
Compare AIC values for all fitted models
AIC(model1, gam.mgcv, gam.step, gam.rforest)
df | AIC | |
---|---|---|
model1 | 47.00000 | 2597.989 |
gam.mgcv | 39.72367 | 1640.318 |
gam.step | 8.00000 | 2261.843 |
gam.rforest | 45.32798 | 2063.802 |
Evaluate the gam.rforest using k-fold cv-error, MSE and R^2 as done before with gam.mgcv.
Conclude about the best model.