6.5 Exercises
The section contain several exercises covering:
- Decomposing and forecasting without cycles
- Decomposing and forecasting with cycles
- Time Series regression with other variables
- Fitting trend variable
- Decomposing using OLS, decompose and CMA
6.5.1 Decomposition + Forecast Exercise
We use the Alomega Foodstore sales data to show the following:
- How to make a decomposition of a time series, using
decompose()
- How to use
decompose()
to retrieve the different components and store these - How to inteprete the components
- How to forecast using the components
<- read_excel("Data/Week47/Alomegafoodstores.xlsx")
y <- y$Sales
y <- ts(data = y #The data
y frequency = 12 #Data is monthly
,end = c(2006,12)
,
)options(scipen = 999)
tsdisplay(y)
It appears as if we have seasonality, but a trend is dificult to say.
It appears as if the data is additive, as the variance appear to be more or less the same over time.
6.5.1.1 a. Decomposing the time series
Now we can decompose the elements using decompose()
, where moving averages are applied to deseasonalize the data.
But just for the sake of it, the multuplicative decomposition is made as well.
<- decompose(x = y #The ts
y.decom.add type = "additive" #The composition
,
)<- decompose(x = y #The ts
y.decom.mult type = "multiplicative" #The composition
,
)plot(y.decom.add)
6.5.1.2 b. Retrieving components
Recall that we have the following components:
- Trend (T), Tr in this example, as T is used for TRUE statement
- Seasons(S)
- Irregular(I) - NOTICE: We don’t want to use this for forecasting, as it is random, hence it can be left out from the assembly yhat
We see no cycles, and assume that they are not there.
<- y.decom.add$trend
Tr <- y.decom.add$seasonal
S <- y.decom.add$random I
6.5.1.3 c. Interpreting the resutls
For the sake of it, the additive and multiplicative decompositions are plotted.
plot(y.decom.add)
plot(y.decom.mult)
We see that the decomposition finds a trend and also it appears as there are some reoccurring patterns in the seasons.
But notice that the seasonal component is different, now it is a factor, which is relative to the trend.
6.5.1.4 d. Forecasting using the components
6.5.1.4.1 Fitting the model
We see that we are able to construct the in sample data 1:1 if the composition is created with all three components.
Let us forecast only using the trend and season component.
We see that we area able to plot the following:
<-Tr+S #Notice, that I is left out, as it is a random component
yhatplot(y, type="l") + lines(Tr,col="green") + lines(yhat,col="red")
## integer(0)
We can assess the accuracy of the composition of the trend and seasons.
accuracy(yhat,y)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -3913.472 56659.88 43104.6 -2.781556 11.18807 -0.09662767 0.2250085
We see that the mean percentage error is -2,78 percent, hence it appears as there is a tendency for the model to be underestimating y.
6.5.1.4.2 Forecasting using the composition
Now we can apply the model in a forecasting setting to forecast h horizons.
<- forecast(object = yhat,h = 18)
forecast.decom.add t(t(forecast.decom.add$mean))
## [,1]
## [1,] 446496.7
## [2,] 320351.2
## [3,] 541732.4
## [4,] 380127.7
## [5,] 361224.6
## [6,] 262271.3
## [7,] 635257.0
## [8,] 317322.8
## [9,] 598609.9
## [10,] 362402.8
## [11,] 526939.7
## [12,] 414048.1
## [13,] 464803.6
## [14,] 338658.2
## [15,] 560039.4
## [16,] 398434.7
## [17,] 379531.6
## [18,] 280578.3
18 months are chosen, as the function forecast from the middle of 2006, where it is was not able to fit the components, as it applies CMA on the months.
Using 18 months, we get forecasts for the last half a year and there after a whole year
We can also plot the forecast
plot(forecast.decom.add)
We can assess the accuracy of the first 6 forecasted periods, as they are known
<- y[(length(y)-5):length(y)]
y.test accuracy(object = forecast.decom.add$mean[1:6],x = y.test)
## ME RMSE MAE MPE MAPE
## Test set 36881.52 90588.21 60504.58 3.564348 13.03319
We see that the accuracy of these periods are 3.56% (MPE).
rm(list = ls())
6.5.2 Decomposition with cycles + Forecasting
Process:
- Finding CMA (deseasonalizing)
- Finding SF
- Identifying Trend
- Finding detrended and deseasonalized data
- Cycle identifying factor
<- read_excel("Data/Week47/Alomegafoodstores.xlsx")
y <- y$Sales
y <- ts(data = y #The data
y frequency = 12 #Data is monthly
,end = c(2006,12)
, )
6.5.2.1 1. Finding CMA (deseasonalizing)
<- ma(x = y #The data
CMA order = 12 #When the same period occurs
,centre = TRUE) #As we have an even period ,
We have now deseasonalized the data, as the moving average has removed the effect of the seasons.
6.5.2.2 2. Finding SF
That is, identifying whether we are above or below the deseasonalized data.
<-y/CMA
SFplot(SF, type="l")
abline(h = 1 #Plotted, as when one is on 1, then you are neither above or below
col = "red") ,
6.5.2.3 3. Identifying Trend
We introduce a trend variable to take out the effect of the trend
<- seq(1:length(y)) #Making counter
trend <- lm(CMA ~ trend
linmod na.action = "na.exclude") ,
Notice that we don’t care about the model, we just need the fitted values for identifying the cycle factor.
6.5.2.4 4. Finding detrended and deseasonalized data
In step 1 we deseasonalized the data, based on the fitted model, we have detrended the
<- linmod$fitted.values CMAT
6.5.2.5 5. Cycle identifying factor
We the deseasonalized and detrended data, we are able to idenitify if the deseasonalized data is above or below the seasons.
<- na.exclude(CMA) / CMAT
CF ts.plot(CF)
abline(h = 1,col = "red")
We want to look for wavelike patterns. We don’t see any, hence it is fair to assume, that we don’t have any cycles.
If we had, what to do?
- When multiplicative: One can apply the cycle factor in the regression
- When additive: One can apply the difference (between CMA and CMAT) in the regression.
End of exercise:
rm(list = ls())
6.5.3 Time series regression
This example show how one can apply other variables in a regression
<- read_excel("Data/Week47/Alomegafoodstores.xlsx")
Alomegasales <- read_excel("Data/Week47/Alomegafoodstores.xlsx")
y <- y$Sales
y <- ts(data = y #The data
y frequency = 12 #Data is monthly
,end = c(2006,12)
,
)
#fetch explanatory variables
<-ts(Alomegasales$Paper, frequency=12)
paper<-ts(Alomegasales$TV, frequency=12)
tv<- seasonaldummy(y) #creates seasonal dummies, if such not available in the data
month #Regression model for sales
<-lm(y~paper+tv+month)
reg summary(reg)
##
## Call:
## lm(formula = y ~ paper + tv + month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88551 -23745 -951 21590 107195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184393.15345 23401.84900 7.879 0.000000003560 ***
## paper 0.36319 0.06854 5.299 0.000007029523 ***
## tv 0.31530 0.03638 8.667 0.000000000398 ***
## monthJan 200847.13731 39150.70001 5.130 0.000011646830 ***
## monthFeb 55491.47382 32399.34749 1.713 0.095868 .
## monthMar 199556.46868 34147.06238 5.844 0.000001372948 ***
## monthApr 100151.42985 32387.77669 3.092 0.003952 **
## monthMay 190293.03783 32822.15893 5.798 0.000001577161 ***
## monthJun 135441.04479 32580.72319 4.157 0.000206 ***
## monthJul 156608.64859 32699.04733 4.789 0.000032157745 ***
## monthAug 51585.64542 32419.67959 1.591 0.120825
## monthSep 183619.40275 36521.78560 5.028 0.000015816441 ***
## monthOct 109096.18894 32438.67351 3.363 0.001919 **
## monthNov 96205.98997 32416.66110 2.968 0.005461 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45800 on 34 degrees of freedom
## Multiple R-squared: 0.9083, Adjusted R-squared: 0.8732
## F-statistic: 25.9 on 13 and 34 DF, p-value: 0.00000000000007305
We see the model summary above. Now we can make the regression
#obtain residuals from the model to perform model quality checks
<-reg$residuals
resplot(res)
We see that the residuals appear homoskedastic.
We can check them for normality
#check for normality
hist(res)
By visual inspection, they appear to be normally distributed.
library(tseries)
jarque.bera.test(res) #Jarque-Bera test the null of normality
##
## Jarque Bera Test
##
## data: res
## X-squared = 1.0947, df = 2, p-value = 0.5785
H0: Normality
Cannot reject.
Now we can check for multicollinearity
#testing for multicollinearity
library(car)
vif(reg)
## GVIF Df GVIF^(1/(2*Df))
## paper 2.054002 1 1.433179
## tv 1.559810 1 1.248924
## month 2.581443 11 1.044049
We see that they are all ≈ 2 or less, hence no problem
We can check the residuals for autocorrelation (serial correlation). We dont want this
#serial correlation
#acf(res)
plot(acf(res,plot=F)[1:20])#alternative command if you want to skip the 0 lag
Despite a few significant spikes, it appears as if it is not autocorrelated. This can be tested with the Durbin Watson test.
#Durbin-Watson serial correlation test
library(lmtest)
dwtest(reg)
##
## Durbin-Watson test
##
## data: reg
## DW = 2.2742, p-value = 0.8231
## alternative hypothesis: true autocorrelation is greater than 0
H0: No autocorrelation
We cannot reject on the five percent level
rm(list = ls())
6.5.4 Alomega Food Stores, case 6 p. 166 + case 7 p. 348 (Another example)
<- read_excel("Data/Week47/Alomegafoodstores.xlsx")
y str(y)
## tibble [48 × 21] (S3: tbl_df/tbl/data.frame)
## $ Sales : num [1:48] 425075 315305 432101 357191 347874 ...
## $ Paper : num [1:48] 75253 15036 134440 119740 135590 ...
## $ TV : num [1:48] 114433 63599 64988 66842 39626 ...
## $ Month : num [1:48] 1 2 3 4 5 6 7 8 9 10 ...
## $ Dum1 : num [1:48] 1 0 0 0 0 0 0 0 0 0 ...
## $ Dum2 : num [1:48] 0 1 0 0 0 0 0 0 0 0 ...
## $ Dum3 : num [1:48] 0 0 1 0 0 0 0 0 0 0 ...
## $ Dum4 : num [1:48] 0 0 0 1 0 0 0 0 0 0 ...
## $ Dum5 : num [1:48] 0 0 0 0 1 0 0 0 0 0 ...
## $ Dum6 : num [1:48] 0 0 0 0 0 1 0 0 0 0 ...
## $ Dum7 : num [1:48] 0 0 0 0 0 0 1 0 0 0 ...
## $ Dum8 : num [1:48] 0 0 0 0 0 0 0 1 0 0 ...
## $ Dum9 : num [1:48] 0 0 0 0 0 0 0 0 1 0 ...
## $ Dum10 : num [1:48] 0 0 0 0 0 0 0 0 0 1 ...
## $ Dum11 : num [1:48] 0 0 0 0 0 0 0 0 0 0 ...
## $ Paper1: num [1:48] 220 75253 15036 134440 119740 ...
## $ Paper2: num [1:48] 5 220 75253 15036 134440 ...
## $ TV1 : num [1:48] 88218 114433 63599 64988 66842 ...
## $ TV2 : num [1:48] 76001 88218 114423 63599 64988 ...
## $ M01-48: num [1:48] 1 2 3 4 5 6 7 8 9 10 ...
## $ Compet: num [1:48] 1 2 2 1 2 3 3 2 3 3 ...
We see that there are 21 variables, where sales is the DV and all others are IDVs, consisting of continous and factors.
Now we are interested in constructing a time series, which consist of the dependent variable. That is done in the following:
<- ts(data = y$Sales #The dependent variable
y end = c(2006,12) #The end date
,frequency = 12 #The frequency, 12 as we are working with months
,
)options(scipen = 999)
plot(y)
Now we want to address if there is:
- Trend: visually it does not look like. But it will be tested by testing the time series against a trend variable
- Seasons:
- Cycles: THis is difficult to say, also as we only have data for four years
Trend
#Creating trend variable
<- seq(from = 1,to = length(y),by = 1)
trend
#Creating a linear model with the trend variable
<- lm(y ~ trend)
lm.fit summary(lm.fit)
##
## Call:
## lm(formula = y ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -200465 -84502 -13613 71381 333683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350840 37726 9.300 0.00000000000389 ***
## trend 1335 1340 0.996 0.325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128600 on 46 degrees of freedom
## Multiple R-squared: 0.0211, Adjusted R-squared: -0.0001811
## F-statistic: 0.9915 on 1 and 46 DF, p-value: 0.3246
We see that the trend variable appear to be non significant, hence there is not enough evidence to reject the null hypothesis, being that there is no relationship between the dependent- and independent variable.
accuracy(object = lm.fit #The fitted values from the linear model
y = y #The actual value
, )
## ME RMSE MAE MPE MAPE MASE
## Training set -0.000000000002425023 125940.3 99341.6 -10.69473 28.49834 0.992719
We see the RMSE of 125.940, this can also be plotted to see the fitted values against the residuals, where the mean absolute error is just below 100.000 units. Notice, that this test in done in sample.
plot(x = lm.fit$fitted.values,y = lm.fit$residuals,main = "Resduals plot") +
grid(col = "lightgrey") +
abline(h = 0,lty = 3,col = "blue")
## integer(0)
6.5.4.1 What might Jackson Tilson say about the forecasts?
With mean average percentage error of 28%, the forecast may not be super good. Although it was shown, that there is statistical evidence for the model being significant. Hence, we must try to convince him, that it does in fact contribute with some knowledge.
What other forecast methods could be used?
E.g., smoothing or moving averages.
6.5.4.2 P.348 Case 7 (Alomega Food Stores)
This is an extension of the case above. It has questions regarding what other things she might do.
How is Julie sure that it is the right predictors
We see on page 348, that some of the variables has a p-vale that is above the 5 percent level, hence there is a greater risk, that there is no statistical evidence for a relationship between the individual variables and the dependent variable.
Also changes in VIF for the variables, although none are close to 10 and according to the rule of thumb, there is no indication of multicollinearity
How would you sell the model
Focus on how it prepares them for future changes, planning staffing, procurement, etc.
How may the model indicate future advertizing?
We see that it makes sense to spend money on all three types of advertisements.
What conditions to be aware of?
- Autocorrelation in the error terms?
- Multicollinearity
Other models that could be used?
- ARIMA
- Smoothing
rm(list = ls())
6.5.5 Monthly Sales Data
Loading the data and storing only the y variable as a timeseries
<- read_excel("Data/Week47/Salesdataseasonal.xlsx")
df <- ts(df$Sales
y frequency = 12 #Monthly data
,start = c(1979,1) #start year 1971 month 1
,
)#Plotting for visual interpretation
plot(y)
Goal:
- Seasonal dummies
- Trend variable
Hereafter different approaches to decomposing data is presented
- OLS using seasonal dummies and trend variable
- Decomposition using
decompose()
- Using CMA (centered moving averages)
a. finding seasonal dummies
Finding seasonal dummies using a trend
#Creting matrix of dummies for each month
<- seasonaldummy(x = y) month
This can be shown with the following table
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
We see that a matrix has been created identifying what month each period belongs to.
b. finding the trend variable
In addition hereof, we must create a trend variable to account for the trend:
<- seq(1,to = length(y),by = 1) trend
1. OLS using seasonal dummies and trend variable
#Constructing linear model
<- lm(y ~ month + trend)
lmmod summary(lmmod)
##
## Call:
## lm(formula = y ~ month + trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.782 -17.094 -3.591 15.140 91.349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.01705 8.51294 9.869 < 0.0000000000000002 ***
## monthJan 5.78164 10.54471 0.548 0.5844
## monthFeb 42.89804 10.54352 4.069 0.00008250119327 ***
## monthMar 79.84777 10.54259 7.574 0.00000000000656 ***
## monthApr 76.88084 10.54192 7.293 0.00000000002882 ***
## monthMay 25.49724 10.54153 2.419 0.0170 *
## monthJun -13.05303 10.54139 -1.238 0.2179
## monthJul -49.51997 10.54153 -4.698 0.00000673141501 ***
## monthAug -23.23690 10.54192 -2.204 0.0293 *
## monthSep -17.94010 10.76929 -1.666 0.0982 .
## monthOct -28.56613 10.76864 -2.653 0.0090 **
## monthNov 5.80784 10.76825 0.539 0.5906
## trend 2.71693 0.05288 51.383 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.25 on 127 degrees of freedom
## Multiple R-squared: 0.9586, Adjusted R-squared: 0.9547
## F-statistic: 245 on 12 and 127 DF, p-value: < 0.00000000000000022
What can we deduct from this?
- January, June, September and November are not significant on the 5% level in the full model. Hence the dummies may not explain the sales for these months
- There is overwhelming evidence for trend being present in the data
Lets now look into the fitted values and plot
#Plotting fitted values
{plot(lmmod$fitted.values,main = 'Fitted values',ylab = 'Sales') +
lines(lmmod$fitted.values,col = 'red') +
grid(col = 'lightgrey')
#Plotting residuals
plot(lmmod$residuals,main = 'Residuals',ylab = 'Residuals') +
lines(lmmod$residuals,col = 'green') +
grid(col = 'lightgrey')}
## integer(0)
2. Decomposition using decompose()
The following is able to decompose the series based on assuming additivity and multiplicative relationship.
Although as we see that the variance is increasing over time, then one may expect that the multiplicative approach is more appropriate.
#Additive
<- decompose(x = y #The timeseries
decompSalesAdd type = "additive") #for additive decomposition
,#Plotting
plot(decompSalesAdd)
#Multiplicative
decompSalesMult <- decompose(x = y #The timeseries
,type = "multiplicative") #for multiplicative decomposition
#Plotting
plot(decompSalesMult)
3. CMA
see process in 6.4.2
- Find the MA_t with equation (6.1) and (6.2) = Deseasonalizing data
#Finding centered MA
<- ma(x = y #The time series
CMA order = 12 #We want the average of 12 periods
,centre = TRUE) #As order is equal, then we want to make a centered MA
,
plot(CMA)
We see that the centered moving averages have removed the seasons from the data. But there is clearly still a trend.
We are now able to estimate the trend using the deseasonalized data with with trend variable.
Just as we would deseasonalize any other data!!
- Then find the centered MA (CMAT) = detrending data
#Detrending data
<- lm(CMA ~ trend
linmod na.action = "na.exclude") #We want to remove rows with NAs
,
# NOTE: we dont really care about the model, we only want the fitted values
# as they represent the estimated linear trend
#Estimated trend
<- linmod$fitted.values #Extracting trend estimates and saving in an object
CMAT plot(CMAT)
With the deseasonalized data and the trend estimates, we are able to assess whether we are above or below the trend in all of deseasonalized data.
The cycle
#Identifying the cycle
<- na.exclude(CMA) / CMAT
Cycle ts.plot(Cycle)
abline(h=1, col = "red")
Now we see the whether we are above or below the trend in the respective periods, e.g., period 60 appear to be far below the trend level indicating it is a the lowest of the seasons.
6.5.6 Own test - Monthly Sales Data decomposition + forecast
<- read_excel("Data/Week47/Salesdataseasonal.xlsx")
df
<- df$Sales
y <- ts(data = y
y frequency = 12 #Monthly data
,start = c(1979,1))
,
ts.plot(y)
We see that the data clearly has a trend and season and the composition appear to be multiplicative as the variance is increasing
Process —-
- How to make a decomposition of a time series, using decompose()
- How to use decompose() to retrieve the different components and store these
- How to inteprete the components
- How to forecast using the components
Decomposing —-
<- decompose(x = y #The ts
y.decom.mult type = "multiplicative") #The composition
,
plot(y.decom.mult)
<- y.decom.mult$trend
Tr <- y.decom.mult$seasonal
S <- y.decom.mult$random I
Forecasting —-
<- Tr*S #Note I is not used as it is random component
yhat
plot(y, type="l")
{lines(Tr,col="green")
lines(yhat,col="red")
grid(col = "grey")}
One sees that the seasons and trend is reconstructed without the –> irregular component –>
<- forecast(object = yhat,h = 60)
forecast.decom.mult
plot(forecast.decom.mult)
rm(list = ls())