4.2 Data Patterns and terminology

Basically the data is assumed to consist of up to four components, that is:

  1. Trend
    • Long-term change in the level of data
    • Positive vs. negative trends
    • Stationary series have no trend
    • Example: Increasing technology leading to increase in productivity
  2. Seasonal
    • Repeated regular variation the level of data
    • Example: Number of tourists in Mallorca
  3. Cyclical
    • Wavelike upward and downward movements around the long-term trend
    • Longer duration than seasonal fluctuations
    • Example: Business cycles
    • Note, this is very often to identify
  4. Irregular
    • Random fluctuations
    • Possibly carrying more dynamics than just deterministic ones
    • Hardest to capture in a forecasting model

The four components may look similar to this:

Components in a timeseries

Figure 4.2: Components in a timeseries

4.2.1 Terminology

\(Y_t\): Denotes a time series variable

\(\hat{Y_t}\): Denotes the foretasted value of \(Y_t\)

\(e_t=Y_t-\hat{Y_t}\): Denotes the residual or the forecast error.

\(Y_{t-k}\): Denotes a time series variable lagged by k periods.

4.2.1.1 Autocorrelation

Autocorrelation: is the correlation between a time series and its past (lagged) observations. To identify this, one can merely compare the lagged values as a series for itself, hence comparing actual time series against the lagged time series. This can be written as:

\[r_k=\frac{\sum_{t=k+1}^n\left(Y_{t\ }-\hat{Y}\right)}{\sum_{t=1}^n\left(Y_t-\hat{Y}\right)^{^2}}\]

Where \(k = 0,1,2,...\), hence take on numbers, typically whole numbers, as the result must be measurable.

We assess autocorrelation to identify if the data have a trend, seasons, cycles or it is random? If we have seasons, trends or cycles, we must make the model account for this, otherwise one is prone to have a model where it is just implicitly correlated, but that is merely due to the autocorrelation, as it says in the word, it is automatically correlated, but that also implies, that it is not necessarily caused by the data, but rather other factors, often we see macro factors, that have an influence, e.g. an economic book.

Autocorrelation can be plotted using an autocorrelation function (ACF) or merely by using a correlogram, which is a k-period plot of the autocorelation, that looks like the following:

Correlogram Example

Figure 4.3: Correlogram Example

Where one wants to be within the upper and lower level.

Manually testing for autocorrelation

One must:

  1. Calculate \(r_k\)
  2. Calculate \(SE(r_k)\)
  3. Hypothesis: \(H0 : \rho=0\), \(H0 : \rho≠0\)
    • We apply t-test

Where: \[SE\left(r_k\right)=\sqrt{\left\{\frac{1+2\sum_{i=1}^{k-1}r_i^2}{n}\right\}}\] Although, with normal approximation \[SE\left(r_k\right)=\frac{1}{\sqrt{n-k}}\]

and test statistic equal \[t=\frac{r_k}{SE(r_k)}\]

Thence one merely must look up the cut off values and assess if there is statistical evidance for autocorrelation or not.

Alternative: Ljung-Box Q statistic

The Ljung Box Q is to identify if at least one of the components explains the Y. Thence H0 = p1 = p2 = p3 = pm, thence we want to reject this one. If not, then none of the predictors explain the Y, thus they are irregular components.

\[Q\ =\ n\left(n+2\right)\sum_{k=1}^m\frac{r_k^2}{n-k}\]

Where m is the number of lags to be tested.

The Q statistic is commonly used for testing correlation in the residuals of a forecast model and the comparison is mate to \(X^2_{m-q}\), where q is the number of parameters in the model.

4.2.1.2 Random vs. correlated data

Randomness is important for forecast model residuals. One can write simple random model, but we dont want complete randomness. Hence we don’t want patterns in our error, where the previous error can explain the next error. E.g. if the data contain trend or seasons, that we have not accounted for, then the errors will be able to predict the coming errors (can be tested by testing errors (residuals) against the lagged errors (residuals)).

\[Y_t=c+\epsilon_t\]

Where c is the component and \(\epsilon_t\) is the random error component. That is assumed to be uncorrelated period to period.

4.2.1.3 Stationary vs. non stationary data

Stationary series is not trending, where is non stationary series is trending, can both be linear or exponential.

The how is it solved?

One can merely apply differencing of order k. That is equal to:

\[\Delta Y_t=Y_t-Y_{t-1}\]

One could also apply growth rates are log differencing instead.