Skip to content
Snippets Groups Projects
Commit ae3fea7d authored by Robert Lanzafame's avatar Robert Lanzafame
Browse files

GA 2.4 ready for review

parent b8a02d0f
No related branches found
No related tags found
No related merge requests found
Pipeline #260935 passed
%% Cell type:markdown id: tags:
 
# GA 2.4: Icy Times
# GA 2.4: Beary Icy
 
<h1 style="position: absolute; display: flex; flex-grow: 0; flex-shrink: 0; flex-direction: row-reverse; top: 60px;right: 30px; margin: 0; border: 0">
<style>
.markdown {width:100%; position: relative}
article { position: relative }
</style>
<img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/tu-logo/TU_P1_full-color.png" style="width:100px" />
<img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/mude-logo/MUDE_Logo-small.png" style="width:100px" />
</h1>
<h2 style="height: 10px">
</h2>
 
*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 2.4, Time serie analysis. For: December 6, 2024*
*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 2.4, Time Series Analysis. For: December 6, 2024*
 
%% Cell type:markdown id: tags:
 
Winter is coming and it is time to start getting our models ready for the ice classic. Our first goal is to improve the temperature model, as that seems to be an important driver of breakup day. Temperature is notoriously hard to predict, but we can analyze historical data to get a better understanding of the patterns. Today we will analyze some time series data from the first 152 days of the year, up to the first of June. We have hourly data available. This is the period of interest for the ice classic, as the ice forms in this period, reaching its maximum thickness around January, and then starts melting, with breakup day typically happening in April or May.
Winter is coming and it is time to start getting our models ready for the ice classic. Our first goal is to improve the temperature model, as that seems to be an important factor in determining breakup day. Temperature is notoriously hard to predict, but we can analyze historical data to get a better understanding of the patterns.
 
In this assignment we will analyze some time series data. We will first fit a functional model to the data in order to stationarize the data. Then we will fit an AR model to the residuals.
In this assignment we will analyze a time series from a **single year**; in fact, only the **first 152 days of the year**, from January 1 until June 1. This is the period of interest for the ice classic, as the ice forms in this period, reaching its maximum thickness between January-March, and then starts melting, with breakup day typically happening in April or May.
 
We will start by loading the data and plotting it.
Remember that we have until April 5 to place a bet. Why, then do we want to fit a model several months beyond this point? This gives us confidence in assessing the ability of the model to predict temperature, so that when we use it on April 5 to make **predictions** about the future, we can understand the uncertainty associated with it.
Let's start by loading the data and plotting it, then we will determine which components should be used to detrend it.
 
%% Cell type:code id: tags:
 
``` python
# Importing the required libraries
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import chi2
from scipy.signal import periodogram
```
 
%% Cell type:markdown id: tags:
 
### Task 1: Load the data and plot it
### Part 1: Load the data and plot it
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 1.1:</b>
Do the following:
- load the data
- create time vector
- plot the data
- describe the data
</p>
</div>
 
%% Cell type:code id: tags:
 
``` python
# YOUR_CODE_HERE
 
# SOLUTION
# Reading the data from the file
data = np.loadtxt('temperature.csv')
time_hours = np.arange(0, len(data))
time_days = time_hours / 24
dt = time_days[1] - time_days[0]
fs = 1 / dt
 
# Plotting the data
plt.figure(figsize=(10, 3))
plt.plot(time_days, data)
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Temperature data Nenana, Alaska')
plt.grid(True)
# END SOLUTION BLOCK
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
## Finding the frequency of the seasonal pattern
We clearly see that the data contains a seasonal pattern. We will start by fitting a functional model to the data in order to stationarize the data. To find the frequency of the seasonal pattern we will use the power spectrum of the data.
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 1.2:</b>
Use the Markdown cell below to describe the data (you can use a few bullet points). For example, confirm relevant characteristics like number of points, units, describe the values (qualitatively), etc.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
- There are 24*152 points (hourly)
- Looks like there is a bit of variation around a central line, but the general increase in temperature throughout the spring is larger than the variation
- the general trend looks linear in the middle, but flattens out at the ends, which is to be expected with a periodic annual cycle
- looks like there is something "funny" happening around day 70 and 120
</p>
</div>
%% Cell type:markdown id: tags:
## Part 2: Extract the Dominant Patterns
We clearly see that the data contains a strong pattern (the general increase in temperature from winter to summer). We will start by fitting a functional model to the data in order to stationarize it. To find the frequency of the seasonal pattern we will use the power spectrum of the data.
 
We will reuse the function `find_seasonal_pattern` from the workshop.
 
Remember that for running this function we need to predefine the A-matrix to detrend the data. Since the data only contains the first 5 months of the year, we see that the temperature is increasing over time. Based on what you know of temperature data, what would be the best model to remove the trend?
Remember that for running this function we need to predefine the A-matrix to detrend the data. Since the data only contains the first 5 months of the year, we see that the temperature is increasing over time. What type of model would be most appropriate to remove the seasonal trend?
 
### task 1
Provide a A-matrix that removes the trend from the data. There are multiple answers that will work, but some are better than others.
- Provide the A-matrix that you will use to detrend the data.
- Explain why you chose this A-matrix.
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.1:</b>
Define functions to help carry out this analysis, for example, <code>fit_model</code> and <code>find_frequency</code>.
</p>
</div>
 
%% Cell type:code id: tags:
 
``` python
def fit_model(data, time, A, plot=False):
'''
Function to find the least squares solution of the data
data: input data
time: time vector
A: A-matrix to fit the data
plot: boolean to plot the results or not
'''
 
# x_hat = YOUR_CODE_HERE # least squares solution
# y_hat = YOUR_CODE_HERE # model prediction
# e_hat = YOUR_CODE_HERE # residuals
 
# SOLUTION
x_hat = np.linalg.solve(A.T @ A, A.T @ data)
y_hat = A @ x_hat
e_hat = data - y_hat
# END SOLUTION BLOCK
 
if plot:
plt.figure(figsize=(10, 5))
plt.subplot(211)
plt.plot(time, data, label='Data')
plt.plot(time, y_hat, label='Estimated data')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Data vs Estimated data')
plt.grid(True)
plt.legend()
plt.subplot(212)
plt.plot(time, e_hat, label='Residuals')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Residuals')
plt.grid(True)
plt.legend()
plt.tight_layout()
 
return x_hat, y_hat, e_hat
 
def find_frequency(data, time, A, fs, plot=True):
'''
Function to find the dominant frequency of the signal
data: input data
time: time vector
A: A-matrix to detrend the data (prior to spectral analysis)
fs: sampling frequency
plot: boolean to plot the psd or not
'''
# Detrending the data
_, _, e_hat= fit_model(data, time, A)
 
N = len(data)
 
# Finding the dominant frequency in e_hat
# freqs, pxx = periodogram(YOUR_CODE_HERE, fs=YOUR_CODE_HERE, window='boxcar',
# nfft=N, return_onesided=False,
# scaling='density')
 
# SOLUTION
# Finding the dominant frequency in e_hat
freqs, pxx = periodogram(e_hat, fs=fs, window='boxcar',
nfft=N, return_onesided=False,
scaling='density')
# END SOLUTION BLOCK
 
# finding the dominant frequency and amplitude
# Note: there are many ways to do this
# amplitude = YOUR_CODE_HERE # Amplitude of the dominant frequency
# dominant_frequency = YOUR_CODE_HERE # Dominant frequency
 
# SOLUTION
# finding the dominant frequency and amplitude
dominant_frequency, amplitude = freqs[np.argmax(pxx)], np.max(pxx)
# END SOLUTION BLOCK
 
# Plotting the PSD
if plot:
plt.figure(figsize=(10, 5))
plt.subplot(211)
plt.plot(time, e_hat)
plt.title('Residuals')
plt.ylabel('Atmospheric Pressure [hPa]')
plt.grid(True)
plt.subplot(212)
plt.plot(freqs[freqs>0], pxx[freqs>0], label='PSD of residuals')
plt.xlabel('Frequency')
plt.ylabel('PSD')
plt.title('Power Spectral Density')
plt.grid(True)
plt.plot(dominant_frequency, amplitude, 'ro', label='Dominant Frequency')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.tight_layout()
 
return dominant_frequency
```
 
%% Cell type:markdown id: tags:
 
### task
- create the A-matrix to detrend
- find the seasonal pattern
- continue till you are happy and have found all the frequencies
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.2:</b>
Now provide an A-matrix that removes the trend from the data. There are multiple answers that will work, but some are better than others.
First, use the Markdown cell below to define your A-matrix and include a brief explanation justifying your choice.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
- The best model (chosen a priori) would be a periodic signal, as we know the temperature behaves this way.
- A linear model may approximate the increasing temperature well, but it is not going to fit the ends well.
- It turns out a power law model gives the best quantitative fit, but this would also be problematic if we need to extrapolate beyond the data range.
</p>
</div>
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.3:</b>
Now define the A-matrix in code and extract the seasonal pattern. Continue extracting components until the time series is stationary (you will then summarize your findings in the next task).
</p>
</div>
 
%% Cell type:code id: tags:
 
``` python
# YOUR_CODE_HERE
 
 
# SOLUTION
A = np.column_stack((np.ones(len(data)), np.cos(2*np.pi*time_days/365), np.sin(2*np.pi*time_days/365)))
dom_f = find_frequency(data, time_days, A, fs=fs)
print(f'Dominant Frequency: {dom_f:.2f}')
# END SOLUTION BLOCK
```
 
%% Output
 
Dominant Frequency: 1.00
 
 
%% Cell type:markdown id: tags:
 
### Task 2
Now that you have found the frequency of the seasonal pattern, We can continue to make our data (more) stationary. Copy the A-matrix you found in the previous task and expand it to include the seasonal pattern. When reporting on seasonal components we want you to report $A_i \cos(2\pi f_i t + \theta_i)$, where $A_i$ is the amplitude, $f_i$ is the frequency and $\theta_i$ is the phase.
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.4:</b>
Describe your results for making the time series stationary. Include at least: a) the number and types of components used (and their parameters), b) how you decided to stop extracting components.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
 
%% Cell type:markdown id: tags:
 
## Part 2: Fitting the functional model
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
 
In the next cell we will fit the model to generate stationary residuals. Above we have used for each dominant frequency $f_i$ ($i=1,2$) the model:
- annual signal is obvious, then of course daily (definitley periodic)
- PSD after taking out the daily signal indicates this is enough
</p>
</div>
%% Cell type:markdown id: tags:
## Fitting the Functional Model
In the next cell we will fit the model to generate stationary residuals. Above, you may have a periodic signal, where for each dominant frequency $f_i$ ($i=1,2$) the model is:
 
$$a_i \cos(2\pi f_i t) + b_i \sin(2\pi f_i t)$$
 
However, to report the periodic signals we would like to have the amplitude, phase shift and the frequency of those signals, which can be recovered from:
$$A_i \cos(2\pi f_i t + \theta_i)$$
Where the amplitude $A_i = \sqrt{a_i^2 + b_i^2}$ and $\theta_i = \arctan(-b_i/a_i)$
 
Note: in Section 4.1 book this was shown where the angular frequency $\omega = 2\pi f$ was used.
 
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.5:</b>
Complete the code cell below to create the functional model.
</p>
</div>
%% Cell type:code id: tags:
 
``` python
def rewrite_seasonal_comp(ak, bk):
'''
Function to rewrite the seasonal component in terms of sin and cos
ak: seasonal component coefficient for cos
bk: seasonal component coefficient for sin
'''
# YOUR_CODE_HERE
 
# SOLUTION
Ak = np.sqrt(ak**2 + bk**2)
theta_k = np.arctan2(-bk, ak)
return Ak, theta_k
# END SOLUTION BLOCK
 
# creating the A matrix of the functional model
# A = YOUR_CODE_HERE
# x_hat, y_hat, e_hat = YOUR_CODE_HERE
 
 
# SOLUTION
A = np.column_stack((np.ones(len(data)),
np.cos(2*np.pi*1*time_days), np.sin(2*np.pi*1*time_days),
np.cos(2*np.pi*time_days/365), np.sin(2*np.pi*time_days/365)))
 
x_hat, y_hat, e_hat0 = fit_model(data, time_days, A)
# END SOLUTION BLOCK
 
# Plotting the data and the estimated trend
plt.figure(figsize=(10, 3))
plt.plot(time_days, data, label='Original data')
plt.plot(time_days, y_hat, label='Estimated trend')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Temperature data Nenana, Alaska')
plt.grid(True)
plt.legend()
 
# Plotting the residuals
plt.figure(figsize=(10, 3))
plt.plot(time_days, e_hat0)
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Residuals')
plt.grid(True)
 
# Extracting the seasonal component coefficients from the estimated parameters
# a_i = YOUR_CODE_HERE
# b_i = YOUR_CODE_HERE
# freqs = YOUR_CODE_HERE
 
 
# SOLUTION
a_i = np.array([x_hat[1], x_hat[3]])
b_i = np.array([x_hat[2], x_hat[4]])
freqs = np.array([1, 1/365])
# END SOLUTION BLOCK
 
print(f'Estimated Parameters:')
for i in range(len(x_hat)):
print(f'x{i} = {x_hat[i]:.3f}')
 
print('\nThe seasonal component is rewritten as:')
for i, j, k in zip(a_i, b_i, freqs):
Ak, theta_k = rewrite_seasonal_comp(i, j)
print(f'Ak = {Ak:.3f}, theta_k = {theta_k:.3f}, f_k = {k:.3f}')
 
```
 
%% Output
 
Estimated Parameters:
x0 = -5.702
x1 = 0.227
x2 = -2.989
x3 = -17.784
x4 = 10.409
The seasonal component is rewritten as:
Ak = 2.998, theta_k = 1.495, f_k = 1.000
Ak = 20.606, theta_k = -2.612, f_k = 0.003
 
 
 
%% Cell type:markdown id: tags:
 
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 2.6:</b>
Are the residuals stationary? State yes or no and describe why in the cell below.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
No! There is an offset...
</p>
</div>
%% Cell type:markdown id: tags:
# Part 3: Finding the grizzly
 
When we look at the residuals after removing the periodic pattern(s), we see that there is still a pattern in the data. From researchers in the Nenana area we have heard that there is a grizzly bear that likes to take a nap (hibernate) in the area. We suspect that the grizzly bear has slept too close to the temperature sensor and has influenced the data.
 
In the next cell we will write an offset detection algorithm to find the offset in the data. The offset detection algorithm is based on the likelihood ratio test framework. However, due to the presence of autocorrelation in the residuals, the traditional critical values for the likelihood ratio test are not valid. Therefore, we will use a bootstrap approach to estimate the critical values. Luckily, this is not the first time we had to remove a grizzly bear from our data, so we know that the estimated critical values is approximately 100.
In the next cell we will write an offset detection algorithm to find the offset in the data. The offset detection algorithm is based on the likelihood ratio test framework. However, due to the presence of autocorrelation in the residuals, the traditional critical values for the likelihood ratio test are not valid. Therefore, we will use a bootstrap approach to estimate the critical values. Luckily, this is **not** the first time we had to remove a grizzly bear from our data, so we know that the estimated critical values is approximately 100.
 
## The offset detection algorithm
The offset detection algorithm is based on the likelihood ratio test framework. The likelihood ratio test has a test statistic that is given by:
 
$$\Lambda = n \log \left( \frac{S_0}{S_1} \right)$$
 
$$S_i = \sum_{i=1}^n (\hat{e}_i)^2$$
 
where $S_0$ is the sum of the squared residuals for the model without an offset, $S_1$ is the sum of the squared residuals for the model with an offset, and $n$ is the number of data points. The likelihood ratio test statistic is compared to a critical value to determine if an offset is present in the data.
 
To find the jump location we will use the following algorithm:
The cell below defines several functions which roughly accomplish the following:
1. Calculate the sum of the squared residuals for the model without an offset, $S_0$.
2. Calculate the sum of the squared residuals for the model with an offset at each data point, $S_1$.
1. For each data point we will calculate the sum of the squared residuals for the model with an offset at that data point.
2. The A-matrix for the model with an offset is the same as the A-matrix for the model without an offset, but with an additional column that is 0 till the data point and 1 after the data point.
3. We will find the offset location that maximizes the likelihood ratio test statistic.
4. We will include the offset in the model and repeat the process until the likelihood ratio test statistic is below the critical value.
 
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 3.1:</b>
Using the description above and the comments and docstring in the code, fill in the code below to complete the offset detection algorithm.
</p>
</div>
%% Cell type:code id: tags:
 
``` python
def A1_matrix(A0, break_point):
'''
Function to create the A1 matrix
A0: A matrix under H0
break_point: break point location
return: A1 matrix
A
'''
# create the new column and stack it to the A0 matrix
# YOUR_CODE_HERE
 
# SOLUTION
new_col = np.zeros(A0.shape[0])
new_col[break_point:] = 1
A1 = np.column_stack((A0, new_col))
# END SOLUTION BLOCK
return A1
 
 
def LR(e0, e1, cv=100, verbose=True):
'''
Function to perform the LR test
e0: residuals under H0
e1: residuals under H1
cv: critical value
'''
# n = YOUR_CODE_HERE
# SSR0 = YOUR_CODE_HERE
# SSR1 = YOUR_CODE_HERE
# test_stat = YOUR_CODE_HERE
 
# SOLUTION
n = len(e0)
SSR0 = e0.T @ e0
SSR1 = e1.T @ e1
test_stat = n*np.log(SSR0 / SSR1)
# END SOLUTION
 
if test_stat > cv:
if verbose:
print(f'Test Statistic: {test_stat:.3f} > Critical Value: {cv:.3f}')
print('Reject the null hypothesis')
else:
if verbose:
print(f'Test Statistic: {test_stat:.3f} < Critical Value: {cv:.3f}')
print('Fail to reject the null hypothesis')
return test_stat
 
def jump_detection(data, time, A, cv=100, plot=True):
'''
Function to detect the jump in the data
data: input data
time: time vector
A: A matrix under H0
cv: critical value
plot: boolean to plot the results or not
'''
# initialize the results vector
# results = YOUR_CODE_HERE
# find the residuals under H0
# YOUR_CODE_HERE
 
# SOLUTION
_, _, e_hat0 = fit_model(data, time, A)
results = np.zeros(len(data))
# END SOLUTION BLOCK
 
# loop over the data points
for i in range(1, len(data)):
# create the A1 matrix
# A1 = YOUR_CODE_HERE
 
# SOLUTION
A1 = A1_matrix(A, i)
# END SOLUTION BLOCK
 
# We need this statement to avoid singular matrices
if np.linalg.matrix_rank(A1) < A1.shape[1]:
pass
else:
# find the residuals under H1
# _, _, e_hat1 = YOUR_CODE_HERE
# test_stat = YOUR_CODE_HERE
# results[i] = YOUR_CODE_HERE
 
# SOLUTION
_, _, e_hat1 = fit_model(data, time, A1)
test_stat = LR(e_hat0, e_hat1, verbose=False)
results[i] = test_stat
# END SOLUTION BLOCK
 
results = np.array(results)
# finding the offset location.
# Offset is the location where the test statistic is maximum
 
# location = YOUR_CODE_HERE
# value = YOUR_CODE_HERE
 
# SOLUTION
location = np.argmax(results)
value = results[location]
# END SOLUTION BLOCK
 
if plot:
plt.figure(figsize=(10, 3))
plt.plot(time, results)
plt.plot(time[location], value, 'ro', label='offset')
plt.plot([0, max(time)], [cv, cv], 'k--', label='Critical Value')
plt.xlabel('Time [days]')
plt.ylabel('Test Statistic')
plt.title('LR Test')
plt.grid(True)
plt.legend()
 
return location, value
```
 
%% Cell type:markdown id: tags:
 
Now we will implement the offset detection algorithm.
run the function `find_offset` to find the offset in the data.
once you have found the offset, include it in the model and repeat the process until the likelihood ratio test statistic is below the critical value.
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 3.2:</b>
Now we will implement the offset detection algorithm by running the function to find the offset in the data. The function will provide figures from which you will be able to determine the offset.
</p>
</div>
 
%% Cell type:code id: tags:
 
``` python
# YOUR_CODE_HERE
 
# SOLUTION
A_offset = A.copy()
 
while True:
break_point, test_stat = jump_detection(data, time_days, A_offset)
print(f'Break Point: {break_point} with : {test_stat:.2f}')
if test_stat < 100:
break
A_offset = A1_matrix(A_offset, break_point)
# END SOLUTION BLOCK
```
 
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 3.3:</b>
Write your chosen offset in the cell below (report both the size and location of the offset).
</p>
</div>
%% Cell type:markdown id: tags:
My offset is: ...
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 3.4:</b>
Once you have found the offset, identify the offset location and update your A-matrix to include it in the model. Then repeat the process until the likelihood ratio test statistic is below the critical value.
</p>
</div>
%% Cell type:code id: tags:
 
``` python
# A2 = YOUR_CODE_HERE
# YOUR_CODE_HERE = fit_model(YOUR_CODE_HERE)
# SOLUTION
A2 = A_offset
x_hat, y_hat, e_hat0 = fit_model(data, time_days, A2)
# END SOLUTION (PART 1 of 2)
 
# Plotting the data and the estimated trend
plt.figure(figsize=(10, 3))
plt.plot(time_days, data, label='Original data')
plt.plot(time_days, y_hat, label='Estimated trend')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Temperature data Nenana, Alaska')
plt.grid(True)
plt.legend()
 
# Plotting the residuals
plt.figure(figsize=(10, 3))
plt.plot(time_days, e_hat0)
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Residuals')
plt.grid(True)
 
# Extracting the seasonal component coefficients from the estimated parameters
# a_i = YOUR_CODE_HERE
# b_i = YOUR_CODE_HERE
# freqs = YOUR_CODE_HERE
 
# SOLUTION
a_i = np.array([x_hat[1], x_hat[3]])
b_i = np.array([x_hat[2], x_hat[4]])
freqs = np.array([1, 1/365])
# END SOLUTION BLOCK
 
print(f'Estimated Parameters:')
for i in range(len(x_hat)):
print(f'x{i} = {x_hat[i]:.3f}')
 
print('The seasonal component is rewritten as:')
for i, j, k in zip(a_i, b_i, freqs):
Ak, theta_k = rewrite_seasonal_comp(i, j)
print(f'Ak = {Ak:.3f}, theta_k = {theta_k:.3f}, f_k = {k:.3f}')
```
 
%% Output
 
Estimated Parameters:
x0 = -1.022
x1 = 0.225
x2 = -3.002
x3 = -19.707
x4 = 2.315
x5 = -6.076
x6 = 4.487
The seasonal component is rewritten as:
Ak = 3.011, theta_k = 1.496, f_k = 1.000
Ak = 19.842, theta_k = -3.025, f_k = 0.003
 
 
 
%% Cell type:markdown id: tags:
 
## Analyzing the residuals
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 3.5:</b>
Use the Markdown cell below to summarize the number and value of all offsets that you found.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
Should be 2, at 70 and 120, with a value of around 5 (true values were around 6 and 4).
</p>
</div>
%% Cell type:markdown id: tags:
## Part 4: Analyzing the residuals
Now that we have our residuals we can fit an AR model to the residuals. We will start by plotting the ACF of the residuals. We will then fit an AR model to the residuals and report the parameters of the AR model. Using the likelihood ratio test framework we will determine the order of the AR model.
 
%% Cell type:code id: tags:
 
``` python
# Lets start with the ACF plot
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
plot_acf(e_hat0, ax=ax, lags=20);
ax.grid()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 4.1:</b>
Begin by completing the functions below to define AR(1) (hint: you did this on Wednesday).
</p>
</div>
%% Cell type:code id: tags:
 
``` python
def AR1(s, time, plot=True):
'''
Function to find the AR(1) model of the given data
s: input data
return: x_hat, e_hat
'''
# y = YOUR_CODE_HERE
# y_lag_1 = YOUR_CODE_HERE
# A = np.atleast_2d(y_lag_1).T
# x_hat, y_hat, e_hat = fit_model(YOUR_CODE_HERE)
 
# SOLUTION
y = s[1:]
y_lag_1 = s[:-1]
A = np.atleast_2d(y_lag_1).T
x_hat, y_hat, e_hat = fit_model(y, time, A)
# END SOLUTION
 
if plot:
fig, ax = plt.subplots(2, 1, figsize=(10, 5))
ax[0].plot(time[1:], y, label='Original Residuals')
ax[0].plot(time[1:], y_hat, label='Estimated Residuals')
ax[0].set_xlabel('Time [days]')
ax[0].set_ylabel('Temperature [°C]')
ax[0].set_title('Original Data vs Estimated Data')
ax[0].grid(True)
ax[0].legend()
plot_acf(e_hat, ax=ax[1], lags=20)
ax[1].grid()
fig.tight_layout()
 
print(f'Estimated Parameters:')
print(f'phi = {x_hat[0]:.4f}')
 
return x_hat, e_hat
 
# Estimating the AR(1) model
# phi_hat_ar1, e_hat_ar1 = AR1(YOUR_CODE_HERE)
 
# SOLUTION
phi_hat_ar1, e_hat_ar1 = AR1(e_hat0, time_days)
# END SOLUTION
 
 
 
```
 
%% Output
 
Estimated Parameters:
phi = 0.8847
 
 
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 4.2:</b>
As you can see, the next task asks you to implement AR(2). State why this is necessary, using the results from the cell above.
</p>
</div>
%% Cell type:markdown id: tags:
_Your answer here._
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
ACF at lag 1 not zero (or not within the confidence interval).
</p>
</div>
%% Cell type:markdown id: tags:
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 4.3:</b>
Now complete the functions to set up AR(2).
</p>
</div>
%% Cell type:code id: tags:
 
``` python
def AR2(s, time, plot=True):
'''
Function to find the AR(2) model of the given data
s: input data
return: x_hat, e_hat
'''
# y = YOUR_CODE_HERE
# y_lag_1 = YOUR_CODE_HERE
# A = YOUR_CODE_HERE
# x_hat, y_hat, e_hat = fit_model(YOUR_CODE_HERE)
 
# SOLUTION
y = s[2:]
y_lag_1 = s[1:-1]
y_lag_2 = s[:-2]
A = np.column_stack((y_lag_1, y_lag_2))
x_hat, y_hat, e_hat = fit_model(y, time, A)
# END SOLUTION
 
if plot:
fig, ax = plt.subplots(2, 1, figsize=(10, 5))
ax[0].plot(time[2:], y, label='Original Residuals')
ax[0].plot(time[2:], y_hat, label='Estimated Residuals')
ax[0].set_xlabel('Time [days]')
ax[0].set_ylabel('Temperature [°C]')
ax[0].set_title('Original Data vs Estimated Data')
ax[0].grid(True)
ax[0].legend()
plot_acf(e_hat, ax=ax[1], lags=20)
ax[1].grid()
fig.tight_layout()
 
print(f'Estimated Parameters:')
print(f'phi_1 = {x_hat[0]:.4f}, phi_2 = {x_hat[1]:.4f}')
 
return x_hat, e_hat
 
# Estimating the AR(2) model
# phi_hat_ar2, e_hat_ar2 = AR2(YOUR_CODE_HERE)
 
# SOLUTION
phi_hat_ar2, e_hat_ar2 = AR2(e_hat0, time_days)
# END SOLUTION
```
 
%% Output
 
Estimated Parameters:
phi_1 = 0.7088, phi_2 = 0.1989
 
 
%% Cell type:markdown id: tags:
 
## Part 4: Report the results
## Part 5: Report the Results
_Note: you did this on Wednesday! It was optional then, so you are not expected to know this for the exam; however, you should implement the code using the WS as a template, and your interpretation at the end will be part of the grade for this assignment._
 
Now that we have found the periodic signals in the data and fitted an AR model to the residuals, we can report the results. By combining including the AR (noise) process, we get residuals that are white noise. When the model hat white noise residuals, we can also report the confidence intervals of the model.
 
We will use the unbiased estimate of the variance of the residuals to calculate the confidence intervals. The unbiased estimate of the variance is given by:
 
$$\hat{\sigma}^2 = \frac{1}{n-p} \sum_{t=1}^{n} \hat{e}_t^2$$
 
Where $n$ is the number of observations and $p$ is the number of parameters in the model.
 
The covariance matrix of the parameters is given by:
 
$$\hat{\Sigma} = \hat{\sigma}^2 (\mathbf{A}^T \mathbf{A})^{-1}$$
 
Where $\mathbf{A}$ is the design matrix of the model.
 
%% Cell type:markdown id: tags:
 
 
 
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"> <p> <b>Task 4.1:</b>
<p>
Complete the missing parts of the code cell below.
Complete the missing parts of the code cell below. Note that you will need to add one additional term, compared to Wednesday.
</p>
</div>
 
%% Cell type:code id: tags:
 
``` python
# combine ar2 and functional model
 
# A_final = YOUR_CODE_HERE
# x_hat, y_hat, e_hat_final = fit_model(YOUR_CODE_HERE)
 
# SOLUTION
A_final = np.column_stack((A2[2:], e_hat0[1:-1], e_hat0[:-2]))
x_hat, y_hat, e_hat_final = fit_model(data[2:], time_days[2:], A_final, plot=True)
# END SOLUTION
 
# Plottint the acf of the residuals
 
# fig, ax = plt.subplots(1, 1, figsize=(10, 3))
# plot_acf(YOUR_CODE_HERE, ax=ax, lags=20);
# ax.grid()
 
# SOLUTION
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
plot_acf(e_hat_final, ax=ax, lags=20);
ax.grid()
# END SOLUTION
 
# # compute the standard errors
# N = YOUR_CODE_HERE
# p = YOUR_CODE_HERE
# sigma2 = YOUR_CODE_HERE
# Cov = YOUR_CODE_HERE
# se = YOUR_CODE_HERE
 
# # Extracting the seasonal component coefficients from the estimated parameters
# a_i = YOUR_CODE_HERE
# b_i = YOUR_CODE_HERE
# freqs = YOUR_CODE_HERE
 
# SOLUTION
# compute the standard errors
N = A_final.shape[0]
p = A_final.shape[1]
sigma2 = np.sum(e_hat_final**2) / (N - p)
Cov = sigma2 * np.linalg.inv(A_final.T @ A_final)
se = np.sqrt(np.diag(Cov))
 
# Extracting the seasonal component coefficients from the estimated parameters
a_i = np.array([x_hat[1], x_hat[3]])
b_i = np.array([x_hat[2], x_hat[4]])
freqs = np.array([1, 1/365])
# END SOLUTION
 
# Check if the number of coefficients match the number of frequencies
assert len(a_i) == len(b_i) == len(freqs), 'The number of coefficients do not match'
 
print(f'Estimated Parameters (standard deviation):')
for i in range(len(x_hat)):
print(f'x{i} = {x_hat[i]:.3f}\t\t ({se[i]:.3f})')
 
print('\nThe seasonal component is rewritten as:')
i = 0
for a, b, f in zip(a_i, b_i, freqs):
A_i, theta_i = rewrite_seasonal_comp(a, b)
i = i + 1
print(f'A_{i} = {A_i:.3f}, theta_{i} = {theta_i:.3f}, f_{i} = {f:.3f}')
```
 
%% Output
 
Estimated Parameters (standard deviation):
x0 = -1.019 (0.091)
x1 = 0.225 (0.021)
x2 = -3.002 (0.021)
x3 = -19.714 (0.074)
x4 = 2.323 (0.095)
x5 = -6.074 (0.076)
x6 = 4.475 (0.071)
x7 = 0.709 (0.016)
x8 = 0.199 (0.016)
The seasonal component is rewritten as:
A_1 = 3.011, theta_1 = 1.496, f_1 = 1.000
A_2 = 19.851, theta_2 = -3.024, f_2 = 0.003
 
 
 
%% Cell type:markdown id: tags:
 
<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Task 4.2:</b>
Now we have the complete functional model. Reflect on it's suitability for capturing the time dependent variation of temperature throughout the spring. Comment specifically on the time series components that were included and which ones have the most significant influence on the result.
Comment also on the suitability of this model for predicting the temperature **beyond the betting deadline of April 5**, assuming that you have data up **until** that date. Remember that the ice typically breaks apart 2 to 6 weeks after the betting deadline.
</p>
</div>
%% Cell type:markdown id: tags:
<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>
The model only includes "memory" for 2 points, which is 2 hours. However, from the ACF prior to removing the residuals, we see that the "effect" of the memory lasts around 20 points. This is 20 hours, so the inclusion of AR(2) is useless and the functional model without these terms would be a fine best estimate for temperature on a given day several weeks in the future.
</p>
</div>
%% Cell type:markdown id: tags:
*End of the assignment.*
......
This diff is collapsed.
This diff is collapsed.
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment