*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 1.3.testfile Due: Friday, September 20, 2024.*
*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 1.3. Due: Friday, September 20, 2024.*
%% Cell type:markdown id:4ad8b9cb tags:
<div style="background-color:#ffa6a6; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"><p><b>Note:</b> don't forget to read the "Assignment Context" section of the README, it contains important information to understand this analysis.</p></div>
As described above, several functions in this assignment require the use of a Python dictionary to make it easier to keep track of important data, variables and results for the various _models_ we will be constructing and validating.
_It may be useful to revisit PA 1.1, where there was a brief infroduction to dictionaires. That PA contains all the dictionary info you need for GA 1.3. A [read-only copy is here](https://mude.citg.tudelft.nl/2024/files/Week_1_1/PA_1_1_Catch_Them_All.html) and [the source code (notebook) is here](https://gitlab.tudelft.nl/mude/2024-week-1-1)._
Test your knowledge by adding a new key <code>new_key</code> and then executing the function to print the value.
</p>
</div>
%% Cell type:code id:41c56f43 tags:
``` python
# YOUR_CODE_HERE
# function_that_uses_my_dictionary(my_dictionary)
# SOLUTION:
my_dictionary['new_key'] = 'new_value'
function_that_uses_my_dictionary(my_dictionary)
```
%% Output
value1
Dictionary Example
[1, 2, 3]
[1 2 3]
hello
new_key exists and has value: new_value
%% Cell type:markdown id:160d6250 tags:
## Task 1: Preparing the data
Within this assignment you will work with two types of data: InSAR data and GNSS data. The cell below will load the data and visualize the observed displacements time. In this task we use the package `pandas`, which is really useful for handling time series. We will learn how to use it later in the quarter; for now, you only need to recognize that it imports the data as a `dataframe` object, which we then convert into a numpy array using the code below.
%% Cell type:markdown id:02b12781 tags:
<div style="background-color:#facb8e; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>Tip: note that we have converted all observations to millimeters.</p></div>
Once you have used the cell above to import the data, investigate the data sets using the code cell below. Then provide some relevant summary information in the Markdown cell.
<em>Hint: at the least, you should be able to tell how many data points are in each data set and get an understanding of the mean and standard deviation of each. Make sure you compare the different datasets and use consistent units.</em>
<div style="background-color:#facb8e; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>The code below gives some examples of the quantitative and qualitative ways you could have looked at the data. It is more than you were expected to do; the important thing is that you showed the ability to learn something about the data and describe aspects that are relevant to our problem. We use a dictionary to easily access the different data series using their names, which are entered as the dictionary keys (also not expected of you, but it's hopefully fun to learn useful tricks).</div>
There are a lot more GNSS data points than InSAR or groundwater. The GNSS observations also have more noise, and what seem to be outliers. In this case the mean and standard deviation do not mean much, because there is clearly a trend with time. We can at least confirm that the time periods of measurements overlap, although the intervals between measurements is certainly not uniform (note that you don't need to do anything with the times, since they are pandas time series and we have not covered them yet).
</p>
</div>
%% Cell type:markdown id:9fe5a729 tags:
You may have noticed that the groundwater data is available for different times than the GNSS and InSAR data. You will therefore have to *interpolate* the data to the same times for a further analysis. You can use the SciPy function ```interpolate.interp1d``` (read its [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html)).
The cells below do the following:
1. Define a function to convert the time unit
2. Convert the time stamps for all data
3. Use `interp1d` to interpolate the groundwater measurements at the time of the satellite measurements
%% Cell type:code id:f02ed4c4 tags:
``` python
def to_days_years(times):
'''Convert the observation times to days and years.'''
<li><code>interp</code> is a function that will return a value (gw level) for the input(s) (date(s)). The interpolated value is found by linearly interpolating between the two nearest times in the gw observations.</li>
<li>The observation arrays of <code>GW_at_GNSS_times</code> and <code>GW_at_INSAR_times</code> changed in size to match the size of the GNSS and InSAR observations, respectively.</li>
Describe the datasets based on the figure above and your observations from the previous tasks. What kind of deformation do you see? And what are the differences between both datasets? Be quantitative.
The points obviously show subsidence, the displacement shows a similar pattern for both datasets. The GNSS data is much noisier than InSAR (range is around 60 mm versus only a few mm), but has a higher sampling rate. Also there seem to be more outliers in the GNSS data compared to InSAR, especially at the start of the observation period. InSAR has only observations every 6 days but is less noisy.
</p>
</div>
%% Cell type:markdown id:f9a7bdd4 tags:
Before we move on, it is time to do a little bit of housekeeping.
Have you found it confusing to keep track of two sets of variables---one for each data type? Let's use a dictionary to store relevant information about each model. We will use this in the plotting functions for this task (and again next week), so make sure you take the time to see what is happening. Review also Part 0 at the top of this notebook if you need a refresher on dictionaries.
Run the cell below to define a dictionary for storing information about the two (future) models.
</p>
</div>
%% Cell type:code id:2c27b4f3 tags:
``` python
model_insar = {'data_type': 'InSAR',
'y':y_insar,
'times':times_insar,
'groundwater': GW_at_InSAR_times
}
model_gnss = {'data_type': 'GNSS',
'y':y_gnss,
'times':times_gnss,
'groundwater': GW_at_GNSS_times
}
```
%% Cell type:markdown id:76c9115b tags:
## Task 2: Set-up linear functional model
We want to investigate how we could model the observed displacements of the road. Because the road is built in the Green Heart we expect that the observed displacements are related to the groundwater level. Furthermore, we assume that the displacements can be modeled using a constant velocity. The model is defined as
$$
d = d_0 + vt + k \ \textrm{GW},
$$
where $d$ is the displacement, $t$ is time and $\textrm{GW}$ is the groundwater level (that we assume to be deterministic).
Therefore, the model has 3 unknowns:
1. $d_0$, as the initial displacement at $t_0$;
2. $v$, as the displacement velocity;
3. $k$, as the 'groundwater factor', which can be seen as the response of the soil to changes in the groundwater level.
As a group you will construct the **functional model** that is defined as
Add the A matrix to the dictionaries for each model. This will be used to plot results later in the notebook.
</p>
</div>
%% Cell type:code id:396ac3a5 tags:
``` python
# model_insar['A'] = YOUR_CODE_HERE
# model_gnss['A'] = YOUR_CODE_HERE
# SOLUTION:
model_insar['A'] = A_insar
model_gnss['A'] = A_gnss
print("Keys and Values (type) for model_insar:")
for key, value in model_insar.items():
print(f"{key:16s} --> {type(value)}")
print("\nKeys and Values (type) for model_gnss:")
for key, value in model_gnss.items():
print(f"{key:16s} --> {type(value)}")
```
%% Output
Keys and Values (type) for model_insar:
data_type --> <class 'str'>
y --> <class 'numpy.ndarray'>
times --> <class 'pandas.core.series.Series'>
groundwater --> <class 'numpy.ndarray'>
A --> <class 'numpy.ndarray'>
Keys and Values (type) for model_gnss:
data_type --> <class 'str'>
y --> <class 'numpy.ndarray'>
times --> <class 'pandas.core.series.Series'>
groundwater --> <class 'numpy.ndarray'>
A --> <class 'numpy.ndarray'>
%% Cell type:markdown id:9325d32b tags:
## 3. Set-up stochastic model
We will use the Best Linear Unbiased Estimator (BLUE) to solve for the unknown parameters. Therefore we also need a stochastic model, which is defined as
$$
\mathbb{D}(Y) = \Sigma_{Y}.
$$
where $\Sigma_{Y}$ is the covariance matrix of the observables' vector.
- The covariance matrix contains information on the quality of the observations, where an entry on the diagonal represents the variance of one observation at a particular epoch. If there is an indication that for instance the quality for a particular time interval differs, different $\sigma$ values can be put in the stochastic model for these epochs.
- The off-diagonal terms in the matrix are related to the correlation between observations at different epochs, where a zero value on the off-diagonal indicates zero correlation.
- The dimension of the matrix is 61x61 for InSAR and 730x730 for GNSS.
Write a function to apply BLUE in the cell below and use the function to estimate the unknowns for the model using the data.
Compute the modeled displacements ($\hat{\mathrm{y}}$), and corresponding residuals ($\hat{\mathrm{\epsilon}}$), as well as associated values (as requested by the blank code lines).
</p>
</div>
%% Cell type:markdown id:936d6b0c tags:
<div style="background-color:#facb8e; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p><strong>Note on code implementation</strong>: you'll see that the functions in this assignment use a dictionary; this greatly reduces the number of input/output variables needed in a function. However, it can make the code inside the function more difficult to read due to the key syntax (e.g., <code>dict['variable_1']</code> versus <code>variable
_1</code>). To make this assignment easier for you to implement we have split these functions into three parts: 1) define variables from the dictionary, 2) perform analysis, 3) add results to the dictionary. Note that this is not the most efficient way to write this code; it is done here specifically for clarity and to help you focus on writing the equations properly and understanding the meaning of each term.</p></div>
%% Cell type:code id:d85b1826 tags:
``` python
def BLUE(d):
"""Calculate the Best Linear Unbiased Estimator
Uses dict as input/output:
- inputs defined from existing values in dict
- outputs defined as new values in dict
"""
y = d['y']
A = d['A']
Sigma_Y = d['Sigma_Y']
# Sigma_X_hat = YOUR_CODE_HERE
# x_hat = YOUR_CODE_HERE
# y_hat = YOUR_CODE_HERE
# e_hat = YOUR_CODE_HERE
# Sigma_Y_hat = YOUR_CODE_HERE
# std_y = YOUR_CODE_HERE
# Sigma_e_hat = YOUR_CODE_HERE
# std_e_hat = YOUR_CODE_HERE
# SOLUTION:
Sigma_X_hat = np.linalg.inv(A.T @ np.linalg.inv(Sigma_Y) @ A)
x_hat = Sigma_X_hat @ A.T @ np.linalg.inv(Sigma_Y) @ y
Do the values that you just estimated make sense? Explain, using quantitative results.
<em>Hint: all you need to do is use the figures created above to verify that the parameter values printed above are reasonable (e.g., order of magnitude, units, etc).</em>
As long as the velocity is negative and around -0.02 mm/day or -10 mm/yr it makes sense if you compare with what you see in the plots with observations. Since load is applied on soil layers we expect the road to subside. We also expect to see a positive value for the GW factor.
As shown above, the standard deviations of the estimated parameters are equal to the square root of the diagonal elements. Compared with the estimated values, the standard deviations seem quite small, except for the estimated offsets. Meaning that the complete estimated model can be shifted up or down.
The off-diagonal elements show the covariances between the estimated parameters, which are non-zeros since the estimates are all computed as function of the same vector of observations and the same model. A different value for the estimated velocity would imply a different value for the GW factor and offset.
Complete the function below to help us compute the confidence intervals, then apply the function. Use a confidence interval of 96% in your analysis.
<em>Hint: it can be used in exactly the same way as the <code>BLUE</code> function above, although it has one extra input.</em>
</p>
</div>
%% Cell type:code id:2711da12 tags:
``` python
def get_CI(d, alpha):
"""Compute the confidence intervals.
Uses dict as input/output:
- inputs defined from existing values in dict
- outputs defined as new values in dict
"""
std_e_hat = d['std_e_hat']
std_y = d['std_y']
# k = YOUR_CODE_HERE
# CI_y = YOUR_CODE_HERE
# CI_res = YOUR_CODE_HERE
# SOLUTION:
k = norm.ppf(1 - 0.5*alpha)
CI_y = k*std_y
CI_res = k*std_e_hat
CI_y_hat = k*np.sqrt(d['Sigma_Y_hat'].diagonal())
d['alpha'] = alpha
d['CI_y'] = CI_y
d['CI_res'] = CI_res
d['CI_Y_hat'] = CI_y_hat
return d
```
%% Cell type:code id:d9a41ea5 tags:
``` python
# model_insar = YOUR_CODE_HERE
# model_gnss = YOUR_CODE_HERE
# SOLUTION:
model_insar = get_CI(model_insar, 0.04)
model_gnss = get_CI(model_gnss, 0.04)
```
%% Cell type:markdown id:53cf3663 tags:
At this point we have all the important results entered in our dictionary and we will be able to use the plots that have been written for you in the next Tasks. In case you would like to easily see all of the key-value pairs that have been added to the dictionary, you can run the cell below:
Read the contents of file <code>functions.py</code> and identify what it is doing: you should be able to recognize that they use our model dictionary as an input and create three different figures. Note also that the function to create the figures have already been imported at the top of this notebook.
Use the functions provided to visualize the results of our two models.
<div style="background-color:#facb8e; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p><strong>Note</strong>: remember that you will have to use the same function to look at <em>both</em> models when writing your interpretation in the Report.</p></div>
%% Cell type:code id:ec7c8bef tags:
``` python
# _, _ = plot_model(YOUR_CODE_HERE)
# SOLUTION:
_, _ = plot_model(model_insar)
_, _ = plot_model(model_gnss)
```
%% Output
%% Cell type:code id:104d155d tags:
``` python
# _, _ = plot_residual(YOUR_CODE_HERE)
# SOLUTION:
_, _ = plot_residual(model_insar)
_, _ = plot_residual(model_gnss)
```
%% Output
%% Cell type:code id:1dc93ce9 tags:
``` python
# _, _ = plot_residual_histogram(YOUR_CODE_HERE)
# SOLUTION:
_, _ = plot_residual_histogram(model_insar)
_, _ = plot_residual_histogram(model_gnss)
```
%% Output
The mean value of the InSAR residuals is 0.0 mm
The standard deviation of the InSAR residuals is 3.115 mm
The mean value of the GNSS residuals is -0.0 mm
The standard deviation of the GNSS residuals is 15.393 mm
The data used in this exercise was generated using Monte Carlo Simulation. It is added to the plots here to illustrate where and how our models differ (it is your job to interpret "why").