Skip to content
Snippets Groups Projects
Commit c54b507c authored by Sandra Verhagen's avatar Sandra Verhagen
Browse files

Merge branch 'San_Part' into 'main'

add Chapters of Sandra in sandbox

See merge request !17
parents 9fe5e28a bfb77655
No related branches found
No related tags found
1 merge request!17add Chapters of Sandra in sandbox
Pipeline #204008 passed
Showing
with 741 additions and 1 deletion
...@@ -29,6 +29,25 @@ parts: ...@@ -29,6 +29,25 @@ parts:
- file: sandbox/continuous/Parameterization.md - file: sandbox/continuous/Parameterization.md
- file: sandbox/continuous/fitting.md - file: sandbox/continuous/fitting.md
- file: sandbox/continuous/GOF - file: sandbox/continuous/GOF
- file: sandbox/SanPart/ObservationTheory/01_Introduction.md
title: Sensing and Observation Theory
sections:
- file: sandbox/SanPart/ObservationTheory/02_LeastSquares.md
- file: sandbox/SanPart/ObservationTheory/03_WeightedLSQ.md
- file: sandbox/SanPart/ObservationTheory/04_BLUE.md
- file: sandbox/SanPart/ObservationTheory/05_Precision.md
- file: sandbox/SanPart/ObservationTheory/06_MLE.md
- file: sandbox/SanPart/ObservationTheory/07_NLSQ.md
- file: sandbox/SanPart/ObservationTheory/08_Testing.md
- file: sandbox/SanPart/ObservationTheory/99_NotationFormulas.md
- file: sandbox/SanPart/premath/00_00_PreMath.md
title: Pre-Math
sections:
- file: sandbox/SanPart/premath/00_01a_PreMath.md
- file: sandbox/SanPart/premath/00_01b_PreMath.md
- file: sandbox/SanPart/premath/00_01c_PreMath.md
- file: sandbox/SanPart/premath/00_02_PreMath.md
- file: sandbox/SanPart/premath/00_03_PreMath.md
- caption: MUDE Cookbook - caption: MUDE Cookbook
chapters: chapters:
- file: cookbook/blank - file: cookbook/blank
......
# Sensing and observation theory # observation theory
As in other physical sciences, empirical data are used in Civil and Environmental Engineering and Applied Earth Sciences to make inferences so as to describe physical reality. Many such problems involve the determination of unknown parameters that bear a functional relationship to the set of observed data or measurements. This determination or parameter estimation can be based on different estimation principles. As in other physical sciences, empirical data are used in Civil and Environmental Engineering and Applied Earth Sciences to make inferences so as to describe physical reality. Many such problems involve the determination of unknown parameters that bear a functional relationship to the set of observed data or measurements. This determination or parameter estimation can be based on different estimation principles.
From experience we know that various uncertain phenomena can be modeled as a random variable (or a random vector), say $Y$. Examples are: the uncertainty in instrument readings due to measurement errors; the variable strength of material of a certain type, the variable lifetime of pieces of equipment of a particular type after they have been put into service; the number of "hits" in a certain time interval on a particular Web page; the variable size, weight, or contents of products of a particular type delivered by a production line, etc. From experience we know that various uncertain phenomena can be modeled as a random variable (or a random vector), say $Y$. Examples are: the uncertainty in instrument readings due to measurement errors; the variable strength of material of a certain type, the variable lifetime of pieces of equipment of a particular type after they have been put into service; the number of "hits" in a certain time interval on a particular Web page; the variable size, weight, or contents of products of a particular type delivered by a production line, etc.
......
File added
File added
# Introduction
As in other physical sciences, empirical data are used in Civil and Environmental Engineering and Applied Earth Sciences to make inferences so as to describe physical reality. Many such problems involve the determination of unknown model parameters that bear a functional relationship to the set of observed data or measurements. This determination or parameter estimation can be based on different estimation principles.
From experience we know that various uncertain phenomena can be modeled as a random variable (or a random vector), say $Y$. Examples are: the uncertainty in instrument readings due to measurement errors; the variable strength of material of a certain type, the variable lifetime of pieces of equipment of a particular type after they have been put into service; the number of "hits" in a certain time interval on a particular Web page; the variable size, weight, or contents of products of a particular type delivered by a production line, etcetera. We will refer to these random variables (our iput data) $Y$ as the *observables*.
The unknown model parameters will be denoted as $\mathrm{x}$. The ingredients for this part on sensing and observation theory can then be summarized as follows:
* a model to describe relation between observables $Y$ and parameters of interest $\mathrm{x}$
* estimation method(s) to estimate the parameters as function of $Y$: $\hat{X}=q(Y)$ (i.e., the estimation method defines the function $q$)
* uncertainty propagation to assess the precision of the estimated parameters $\hat{X}$
* hypothesis testing to account for errors in data and/or to choose best model from different candidates
```{admonition} Definition
The *functional model* describes the functional relationship between the observed variables (*observables*) and the unknown parameters.
In case the functional relationship is linear, the functional model will have the form:
$$
\mathbb{E}(Y) = \mathrm{Ax}
$$
where $\mathrm{x}$ is a vector with the unknown model parameters, and $\mathrm{A}$ is referred to as the *design matrix*. $\mathbb{E}(.)$ is the expectation operator.
```
Some examples will be shown in [the next subsection](01_funcmodel)
Next, we will develop the principles and underlying theory of (1) Least-squares estimation, (2) Best linear unbiased estimation, and (3) Maximum likelihood estimation.
Special attention is then given to how the uncertainty in the measurements propagates into parameter estimation results, such that we can assess the precision of the estimated parameters. For that purpose we need the stochastic model.
```{admonition} Definition
The *stochastic model* describes the uncertainty of the observables in the form of the covariance matrix $\Sigma_Y$.
```
Parameter estimation requires specification of the underlying functional and stochastic models. It may happen, however, that some parts of the models are misspecified, thereby invalidating the results of estimation. Some measurements, for instance, may be corrupted by blunders, or the chosen model my fail to give an adequate description of physical reality.
(01_funcmodel)=
## Functional model: examples
Relate to 'modelling' in MUDE.
## Estimation and linear regression
The goal of estimation is the estimate *model parameters* from a set of observations. In Civil Engineering, Applied Earth Sciences and Environmental Engineering this is needed in many monitoring and sensing applications, such as:
* Sea level rise
* Land subsidence / uplift
* Air quality monitoring
* Settlement of soils
* Tunnel deformation
* Bridge motions
* Traffic flow rates
* Water vapor content for numerical weather prediction
* Ground water level
In this part, we will introduce different estimation principles, starting with (weighted) least-squares. Next, Best Linear Unbiased estimation and Maximum Likelihood estimation will be introduced, providing the probabilistic framework for estimating *model* parameters from *uncertain data*.
*Linear regression* aims to estimate the relationship between so-called *dependent* and *independent* variables, the latter also referred to as *predictors*, since once the relationship is known, these can be used to predict the behaviour of the dependent variables. Examples of bivariate regression model applications:
* predict *river run-off* (dependent variable) based on *amount of rainfall* (independent variable), see {numref}`Regres`
* predict *house prices* (dependent variable) based on their *energy label*(independent variable)
* predict *geothermal power production* (dependent variable) based on *brine temperature* (independent variable)
* predict *structural deformation* (dependent variable) based on *wind load* (independent variable)
```{figure} ../figures/ObservationTheory/01_Regression.png
---
height: 350px
name: Regres
---
Linear regression example for amount of rainfall (independent variable) and river runoff (dependent variable).
```
The least-squares and maximum likelihood principles are used for estimating these relationships.
Supervised machine learning [**ADD LINK TO MACHINE LEARNING PART**] is all about finding relationships between target (dependent) variables and certain features (predictors), and therefore regression analysis. However, in this part, we will focus on monitoring and sensing applications, and not on linear regression.
# Least-squares estimation
## Functional model: dealing with inconsistency
Given a set of observations which contain noise, and a model that is assumed to explain these data, the goal is to estimate the unknown parameters of that model. The least-squares principle can be used to solve this task.
Mathematically, we use the ‘system of observation equations’, with the vector of observations $\mathrm{y}$, the vector of unknowns $\mathrm{x}$, and the design matrix $\mathrm{A}$. In a simple form, we can write this system of observation equations as
$$
\mathrm{y = Ax}
$$
If all the observations would fit perfectly to this model, this system is called [consistent](PM_consistent). This is only possible if the number of observations is equal to (or smaller than) the number of unknowns.
If the number of observations is greater than the number of unknowns (and the design matrix $\mathrm{A}$ is of full column rank), it is very unlikely that the system is consistent. Physical reality would need to be perfectly described by the conceptual mathematical model. It is evident that in real life this is never the case, since (i) our observations are always contaminated by some form of noise, and (ii) physical reality is often more complex than can be described with a simplified mathematical model.
Thus, in the case in which there are more observations than unknowns (and design matrix $\mathrm{A}$ is of full column rank) the $\mathrm{y=Ax}$ system of equations has no solution. In other words; every 'solution' would be wrong, since the model would not 'fit' the data.
### Example: linear trend model
Assume we have $m$ observations and we try to fit the linear trend model:
$$
\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \underset{\mathrm{A}}{\underbrace{\begin{bmatrix} 1 & t_1 \\ 1 & t_2 \\ \vdots & \vdots \\ 1 & t_m \end{bmatrix}}}\underset{\mathrm{x}}{\underbrace{\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}}}
$$
{numref}`LSfit` shows the observations, together with a fitted linear trend line. Obviously, it is impossible to fit a line that perfectly matches all observations due the small measurement errors, which are assumed to be random.
```{figure} ../figures/ObservationTheory/02_LeastSquares_fit.png
---
height: 250px
name: LSfit
---
Linear trend line fitted to a set of $m$ observations affected by random errors.
```
We solve the problem of having an ‘unsolvable’ equation by considering the random errors for each observation:
$$
\mathrm{y=Ax + \epsilon}
$$
The length of the error vector (or vector of residuals) $\mathrm{\epsilon}$ is equal to the length of the vector of observations $\mathrm{y}$.
## Least-squares principle
We are looking for a solution for $\mathrm{x}$; this solution will be denoted by $\hat{\mathrm{x}}$. Based on this solution, the 'adjusted observations' would be $\hat{y}= A\hat{\mathrm{x}}$ (solution of the forward model).
To find a solution for an inconsistent system of equations, we prefer the one for which the observations are as ‘close as possible’ to the model. This is a ‘minimum distance’ objective. In mathematical terms, we look at the vector of residuals $\hat{\epsilon}$: the difference between the observations $\mathrm{y}$ and the model realization $\mathrm{\hat{y}}$:
$$
\mathrm{\hat{\epsilon}=y-\hat{y}=y-A\hat{x}}
$$
We want this vector of residuals to be as small as possible (i.e., the minimum distance objective).The ‘least-squares’ principle attempts to achieve that: by minimizing (‘least’) the sum of the squared residuals (‘squares’).
If we take the square-root of this sum, we find the length of the vector, also known as the ‘norm’ of the vector. Thus, the various possible error vectors have a norm defined as:
$$
\left \| \epsilon \right \| = \sqrt{\epsilon_1^2+\epsilon_2^2+...+\epsilon_m^2}=\sqrt{\epsilon^T\epsilon}
$$
Finding the minimum of (a) the sum of the squared differences, or (b) the square-root of the sum of the squared differences, yields the same result, since both terms are always positive or zero.
If we find this minimum, the corresponding solution $\mathrm{\hat{x}}$ is than the least-squares solution of the system of observation equations. In mathematical terms this is written as
$$
\mathrm{\hat{x}} = \arg \underset{\mathrm{x}}{\min} \left \| \epsilon \right \|^2= \arg \underset{x}{\min} {(\epsilon^T\epsilon)}
$$
From the system observation equations $\mathrm{y=Ax+\epsilon}$, it follows directly that $\epsilon=\mathrm{y-Ax}$ and therefore the least-squares solution follows as:
$$
\mathrm{\hat{x}} =\arg \underset{\mathrm{x}}{\min} {\mathrm{(y-Ax)^T (y-Ax)}}.
$$
In other words, we find $\mathrm{\hat{x}}$ by finding the minimum of $\mathrm{(y-Ax)^T (y-Ax)}$.
## Least-squares solution
We can find the minimum of a function by taking the first derivative with respect to the 'free variable'. Since the observation vector is not free, and also the design matrix $A$ is fixed, the only variable which we can vary is $x$. The first derivative of our objective function should be equal to zero to reach a minimum (see [Gradient](PM_gradient)):
$$
\begin{align*}
\partial_x \mathrm{(y-Ax)^T (y-Ax)} &=0\\
\partial_x \mathrm{( y^Ty -(Ax)^T y -y^T Ax + (Ax)^T(Ax) )}&=0\\
\partial_x \mathrm{( y^Ty -x^TA^T y -y^T Ax + x^T A^TAx )}&=0\\
\partial_x \mathrm{(y^Ty -2x^TA^T y + x^T A^TAx) }&=0\\
\mathrm{-2A^T y + 2A^TAx} &=0\\
\mathrm{A^TAx} &=\mathrm{A^T y }
\end{align*}
$$
This last equation is known as the normal equation, and the matrix $\mathrm{N=A^T A}$ is known as the normal matrix.
```{admonition} MUDE exam information
:class: tip, dropdown
You will not be asked to derive the least-squares solution as above.
```
If it is possible to compute the inverse of the normal matrix, the normal equation can be written to express the estimate of $\mathrm{x}$, denoted by $\mathrm{\hat{x}}$, as a linear function of the observations:
$$
\mathrm{\hat{x}= (A^T A)^{-1} A^T y}
$$
## Overview
In summary, the least-squares estimates of $\mathrm{x}$, $\mathrm{y}$ and $\epsilon$ are given by:
$$
\mathrm{\hat{x}= (A^T A)^{-1} A^T y}
$$
$$
\mathrm{\hat{y} = A \hat{x}}
$$
$$
\mathrm{\hat{\epsilon} = y - A\hat{x} = y - \hat{y}}
$$
{numref}`LSsol` visualizes the least-squares solution based on a linear trend model.
```{figure} ../figures/ObservationTheory/02_LeastSquares_sol.png
---
height: 350px
name: LSsol
---
Least-squares fit to a set of 5 observations affected by random errors. Index $i$ refers to the $i$th observation.
```
\ No newline at end of file
# Weighted least-squares estimation
In ordinary least-squares estimation, we assume that all observations are equally important. In many cases this is not realistic, as observations may be obtained by different measurement systems, or under different circumstances. We want our methodology for least-squares estimation to be able to take this into account.
We achieve this goal by introducing a weight matrix in the normal equation. In the unweighted least-squares approach, we arrive at the normal equation by pre-multiplying both sides of $\mathrm{y=Ax}$ with the transpose of the design matrix $\mathrm{A^T}$:
$$
\mathrm{y=Ax} \; \rightarrow \; \mathrm{A^T\; y = A^T\; A x }
$$
In the weighted least-squares approach, we add weight matrix $W$ to this pre-multiplication factor, i.e., $ \mathrm{A^T W}$, to obtain the normal equation:
$$
\mathrm{y=Ax} \; \rightarrow \; \mathrm{A^T W \; y = A^TW \; A x}
$$
The normal matrix is now defined as $\mathrm{N=A^T W A}$. From this, assuming that the normal matrix $N$ is invertible (non-singular) we find the weighted least-squares estimate $ \hat{x} $,
$$
\mathrm{\hat{x} = (A^T W A)^{-1} A^T W y}
$$
We also find the derived estimate $ \mathrm{\hat{y}} $ and $ \mathrm{\hat{e}} $:
$$
\mathrm{\hat{y} = A \hat{x} = A (A^T W A )^{-1} A^T W y}
$$
$$
\mathrm{\hat{\epsilon} = y - \hat{y}= y - A \hat{x} = y-A (A^T W A )^{-1} A^T W y = (I- A(A^T W A )^{-1} A^T W) y}
$$
## Video
```{eval-rst}
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/iJmkkz37EuU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
```
## Discussion on the weight matrix
The weight matrix $\mathrm{W}$ expresses the (relative) weights between the observations. It is always a square matrix. The size of the weight matrix depends on the number of observations, $m$. The size of the weight matrix is $m\times m$.
If it is a unit matrix ($\mathrm{W=I}$), this implies that all observations have equal weight. Note that in this case the equations are equal to the ordinary least-squares solution.
If it is a diagonal matrix, with different values on the diagonal, this implies that entries with a higher value correspond to the observations which are considered to be of more importance. If the weight matrix has non-zero elements on the off-diagonal positions, this implies that (some) observations are correlated.
## Weighted least-squares estimator: properties
Until now, we have looked at the weighted least-squares solution of a single *realization* of the observations, where generally we assume that it is a realization of the *random* observable vector $Y$, since measurements are affected by random errors. As such it follows the the weighted least-squares *estimator* is given by:
$$
\hat{X} = \mathrm{(A^T W A )^{-1} A^T W} Y
$$
This estimator has two important properties: it is *linear* and *unbiased*.
The linearity property is due to the fact that $\hat{X}$ is a linear function of the observables $Y$.
The unbiased property means that the expectation of $\hat{X}$ is equal to the true (but unknown) $\mathrm{x}$. This can be shown as follows:
$$
\mathbb{E}(\hat{X}) = \mathrm{(A^T W A )^{-1} A^T W} \mathbb{E}(Y) = \mathrm{(A^T W A )^{-1} A^T W Ax = x}
$$
This a very desirable property. It applies that if we would repeat the measurements many times to obtain a new estimate, the *average of the estimated* values would be equal to the truy values.
\ No newline at end of file
# Best linear unbiased estimation
Best Linear Unbiased estimation (BLUE) is based on a fundamentally different principle than (weighted) least-squares (WLS), it can be regarded as a special case of WLS.
Comparing the weighted least-squares estimator:
$$
\hat{X}_{\text{WLS}}= \mathrm{(\mathrm{A^T} W \mathrm{A})^{-1} \mathrm{A^T} W} Y
$$
with the estimator of the vector of unknowns following from BLUE:
$$
\hat{X}_{\text{BLU}}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
$$
We see that the weight matrix $\mathrm{W}$ is now replaced by the inverse of the covariance matrix $\Sigma_Y$.
This makes sense intuitively: suppose the observables are all independent and thus uncorrelated, such that the covariance matrix is a diagonal matrix with the variances as its diagonal element. The weights for the individual observables are then equal to the inverse of their variances, which implies that a more precise observable (smaller variance) receives a larger weight.
Taking this particular weight matrix, i.e., $\mathrm{W}=\Sigma_Y^{-1}$, has a special meaning. It has the “Best” property. This means that using this particular weight matrix, we obtain a linear estimator which has minimal variance. In other words, with this particular weight matrix we get the best possible estimator among all linear estimators, where ‘best’ represents optimal precision or minimal variance.
Given the BLU-estimator for $\hat{X}$, we can also find the BLU-estimators for $\hat{Y} =\mathrm{A}\hat{X}$,and for $\hat{\epsilon} = Y-\hat{Y} $,
$$
\hat{Y}= \mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
$$
$$
\hat{\epsilon}= Y-\mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
$$
## Video
```{eval-rst}
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/EIZvJrlxZhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
```
## BLUE decomposed
In BLUE, Best Linear Unbiased Estimation, the parts of the acronym ‘B’, ‘L’, and ‘U’ refer to specific properties.
*Linear* means that there is a linear (matrix) relation between the variables. Such linear relations imply that if we have a normally distributed vector $Y\sim N(\mathrm{Ax},\Sigma_Y)$, which is multiplied with a matrix $\mathrm{L}$, this product will be normally distributed as well:
$$
\hat{X}=\mathrm{L^T}Y\sim N(\mathrm{L^TAx},\mathrm{L^T} \Sigma_Y \mathrm{L})
$$
where we use the linear propagation laws of the mean and covariance.
*Unbiased* means that the expected value of an estimator of a parameter is equal to the value of that parameter. In other words:
$$
\mathbb{E}(\hat{X})= \mathrm{x}.
$$
The unbiased property also implies that:
$$
\begin{align*}\mathbb{E}(\hat{Y} ) &=\mathrm{A} \mathbb{E}(\underline{\hat{x}} )=\mathrm{Ax} \\ \mathbb{E}(\hat{\epsilon} ) &= \mathbb{E}(Y)-\mathbb{E}(\hat{Y})= \mathrm{Ax-Ax}= 0 \end{align*}
$$
*Best* means that the estimator has minimum variance (best precision), when compared to all other possible linear estimators:
$$
\text{trace}(\Sigma_{\hat{X}}) = \sum_{i=1}^m \sigma^2_{\hat{X}_i} = \text{minimum}
$$
(Note that this is equivalent to "minimum mean squared errors" $\mathbb{E}(\|\hat{X}-\mathrm{x}\|^2)$. )
(04_cov)=
## Covariance matrices of the BLU estimators
The quality of the estimator is expressed by its covariance matrix. For the ‘best linear unbiased’ estimator of $\hat{X}$, $\hat{Y}$ and $\hat{\underline{e}}$ we (by applying the [linear covariance propagation laws](99_proplaw))
$$
\begin{align*}
\Sigma_{\hat{X}} &= \mathrm{L^T} \Sigma_Y \mathrm{L} \\ &= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1} \cdot \Sigma_Y \cdot \Sigma_Y^{-1} \mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ &=(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ &= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ \\ \Sigma_{\hat{Y}} &=\mathrm{A}\Sigma_{\hat{X}} \mathrm{A^T} \\ &=\mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \\ \\
\Sigma_{\hat{\epsilon}} &= (\mathrm{I}_m - A (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1}) \cdot \Sigma_Y \cdot (\mathrm{I}_m - A (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1})^T \\ &= (\Sigma_Y - \mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T}) (\mathrm{I}_m -\Sigma_Y^{-1}\mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T})\\ &= \Sigma_Y - \Sigma_{\hat{Y}}-\Sigma_{\hat{Y}}+\Sigma_{\hat{Y}}\\ &= \Sigma_Y - \Sigma_{\hat{Y}}\end{align*}
$$
In summary, by applying the BLUE method, we can compute best estimators among all linear unbiased estimators, where ‘best’ is quantitatively expressed via the covariance matrix.
## Additional note on the linearity condition of BLUE
The BLUE estimator is the best (or minimum variance) among all linear unbiased estimators. So if we drop the condition of linearity, then BLUE is not necessarily the best. It means that there may be some other non-linear estimators that have even better precision than the BLUE. However, it can be proven that, in the case of normally distributed observations, the BLUE is also the best among all possible estimators. So we can say: for observations $Y$ that are normally distributed, the BLUE is BUE (best unbiased estimator).
\ No newline at end of file
# Precision and confidence intervals
When evaluating and reporting estimation results, it is important to give a quality assessment. Thereby we should distinguish between:
* the precision of the estimated parameters (topic of this part)
* how well the model fits the data (topic of chapter [Model testing](08_testing))
The precision of the estimated parameters is provided by the covariance matrix $\Sigma_{\hat{X}}$. But the question is: how to present and interpret the numbers in the matrix? For that purpose we will introduce the concept of confidence intervals, which basically means we will 'convert' the precision to probabilities.
## Video
```{eval-rst}
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/7XJP_kaeL7c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
```
(05_normal)=
## Review normal distribution and probabilities
Let's start with a standard normally distributed random variable $Z\sim N(0,1)$, which has an expectation equal to 0, and variance (for us: precision) of 1. {numref}`standard` shows the corresponding PDF.
```{figure} ../figures/ObservationTheory/05_standard.png
---
height: 250px
name: standard
---
PDF of standard normal distributed random variable.
```
We are now interested in the probability that a realization of $Z$ will be within a certain interval $\pm k$ from the expectation 0:
$$
P(-k<Z<k) = P(|Z|<k) = 1-\alpha
$$
In this case, $k$ is the $(1-0.5\alpha)$-quantile, since it can be seen in {numref}`standard` that
$$
P(Z<k) = 1-0.5\alpha
$$
If we choose a certain value for $\alpha$ we can thus find the value of $k$ using the inverse CDF, e.g., with $\alpha = 0.05$ we would obtain $k=1.96$.
ADD CODE BLOCK TO SHOW
## Confidence interval of observations
```{admonition} Definition
A *confidence interval* of a random variable $\epsilon \sim N(0,\sigma_{\epsilon}^2)$ is defined as the range or interval $\epsilon \pm r$ such that:
$$
P(|\epsilon|<c) = 1-\alpha
$$
where $(1-\alpha)\cdot$ 100\% is the *confidence level*.
```
On purpose we considered $\epsilon \sim N(0,\sigma_{\epsilon}^2)$ in the above definition, since in observation theory we generally assume that the random errors in our observations are normally distributed with zero mean and variance $\sigma^2_Y=\sigma^2_{\epsilon}$.
Similarly as for the standard normal distribution, the interval $\pm c$ can be computed. We will do this by first applying the transformation:
$$
Z = \frac{\epsilon - 0}{\sigma_{\epsilon}} \sim N(0,1)
$$
With this we have that:
$$
P(|\epsilon|<c) = P(|\frac{\epsilon}{\sigma_{\epsilon}}|<\frac{c}{\sigma_{\epsilon}})= P(|Z|< k)
$$
with $c = k\sigma_{\epsilon}$.
This allows us to compute $k$ as shown in {ref}`05_normal` and express the confidence interval as:
$$
\epsilon \pm k\sigma_{\epsilon}
$$
For example, for a confidence level of 95\% we have $\alpha = 0.05$, for which we found $k=1.96$. Hence the corresponding confidence interval is $\epsilon \pm 1.96\sigma_{\epsilon}$.
So far we considered the confidence level of a single random error. If we have multiple observations, the corresponding intervals can be evaluated. Even though the precision of each observation may be different, we only need to compute $k$ once to find the intervals $\epsilon_i \pm k\sigma_{\epsilon_i}$, $i=1,\ldots,m$.
{numref}`CI` shows how the confidence intervals can be visualized as error bars for the corresponding observations $y_i$.
```{figure} ../figures/ObservationTheory/05_CI.png
---
height: 350px
name: CI
---
Confidence intervals of two observations $y_1$ and $y_2$ visualized with error bars.
```
```{note}
The error bars in {numref}`CI` should not be interpreted as the interval such that there is $1-\alpha$ probability that the true value of $Y_i$ is in that interval. We are namely only looking at a single realization of each $Y_i$, while the confidence interval is computed based on the probability that the *random* error deviates $\pm k\sigma_{\epsilon_i}$ from 0, and hence that $Y_i$ deviates by that amount from its expected value. If we would repeat the two measurements, the values of $y_1$ and $y_2$ will be different and hence the error bars would be shifted.
```
## Confidence intervals of estimated parameters
The covariance matrix $\Sigma_{\hat{X}}$ of a linear estimators can be evaluated by applying the linear propagation law of covariances. Moreover, we know that if our observables $Y$ are normally distributed, applying linear estimation implies that the estimated parameters $\hat{X}$ are also normally distributed. Recall that the weighted least-squares and Best Linear Unbiased estimators are indeed linear estimators.
The [covariance matrices of the BLUE estimator](04_cov) were obtained as:
$$
\begin{align*}
\Sigma_{\hat{X}}&=(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\
\Sigma_{\hat{Y}}&= \mathrm{A}\Sigma_{\hat{X}} \mathrm{A^T} \\
\Sigma_{\hat{\epsilon}}&=\Sigma_Y-\Sigma_{\hat{Y}}
\end{align*}
$$
Following the same logic as above, the confidence intervals for each estimated parameter can be evaluated as:
$$
\begin{align*}
\hat{X}_i &\pm k\sigma_{\hat{X}_i} \\
\hat{Y}_i &\pm k\sigma_{\hat{Y}_i} \\
\hat{\epsilon}_i &\pm k\sigma_{\hat{\epsilon}_i} \\
\end{align*}
$$
where $k$ is determined by the chosen confidence level.
{numref}`CI_model` shows a visualization of the confidence interval of the *predicted* observations, i.e., the fitted model.
```{figure} ../figures/ObservationTheory/05_CI_model.png
---
height: 350px
name: CI_model
---
Confidence interval of fitted model, including error bars for the individual adjusted observations $\hat{y}_i$.
```
Note that covariance matrices can be computed without the need for actual observations. Hence, we can also set up a functional model including future epochs. Let's call the corresponding design matrix $\mathrm{A_p}$, then $\Sigma_{\hat{Y}_p}= \mathrm{A_p}\Sigma_{\hat{X}} \mathrm{A_p^T}$. Note that the $\Sigma_{\hat{X}}$ is based on the original model, since the fitted model is only based on the actual observations. In this way, we can visualize the uncertainty of extrapolated values based on the fitted model.
\ No newline at end of file
# Maximum Likelihood Estimation
Maximum likelihood estimation is yet another estimation principle. In contrast to (weighted) least-squares and Best Linear Unbiased estimation, it relies on the probability distribution of the observables. Recall that for BLUE, we need to know the first two moments, given by the functional and stochastic model, respectively:
$$
\mathbb{E}(Y) = \mathrm{A} \mathrm{x}\;, \mathbb{D}(Y) = \Sigma_{Y}=\Sigma_{\epsilon}
$$
in order to estimate the unknown parameters $\mathrm{x}$. However, the solution $\hat{X}$ does not depend on the underlying distribution of $Y$.
## Maximum likelihood principle
The principle of Maximum Likelihood estimation (MLE) is to find the *most likely* $\mathrm{x}$ given a realization $\mathrm{x}$ of the $Y$.
This boils down to estimating the unknown parameters $\alpha$ of the underlying distribution, which means that the probability density function (PDF) is known apart from the $n$ parameters in $\alpha$. We will now distinguish between a PDF and likelihood function.
A *probability density function* $f_Y(\mathrm{y}|\alpha)$ with $\alpha$ known can be evaluated for any value of $\mathrm{y}$.
A *likelihood function* $f_Y(\mathrm{y}|\alpha)$ for a given realization $\mathrm{y}$ can be evaluated for all possible values of $\alpha$.
The goal is to find the $\alpha$ which maximizes the likelihood function for the given realization $\mathrm{y}$.
### Example exponential distribution:
If $Y\sim \text{Exp}(\lambda)$, the goal would be to estimate $\lambda$ based a realization $\mathrm{y_{obs}}$ of $Y$. The FIG shows the PDF of $Y$
$$
f_Y(\mathrm{y}|\lambda)=\lambda \exp(-\lambda \mathrm{y})
$$
for different possible values of $\lambda$. The likelihood function for the given $\mathrm{y_{obs}}$ is shown on the right-hand side. It is shown that for instance the likelihood value for $\lambda_1$ is equal to the corresponding density value in the left panel. The maximum likelihood estimate $\hat{\lambda}$ is the value for which the likelihood function is maximized as shown in the figure.
## Maximum Likelihood estimator of $\mathrm{x}$
We have that our observables are assumed to be normally distributed: $Y\sim N(\mathrm{Ax},\Sigma_Y)$, where $\mathrm{x}$ is unknown. The covariance matrix $\Sigma_Y$ is assumed to be known, for instance from a calibration campaign.
```{note}
It is also possible to consider the case that $\Sigma_Y$ is not (completely) known, for instance in case of a new sensor. Maximum Likelihood estimation can then also be applied to estimate this covariance matrix. However, this is beyond the scope of MUDE.
```
The likelihood function of the multivariate normal distribution is given by:
$$
f_{Y}(y|\mathrm{Ax},\Sigma_Y)=\underset{c}{\underbrace{(\det{2\pi \Sigma_Y})^{-0.5}}}\exp(-\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax}))
$$
Maximizing this likelihood function for $\mathrm{x}$ means that we have to find the $\mathrm{x}$ such that:
* the first-order partial derivatives ([gradient](PM_gradient)) are zero: $\partial_{\mathrm{x} }f_{Y}(y|\mathrm{Ax},\Sigma_Y)=0$
* the second-order partial derivatives are negative.
Instead of working with the likelihood function, we prefer to work with the *loglikelihood* function:
$$
\ln f_{Y}(y|\mathrm{Ax},\Sigma_Y)=\ln c -\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax})
$$
since that is easier and results in the same maximum. Setting the gradient to zero gives:
$$
\begin{align*}
\frac{\partial \ln f_{Y}(y|\mathrm{Ax},\Sigma_Y)}{\partial \mathrm{x}}&= \frac{\partial (-\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax}))}{\partial \mathrm{x}}\\
&= -\frac{1}{2}(-2\mathrm{A^T} \Sigma_Y^{-1} Y+2\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}\mathrm{x})=0\\
\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}\mathrm{x} &= \mathrm{A^T} \Sigma_Y^{-1} \mathrm{y}
\end{align*}
$$
The maximum likelihood estimate for a given realization $\mathrm{y}$ follows thus as:
$$
\hat{\mathrm{x}}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} \mathrm{y}
$$
It can be verified that the second-order partial derivatives are $-\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}$ and hence indeed negative (since $\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}$ is positive definite).
```{admonition} MUDE exam information
:class: tip, dropdown
You will not be asked to derive the MLE solution as above.
```
If we now replace the realization $\mathrm{y}$ by the random observable vector $Y$ we obtain the *Maximum Likelihood estimator* of $\mathrm{x}$:
$$
\hat{X}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} \mathrm{Y}
$$
It follows that the Maximum Likelihood estimator of $\mathrm{x}$ is identical to the BLUE **if** $Y$ is normally distributed.
# Non-linear least-squares estimation
MMMMM
THIS CHAPTER IS NOT YET IN GOOD SHAPE , ESP. REGARDING NOTATION
Let the non-linear system of observation equations be given by:
$$
E \{ \begin{bmatrix} \underline{y}_1\\ \underline{y}_2\\ \vdots\\ \underline{y}_m \end{bmatrix} \} = A(x) = \begin{bmatrix} a_{1}(x)\\ a_{2}(x)\\ \vdots\\ a_{m}(x) \end{bmatrix}
$$
This system cannot be directly solved using the weighted least-squares or the best linear unbiased estimators. Here, we will introduce the non-linear least-squares principle, based on a linearization of the system of equations.
## Video
```{eval-rst}
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/qkPzMlnkg6o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
```
## Linearization
For the linearization we need the first-order Taylor polynomial, which gives the linear approximation of for instance the first observation $y_1 = a_1(x)$ at $x_{[0]}$ as:
$$
y_1\approx a_1( x_{[0]})+ \partial_{x_1} a_1(x_{[0]})\Delta x_{[0]}+ \partial_{x_2} a_1(x_{[0]})\Delta x_{[0]}+ \ldots + \partial_{x_n} a_1(x_{[0]})\Delta x_{[0]}
$$
where $\Delta x_{[0]} = x- x_{[0]}$ and $x_{[0]}$ is the initial guess of $x$.
The difference between the actual observation $y_1$ and the solution of the forward model at $x_{[0]}$ is thus given by:
$$
\begin{align} \Delta y_{1[0]} &= y_1- a_1(x_{[0]}) \\ &\approx \left[ \partial_{x_1} a_1(x_{[0]}) \quad \partial_{x_2} a_1(x_{[0]}) \quad \ldots \quad \partial_{x_n} a_1(x_{[0]})\right]\Delta x_{[0]} \end{align}
$$
We can now obtain the linearized functional model:
$$
\begin{align} E\{\begin{bmatrix} \Delta \underline{y}_1 \\ \Delta\underline{y}_2 \\ \vdots \\ \Delta\underline{y}_m \end{bmatrix}_{[0]}\} &=\begin{bmatrix} \partial_{x_1} a_1(x_{[0]}) &\partial_{x_2} a_1(x_{[0]}) & \ldots & \partial_{x_n} a_1(x_{[0]}) \\ \partial_{x_1} a_2(x_{[0]}) &\partial_{x_2} a_2(x_{[0]}) & \ldots & \partial_{x_n} a_2(x_{[0]}) \\ \vdots & \vdots & \ddots & \vdots\\ \partial_{x_1} a_m(x_{[0]}) &\partial_{x_2} a_m(x_{[0]}) & \ldots& \partial_{x_n} a_m(x_{[0]}) \end{bmatrix}\Delta x_{[0]} \\ &= J_{[0]}\Delta x_{[0]}\end{align}
$$
with [Jacobian matrix](PM_jacobian) $J_{[0]}$.
And this allows to apply best linear unbiased estimation (BLUE) to get an estimate for $\Delta x_{[0]}$:
$$
\Delta \hat{x}_{[0]}=\left(J_{[0]}^T Q_{yy}^{-1} J_{[0]} \right)^{-1}J_{[0]}^T Q_{yy}^{-1}\Delta y_{[0]}
$$
since the Jacobian $ J_{[0]}$ takes the role of the design matrix $A$ and the observation vector is replaced by its delta-version as compared to the BLUE of the linear model $E\{\underline{y}\} = Ax$.
Since we have $\Delta x_{[0]} = x- x_{[0]}$, an “estimate” of $x$ can thus be obtained as:
$$
x_{[1]}=x_{[0]}+\Delta \hat{x}_{[0]}
$$
However, the quality of the linear approximation depends very much on the closeness of $x_{[0]}$ to $x$. Therefore, instead of using $x_{[1]}$ as the estimate, we repeat the process with $x_{[1]}$ as our new guess. This procedure is continued until the $\Delta \hat{x}_{[i]}$ becomes small enough. This is called the Gauss-Newton iteration procedure, and is shown in the scheme below.
## Gauss-Newton iteration
In this scheme you can see that as stop criterion
$$
\Delta \hat{x}_{[i]}^T W \Delta \hat{x}_{[i]} < \delta
$$
is used. The weight matrix $W$ can be set equal to for instance the identity matrix, but a more logical choice may be the normal matrix $N_{[i]}=J_{[i]}^T Q_{yy}^{-1} J_{[i]}$. The threshold $\delta$ must be set to a very small value.
Once the stop criterion is met, we say that the solution converged, and the last solution $ x_{[i+1]}$ is then finally used as our estimate of $x$.
## Remarks and properties
There is no default recipe for making the initial guess $x_{[0]}$; it must be made based on insight into the problem at hand. A good initial guess is important for two reasons:
a bad initial guess may result in the solution NOT to converge;
a good initial guess will speed up the convergence, i.e. requiring fewer iterations.
In the iteration scheme the covariance matrix of $\hat{x}$ was given by:
$$
Q_{\hat{x}\hat{x}}=\left(J_{[i]}^T Q_{yy}^{-1} J_{[i]} \right)^{-1}
$$
In fact, this is not strictly true. The estimator $\underline{\hat x}$ is namely NOT a best linear unbiased estimator of $x$ since we used a linear approximation. And since the estimator is not a linear estimator, it is not normally distributed.
However, in practice the linear (first-order Taylor) approximation often works so well that the performance is very close to that of BLUE and we may assume the normal distribution.
\ No newline at end of file
(08_testing)=
# Model testing
How well does the model fit the data?
First idea based on plotting observations, fitted model and confidence intervals...
\ No newline at end of file
# Notation and formulas
```{list-table}
:header-rows: 1
* - Description
- Notation / formula
* - Observables (random)
- $Y=\begin{bmatrix} Y_1,Y_2,\ldots, Y_m \end{bmatrix}^T$
* - Observations (realization of $Y$)
- $\mathrm{y}=\begin{bmatrix} y_1,y_2,\ldots, y_m \end{bmatrix}^T$
* - Unknown parameters (deterministic)
- $\mathrm{x}=\begin{bmatrix} x_1,x_2,\ldots, x_n \end{bmatrix}^T$
* - Random errors
- $\epsilon = \begin{bmatrix} \epsilon_1,\epsilon_2,\ldots, \epsilon_m \end{bmatrix}^T$ with $\epsilon\sim N(0,\Sigma_{\epsilon})$
* - Functional model (linear)
- $\mathbb{E}(Y) = \mathrm{A} \mathrm{x}\;$ or $\;Y = \mathrm{A} \mathrm{x} + \epsilon$
* - Stochastic model
- $\mathbb{D}(Y) = \Sigma_{Y}=\Sigma_{\epsilon}$
* - Estimator of $\mathrm{x}$
- $\hat{X}$
* - Estimate of $\mathrm{x}$ (realization of $\hat{X}$)
- $\hat{\mathrm{x}}$
* - Adjusted (or predicted) observations
- $\hat{\mathrm{y}}=\mathrm{A}\cdot\hat{\mathrm{x}}$
* - Residuals
- $\hat{\epsilon}=\mathrm{y}-\mathrm{A}\hat{\mathrm{x}}$
```
## Estimators of $\mathrm{x}$
```{list-table}
:header-rows: 1
* - Estimators
- Formula
* - Least-squares (LS)
- $\hat{X}=\left(\mathrm{A}^T \mathrm{A} \right)^{-1} \mathrm{A}^T Y $
* - Weighted least-squares (WLS)
- $\hat{X}=\left(\mathrm{A}^T\mathrm{WA} \right)^{-1} \mathrm{A}^T\mathrm{W} Y $
* - Best linear unbiased (BLU)
- $\hat{X}=\left(\mathrm{A}^T\Sigma_Y^{-1} \mathrm{A} \right)^{-1} \mathrm{A}^T\Sigma_Y^{-1} Y $
```
## Test statistics (with distribution under $\mathcal{H}_0$)
```{list-table}
:header-rows: 1
* - Test statistic
- Formula
* - Generalized likelihood ratio test
- $T_q = \hat{\epsilon}^T\Sigma_Y^{-1}\hat{\epsilon}-\hat{\epsilon}_a^T\Sigma_Y^{-1}\hat{\epsilon}_a\sim \chi^2(q,0)$
* - Overall model test
- $T_{q=m-n} = \hat{\epsilon}^T\Sigma_Y^{-1}\hat{\epsilon}\sim \chi^2(q,0)$
* - w-test
- $W = \frac{\mathrm{C}^T\Sigma_Y^{-1} \hat{\epsilon}}{\sqrt{\mathrm{C}^T\Sigma_Y^{-1}\Sigma_{\hat{\epsilon}}\Sigma_Y^{-1} \mathrm{C}}} \sim N(0,1)$
```
(99_proplaw)=
## Linear propagation laws if $\hat{X}=\mathrm{L^T} Y $
```{list-table}
:header-rows: 1
* - Propagation law of the ...
- Formula
* - ... mean
- $\mathbb{E}(\hat{X}) = \mathrm{L^T}\mathbb{E}(Y)$
* - ... covariance matrix
- $\Sigma_{\hat{X}} =\mathrm{L^T}\Sigma_Y \mathrm{L} $
File added
book/sandbox/SanPart/figures/ObservationTheory/01_Regression.png

95.5 KiB

book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_fit.png

122 KiB

book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_sol.png

99.5 KiB

book/sandbox/SanPart/figures/ObservationTheory/05_CI.png

70 KiB

book/sandbox/SanPart/figures/ObservationTheory/05_CI_model.png

53.6 KiB

book/sandbox/SanPart/figures/ObservationTheory/05_standard.png

64.4 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment