diff --git a/book/_toc.yml b/book/_toc.yml
index c333ff1f2d8e80e1b034136dafdf59d3e16c50f0..ba85f9d21d27d8feffe7f9ccbab6100f0fff5591 100644
--- a/book/_toc.yml
+++ b/book/_toc.yml
@@ -29,6 +29,25 @@ parts:
       - file: sandbox/continuous/Parameterization.md
       - file: sandbox/continuous/fitting.md
       - file: sandbox/continuous/GOF
+    - file: sandbox/SanPart/ObservationTheory/01_Introduction.md
+      title: Sensing and Observation Theory
+      sections:
+      - file: sandbox/SanPart/ObservationTheory/02_LeastSquares.md    
+      - file: sandbox/SanPart/ObservationTheory/03_WeightedLSQ.md
+      - file: sandbox/SanPart/ObservationTheory/04_BLUE.md
+      - file: sandbox/SanPart/ObservationTheory/05_Precision.md
+      - file: sandbox/SanPart/ObservationTheory/06_MLE.md
+      - file: sandbox/SanPart/ObservationTheory/07_NLSQ.md
+      - file: sandbox/SanPart/ObservationTheory/08_Testing.md
+      - file: sandbox/SanPart/ObservationTheory/99_NotationFormulas.md
+    - file: sandbox/SanPart/premath/00_00_PreMath.md
+      title: Pre-Math
+      sections:
+      - file: sandbox/SanPart/premath/00_01a_PreMath.md
+      - file: sandbox/SanPart/premath/00_01b_PreMath.md
+      - file: sandbox/SanPart/premath/00_01c_PreMath.md
+      - file: sandbox/SanPart/premath/00_02_PreMath.md
+      - file: sandbox/SanPart/premath/00_03_PreMath.md
   - caption: MUDE Cookbook
     chapters:
     - file: cookbook/blank
diff --git a/book/sandbox/ObservationTheory/01_Introduction.md b/book/sandbox/ObservationTheory/01_Introduction.md
index cd40133696743a955cb5ad417453c229a6e4db95..cabe034634c46a42828b61332b8293bf579129e0 100644
--- a/book/sandbox/ObservationTheory/01_Introduction.md
+++ b/book/sandbox/ObservationTheory/01_Introduction.md
@@ -1,4 +1,4 @@
-# Sensing and observation theory
+# observation theory
 
 As in other physical sciences, empirical data are used in Civil and Environmental Engineering and Applied Earth Sciences to make inferences so as to describe physical reality. Many such problems involve the determination of unknown parameters that bear a functional relationship to the set of observed data or measurements. This determination or parameter estimation can be based on different estimation principles. 
 From experience we know that various uncertain phenomena can be modeled as a random variable (or a random vector), say $Y$. Examples are: the uncertainty in instrument readings due to measurement errors; the variable strength of material of a certain type,  the variable lifetime of pieces of equipment of a particular type after they have been put into service; the number of "hits" in a certain time interval on a particular Web page; the variable size, weight, or contents of products of a particular type delivered by a production line, etc. 
diff --git a/book/sandbox/SanPart/.DS_Store b/book/sandbox/SanPart/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..9a398437b8bffaff530cd863b3d317b5b0a5124d
Binary files /dev/null and b/book/sandbox/SanPart/.DS_Store differ
diff --git a/book/sandbox/SanPart/ObservationTheory/.DS_Store b/book/sandbox/SanPart/ObservationTheory/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..5008ddfcf53c02e82d7eee2e57c38e5672ef89f6
Binary files /dev/null and b/book/sandbox/SanPart/ObservationTheory/.DS_Store differ
diff --git a/book/sandbox/SanPart/ObservationTheory/01_Introduction.md b/book/sandbox/SanPart/ObservationTheory/01_Introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..67d40762b934b726d05d9df3cc331f94b0362b81
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/01_Introduction.md
@@ -0,0 +1,74 @@
+# Introduction
+
+As in other physical sciences, empirical data are used in Civil and Environmental Engineering and Applied Earth Sciences to make inferences so as to describe physical reality. Many such problems involve the determination of unknown model parameters that bear a functional relationship to the set of observed data or measurements. This determination or parameter estimation can be based on different estimation principles. 
+
+From experience we know that various uncertain phenomena can be modeled as a random variable (or a random vector), say $Y$. Examples are: the uncertainty in instrument readings due to measurement errors; the variable strength of material of a certain type,  the variable lifetime of pieces of equipment of a particular type after they have been put into service; the number of "hits" in a certain time interval on a particular Web page; the variable size, weight, or contents of products of a particular type delivered by a production line, etcetera. We will refer to these random variables (our iput data) $Y$ as the *observables*.
+
+The unknown model parameters will be denoted as $\mathrm{x}$. The ingredients for this part on sensing and observation theory can then be summarized as follows:
+* a model to describe relation between observables $Y$ and parameters of interest $\mathrm{x}$
+* estimation method(s) to estimate the parameters as function of $Y$: $\hat{X}=q(Y)$ (i.e., the estimation method defines the function $q$)
+* uncertainty propagation to assess the precision of the estimated parameters $\hat{X}$
+* hypothesis testing to account for errors in data and/or to choose best model from different candidates
+
+```{admonition} Definition
+The *functional model* describes the functional relationship between the observed variables (*observables*) and the unknown parameters. 
+
+In case the functional relationship is linear, the functional model will have the form:
+
+$$
+\mathbb{E}(Y) = \mathrm{Ax}
+$$
+
+where $\mathrm{x}$ is a vector with the unknown model parameters, and $\mathrm{A}$ is referred to as the *design matrix*. $\mathbb{E}(.)$ is the expectation operator.
+```
+
+Some examples will be shown in [the next subsection](01_funcmodel)
+
+Next, we will develop the principles and underlying theory of (1) Least-squares estimation, (2) Best linear unbiased estimation, and (3) Maximum likelihood estimation. 
+
+Special attention is then given to how the uncertainty in the measurements propagates into parameter estimation results, such that we can assess the precision of the estimated parameters. For that purpose we need the stochastic model.
+
+```{admonition} Definition
+The *stochastic model* describes the uncertainty of the observables in the form of the covariance matrix $\Sigma_Y$.
+```
+
+Parameter estimation requires specification of the underlying functional and stochastic models. It may happen, however, that some parts of the models are misspecified, thereby invalidating the results of estimation. Some measurements, for instance, may be corrupted by blunders, or the chosen model my fail to give an adequate description of physical reality. 
+
+(01_funcmodel)=
+## Functional model: examples
+Relate to 'modelling' in MUDE.
+
+
+## Estimation and linear regression
+
+The goal of estimation is the estimate *model parameters* from a set of observations. In Civil Engineering, Applied Earth Sciences and Environmental Engineering this is needed in many monitoring and sensing applications, such as:
+* Sea level rise
+* Land subsidence / uplift
+* Air quality monitoring
+* Settlement of soils 
+* Tunnel deformation
+* Bridge motions
+* Traffic flow rates
+* Water vapor content for numerical weather prediction
+* Ground water level
+
+In this part, we will introduce different estimation principles, starting with (weighted) least-squares. Next, Best Linear Unbiased estimation and Maximum Likelihood estimation will be introduced, providing the probabilistic framework  for estimating *model* parameters from *uncertain data*.
+
+*Linear regression* aims to estimate the relationship between so-called *dependent* and *independent* variables, the latter also referred to as *predictors*, since once the relationship is known, these can be used to predict the behaviour of the dependent variables. Examples of bivariate regression model applications:
+* predict *river run-off* (dependent variable) based on *amount of rainfall* (independent variable), see {numref}`Regres`
+* predict *house prices* (dependent variable) based on their *energy label*(independent variable)
+* predict *geothermal power production* (dependent variable) based on *brine temperature* (independent variable)
+* predict *structural deformation* (dependent variable) based on *wind load* (independent variable)
+
+
+```{figure} ../figures/ObservationTheory/01_Regression.png
+---
+height: 350px
+name: Regres
+---
+Linear regression example for amount of rainfall (independent variable) and river runoff (dependent variable).
+```
+
+The least-squares and maximum likelihood principles are used for estimating these relationships.
+
+Supervised machine learning [**ADD LINK TO MACHINE LEARNING PART**] is all about finding relationships between target (dependent) variables and certain features (predictors), and therefore regression analysis. However, in this part, we will focus on monitoring and sensing applications, and not on linear regression.
diff --git a/book/sandbox/SanPart/ObservationTheory/02_LeastSquares.md b/book/sandbox/SanPart/ObservationTheory/02_LeastSquares.md
new file mode 100644
index 0000000000000000000000000000000000000000..5e144ce36cc0e96f97b1a5a4cb5c2c84ec387a78
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/02_LeastSquares.md
@@ -0,0 +1,126 @@
+# Least-squares estimation
+
+## Functional model: dealing with inconsistency
+Given a set of observations which contain noise, and a model that is assumed to explain these data, the goal is to estimate the unknown parameters of that model. The least-squares principle can be used to solve this task. 
+
+Mathematically, we use the ‘system of observation equations’, with the vector of observations $\mathrm{y}$, the vector of unknowns $\mathrm{x}$, and the design matrix $\mathrm{A}$. In a simple form, we can write this system of observation equations as
+
+$$
+\mathrm{y = Ax}
+$$
+
+If all the observations would fit perfectly to this model, this system is called [consistent](PM_consistent). This is only possible if the number of observations is equal to (or smaller than) the number of unknowns.
+
+If the number of observations is greater than the number of unknowns (and the design matrix $\mathrm{A}$ is of full column rank), it is very unlikely that the system is consistent. Physical reality would need to be perfectly described by the conceptual mathematical model. It is evident that in real life this is never the case, since (i) our observations are always contaminated by some form of noise, and (ii) physical reality is often more complex than can be described with a simplified mathematical model. 
+
+Thus, in the case in which there are more observations than unknowns (and design matrix $\mathrm{A}$ is of full column rank) the $\mathrm{y=Ax}$ system of equations has no solution. In other words; every 'solution' would be wrong, since the model would not 'fit' the data.
+
+### Example: linear trend model
+Assume we have $m$ observations and we try to fit the linear trend model:
+
+$$
+\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \underset{\mathrm{A}}{\underbrace{\begin{bmatrix} 1 & t_1 \\ 1 & t_2  \\ \vdots & \vdots \\ 1 & t_m \end{bmatrix}}}\underset{\mathrm{x}}{\underbrace{\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}}}
+$$
+
+{numref}`LSfit` shows the observations, together with a fitted linear trend line. Obviously, it is impossible to fit a line that perfectly matches all observations due the small measurement errors, which are assumed to be random.
+
+```{figure} ../figures/ObservationTheory/02_LeastSquares_fit.png
+---
+height: 250px
+name: LSfit
+---
+Linear trend line fitted to a set of $m$ observations affected by random errors.
+```
+
+We solve the problem of having an ‘unsolvable’ equation by considering the random errors for each observation:
+
+$$
+\mathrm{y=Ax + \epsilon}
+$$
+
+The length of the error vector (or vector of residuals) $\mathrm{\epsilon}$ is equal to the length of the vector of observations $\mathrm{y}$.
+
+## Least-squares principle
+We are looking for a solution for $\mathrm{x}$; this solution will be denoted by $\hat{\mathrm{x}}$. Based on this solution, the 'adjusted observations' would be $\hat{y}= A\hat{\mathrm{x}}$ (solution of the forward model).
+
+To find a solution for an inconsistent system of equations, we prefer the one for which the observations are as ‘close as possible’ to the model. This is a ‘minimum distance’ objective. In mathematical terms, we look at the vector of residuals $\hat{\epsilon}$: the difference between the observations $\mathrm{y}$ and the model realization $\mathrm{\hat{y}}$:
+
+$$
+\mathrm{\hat{\epsilon}=y-\hat{y}=y-A\hat{x}}
+$$
+
+We want this vector of residuals to be as small as possible (i.e., the minimum distance objective).The ‘least-squares’ principle attempts to achieve that: by minimizing (‘least’) the sum of the squared residuals (‘squares’). 
+
+If we take the square-root of this sum, we find the length of the vector, also known as the ‘norm’ of the vector. Thus, the various possible error vectors have a norm defined as:
+
+$$
+\left \| \epsilon \right \| = \sqrt{\epsilon_1^2+\epsilon_2^2+...+\epsilon_m^2}=\sqrt{\epsilon^T\epsilon}
+$$
+
+Finding the minimum of (a) the sum of the squared differences, or (b) the square-root of the sum of the squared differences, yields the same result, since both terms are always positive or zero.
+
+If we find this minimum, the corresponding solution $\mathrm{\hat{x}}$ is than the least-squares solution of the system of observation equations. In mathematical terms this is written as
+
+$$
+\mathrm{\hat{x}} = \arg \underset{\mathrm{x}}{\min} \left \| \epsilon \right \|^2= \arg \underset{x}{\min} {(\epsilon^T\epsilon)}
+$$
+
+From the system observation equations $\mathrm{y=Ax+\epsilon}$, it follows directly that $\epsilon=\mathrm{y-Ax}$ and therefore the least-squares solution follows as:
+
+$$
+\mathrm{\hat{x}} =\arg \underset{\mathrm{x}}{\min} {\mathrm{(y-Ax)^T (y-Ax)}}.
+$$
+
+ In other words, we find $\mathrm{\hat{x}}$ by finding the minimum of $\mathrm{(y-Ax)^T (y-Ax)}$.
+
+## Least-squares solution
+We can find the minimum of a function by taking the first derivative with respect to the 'free variable'. Since the observation vector is not free, and also the design matrix $A$ is fixed, the only variable which we can vary is $x$. The first derivative of our objective function should be equal to zero to reach a minimum (see [Gradient](PM_gradient)):
+
+$$
+\begin{align*} 
+\partial_x \mathrm{(y-Ax)^T (y-Ax)} &=0\\  
+\partial_x \mathrm{( y^Ty -(Ax)^T y -y^T Ax + (Ax)^T(Ax) )}&=0\\ 
+\partial_x \mathrm{( y^Ty -x^TA^T y -y^T Ax + x^T A^TAx )}&=0\\ 
+\partial_x \mathrm{(y^Ty -2x^TA^T y + x^T A^TAx) }&=0\\ 
+\mathrm{-2A^T y +  2A^TAx} &=0\\ 
+\mathrm{A^TAx} &=\mathrm{A^T y }
+\end{align*}
+$$
+
+This last equation is known as the normal equation, and the matrix $\mathrm{N=A^T A}$ is known as the normal matrix.
+
+```{admonition} MUDE exam information
+:class: tip, dropdown
+You will not be asked to derive the least-squares solution as above.
+```
+
+If it is possible to compute the inverse of the normal matrix, the normal equation can be written to express the estimate of $\mathrm{x}$, denoted by $\mathrm{\hat{x}}$, as a linear function of the observations: 
+
+$$
+\mathrm{\hat{x}= (A^T A)^{-1} A^T y}
+$$
+
+## Overview
+In summary, the least-squares estimates of $\mathrm{x}$, $\mathrm{y}$ and $\epsilon$ are given by:
+
+$$
+\mathrm{\hat{x}= (A^T A)^{-1} A^T y}
+$$
+
+$$
+\mathrm{\hat{y} = A \hat{x}}
+$$
+
+$$
+\mathrm{\hat{\epsilon} = y - A\hat{x} = y - \hat{y}}
+$$
+
+{numref}`LSsol` visualizes the least-squares solution based on a linear trend model.
+
+```{figure} ../figures/ObservationTheory/02_LeastSquares_sol.png
+---
+height: 350px
+name: LSsol
+---
+Least-squares fit to a set of 5 observations affected by random errors. Index $i$ refers to the $i$th observation.
+```
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/03_WeightedLSQ.md b/book/sandbox/SanPart/ObservationTheory/03_WeightedLSQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..af66d98d213d75b95686c8686b04022673506031
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/03_WeightedLSQ.md
@@ -0,0 +1,64 @@
+# Weighted least-squares estimation
+In ordinary least-squares estimation, we assume that all observations are equally important. In many cases this is not realistic, as observations may be obtained by different measurement systems, or under different circumstances. We want our methodology for least-squares estimation to be able to take this into account.
+
+We achieve this goal by introducing a weight matrix in the normal equation. In the unweighted least-squares approach, we arrive at the normal equation by pre-multiplying both sides of $\mathrm{y=Ax}$ with the transpose of the design matrix $\mathrm{A^T}$:
+
+$$ 
+\mathrm{y=Ax} \; \rightarrow \; \mathrm{A^T\;  y = A^T\; A x }
+$$
+
+In the weighted least-squares approach, we add weight matrix $W$ to this pre-multiplication factor, i.e., $ \mathrm{A^T W}$, to obtain the normal equation:
+
+$$ 
+\mathrm{y=Ax} \; \rightarrow \; \mathrm{A^T W \; y = A^TW \; A x}
+$$
+
+The normal matrix is now defined as $\mathrm{N=A^T W A}$. From this, assuming that the normal matrix $N$ is invertible (non-singular) we find the weighted least-squares estimate $ \hat{x} $, 
+
+$$ 
+\mathrm{\hat{x} = (A^T W A)^{-1} A^T W y}
+$$
+
+We also find the derived estimate $ \mathrm{\hat{y}} $ and $ \mathrm{\hat{e}} $:
+
+$$
+\mathrm{\hat{y} = A \hat{x}  = A (A^T W A )^{-1} A^T W y}
+$$
+
+$$
+ \mathrm{\hat{\epsilon} = y - \hat{y}= y - A \hat{x} = y-A (A^T W A )^{-1} A^T W y = (I- A(A^T W A )^{-1} A^T W) y}
+$$
+
+## Video
+```{eval-rst}
+.. raw:: html
+
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/iJmkkz37EuU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+```
+
+## Discussion on the weight matrix
+The weight matrix $\mathrm{W}$ expresses the (relative) weights between the observations. It is always a square matrix. The size of the weight matrix depends on the number of observations, $m$. The size of the weight matrix is $m\times m$. 
+
+If it is a unit matrix ($\mathrm{W=I}$), this implies that all observations have equal weight. Note that in this case the equations are equal to the ordinary least-squares solution. 
+
+If it is a diagonal matrix, with different values on the diagonal, this implies that entries with a higher value correspond to the observations which are considered to be of more importance. If the weight matrix has non-zero elements on the off-diagonal positions, this implies that (some) observations are correlated.
+
+## Weighted least-squares estimator: properties
+
+Until now, we have looked at the weighted least-squares solution of a single *realization* of the observations, where generally we assume that it is a realization of the *random* observable vector $Y$, since measurements are affected by random errors. As such it follows the the weighted least-squares *estimator* is given by:
+
+$$
+\hat{X}  = \mathrm{(A^T W A )^{-1} A^T W} Y
+$$
+
+This estimator has two important properties: it is *linear* and *unbiased*.
+
+The linearity property is due to the fact that $\hat{X}$ is a linear function of the observables $Y$.
+
+The unbiased property means that the expectation of $\hat{X}$ is equal to the true (but unknown) $\mathrm{x}$. This can be shown as follows:
+
+$$
+\mathbb{E}(\hat{X})  = \mathrm{(A^T W A )^{-1} A^T W} \mathbb{E}(Y) = \mathrm{(A^T W A )^{-1} A^T W Ax = x}
+$$
+
+This a very desirable property. It applies that if we would repeat the measurements many times to obtain a new estimate, the *average of the estimated* values would be equal to the truy values.
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/04_BLUE.md b/book/sandbox/SanPart/ObservationTheory/04_BLUE.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb1741a671f237c28fdaba4b7fd50e0468e7deac
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/04_BLUE.md
@@ -0,0 +1,86 @@
+# Best linear unbiased estimation
+Best Linear Unbiased estimation (BLUE) is based on a fundamentally different  principle than (weighted) least-squares (WLS), it can be regarded as a special case of WLS.
+
+Comparing the weighted least-squares estimator:
+
+$$ 
+\hat{X}_{\text{WLS}}= \mathrm{(\mathrm{A^T} W \mathrm{A})^{-1} \mathrm{A^T} W} Y
+$$
+
+with the estimator of the vector of unknowns following from BLUE:
+
+$$
+\hat{X}_{\text{BLU}}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
+$$
+
+We see that the weight matrix $\mathrm{W}$ is now replaced by the inverse of the covariance matrix $\Sigma_Y$.
+
+This makes sense intuitively: suppose the observables are all independent and thus uncorrelated, such that the covariance matrix is a diagonal matrix with the variances as its diagonal element. The weights for the individual observables are then equal to the inverse of their variances, which implies that a more precise observable (smaller variance) receives a larger weight.
+
+Taking this particular weight matrix, i.e., $\mathrm{W}=\Sigma_Y^{-1}$, has a special meaning. It has the “Best” property. This means that using this particular weight matrix, we obtain a linear estimator which has minimal variance. In other words, with this particular weight matrix we get the best possible estimator among all linear estimators, where ‘best’ represents optimal precision or minimal variance.
+
+Given the BLU-estimator for $\hat{X}$, we can also find the BLU-estimators for $\hat{Y} =\mathrm{A}\hat{X}$,and for $\hat{\epsilon} =  Y-\hat{Y} $,
+
+$$
+\hat{Y}= \mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
+$$
+
+$$
+\hat{\epsilon}= Y-\mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} Y
+$$
+
+## Video
+
+```{eval-rst}
+.. raw:: html
+
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/EIZvJrlxZhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+```
+
+## BLUE decomposed
+In BLUE, Best Linear Unbiased Estimation, the parts of the acronym ‘B’, ‘L’, and ‘U’ refer to specific properties.
+
+*Linear* means that there is a linear (matrix) relation between the variables. Such linear relations imply that if we have a normally distributed vector $Y\sim N(\mathrm{Ax},\Sigma_Y)$, which is multiplied with a matrix $\mathrm{L}$, this product will be normally distributed as well:
+
+$$
+\hat{X}=\mathrm{L^T}Y\sim N(\mathrm{L^TAx},\mathrm{L^T} \Sigma_Y \mathrm{L})
+$$ 
+
+where we use the linear propagation laws of the mean and covariance.
+
+*Unbiased* means that the expected value of an estimator of a parameter is equal to the value of that parameter. In other words: 
+
+$$
+\mathbb{E}(\hat{X})= \mathrm{x}.
+$$
+
+The unbiased property also implies that:
+
+$$
+\begin{align*}\mathbb{E}(\hat{Y} ) &=\mathrm{A} \mathbb{E}(\underline{\hat{x}} )=\mathrm{Ax} \\ \mathbb{E}(\hat{\epsilon} )  &= \mathbb{E}(Y)-\mathbb{E}(\hat{Y})= \mathrm{Ax-Ax}= 0 \end{align*}
+$$
+
+*Best* means that the estimator has minimum variance (best precision), when compared to all other possible linear estimators:
+
+$$
+\text{trace}(\Sigma_{\hat{X}}) = \sum_{i=1}^m \sigma^2_{\hat{X}_i} = \text{minimum}
+$$
+
+
+(Note that this is equivalent to "minimum mean squared errors" $\mathbb{E}(\|\hat{X}-\mathrm{x}\|^2)$. )
+
+(04_cov)=
+## Covariance matrices of the BLU estimators
+The quality of the estimator is expressed by its covariance matrix. For the ‘best linear unbiased’ estimator of $\hat{X}$, $\hat{Y}$  and $\hat{\underline{e}}$ we (by applying the [linear covariance propagation laws](99_proplaw))
+
+$$
+\begin{align*}
+\Sigma_{\hat{X}} &= \mathrm{L^T} \Sigma_Y \mathrm{L} \\ &= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1} \cdot \Sigma_Y \cdot \Sigma_Y^{-1} \mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ &=(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ &= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\ \\ \Sigma_{\hat{Y}} &=\mathrm{A}\Sigma_{\hat{X}} \mathrm{A^T} \\ &=\mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \\ \\
+\Sigma_{\hat{\epsilon}} &= (\mathrm{I}_m - A (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1}) \cdot \Sigma_Y \cdot (\mathrm{I}_m - A (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T} \Sigma_Y^{-1})^T \\ &= (\Sigma_Y - \mathrm{A}(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T}) (\mathrm{I}_m -\Sigma_Y^{-1}\mathrm{A} (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\mathrm{A^T})\\ &= \Sigma_Y - \Sigma_{\hat{Y}}-\Sigma_{\hat{Y}}+\Sigma_{\hat{Y}}\\ &= \Sigma_Y - \Sigma_{\hat{Y}}\end{align*}
+$$
+
+In summary, by applying the BLUE method, we can compute best estimators among all linear unbiased estimators, where ‘best’ is quantitatively expressed via the covariance matrix.
+
+
+## Additional note on the linearity condition of BLUE 
+The BLUE estimator is the best (or minimum variance) among all linear unbiased estimators. So if we drop the condition of linearity, then BLUE is not necessarily the best.  It means that there may be some other non-linear estimators that have even better precision than the BLUE. However, it can be proven that, in the case of normally distributed observations, the BLUE is also the best among all possible estimators. So we can say: for observations $Y$ that are normally distributed, the BLUE is BUE (best unbiased estimator). 
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/05_Precision.md b/book/sandbox/SanPart/ObservationTheory/05_Precision.md
new file mode 100644
index 0000000000000000000000000000000000000000..4d62003c0d539eef710a6481a35313cf87bc71b8
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/05_Precision.md
@@ -0,0 +1,129 @@
+# Precision and confidence intervals
+When evaluating and reporting estimation results, it is important to give a quality assessment. Thereby we should distinguish between:
+* the precision of the estimated parameters (topic of this part)
+* how well the model fits the data (topic of chapter [Model testing](08_testing))
+
+The precision of the estimated parameters is provided by the covariance matrix $\Sigma_{\hat{X}}$. But the question is: how to present and interpret the numbers in the matrix? For that purpose we will introduce the concept of confidence intervals, which basically means we will 'convert' the precision to probabilities.
+
+## Video
+```{eval-rst}
+.. raw:: html
+
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/7XJP_kaeL7c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+```
+
+(05_normal)=
+## Review normal distribution and probabilities
+Let's start with a standard normally distributed random variable $Z\sim N(0,1)$, which has an expectation equal to 0, and variance (for us: precision) of 1. {numref}`standard` shows the corresponding PDF.
+
+```{figure} ../figures/ObservationTheory/05_standard.png
+---
+height: 250px
+name: standard
+---
+PDF of standard normal distributed random variable.
+```
+
+We are now interested in the probability that a realization of $Z$ will be within a certain interval $\pm k$ from the expectation 0:
+
+$$
+P(-k<Z<k)  = P(|Z|<k) = 1-\alpha
+$$
+
+In this case, $k$ is the $(1-0.5\alpha)$-quantile, since it can be seen in {numref}`standard` that 
+
+$$
+P(Z<k) = 1-0.5\alpha
+$$ 
+
+If we choose a certain value for $\alpha$ we can thus find the value of $k$ using the inverse CDF, e.g., with $\alpha = 0.05$ we would obtain $k=1.96$.
+
+ADD CODE BLOCK TO SHOW
+
+## Confidence interval of observations
+
+```{admonition} Definition
+A *confidence interval* of a random variable $\epsilon \sim N(0,\sigma_{\epsilon}^2)$ is defined as the range or interval $\epsilon \pm r$ such that:
+
+$$
+P(|\epsilon|<c) = 1-\alpha
+$$
+
+where $(1-\alpha)\cdot$ 100\% is the *confidence level*. 
+```
+
+On purpose we considered $\epsilon \sim N(0,\sigma_{\epsilon}^2)$ in the above definition, since in observation theory we generally assume that the random errors in our observations are normally distributed with zero mean and variance $\sigma^2_Y=\sigma^2_{\epsilon}$.
+
+Similarly as for the standard normal distribution, the interval $\pm c$ can be computed. We will do this by first applying the transformation:
+
+$$
+Z = \frac{\epsilon - 0}{\sigma_{\epsilon}} \sim N(0,1)
+$$
+
+With this we have that:
+
+$$
+P(|\epsilon|<c) = P(|\frac{\epsilon}{\sigma_{\epsilon}}|<\frac{c}{\sigma_{\epsilon}})= P(|Z|< k)
+$$
+
+with $c = k\sigma_{\epsilon}$.
+
+This allows us to compute $k$ as shown in {ref}`05_normal` and express the confidence interval as:
+
+$$
+\epsilon \pm k\sigma_{\epsilon}
+$$
+
+For example, for a confidence level of 95\% we have $\alpha = 0.05$, for which we found $k=1.96$. Hence the corresponding confidence interval is $\epsilon \pm 1.96\sigma_{\epsilon}$.
+
+So far we considered the confidence level of a single random error. If we have multiple observations, the corresponding intervals can be evaluated. Even though the precision of each observation may be different, we only need to compute $k$ once to find the intervals $\epsilon_i \pm k\sigma_{\epsilon_i}$, $i=1,\ldots,m$.
+
+{numref}`CI` shows how the confidence intervals can be visualized as error bars for the corresponding observations $y_i$.
+
+```{figure} ../figures/ObservationTheory/05_CI.png
+---
+height: 350px
+name: CI
+---
+Confidence intervals of two observations $y_1$ and $y_2$ visualized with error bars.
+```
+
+```{note}
+The error bars in {numref}`CI` should not be interpreted as the interval such that there is $1-\alpha$ probability that the true value of $Y_i$ is in that interval. We are namely only looking at a single realization of each $Y_i$, while the confidence interval is computed based on the probability that the *random* error deviates $\pm k\sigma_{\epsilon_i}$ from 0, and hence that $Y_i$ deviates by that amount from its expected value. If we would repeat the two measurements, the values of $y_1$ and $y_2$ will be different and hence the error bars would be shifted. 
+```
+## Confidence intervals of estimated parameters
+The covariance matrix $\Sigma_{\hat{X}}$ of a linear estimators can be evaluated by applying the linear propagation law of covariances. Moreover, we know that if our observables $Y$ are normally distributed, applying linear estimation implies that the estimated parameters $\hat{X}$ are also normally distributed. Recall that the weighted least-squares and Best Linear Unbiased estimators are indeed linear estimators.
+
+The [covariance matrices of the BLUE estimator](04_cov) were obtained as:
+
+$$
+\begin{align*}
+\Sigma_{\hat{X}}&=(\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1}\\
+\Sigma_{\hat{Y}}&= \mathrm{A}\Sigma_{\hat{X}} \mathrm{A^T} \\
+\Sigma_{\hat{\epsilon}}&=\Sigma_Y-\Sigma_{\hat{Y}}
+\end{align*}
+$$
+
+Following the same logic as above, the confidence intervals for each estimated parameter can be evaluated as:
+
+$$
+\begin{align*}
+\hat{X}_i &\pm k\sigma_{\hat{X}_i} \\
+\hat{Y}_i &\pm k\sigma_{\hat{Y}_i} \\
+\hat{\epsilon}_i &\pm k\sigma_{\hat{\epsilon}_i} \\
+\end{align*}
+$$
+
+where $k$ is determined by the chosen confidence level.
+
+{numref}`CI_model` shows a visualization of the confidence interval of the *predicted* observations, i.e., the fitted model.
+
+```{figure} ../figures/ObservationTheory/05_CI_model.png
+---
+height: 350px
+name: CI_model
+---
+Confidence interval of fitted model, including error bars for the individual adjusted observations $\hat{y}_i$.
+```
+
+Note that covariance matrices can be computed without the need for actual observations. Hence, we can also set up a functional model including future epochs. Let's call the corresponding design matrix $\mathrm{A_p}$, then $\Sigma_{\hat{Y}_p}= \mathrm{A_p}\Sigma_{\hat{X}} \mathrm{A_p^T}$. Note that the $\Sigma_{\hat{X}}$ is based on the original model, since the fitted model is only based on the actual observations. In this way, we can visualize the uncertainty of extrapolated values based on the fitted model.
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/06_MLE.md b/book/sandbox/SanPart/ObservationTheory/06_MLE.md
new file mode 100644
index 0000000000000000000000000000000000000000..242b2f01b3a3adfd63d49d2e51ce758ccd8b77b8
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/06_MLE.md
@@ -0,0 +1,83 @@
+# Maximum Likelihood Estimation
+Maximum likelihood estimation is yet another estimation principle. In contrast to (weighted) least-squares and Best Linear Unbiased estimation, it relies on the probability distribution of the observables. Recall that for BLUE, we need to know the first two moments, given by the functional and stochastic model, respectively:
+
+$$
+\mathbb{E}(Y) = \mathrm{A} \mathrm{x}\;, \mathbb{D}(Y) = \Sigma_{Y}=\Sigma_{\epsilon}
+$$
+
+in order to estimate the unknown parameters $\mathrm{x}$. However, the solution $\hat{X}$ does not depend on the underlying distribution of $Y$.
+
+## Maximum likelihood principle
+The principle of Maximum Likelihood estimation (MLE) is to find the *most likely* $\mathrm{x}$ given a realization $\mathrm{x}$ of the $Y$.
+
+This boils down to estimating the unknown parameters $\alpha$ of the underlying distribution, which means that the probability density function (PDF) is known apart from the $n$ parameters in $\alpha$. We will now distinguish between a PDF and likelihood function.
+
+A *probability density function* $f_Y(\mathrm{y}|\alpha)$ with $\alpha$ known can be evaluated for any value of $\mathrm{y}$.
+
+A *likelihood function* $f_Y(\mathrm{y}|\alpha)$ for a given realization $\mathrm{y}$ can be evaluated for all possible values of $\alpha$.
+
+The goal is to find the $\alpha$ which maximizes the likelihood function for the given realization $\mathrm{y}$.
+
+### Example exponential distribution:
+
+If $Y\sim \text{Exp}(\lambda)$, the goal would be to estimate $\lambda$ based a realization $\mathrm{y_{obs}}$ of $Y$. The FIG shows the PDF of $Y$
+
+$$
+f_Y(\mathrm{y}|\lambda)=\lambda \exp(-\lambda \mathrm{y})
+$$
+
+for different possible values of $\lambda$. The likelihood function for the given $\mathrm{y_{obs}}$ is shown on the right-hand side. It is shown that for instance the likelihood value for $\lambda_1$ is equal to the corresponding density value in the left panel. The maximum likelihood estimate $\hat{\lambda}$ is the value for which the likelihood function is maximized as shown in the figure.
+
+## Maximum Likelihood estimator of $\mathrm{x}$
+We have that our observables are assumed to be normally distributed: $Y\sim N(\mathrm{Ax},\Sigma_Y)$, where $\mathrm{x}$ is unknown. The covariance matrix $\Sigma_Y$ is assumed to be known, for instance from a calibration campaign.
+
+```{note}
+It is also possible to consider the case that $\Sigma_Y$ is not (completely) known, for instance in case of a new sensor. Maximum Likelihood estimation can then also be applied to estimate this covariance matrix. However, this is beyond the scope of MUDE.
+```
+The likelihood function of the multivariate normal distribution is given by:
+
+$$
+f_{Y}(y|\mathrm{Ax},\Sigma_Y)=\underset{c}{\underbrace{(\det{2\pi \Sigma_Y})^{-0.5}}}\exp(-\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax})) 
+$$
+
+Maximizing this likelihood function for $\mathrm{x}$ means that we have to find the $\mathrm{x}$ such that:
+* the first-order partial derivatives ([gradient](PM_gradient)) are zero: $\partial_{\mathrm{x} }f_{Y}(y|\mathrm{Ax},\Sigma_Y)=0$
+* the second-order partial derivatives are negative.
+
+Instead of working with the likelihood function, we prefer to work with the *loglikelihood* function:
+
+$$
+\ln f_{Y}(y|\mathrm{Ax},\Sigma_Y)=\ln c -\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax}) 
+$$
+
+since that is easier and results in the same maximum. Setting the gradient to zero gives:
+
+$$
+\begin{align*}
+\frac{\partial \ln f_{Y}(y|\mathrm{Ax},\Sigma_Y)}{\partial \mathrm{x}}&= \frac{\partial (-\frac{1}{2}\mathrm{(y-Ax)^T} \Sigma_Y^{-1}(\mathrm{y-Ax}))}{\partial \mathrm{x}}\\
+&= -\frac{1}{2}(-2\mathrm{A^T} \Sigma_Y^{-1} Y+2\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}\mathrm{x})=0\\
+\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}\mathrm{x} &= \mathrm{A^T} \Sigma_Y^{-1} \mathrm{y}
+\end{align*}
+$$
+
+The maximum likelihood estimate for a given realization $\mathrm{y}$ follows thus as:
+
+$$
+\hat{\mathrm{x}}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} \mathrm{y}
+$$
+
+It can be verified that the second-order partial derivatives are $-\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}$ and hence indeed negative (since $\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A}$ is positive definite).
+
+```{admonition} MUDE exam information
+:class: tip, dropdown
+You will not be asked to derive the MLE solution as above.
+```
+
+If we now replace the realization $\mathrm{y}$ by the random observable vector $Y$ we obtain the *Maximum Likelihood estimator* of $\mathrm{x}$:
+
+$$
+\hat{X}= (\mathrm{A^T} \Sigma_Y^{-1} \mathrm{A})^{-1} \mathrm{A^T} \Sigma_Y^{-1} \mathrm{Y}
+$$
+
+It follows that the Maximum Likelihood estimator of $\mathrm{x}$ is identical to the BLUE **if** $Y$ is normally distributed.
+
diff --git a/book/sandbox/SanPart/ObservationTheory/07_NLSQ.md b/book/sandbox/SanPart/ObservationTheory/07_NLSQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e5b19a4d612615eecf3581d6766025ceca2ae9b
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/07_NLSQ.md
@@ -0,0 +1,86 @@
+# Non-linear least-squares estimation 
+
+MMMMM
+
+THIS CHAPTER IS NOT YET IN GOOD SHAPE , ESP. REGARDING NOTATION
+
+Let the non-linear system of observation equations be given by:
+
+$$
+E \{ \begin{bmatrix} \underline{y}_1\\ \underline{y}_2\\ \vdots\\ \underline{y}_m \end{bmatrix}  \} = A(x) = \begin{bmatrix} a_{1}(x)\\ a_{2}(x)\\ \vdots\\  a_{m}(x) \end{bmatrix}
+$$
+
+This system cannot be directly solved using the weighted least-squares or the best linear unbiased estimators. Here, we will introduce the non-linear least-squares principle, based on a linearization of the system of equations. 
+
+## Video
+```{eval-rst}
+.. raw:: html
+
+    <iframe width="560" height="315" src="https://www.youtube.com/embed/qkPzMlnkg6o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+```
+
+## Linearization
+For the linearization we need the first-order Taylor polynomial, which gives the linear approximation of for instance the first observation $y_1 = a_1(x)$ at $x_{[0]}$ as:
+
+$$
+y_1\approx a_1( x_{[0]})+ \partial_{x_1} a_1(x_{[0]})\Delta x_{[0]}+ \partial_{x_2} a_1(x_{[0]})\Delta x_{[0]}+ \ldots + \partial_{x_n} a_1(x_{[0]})\Delta x_{[0]}
+$$
+
+where $\Delta x_{[0]} = x- x_{[0]}$ and $x_{[0]}$ is the initial guess of $x$.
+
+The difference between the actual observation $y_1$ and the solution of the forward model at $x_{[0]}$ is thus given by:
+
+$$
+\begin{align} \Delta y_{1[0]} &= y_1- a_1(x_{[0]}) \\ &\approx \left[ \partial_{x_1} a_1(x_{[0]}) \quad \partial_{x_2} a_1(x_{[0]}) \quad \ldots \quad \partial_{x_n} a_1(x_{[0]})\right]\Delta x_{[0]} \end{align}
+$$
+
+We can now obtain the linearized functional model:
+
+$$
+\begin{align} E\{\begin{bmatrix} \Delta \underline{y}_1 \\ \Delta\underline{y}_2 \\ \vdots \\ \Delta\underline{y}_m \end{bmatrix}_{[0]}\} &=\begin{bmatrix} \partial_{x_1} a_1(x_{[0]}) &\partial_{x_2} a_1(x_{[0]}) & \ldots & \partial_{x_n} a_1(x_{[0]}) \\ \partial_{x_1} a_2(x_{[0]}) &\partial_{x_2} a_2(x_{[0]}) & \ldots & \partial_{x_n} a_2(x_{[0]}) \\ \vdots & \vdots  & \ddots  & \vdots\\ \partial_{x_1} a_m(x_{[0]}) &\partial_{x_2} a_m(x_{[0]}) & \ldots& \partial_{x_n} a_m(x_{[0]}) \end{bmatrix}\Delta x_{[0]} \\ &= J_{[0]}\Delta x_{[0]}\end{align}
+$$
+
+with [Jacobian matrix](PM_jacobian) $J_{[0]}$.
+
+And this allows to apply best linear unbiased estimation (BLUE) to get an estimate for $\Delta x_{[0]}$:
+
+$$
+\Delta \hat{x}_{[0]}=\left(J_{[0]}^T Q_{yy}^{-1} J_{[0]} \right)^{-1}J_{[0]}^T Q_{yy}^{-1}\Delta y_{[0]}
+$$
+
+since the Jacobian $ J_{[0]}$ takes the role of the design matrix $A$ and the observation vector is replaced by its delta-version as compared to the BLUE of the linear model $E\{\underline{y}\} = Ax$.
+
+Since we have $\Delta x_{[0]} = x- x_{[0]}$, an “estimate” of $x$ can thus be obtained as:
+
+$$
+x_{[1]}=x_{[0]}+\Delta \hat{x}_{[0]}
+$$
+
+However, the quality of the linear approximation depends very much on the closeness of $x_{[0]}$ to $x$. Therefore, instead of using $x_{[1]}$ as the estimate, we repeat the process with $x_{[1]}$ as our new guess. This procedure is continued until the $\Delta \hat{x}_{[i]}$ becomes small enough. This is called the Gauss-Newton iteration procedure, and is shown in the scheme below.
+
+## Gauss-Newton iteration
+
+In this scheme you can see that as stop criterion
+
+$$
+\Delta \hat{x}_{[i]}^T W \Delta \hat{x}_{[i]} < \delta
+$$
+
+is used. The weight matrix $W$ can be set equal to for instance the identity matrix, but a more logical choice may be the normal matrix $N_{[i]}=J_{[i]}^T Q_{yy}^{-1} J_{[i]}$. The threshold $\delta$ must be set to a very small value. 
+
+Once the stop criterion is met, we say that the solution converged, and the last solution $ x_{[i+1]}$ is then finally used as our estimate of $x$. 
+
+## Remarks and properties
+There is no default recipe for making the initial guess $x_{[0]}$; it must be made based on insight into the problem at hand. A good initial guess is important for two reasons:
+
+a bad initial guess may result in the solution NOT to converge;
+a good initial guess will speed up the convergence, i.e. requiring fewer iterations.
+In the iteration scheme the covariance matrix of $\hat{x}$ was given by:
+
+$$
+Q_{\hat{x}\hat{x}}=\left(J_{[i]}^T Q_{yy}^{-1} J_{[i]} \right)^{-1}
+$$
+
+In fact, this is not strictly true. The estimator $\underline{\hat x}$ is namely NOT a best linear unbiased estimator of $x$ since we used a linear approximation. And since the estimator is not a linear estimator, it is not normally distributed. 
+
+However, in practice the linear (first-order Taylor) approximation often works so well that the performance is very close to that of BLUE and we may assume the normal distribution.
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/08_Testing.md b/book/sandbox/SanPart/ObservationTheory/08_Testing.md
new file mode 100644
index 0000000000000000000000000000000000000000..effca3093683902d31b261d5a900fa999c0cb545
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/08_Testing.md
@@ -0,0 +1,5 @@
+(08_testing)=
+# Model testing
+How well does the model fit the data?
+
+First idea based on plotting observations, fitted model and confidence intervals...
\ No newline at end of file
diff --git a/book/sandbox/SanPart/ObservationTheory/99_NotationFormulas.md b/book/sandbox/SanPart/ObservationTheory/99_NotationFormulas.md
new file mode 100644
index 0000000000000000000000000000000000000000..687142395f3e59327b0b8b735df48be2140a2198
--- /dev/null
+++ b/book/sandbox/SanPart/ObservationTheory/99_NotationFormulas.md
@@ -0,0 +1,68 @@
+# Notation and formulas
+
+```{list-table}
+:header-rows: 1
+
+* - Description
+  - Notation / formula
+* - Observables (random)
+  - $Y=\begin{bmatrix} Y_1,Y_2,\ldots, Y_m \end{bmatrix}^T$
+* - Observations (realization of $Y$)
+  - $\mathrm{y}=\begin{bmatrix} y_1,y_2,\ldots, y_m \end{bmatrix}^T$
+* - Unknown parameters (deterministic)
+  - $\mathrm{x}=\begin{bmatrix} x_1,x_2,\ldots, x_n \end{bmatrix}^T$
+* - Random errors
+  - $\epsilon = \begin{bmatrix} \epsilon_1,\epsilon_2,\ldots, \epsilon_m \end{bmatrix}^T$ with $\epsilon\sim N(0,\Sigma_{\epsilon})$
+* - Functional model (linear)
+  - $\mathbb{E}(Y) = \mathrm{A} \mathrm{x}\;$  or  $\;Y = \mathrm{A} \mathrm{x} + \epsilon$
+* - Stochastic model
+  - $\mathbb{D}(Y) = \Sigma_{Y}=\Sigma_{\epsilon}$
+* - Estimator of $\mathrm{x}$ 
+  - $\hat{X}$ 
+* - Estimate of $\mathrm{x}$ (realization of $\hat{X}$) 
+   - $\hat{\mathrm{x}}$
+* - Adjusted (or predicted) observations 
+  - $\hat{\mathrm{y}}=\mathrm{A}\cdot\hat{\mathrm{x}}$
+* - Residuals 
+  - $\hat{\epsilon}=\mathrm{y}-\mathrm{A}\hat{\mathrm{x}}$
+```
+
+## Estimators of $\mathrm{x}$
+```{list-table}
+:header-rows: 1
+
+* - Estimators
+  - Formula
+* - Least-squares (LS) 
+  - $\hat{X}=\left(\mathrm{A}^T \mathrm{A} \right)^{-1} \mathrm{A}^T Y $
+* - Weighted least-squares (WLS)
+  - $\hat{X}=\left(\mathrm{A}^T\mathrm{WA} \right)^{-1} \mathrm{A}^T\mathrm{W} Y $
+* - Best linear unbiased (BLU) 
+  - $\hat{X}=\left(\mathrm{A}^T\Sigma_Y^{-1} \mathrm{A} \right)^{-1} \mathrm{A}^T\Sigma_Y^{-1} Y $
+```
+
+## Test statistics (with distribution under $\mathcal{H}_0$)
+```{list-table}
+:header-rows: 1
+
+* - Test statistic
+  - Formula
+* - Generalized likelihood ratio test 
+  - $T_q = \hat{\epsilon}^T\Sigma_Y^{-1}\hat{\epsilon}-\hat{\epsilon}_a^T\Sigma_Y^{-1}\hat{\epsilon}_a\sim \chi^2(q,0)$
+* - Overall model test 
+  - $T_{q=m-n} = \hat{\epsilon}^T\Sigma_Y^{-1}\hat{\epsilon}\sim \chi^2(q,0)$
+* - w-test 
+  - $W = \frac{\mathrm{C}^T\Sigma_Y^{-1} \hat{\epsilon}}{\sqrt{\mathrm{C}^T\Sigma_Y^{-1}\Sigma_{\hat{\epsilon}}\Sigma_Y^{-1} \mathrm{C}}} \sim N(0,1)$
+```
+
+(99_proplaw)= 
+## Linear propagation laws if $\hat{X}=\mathrm{L^T} Y $
+```{list-table}
+:header-rows: 1
+
+* - Propagation law of the ...
+  - Formula
+* - ... mean 
+  -  $\mathbb{E}(\hat{X}) = \mathrm{L^T}\mathbb{E}(Y)$ 
+* - ... covariance matrix 
+  - $\Sigma_{\hat{X}} =\mathrm{L^T}\Sigma_Y \mathrm{L} $ 
diff --git a/book/sandbox/SanPart/figures/.DS_Store b/book/sandbox/SanPart/figures/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..3ea7b9852cfefd600d83f4cd798b3d84e84d9ac6
Binary files /dev/null and b/book/sandbox/SanPart/figures/.DS_Store differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/01_Regression.png b/book/sandbox/SanPart/figures/ObservationTheory/01_Regression.png
new file mode 100644
index 0000000000000000000000000000000000000000..55c2c0f0a50e69ec456ee2bbedc3283edfc11f08
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/01_Regression.png differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_fit.png b/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_fit.png
new file mode 100644
index 0000000000000000000000000000000000000000..b0b641829e3a16b7fceb4171d31dd328d1850b3c
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_fit.png differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_sol.png b/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_sol.png
new file mode 100644
index 0000000000000000000000000000000000000000..e148bc33d5c0607590168cd261e0668ff249c831
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/02_LeastSquares_sol.png differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/05_CI.png b/book/sandbox/SanPart/figures/ObservationTheory/05_CI.png
new file mode 100644
index 0000000000000000000000000000000000000000..33d6ccf5c846e29f6ac400e4d9ea124de873d20b
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/05_CI.png differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/05_CI_model.png b/book/sandbox/SanPart/figures/ObservationTheory/05_CI_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..dae1d35b9645f550c26ccea0471ad87141d9c16c
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/05_CI_model.png differ
diff --git a/book/sandbox/SanPart/figures/ObservationTheory/05_standard.png b/book/sandbox/SanPart/figures/ObservationTheory/05_standard.png
new file mode 100644
index 0000000000000000000000000000000000000000..6196cb9cbbc2f0bf2712bc9c4ca52c5d2e2f004b
Binary files /dev/null and b/book/sandbox/SanPart/figures/ObservationTheory/05_standard.png differ
diff --git a/book/sandbox/SanPart/premath/00_00_PreMath.md b/book/sandbox/SanPart/premath/00_00_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..e67ee7648679d1e3f90a34ac5acf631597da0be6
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_00_PreMath.md
@@ -0,0 +1,5 @@
+# Introduction
+
+This chapter serves as a quick reference to certain topics from Calculus and Linear Algebra that will be useful for MUDE.
+
+Another useful resource is the [Math Explained](https://www.tudelft.nl/en/eemcs/study/online-education/math-explained) website with videos on various math subjects (Calculus, Linear Algebra, Differential Equations, Probability and Statistics) developed by TU Delft.
\ No newline at end of file
diff --git a/book/sandbox/SanPart/premath/00_01a_PreMath.md b/book/sandbox/SanPart/premath/00_01a_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..b619f5db6975fb28b85a60355ff98fe8a139439d
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_01a_PreMath.md
@@ -0,0 +1,92 @@
+# Vectors and matrices
+
+## Matrix
+A matrix is a rectangular array of different numbers.  If $A$ is a matrix, it can be written, in a generic form as: 
+
+$$ 
+A= \left[ \begin{array}{cccc} a_{11} & a_{12} & \ldots & a_{1n} \\ a_{21} & a_{22} & \ldots & a_{2n} \\ \vdots & \vdots &        & \vdots \\ a_{m1} & a_{m2} & \ldots & a_{mn} \\ \end{array} \right], 
+$$ 
+
+where $a_{ij}=(A)_{ij}$ denotes the $ (i,j) $th element of $A$.  The size of a matrix is defined as its number of rows times the number of columns. In the above example,  the size of the matrix  $A$ is $ m \times n $.   
+
+## Vector
+An $m \times 1$ matrix 
+
+$$ 
+x = \left[ \begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \\ \end{array} \right] $$ 
+
+is called an $m$-dimensional column vector. A $ 1 \times n $ matrix is called an $ n$-dimensional row vector. When we speak of vectors, we usually refer to column vectors, unless otherwise stated.
+
+## Matrix operations 
+*Summation*: The sum of two matrices $A$ and $B$ is defined if they have the same size. Then $A+B=(a_{ij}+b_{ij})$. The product of a scalar $\alpha$ and a matrix $A$ is $ \alpha A = A \alpha = (\alpha a_{ij}) $.
+
+*Multiplication*: The product $AB$ of two matrices $A$ and $B$ is defined only if the number of columns of matrix $A$ equals the number of rows of matrix $B$. Thus if $A$ is of size $m \times n$ and $B$ is of size $n \times o$, then $C=AB$ is of size $m \times o$, which has its $(ij)$th entry $c_{ij}$ given by $ c_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj} $. Note that in general $AB \neq BA$.
+
+## Special matrices
+*Square matrices*: Matrices with the same number of columns and rows are called square matrices. Let $A$ be an $m \times n$ matrix. If $m=n$, then $A$ is a square matrix.  
+
+*Symmetric matrices*: A square matrix $A$ is called symmetric when $A^{T}=A$. For example, the following matrix is a symmetric matrix: 
+
+$$ 
+A= \left[ \begin{array}{cccc} 4 & 3& 2& 1 \\ 3 & 4& 3& 2 \\ 2 & 3& 4& 3 \\ 1 & 2& 3& 4 \\ \end{array} \right].  
+$$ 
+
+In other words, in symmetric matrices,  $a_{ij}=a_{ji}$.
+
+*Diagonal matrices*: The diagonal entries of an $m \times m$ matrix $A$, with entries $(A)_{ij}=a_{ij}$, are $a_{11}, a_{22}, \ldots, a_{mm}$. Matrix $A$ is called a diagonal matrix if all other entries are equal to zero. Such a matrix will be denoted as $A= \text{diag}(a_{11}, a_{22}, \ldots, a_{mm})$. For example, the following matrix is a diagonal matrix: 
+
+$$ 
+A= \left[ \begin{array}{cccc} 1 & 0& 0& 0 \\ 0 & 2& 0& 0 \\ 0 & 0& 5& 0 \\ 0 & 0& 0& 3 \\ \end{array} \right].  
+$$
+
+*Identity matrices*: A diagonal matrix is called an  identity matrix if all of its diagonal entries equal $1$. An identity matrix of size $m \times m$ will be written as $A=I_{m}$ or sometimes simply as $A=I$. Examples of identity metrices are: 
+
+$$
+I_{1}=1, ~ ~ ~ ~ I_{2}= \left[ \begin{array}{cc}  1&0\\0&1 \end{array} \right], ~ ~ ~ ~ I_{3}=\left[ \begin{array}{ccc}  1&0&0\\0&1&0\\0&0&1 \end{array} \right], ~ ~ ~ ~   I_{4}=\left[ \begin{array}{cccc}  1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1 \end{array} \right].  
+$$
+
+*Zero matrix*: A matrix of which all its entries equal zero is called the zero-matrix and is written as $A=0$. 
+
+## Matrix Transpose 
+The transpose of a matrix $A$ is the matrix $A^{T}$ obtained by interchanging the rows and columns of $A$. Therefore,  the $(i,j)$th entry of $A^{T}$ is $a_{ji}$. If the size of $A$ is  $ m{\times}n$, then its transpose  $A^{T}$ is a matrix of size  $ n{\times}m$.  Let $\alpha$ and $\beta$ be scalars and $A$ and $B$ be matrices of appropriate size. Then the following properties are hold:
+
+$$ 
+\begin{array}{llll} (a) & (\alpha A)^{T}=\alpha A^{T} & (c) & (\alpha A+\beta B)^{T}= \alpha A^{T} + \beta B^{T}\\ (b) & (A^{T})^{T}=A & (d) & (AB)^{T}=B^{T}A^{T} \\ \end{array}. 
+$$
+
+Check yourself that for a symmetric matrix $W$ we have $W=W^T$.
+Consequently, also $\left(\mathrm{A}^T W\mathrm{A} \right) $ and $\left(\mathrm{A}^T W\mathrm{A} \right)^{-1} $ are symmetric.
+
+## Trace of a matrix
+The trace is a function that is only defined on square matrices. The  trace of an square matrix $A$ of size $m \times m$ , denoted as ${\rm trace}(A)$, is defined to be the sum of all its diagonal entries, $${\rm trace}(A)= \sum_{i=1}^{m} a_{ii}.$$   For example: 
+
+$$ 
+\text{trace}\Big( \left[ \begin{array}{ccc} 4 & 2 & 1 \\ 2 & 3 & 1 \\ 1 & 1 & 2 \\ \end{array} \right] \Big)= 4+3+2=9. 
+$$ 
+
+## Linear Combination
+Assume $x_{1}, ~ x_{2}, \dots, x_{n}$ are $n$ vectors with the size of $m \times 1$ each, and $\alpha_{1}, ~ \alpha_{2}, \dots \alpha_{n}$, are $n$ scalars, then $ \sum_{i=1}^{n} \alpha_{i} x_{i} $ is called a linear combination of $ x_{1}, \ldots, x_{n} $.  In other words, if we multiply vectors by scalars, and add or subtract them from each other, the result is called the linear combination of the vectors. 
+
+## Linear Dependency
+Vectors $ x_{i} $, $ i=1, \ldots, n $, are said to be linear dependent if there exist scalars $ \alpha_{i} $, not all equal to zero, such that $ \alpha_{1}x_{1}+\alpha_{2}x_{2}+\ldots + \alpha_{n}x_{n} = 0 $. An alternative but equivalent definition is:  a set of vectors are linear dependent if at least one of the vectors can be written as linear combination of the others. If not, we say that the vectors $ x_{i} $ are linear independent.  The vectors $ x_{1}, \ldots, x_{n} $ are linear independent if and only if: 
+
+$$ 
+\alpha_{1}x_{1}+\alpha_{2}x_{2}+\ldots + \alpha_{n}x_{n} = 0 \Rightarrow \alpha_{1}= \ldots = \alpha_{n}=0 
+$$
+
+## Rank of a matrix
+The maximum number of linearly independent column vectors of a matrix $A$ is called the rank of $A$, and it is denoted as $\text{rank}(A)$.  The maximum number of linearly independent column vectors of a matrix always equals its maximum number of linearly independent row vectors. A matrix $A$ of size $m \times n$ is said to have full row rank if $\text{rank}(A)=m$ and full column rank if $\text{rank}(A)=n$ . The matrix is said to have a rank deficiency if $ \text{rank}(A) < \text{min}(m,n) $.  
+
+## Singular matrices 
+Square matrices with a rank deficiency are called singular. Alternatively, if a square matrix of size $m \times m$ has a rank equal to $m$, the matrix is called nonsingular. 
+
+## Inverse of a matrix
+Let $A$ be a square $m \times m$ and nonsingular matrix. Then there exists a unique matrix $A^{-1}$, called the inverse of $A$, such that 
+
+$$
+AA^{-1}=A^{-1}A=I_{m}.
+$$ 
+
+It can be shown that $ (A^{-1})^{-1}=A$. If $\alpha$ is a nonzero scalar, then $ (\alpha A)^{-1}=\frac{1}{\alpha}A^{-1} $. Note that, singular matrices (or matrices with rank deficiency) are not invertible. 
+
+
diff --git a/book/sandbox/SanPart/premath/00_01b_PreMath.md b/book/sandbox/SanPart/premath/00_01b_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..d5a7315f7cc5c692e6707321a9b51e227541679b
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_01b_PreMath.md
@@ -0,0 +1,83 @@
+# Vector spaces
+
+## Vector space and subspace 
+A set $ \mathcal{W} $  is called a vector space if its elements are vectors and the sum of two elements of $ \mathcal{W} $ is again an element of $ \mathcal{W} $, and the product of an element of $ \mathcal{W} $ with a scalar is an element of $ \mathcal{W} $.  
+A subset of a vector space which itself is a vector space is called a subspace. 
+
+## Span 
+Let $ a_{i} \in \mathcal{W} $,  $ i=1, \ldots, n $, where $ \mathcal{W}$ is a vector space. The set of all linear combinations of $ a_{1}, \ldots, a_{n} $, denoted as $ \{ a_{1}, \ldots, a_{n} \} $, is a subspace of $ \mathcal{W}$: $ \{ a_{1}, \ldots, a_{n} \} \subset \mathcal{W} $. 
+
+If every vector of a vector space $ \mathcal{V} $ can be written as a linear combination of $ a_{1}, \ldots, a_{n} $, then $ a_{1}, \ldots, a_{n} $ is said to span $ \mathcal{V}$: $\mathcal{V} = \{ a_{1}, \ldots, a_{n} \} $.
+
+## Basis and dimension of a vector space
+A basis of a vector space $ \mathcal{W} $ is a set of linear independent vectors which span $ \mathcal{W} $.
+
+Every vector space contains a basis and every vector can be written as a unique linear combination of the vectors of a basis.
+
+The dimension of a vector space $ \mathcal{W} $, denoted as $ \dim \mathcal{W} $, is the number of vectors of a basis of $ \mathcal{W} $. In an $ n $-dimensional vector space $ \mathcal{W} $, every linear independent set of $n$ vectors is a basis of $ \mathcal{W} $. 
+
+As an example, the three-dimensional space $\mathbb{R}^3$ is a vector space, and one possible basis for $\mathbb{R}^3$  is a set of the following unit vectors:
+
+$$ 
+\begin{bmatrix}1\\0\\0 \end{bmatrix}, ~~~ \begin{bmatrix}0\\1\\0 \end{bmatrix}, ~~~ \begin{bmatrix}0\\0\\1 \end{bmatrix}. 
+$$
+
+That is all the vectors in  $\mathbb{R}^3$ can be written as linear combination of these three unit vectors. For example, the arbitrary three dimentional  vector $ \begin{bmatrix} 4\\3\\5 \end{bmatrix} $ can be written as the linear combination:
+
+$$
+\begin{bmatrix} 4\\3\\5 \end{bmatrix}=4\begin{bmatrix}1\\0\\0 \end{bmatrix}+3\begin{bmatrix}0\\1\\0 \end{bmatrix}+5\begin{bmatrix}0\\0\\1 \end{bmatrix}. 
+$$ 
+
+We can say the three vectors  $ \begin{bmatrix}1\\0\\0 \end{bmatrix}, \begin{bmatrix}0\\1\\0 \end{bmatrix}$, and $\begin{bmatrix}0\\0\\1 \end{bmatrix} $span the vector space $\mathbb{R}^3$ . 
+
+## Column Space (or Range Space) of a Matrix
+The column space (or range space) of a matrix $A$ of size  $m\times n$, is the subspace of $\mathbb{R}^{m}$ which is spanned by the column vectors of $A$. The range space of $A$ is denoted as $\mathcal{R}(A)$.  The dimension of $\mathcal{R}(A)$, equals the maximum number of linear independent column vectors of $A$.
+
+For example, let $A=\begin{bmatrix} 1&1\\1&2\\1&3 \end{bmatrix}$. Then the $\mathcal{R}(A)$ is a subspace in $\mathbb{R}^{3}$, and its elements can be written as linear combination of the two column vectors of $A$:
+
+$$
+\begin{bmatrix} 1\\1\\1 \end{bmatrix}, ~ \text{and} ~ \begin{bmatrix} 1\\2\\3 \end{bmatrix}. 
+$$ 
+
+In this example, the dimension of $\mathcal{R}(A)$ is 2 becouse $A$ has two independent columns. 
+
+## Inner product of two vectors 
+Inner product of two vectors $x$ and $ y$ denoted by $(x,y)$, is defined as: 
+
+$$ 
+(x,y)=x^Ty. 
+$$ 
+
+For example the inner product of the two vectors $ x= \begin{bmatrix} 3 \\ 2 \\ 1 \\ 1 \end{bmatrix} $ and $ y= \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} $ is calculated as: 
+
+$$   
+(x,y)=x^Ty= \left[ \begin{array}{cccc} 3&2&1&1\\ \end{array} \right] \left[ \begin{array}{c} 1 \\ 2 \\ 3 \\ 4 \end{array} \right] =(3\times 1)+(2\times 2)+(1\times 3) +(1\times 4)=14.  
+$$
+
+## Norm or length of a vector
+The length or norm of a vector $x$, denoted as $\|x\|$, is defined as:
+
+$$ 
+\| x \| = \sqrt{(x, x)}= \sqrt{x^Tx}. 
+$$ 
+
+For example, the length of the vector $ x= \begin{bmatrix} 3 \\ 2 \\ 1 \\ 1 \end{bmatrix} $ is computed as: 
+
+$$ 
+\| x \|=\sqrt{ (3\times 3)+ (2\times 2)+ (1\times 1)+(1\times 1)}=\sqrt{15} 
+$$ 
+
+## Distance between two vectors
+The distance between two vectors $x$ and $y$ is defined as the norm of $u-v$, and is given as $ \|u-v\|$.
+
+## Angle between two vectors and orthogonality
+The angle between two vectors $x$ and $y$
+
+is defined to be the number $\theta \in [0, \pi] $  such that:
+
+$$
+\cos (\theta) = \frac{ (x,y)}{ \|x\| \|y\|}.
+$$ 
+
+Two vectors are said to be orthogonal (or normal to each other) if the angle between them is $\pi/2$, or in other words when their inner product equals zero: $ (x,y)=x^Ty=0 $ .
+
diff --git a/book/sandbox/SanPart/premath/00_01c_PreMath.md b/book/sandbox/SanPart/premath/00_01c_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..a40418614e2168fc25231ca3289afe6ae80eaa2f
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_01c_PreMath.md
@@ -0,0 +1,102 @@
+# System of linear equations
+Vectors and matrices play an important role in describing and solving systems of linear equations. Let a linear system of $m$ equations in $n$ unknown parameters $x_{i}$, $i=1, \ldots, n$, be given as
+
+$$
+\begin{array}{ccc} y_{1} & = & a_{11}x_{1}+a_{12}x_{2}+\ldots + a_{1n}x_{n} \\ y_{2} & = & a_{21}x_{1}+a_{22}x_{2}+\ldots + a_{2n}x_{n} \\ \vdots & & \vdots \\ y_{m} & = & a_{m1}x_{1}+a_{m2}x_{2}+\ldots + a_{mn}x_{n} \end{array} 
+$$
+
+This system of linear equations can be written in a matrix form as:
+
+$$ 
+\begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_m \end{bmatrix}= \begin{bmatrix} a_{11}&a_{12}&\dots&a_{1n}\\a_{21}&a_{22}&\dots&a_{2n} \\ \vdots&\vdots&\vdots&\vdots \\ a_{m1}&a_{m2}&\dots&a_{mn} \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \\ \vdots \\ x_n \end{bmatrix}
+$$
+
+By introducing  
+
+$$
+\mathrm{y}=\begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_m \end{bmatrix},  ~  \mathrm{A}=\begin{bmatrix} a_{11}&a_{12}&\dots&a_{1n}\\a_{21}&a_{22}&\dots&a_{2n} \\ \vdots&\vdots&\vdots&\vdots \\ a_{m1}&a_{m2}&\dots&a_{mn} \end{bmatrix}, ~ \mathrm{x}=\begin{bmatrix} x_1\\ x_2 \\ \vdots \\ x_n \end{bmatrix},
+$$ 
+
+the linear system can be written in the compact matrix form $\mathrm{y} = \mathrm{A} \mathrm{x}$.
+
+Note that the right-hand side of the system $\mathrm{y} = \mathrm{A} \mathrm{x}$ can be re-written as the linear combinations of the column vectors of $\mathrm{A}$:
+
+$$
+\mathrm{A}\mathrm{x}=\begin{bmatrix} a_{11}\\a_{21}\\ \vdots \\ a_{m1} \end{bmatrix}x_1+\begin{bmatrix} a_{12}\\a_{22}\\ \vdots \\ a_{m2} \end{bmatrix}x_2 + \dots +\begin{bmatrix} a_{1n}\\a_{2n}\\ \vdots \\ a_{mn} \end{bmatrix}x_n.
+$$
+
+The linear system of equations is solved, once it is known which linear combination(s) of the columns of $\mathrm{A}$ produces $\mathrm{y}$.
+
+Example: Assume the following system of three equations with two unknowns:
+
+$$
+1 =x_1-x_2\\
+-1=x_1-3x_2\\
+3 =x_1+x_2
+$$
+
+These three equations can be written in a matrix form as
+
+$$
+\begin{bmatrix} 1\\-1\\3 \end{bmatrix} = \begin{bmatrix} 1&-1\\1&-3\\1&1 \end{bmatrix} \begin{bmatrix} x_1\\x_2 \end{bmatrix}
+$$
+
+The solution of this system is $x_1=2$ and $x_2=1$. We can see that the particular linear combination of columns of $\mathrm{A}$ using $x_1=2$ and $x_2=1$ produces the vector $\begin{bmatrix} 1\\-1\\3 \end{bmatrix}$ :  
+
+$$
+\begin{bmatrix} 1\\1\\1 \end{bmatrix}2 + \begin{bmatrix} -1\\-3\\1 \end{bmatrix}1=\begin{bmatrix} 1\\-1\\3 \end{bmatrix}.
+$$
+
+## Properties of linear systems of equations
+Consider a linear system of $m$ equations in $n$ unknowns:
+
+$$ \begin{bmatrix} y_1\\ y_2 \\ \vdots \\ y_m \end{bmatrix}= \begin{bmatrix} a_{11}&a_{12}&\dots&a_{1n}\\a_{21}&a_{22}&\dots&a_{2n} \\ \vdots&\vdots&\vdots&\vdots \\ a_{m1}&a_{m2}&\dots&a_{mn} \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \\ \vdots \\ x_n \end{bmatrix}$$
+
+In compact form this is written as $\mathrm{y} = \mathrm{A}\mathrm{x}$.
+
+Such a system may or may not have a solution, and if a solution exists, this solution may or may not be unique.
+
+(PM_consistent)=
+## Consistent systems: at least one solution exists
+
+If a solution exists, the system of equations is called consistent. A solution exists if and only if $y$ can be written as a linear combination of the column vectors of matrix $\mathrm{A}$:
+
+$$\mathrm{y}=\begin{bmatrix} a_{11}\\a_{21}\\ \vdots \\ a_{m1} \end{bmatrix}x_1+\begin{bmatrix} a_{12}\\a_{22}\\ \vdots \\ a_{m2} \end{bmatrix}x_2 + \dots +\begin{bmatrix} a_{1n}\\a_{2n}\\ \vdots \\ a_{mn} \end{bmatrix}x_n$$ 
+
+In that case $\mathrm{y}$ is an element of the range space of $\mathrm{A}$: $\mathrm{y} \in \mathcal{R}(\mathrm{A})$.
+
+For a consistent system of equations it is always possible to find a solution $x$ such that $\mathrm{y}=\mathrm{A}\mathrm{x}$. If it is not possible to find a solution, the system is called inconsistent.
+
+### Consistency is guaranteed if $\text{rank}(\mathrm{A}) = m$
+
+Explanation: $ \mathrm{y} \in \mathbb{R}^{m}$ and the system is consistent if  $ \mathrm{y} \in \mathcal{R}(\mathrm{A}) $, hence consistency is guaranteed if $\mathcal{R}(\mathrm{A})=\mathbb{R}^{m}$. This means that the columns of $\mathrm{A}$ must span the complete space of reals $\mathbb{R}^{m}$, which is true if $\text{rank}(\mathrm{A})  = m$.
+
+If $\text{rank}(\mathrm{A}) < m$, the system may or may not be consistent, this depends then on the actual entries of the vector $\mathrm{y}$.
+
+## Unique solution
+
+A consistent system has a unique solution if and only if the column vectors of matrix $\mathrm{A}$ are independent, i.e. if $ \text{rank}(\mathrm{A}) = n$.
+
+This can be seen as follows: assume that $\mathrm{x}$ and $\mathrm{x}' \neq \mathrm{x}$ are two different solutions. Then $\mathrm{A}\mathrm{x}=\mathrm{A}\mathrm{x}’$ or $\mathrm{A}(\mathrm{x}-\mathrm{x}')=0$. But this can only be the case if some of the column vectors of $\mathrm{A}$ are linear dependent, which contradicts the assumption of full column rank.
+
+## Determined, overdetermined and underdetermined systems
+
+A system of equations $\mathrm{y}=\mathrm{A}\mathrm{x}$ with $\text{rank}(\mathrm{A}) =m=n$ is consistent and has a unique solution: $\hat{\mathrm{x}} = \mathrm{A}^{-1}\mathrm{y}$. Such a system is called *determined*.
+
+A system is *underdetermined* if $\text{rank}(\mathrm{A})  < n$, i.e. if it does not have a unique solution.
+
+A system is *overdetermined* if $\text{rank}(\mathrm{A})  < m$, i.e. the system may or may not be consistent.
+
+## Redundancy
+
+The redundancy of a system of equations is equal to $m - \text{rank}(\mathrm{A})$.
+
+If we restrict ourselves to systems of observation equations that are of full column rank: $\text{rank}(\mathrm{A}) = n $, the system can either be
+
+*determined*: $\text{rank}(\mathrm{A}) =n =m$, the redundancy is equal to 0
+
+*overdetermined*: $\text{rank}(\mathrm{A}) =n < m$, the redundancy is equal to $m-n>0$
+
+
+
+
diff --git a/book/sandbox/SanPart/premath/00_02_PreMath.md b/book/sandbox/SanPart/premath/00_02_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..cff418b6e32951178aa3dc97441734695e12b64b
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_02_PreMath.md
@@ -0,0 +1,61 @@
+# Multivariate differentiation
+
+(PM_gradient)=
+## Gradient vector
+For a multivariate function $f(x)$  where $x=\begin{bmatrix} x_1\\ \vdots\\x_n \end{bmatrix}$, the vector with enteries as partial derivatives with respect to $x_1,x_2,\dots,$ and $x_n$ is called the gradient vector and it is denoted as:
+
+$$
+\partial_x f= \begin{bmatrix} \partial f/\partial x_1\\ \vdots\\ \partial f/\partial x_n \end{bmatrix}.
+$$
+
+For example, let's assume  $x=\begin{bmatrix} x_1\\ x_2\\x_3 \end{bmatrix}$ and $f(x)=x_1^2+3x_2+\cos(x_3)$. Then the gradient vector of $f(x)$ is:
+
+$$
+\partial_x f= \begin{bmatrix} 2x_1\\3 \\-\sin(x_3)  \end{bmatrix}.
+$$
+
+## Gradient of linear functions
+If $b$ and $x$ are the column vectors with the same size, the gradient of $f(x)=b^Tx=x^Tb$ with respect to $x$  is derived as:
+
+$$
+\partial_x f=b. 
+$$
+
+If $b$ is a $m\times1$ vector, $x$ is a $n\times1$, and $A$ is a matrix of size $m\times n$, the gradient of $f(x)=b^TAx=x^TA^Tb$ with respect to $x$  is derived as:
+
+$$
+\partial_x f=A^Tb.
+$$
+
+## Gradient of quadratic functions
+If $x$ is the column vector of size $n\times 1$ , and $B$ is an square matrix of size $n\times n$,  the gradient of $f(x)=x^TBx=x^TB^Tx$  with respect to $x$  is derived as:
+
+$$
+\partial_x f=(B+B^T)x. 
+$$
+
+Note that if $B$ is a symmetric matrix, then $B^T=B$, and so the gradient of $f(x)=x^TBx=x^TB^Tx$  with respect to $x$  is derived as: 
+
+$$
+\partial_x f=2Bx.
+$$
+
+(PM_jacobian)=
+## Jacobian matrix
+Assume $m$ number of different multivariate functions as  $f_1(x),f_2(x),\dots, f_m(x)$, where $x=\begin{bmatrix} x_1\\ \vdots\\x_n \end{bmatrix}$. If we collect all the $f_i(x)$ functions in a column vector $F(x)$ as
+
+$$
+F(x)=\begin{bmatrix} f_1(x)\\ \vdots\\f_m(x) \end{bmatrix},
+$$
+
+$F(x)$ is itself a $m$-vector function of an $n$-vector variable $x$. For such a vector multivariate function, the $m\times n$ matrix of partial derivatives $\partial F_{i}/ \partial x_{j}$, denoted as $J_x$, is called  Jacobian matrix of $F$ defined as:
+
+$$ 
+J_x=\partial_{x^{T}} F =\left[\begin{array}{ccc}\partial F_{1} / \partial x_{1} & \ldots & \partial F_{1}/ \partial x_{n} \\\partial F_{2} / \partial x_{1} & \ldots & \partial F_{2}/ \partial x_{n} \\ \vdots & & \vdots \\ \partial F_{m} / \partial x_{1} & \ldots & \partial F_{m}/ \partial x_{n}\end{array}\right]. 
+$$
+
+For example, let $x=\begin{bmatrix} x_1\\ x_2\\x_3 \end{bmatrix}$ and  $ F(x)=\begin{bmatrix} x_1+x_2+x_3 \\ x_1^2+2x_2 \\ x_1^2+x_2^3+\cos(x_3)\end{bmatrix} $. Then the Jacobian matrix of $F(x)$ is derived as:
+
+$$ 
+J_x=\partial_{x^{T}} F =\left[\begin{array}{ccc}  1& 1& 1 \\ 2x_1&2 &0  \\ 2x_1 & 3x_2^2& -\sin(x_3) \end{array}\right].
+$$
\ No newline at end of file
diff --git a/book/sandbox/SanPart/premath/00_03_PreMath.md b/book/sandbox/SanPart/premath/00_03_PreMath.md
new file mode 100644
index 0000000000000000000000000000000000000000..27c0fc0be0fa1e7e5874def1a0e55233f9017fa3
--- /dev/null
+++ b/book/sandbox/SanPart/premath/00_03_PreMath.md
@@ -0,0 +1,38 @@
+# Taylor series
+
+## Taylor's theorem for approximating functions of 1 variable
+
+Taylor’s theorem can be used to approximate a function $f(x)$ with the so called $p$-th order Taylor polynomial:
+
+$$
+f(x) \approx f(x_0) +\frac{\partial}{\partial x}f(x_0)\Delta x + \frac{1}{2!} \frac{\partial^2}{\partial x^2}f(x_0)\Delta x^2 + \ldots +\frac{1}{p!} \frac{\partial^p}{\partial x^p}f(x_0)\Delta x^p=P_k (x)
+$$
+
+where it is required that the function $f: \mathbb{R}\mapsto \mathbb{R}$ is $p$-times differentiable at the point $x_0 \in \mathbb{R}$, and $\Delta x = x-x_0$.
+
+The approximation error is equal to 
+
+$$
+R_p(x) = f(x)- P_k (x) 
+$$
+
+and is called the remainder term.
+
+Example:
+
+A linear approximation (also called linearization) of $f(x) = \cos(x)$ at $x_0$ is obtained by the 1st order Taylor polynomial as:
+
+$$
+f(x) \approx \cos x_0 – \sin x_0 \Delta x
+$$
+
+## First-order Taylor polynomial for linearizing a function of $n$ variables
+
+In this course we will only need first-order Taylor approximations for linearizing non-linear functions of $x$ being a vector with $n$ variables. The corresponding Taylor polynomial is then given by:
+
+$$
+f(x) \approx f(x_0)  + \partial_{x_1} f(x_0) \Delta x_0+ \partial_{x_2} f(x_0) \Delta x_0+ \ldots + \partial_{x_n} f(x_0) \Delta x_0
+$$
+
+where we need the $n$ partial derivatives of function $f$ evaluated at $x_0$.
+