"<div style=\"background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%\"> <p><em>Note: there is an intermediate variable defined in the cell below <code>mu_V_0</code> that is not needed; the intention was to provide an easier way to calculate the mean of V, as it represents the first term in the Taylor series approximation.</em></p></div>"
]
},
{
"cell_type": "code",
"execution_count": 2,
...
...
@@ -618,6 +626,14 @@
"validate_distribution(10000, 0.01)"
]
},
{
"cell_type": "markdown",
"id": "434afdc5",
"metadata": {},
"source": [
"<div style=\"background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%\"> <p><em>Note: in the figure above (and widget below) the right-hand plot by default is labeled \"Theoretical quantiles,\" but in fact it is the q value described above (the number of standard deviations).</em></p></div>"
In this notebook you will apply the propagation laws for the mean and variance for a function of two independent random variables. You will assess how well the approximations correspond with the <em>simulation-based</em> equivalents. You will also assess the distribution of the function.
_You do not need to turn in this notebook._
### Objectives
1. Observe how uncertainty "propagates" from the inputs to the output of a function by estimating moments of the function of random variables and seeing how they change relative to the moments of the input random variables.
2. Recognize that a non-linear function of random variables that have the (joint) Normal distribution (the inputs) produces a non-Normal random variable (the output).
3. Using _sampling_ (Monte Carlo Simulation) to _validate_ the linearized error propagation technique introduced in the textbook. Specifically, by:
1. Comparing the estimated moments with that of the sample, and
2. Comparing the Normal distribution defined by the estimated moments to the sample
### A Note on "Sampling"
We will use Monte Carlo Simulation to create an empirical "sample" of the random values of our function of random variables, $V$ (the output). This is a commonly used approach widely used in science and engineering applications. It is a numerical way of computing the distribution of a function that is useful when analytic approaches are not possible (for example, when the input distributions are non-Normal or the function is non-linear). For our purposes today, Monte Carlo Simulation is quite simple and involves the following steps:
1. Define a function of random variables and the distributions of its input parameters.
2. Create a random sample of each input parameter according to the specified distribution.
3. Create a random sample of the output variable by computing the function for every set of input samples.
4. Evaluate the resulting distribution of the output.
A few key points to recognize are:
1. As the sample size increases, the resulting distribution becomes more accurate.
2. This is a way to get the (approximately) "true" distribution of a function of random variables.
3. Accuracy is relative to the propagation of uncertainty through the function based on the assumed distributions of the input random variables. In other words, MCS can't help you if your function and distributions are poor representations of reality!
### Application: Sewer Pipe Flow Velocity
We will apply Manning's formula for the flow velocity $V$ in a sewer:
$$
V =\frac{1}{n}R^{2/3}S^{1/2}
$$
where $R$ is the hydraulic radius of the sewer pipe (in $m$), $S$ the slope of the pipe (in $m/m$), and $n$ is the coefficient of roughness [$-$].
Both $R$ and $S$ are random variables, as it is known that sewer pipes are susceptible to deformations; $n$ is assumed to be deterministic and in our case $n=0.013$ $s/m^{1/3}$. The sewer pipe considered here has mean values $\mu_R$, $\mu_S$, and standard deviations $\sigma_R$ and $\sigma_S$; $R$ and $S$ are independent.
We are now interested in the mean flow velocity in the sewer as well as the uncertainty expressed by the standard deviation. This is important for the design of the sewer system.
*Disclaimer: the dimensions of the pipe come from a real case study, but some aspects of the exercise are...less realistic.*
### Programming
Remember to use your `mude-base` environment when running this notebook.
Some of the functions below uses <em>keyword arguments</em> to specify some of the parameter values; this is a way of setting "default" values. You can override them when using the function by specifying an alternative syntax. For example, the function here can be used in the following way to return `x=5` and `x=6`, respectively:
```python
deftest(x=5)
returnx
print(test())
print(test(x=6))
```
Copy/paste into a cell to explore further!
Note also in the cell below that we can increase the default size of the text in our figures to make them more readable!
### Theory: Propagation laws for a function of 2 random variables
We are interested in the mean and variance of $X$, which is a function of 2 random variables: $X=q(Y_1,Y_2)$. The mean and covariance matrix of $Y$ are assumed to be known:
We are interested to know how the uncertainty in $R$ and $S$ propagates into the uncertainty of the flow velocity $V$. We will first do this analytically and then implement it in code.
Use the Taylor series approximation to find the expression for $\mu_V$ and $\sigma_V$ as function of $\mu_R$, $\sigma_R$, $\mu_S$, $\sigma_S$. Write your answer on paper or using a tablet; later we will learn how to include images directly in our notebooks! For now you can skip this step, as you are not turning this notebook in.
Complete the function below, such that <code>moments_of_taylor_approximation</code> will compute the approximated $\mu_V$ and $\sigma_V$, as found in the previous Task.
</p>
</div>
%% Cell type:markdown id:199753a6 tags:
<divstyle="background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"><p><em>Note: there is an intermediate variable defined in the cell below <code>mu_V_0</code> that is not needed; the intention was to provide an easier way to calculate the mean of V, as it represents the first term in the Taylor series approximation.</em></p></div>
Interpret the figures above, specifically looking at differences between Case 1 and Case 2. Also look at the equations you derived to understand why for Case 1 we get a non-linear relation, and for Case 2 a linear one.
The standard deviation of $V$ is a non-linear function of $\sigma_R$ and $\sigma_S$ - the left figure shows how $\sigma_V$ increases as function of $\sigma_R$ for a given value $\sigma_S$.
If $\sigma_S$ is zero, there is no uncertainty in the slope of the pipe, and the standard deviation of $V$ becomes a linear function of $\sigma_R$ (right figure). The uncertainty of $V$ is smaller now, since it only depends on the uncertainty in $R$.
Furthermore, it is assumed that $R$ and $S$ are independent normally distributed random variables. We will generate at least 10,000 simulated realizations each of $R$ and $S$ using a random number generator, and then you need to use these to calculate the corresponding sample values of $V$ and find the moments of that sample.
Complete the functions <code>function_of_random_variables</code> and <code>get_samples</code> below to define the function of random variables and then generate a sample of the output from this function, assuming the inputs are also random variables with the Normal distribution. Then find the moments of the samples.
Are the results similar for the linearized and simulated values? Describe the difference quantitatively. Check your result also for the range of values of $\sigma_R$ from 0.01 to 0.10; are they consistent?
The estimated moments are both generally within 0.01 m of the result from the simulation when $\sigma_R=0.05$ m. When $\sigma_R$ changes the mean, $\mu_V$, is affected much less than the standard deviation, $\sigma_V$. When $\sigma_R=0.01$ m the Taylor approximation is <em>higher</em> by 0.15 m; when $\sigma_R=0.1$ m the Taylor approximation is <em>lower</em> by 0.30 m
Run the cell with the sampling algorithm above repeatedly and look at the values printed in the cell output. Which values change? Which values do <em>not</em> change? Explain why, in each case.
The moments calculated from the sample change because they are computed from random values sampled from the distribution; thus every time we run the code the samples are slightly different. The sample size seems large enough to give accurate results for our purposes, since the values don't change significantly (i.e, less than 1%; less than the difference between the two estimates of the moments). The moments calculated from the Taylor series approximation remain fixed, as they are based on moments of the input variables, not randomly generated samples.
</div>
%% Cell type:markdown id:7cfc8c41 tags:
## Part 3: Validating the Moments with a Distribution
In Part 2 we used a sample of the function of random variables to _validate_ the Taylor approximation (we found that they are generally well-approximated). Now we will assume that the function of random variables has the Normal distribution to validate the moments and see for which range of values they remain a good approximation. This is done by comparing the sample to the assumed distribution; the former is represented by a histogram (also called an empirical probability density function, PDF, when normalized), the latter by a Normal distribution with moments calculated using the Taylor approximation.
We will also use a normal probability plot to assess how well the assumption that $V$ is normally distributed holds up while varying the value of $\sigma_R$, introduced next.
### Theoretical Quantiles with `probplot`
The method `probplot` is built into `scipy.stats` (Documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html)) and _validates_ a probability model by comparing samples (i.e., data) to a theoretical distribution (in this case, Normal). The "Theoretical quantile" that is plotted on the x-axis of this plot and measures the distance from the median of a distribution, normalized by the standard deviation, such that $\mathrm{quantile}=q\cdot\sigma$. For example, $q=-1.5$ is $\mu-1.5\cdot\sigma$. The vertical axis is the value of the random variable.
Because we are comparing a theoretical distribution and a sample (data) on the same plot, one of the lines is the Normal PDF, which of course will have an exact match with the _theoretical quantiles_. This is why the Normal PDF will plot as a straight line in `probplot`. Comparing the (vertical) distance between the samples and the theoretical distribution (the red line) allows us to _validate_ our model. In particular, it allows us to validate the model for different regions of the distribution. In your interpretation, for example, you should try and identify whether the model is a good fit for the center and/or tails of the distribution.
Note that `probplot` needs to know what to use for samples (you will tell it this), and what type of theoretical distribution you are using (we already did this for you...`norm`).
Complete the function <code>validate_distribution</code> below (instructions are in the docstring) to plot the empirical probability density function (PDF) of $V$ using your simulated samples. Also plot the Normal PDF in the same figure using the moments computed from the error propagation law.
</p>
</div>
%% Cell type:markdown id:eea242ab tags:
<divstyle="background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"><p><em>Hint: if you are struggling with the code below, re-read the introduction to Part 3 carefully!</em></p></div>
# **NOTE IN PARTICULAR WHICH mu AND sigma ARE USED!!!**
probplot(V_samples,dist=norm,fit=True,plot=ax[1])
ax[1].legend(['Generated samples','Normal fit'])
ax[1].get_lines()[1].set_linewidth(2.5)
plt.show()
validate_distribution(10000,0.01)
```
%% Output
%% Cell type:markdown id:434afdc5 tags:
<divstyle="background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"><p><em>Note: in the figure above (and widget below) the right-hand plot by default is labeled "Theoretical quantiles," but in fact it is the q value described above (the number of standard deviations).</em></p></div>
### Validate the Distribution of $V$ for Various $\sigma_R$
The code below uses a widget to call your function to make the plots and add a slider to change the values of $\sigma_R$ and visualize the change in the distributions.
Run the cell below, then play with the slider to change $\sigma_R$. How well does the error propagation law match the "true" distribution (the samples)? State your conclusion and explain why. Check also whether there is an impact for different $\sigma_R$ values.
Using a different value for $\sigma_R$ has several impacts:
- For a larger $\sigma_R$ $\sigma_V$ will become larger (we saw this in Task 2) and $\mu_V$ will become smaller.
- For a larger $\sigma_R$ the PDF will become wider and less peaked (we saw this in Task 2).
- Depending on the value of $\sigma_R$, the more <em>extreme</em> values of the random variable $V$ deviate from that expected by the Normal distribution, evidenced by the distance between the blue dots and the red lines.
- $V$ does not follow a normal distribution; it is skewed slightly and the tails deviate most; the amount of deviation depends on the values of $\sigma_R$ and $\sigma_S$.
The reason for this is that $V$ is a non-linear function of the normally distributed random variables $R$ and $S$, due to the non-linearity $V$ will <em>not</em> be Normally distributed.
To further validate the error propagation law, estimate the probability that the "true" velocity of the tunnel is in the inaccurate range of values (assuming that the Normal distribution is a suitable model for the distribution of the function of random variables).
<em>Hint: recall that in this notebook a quantile, $q$, is defined as a standard deviation, and that the probability of observing a random variable $X$ such that $P[X\lt q]$ can be found with the CDF of the standard Normal distribution, evaluated in Python using <code>scipy.stats.norm.cdf(q)</code>.</em>
As we saw in 3.2, the Normal distribution does not fit the tails of the distribution of the function of random variables perfectly. Although this means using the Normal distribution may not be a good way of estimating the probability of "being in the tails," the approach in this Task is still suitable for getting an idea of the order of magnitude, and observing how sever this "error" maybe for different assumptions of $\sigma_R$.</em></p></div>
%% Cell type:code id:c7a8d678 tags:
``` python
# p = YOUR_CODE_HERE
# Solution:
p=2*norm.cdf(-3)
print(f'The probability is {p:0.6e}')
# extra solution
print(f'The probability is {2*norm.cdf(-2.5):0.6e} for sigma_R = 0.01')
print(f'The probability is {2*norm.cdf(-3.0):0.6e} for sigma_R = 0.05')
print(f'The probability is {2*norm.cdf(-3.5):0.6e} for sigma_R = 0.10')
```
%% Output
The probability is 2.699796e-03
The probability is 1.241933e-02 for sigma_R = 0.01
The probability is 2.699796e-03 for sigma_R = 0.05
The probability is 4.652582e-04 for sigma_R = 0.10
Since the probability plots give the quantiles, we simply need to recall that the probability of being in the "tails" is twice the probability of being less than one of the quantiles (as a negative value); the factor two is due to counting both tails. As we can see, the probability is generally small (no more than about 1%). It is interesting that the probability decreases by factor 100 as the value of $\sigma_R$ increases from 0.01 to 0.10.