Skip to content
Snippets Groups Projects
Commit 9fe5e28a authored by Robert Lanzafame's avatar Robert Lanzafame
Browse files

Merge branch '5-continuous-distribution-outline' into 'main'

Resolve "continuous distribution outline"

Closes #5

See merge request !16
parents abfb2404 7e434e0e
No related branches found
No related tags found
1 merge request!16Resolve "continuous distribution outline"
Pipeline #204001 passed
Showing
with 414 additions and 0 deletions
......@@ -12,6 +12,23 @@ parts:
- file: sandbox/ObservationTheory/01_Introduction.md
sections:
- file: sandbox/ObservationTheory/02_Notebook_LeastSquares_dont_execute.ipynb
- file: sandbox/prob/prob-intro
title: Probability
sections:
- file: sandbox/prob/prob-background
- file: sandbox/prob/prob-rv
- file: sandbox/prob/prob-discrete
- file: sandbox/prob/prob-notation
title: Notation
- file: sandbox/continuous/Reminder_intro.md
sections:
- file: sandbox/continuous/PDF_CDF
- file: sandbox/continuous/empirical
- file: sandbox/continuous/Param_distr.md
- file: sandbox/continuous/other_distr.md
- file: sandbox/continuous/Parameterization.md
- file: sandbox/continuous/fitting.md
- file: sandbox/continuous/GOF
- caption: MUDE Cookbook
chapters:
- file: cookbook/blank
......
# Goodness of Fit
In the previous sections you have studied the different mathematical models (continuous distribution functions) that we can use to model the univariate uncertainty of a random variable and how to fit them based on observations. Also, you have been introduced to some methods to fit those models. **But how do I choose between different models?**
The choice of the appropriate distribution function needs to be based first on the **physics of the random variable** we are studying. For instance, if I am studying the concentration of a gas in the atmosphere, negative values do not have a physical meaning, so the selected distribution function should not provide with those estimations.
Once we have accounted for the physical characteristics of the random variable, we can make use of **goodness of fit (GOF) techniques** to support our decision. This is, GOF techniques are not a ground truth, but an objective way of comparing models. Different techniques may lead to different judgments and it is you as expert who has to balance those outputs and select the best model to your judgment. Thus, it is recommended to use more than one GOF technique in the decision-making process. In the subsequent sections, some commonly used GOF techniques in the statistics field are presented.
In order to illustrate these techniques, the following toy example will be used. The set of observations is represented in the plots below by its pdf and cdf. A Gaussian ($N(5.17, 5.76)$) and an Exponential distributions ($Expon(-5.25, 10.42)$) are fitted to the data. GOF techniques will be applied to determine which one of the two models fits the data best.
```{figure} /sandbox/continuous/figures/GOF_data.png
---
---
Data overview.
```
## Graphical methods
GOF graphical methods are useful tools to have a first intuition of how different models are performing and confirm the results of other quantitative analysis. Here, you are introduce to three techniques: (1) QQ-plot, (2) Log-scale, and (3) Probability plot.
### QQ-plot
This technique is as simple as comparing the observations used to fit the model with the predictions of the model. Typically, the observations are represented in the x-axis and the predictions in the y-axis. Therefore, the perfect fit would be represented by the $45 ^\circ$-line.
Let's see it applied to the example data. Note that the term *"quantile"* is used in statistics to denote the values of the random variable.
```{figure} /sandbox/continuous/figures/QQplot.png
---
scale: 75%
name: rating_curve
---
QQ-plot.
```
In the QQ-plot, it is shown how the predictions given by the Gaussian distribution (in blue) closely follow the $45 ^\circ$-line. Those provided by the Exponential distribution are way further, detaching significantly from the $45 ^\circ$-line in the upper tail. Based on this graphical technique, it is possible to conclude that Normal distribution seems to be a better model for the data.
**Let's code it!**
Pseudo code is presented to illustrate the procedure to build a QQ-plot.
read observations
#calculate the empirical cdf
p_emp, q_emp = ecdf(observations)
#define the parameters of the Gaussian distribution
mean_gaussian = 5.17
sd_gaussian = 5.76
#compute the values of the random variable predicted by the Normal distribution
q_gaussian = cdf.norm(p_emp, param = [mean_gaussian, sd_gaussian])
#define the parameters of the Exponential distribution
loc_expon = -5.25
scale_expon = 10.42
#compute the values of the random variable predicted by the fitted distribution
q_exponential = cdf.expon(p_emp, param = [loc_expon, scale_expon])
scatter(q_emp, q_gaussian)
scatter(q_emp, q_expon)
### Log-scale
As previously introduced, the tails of the distributions are key to allow the inference of values which have not been observed yet. Therefore, it is important to check whether the distribution used to model the observations is performing properly in that region. A simple trick to do so is to use a logarithmic scale (log-scale) to represent the exceedance probability plot. That way, we "zoom in" on those points in the tail instead of focusing on the bulk of the data. In the figure below, the representation of the cdf in regular and log-scale is shown.
```{figure} /sandbox/continuous/figures/log-scale.png
---
name: log-scale
---
Exceedance probability plot represented both in regular and logarithmic scale.
```
Analyzing the figure on the left side, it can be seen that the observations follow better the Normal distribution. However, it is not clear which one of the two distributions is performing better in the tail. By analyzing the plot on the right side, it is possible to answer that question. Again, it is observed that the data points follow better the Gaussian distribution. However, observations in the tail are not that well represented by the Gaussian distribution, being even closer to the Exponential distribution. Thus, since none of the considered distributions performs properly in the tail, it may be needed to consider another distribution to model the asymmetry of the data, such as the Gumbel or Lognormal distributions.
### Probability plot or probability paper
This graphical technique consists on adapting the axis of the plot of the cdf accounting for the parametric distribution fitted so it is presented as a line. This is, the cdf of any Gaussian distribution will be plotted as a line in the Normal probability plot. Thus, in the x-axis a function of the values of the random variable is presented, while in the y-axis a function of the non-exceedance probabilities is shown.
Let's see it with the example of the Exponential distribution. Its cdf is given by
$
F(x) = 1 - exp(-\lambda[x-\mu])
$
where $\lambda$ is the scale parameter and $\mu$ is the location parameter. A transformation is performed on the cdf so a linear relationship is established between the value of the random variable X and the non-exceedance probabilities. In the case of the Exponential distribution, it is just a matter of calculating logarithms to both sides of the equation as
$
ln[1-F(x)] = -\lambda[x-\mu]
$
In this manner, there is a linear relationship between $ln[1-F(x)]$ and $x$. Note that in the case of the Exponential distribution, the probability plot is the same as the log-scale! Therefore, the Exponential distribution was shown as a straight line in the previous plot, while the Gaussian distribution was not.
## Formal hypothesis test: Kolmogorov-Smirnov test
Kolmogorov-Smirnov (KS) test is one of the most popular nonparametric formal hypothesis tests in statistics. It can be used with two purposes: (1) to compare a sample with a reference parametric distribution, and (2) to compare two samples. Here, the first option is considered since it is the one used for GOF purposes. Thus, this test aims to determine how likely is that a sample was drawn from the reference parametric distribution.
This test is based on the KS statistic, which is (roughly) the maximum distance between the empirical cumulative distribution and the parametric distribution fitted to those observations. This statistic is mathematically defined as
$
D_n = sup_x|\hat{F}(x)-F(x)|
$
where $D_n$ is the KS statistic, $sup_x$ is the supremum of the set distances (intuitively, the largest absolute difference between the two distribution functions across all the values of the random variable $X$), $\hat{F}(x)$ is the empirical cumulative distribution and $F(x)$ the fitted parametric cumulative distribution.
Once $D_n$ is computed, a formal hypothesis test is performed. The null hypothesis corresponds to $\hat{F}$ having the same distribution as $F$. In mathematical terms:
$H_0: \hat{F} \sim F$
The distribution of $D_n$ has been already calculated and included in different statistic packages, since it depends on the considered parametric distribution. These distributions can be used to calculate the probability of the null hypothesis being true (called $p-value$). A significance level needs to be selected (typically, $\alpha=0.05$) as a threshold to determine whether the null hypothesis is rejected or accepted. This is, if the probability of $H_0$ being true ($p-value$) is below $\alpha$, $H_0$ is rejected, so the empirical cumulative distribution is not coming from the fitted parametric cumulative distribution.
Let's see it in an example. In the figure below, both the empirical distribution (step function) and the fitted normal distribution are shown. The maximum distance between both distributions is also presented in red.
```{figure} /sandbox/continuous/figures/sketch_KS.png
---
name: KS
---
Maximum distance between the empirical and fitted normal distribution ($D_n$).
```
If we compute the KS statistic using the distribution already implemented in software (Scipy package, in this case), $D_n = 0.12$ is obtained which (roughly) corresponds to what is shown in the previous plot.
After that, the $p-value$ is also computed, obtaining $p-value = 0.93$. This means that the probability of the null hypothesis ($H_0: \hat{F} \sim F$, the sample comes from the parametric distribution) being true is 0.93. Thus, considering a significance level $\alpha = 0.05$, $pvalue=0.93>\alpha=0.05$, so I cannot reject the null hypothesis.
# PDF and CDF
## Probability Density Function (PDF)
To mathematically describe the distribution of probability for a continuous random variable, we define the probability density function (PDF) of $X$ as $f_X(x)$, such that
$
f_X(x)dx = P(x < X \leq x + dx)
$
To qualify as a probability distribution, the function must satisfy the conditions $f_X(x) \geq 0$ and $\int_{-\infty}^{+\infty}f_X(x)dx =1$, which can be related to the axioms. Note that in this case we use lower case $x$ as the argument of the PDF, and upper case $X$ denotes the random variable. Similarly, the function $f_Y(u)$ describes the PDF of the random variable $Y$.
## Cumulative Distribution Function (CDF)
It’s important to realize that while the PDF describes the distribution of probability across all values of the random variable, probability density is not equivalent to probability. The density allows us to quantify the probability of a certain interval of the continuous random variable, through integration. In the equation below, the mathamtical relationship between the CDF (denoted here as $F(x)$) and the PDF (denoted as $f(x)$) is shown.
$
F(x) = \int_{-\infty}^{x}f(x)dx
$
The definition of the CDF includes an integral that begins at negative infinity and continues to a specific value, $x$, which defines the interval over which the probability is computing. In other words, **the CDF gives the probability that the random variable
$X$ has a value less than $x$**.
It should be easy to see from the definition of the CDF that the probability of observing an exact value of a continuous random variable is exactly zero. This is an important observation, and also an important characteristic that separates continuous and discrete random variables.
## PDF and CDF of Gaussian distribution
You have already extensively used one parametric distribution! Does it ring the bell 'the bell curve'? During your BSc, you have probably used the Normal or Gaussian distribution, whose PDF presents a bell shape. The PDF of the Normal distribution is given by
$
f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}
$
where $x$ is the value of the random variable and $\mu$ and $\sigma$ are the two parameters of the distribution. Thus, you can see that through the previous equation we establish a relationship between the probability densities and the values of the random variable.
In the case of the Normal distribution, the parameters $\mu$ and $\sigma$ correspond to the mean and standard deviation of the random variable. However, this is not the case for all the distributions and it is also dependent on how it is parameterized.
As you have already seen, the previous expression provides us with probability densities, so we need to integrate it to obtain actual probabilities through the CDF (non-exceedance probabilities). In the case of the Normal distribution, there is no closed form of the CDF (the integral).
Let's see how the distribution looks. In the figure below, the PDF and CDF of the Gaussian distribution are shown for different values of its parameters. In the PDF plot, you can see the bell shape that was already mentioned.
```{figure} /sandbox/continuous/figures/gaussian.png
---
scale: 75%
name: gaussian distr
---
Gaussian distribution function: PDF and CDF.
```
As shown in the legend, the black and blue lines present the same value of the standard deviation ($\sigma$), so in the PDF plot the width of the bell is the same. However, they have different values of the mean ($\mu$), which acts like a location parameter. Thus, increasing the mean moves the distribution towards the right, making more likely higher values of the random variable. You can also see that in the CDF plot. The distribution moves towards the right so for a given value, $x = 2$, $F(x\leq2) \approx 0.98$ for the black line and $F(x\leq2) \approx 0.84$ for the blue line.
Regarding the standard deviation ($\sigma$), it can be interpreted as the dispersion around the mean ($\mu$). Thus, you can see in the PDF plot the the red distribution is wider that the black or blue ones, since the standard deviation is the double of the other two. You can also see the effect in the CDF plot, where the slope of the red distribution is more gentle than those of the black and blue distributions.
## Probability of other intervals
We saw that the CDF provides us with the non-exceedance probabilities, this is $[-\infty, x]$. But what happens if we are interested in the probabilities of another intervals? It is common to be interested in the probability of exceeding a value. For instance, wind speeds over a value can damage an structure or concentrations of a nutrient higher than a value can lead to eutrophication. Therefore, we want to integrate from a value $x$ to $\infty$. Here the probability axioms make this easy, since the PDF integrates to 1 over the sample space of the random variable:
$
\int_x^{+\infty}{f(x)dx} = 1 - \int_{-\infty}^x{f(x)dx} = 1 - F(x)
$
The figure below shows both the CDF and the complementary CDF.
```{figure} /sandbox/continuous/figures/survival.png
---
scale: 75%
name: survival gaussian
---
Gaussian distribution function: CDF and survival function or complemetary CDF.
```
Thus, the *exceedance probabilities* can be directly computed by substracting to 1 the non-exceedance probabilities obtained from the CDF. The result is called the *complementary CDF*. However, this function has many alternative names. The name *survival function* may sound odd due to its positive connotation, but this is appropriate when the random variable describes, for example, the lifetime of a structure.
Another interval is that between two values, $x_1$ and $x_2$ (where $x_2>x_1$). Using the CDF, $F(x_2)$, gives the probability of values below $x_2$ but also those below $x_1$. Then, we need to substract $F(x_2)-F(x_1)$ to obtain the probability of being in the interval $[x_1, x_2]$. In mathematical terms:
$
\int_{x_1}^{x_2}{f(x)}dx = \int_{-\infty}^{x_2}{f(x)}dx - \int_{-\infty}^{x_1}{f(x)}dx = F(x_2)-F(x_1)
$
## Inverse CDF
Often, in regulations and guidelines, it is required to design our structure or system for a value which is not exceeded more than $p$ percent of the time. Thus, we are facing the opposite problem: what is the value of the random variable, $x$, whose non-exceedance probability has a specified value, $p$? The solution is simple: the inverse of the CDF, $x = F^{-1}(p)$. As previously mentioned, the CDF is just an equation which in most occasions can be solved analytically, so we just need to work through the formula and calculate $x$ given $p$.
\ No newline at end of file
# Parametric Distributions
In the previous section, you were introduced to the concepts of random variable, probability density function (PDF) and cumulative distribution function (CDF) and how to compute them using empirical data. Here, the concept of parametric distribution as a model of the observed empirical distribution is introduced.
Parametric distributions functions are mathematical models for the empirical distributions that we observe in our data. This is, a parametric CDF is just an equation which relates the non-exceedance probability with the value of the studied random variable. This equation has some parameters or coefficients that need to be fitted using our observations.
**But why do we need them?**
We typically fit a parametric distribution to our data for several reasons. The most important one is that the empirical distribution is limited to the observations we have. Using the empirical CDF, we can interpolate between the observed values, but we cannot extend it further and infer probabilities higher or lower than those we have observed.
Another good reason to fit a parametric distribution is more on the practical side: an equation allows us to use all the power of analytic solutions and it is very easy to transfer and handle. Also, we can make use of the properties of the fitted distribution to have a further insight on the random variable we are studying.
## Gaussian revisited
```{note}
Explicit illustration of parameters of the Gaussian.
```
## Generalized parameterization
Loc, scale, shape thing
# Continuous Distributions
```{note}
check what's in the probability chapter and include a short overview about what's there and what's here. Depending on what the random variable chapter includes, this page should show the normal distribution and point out that we will take this concept MUCH further in this chapter. It shows the PDF/CDF, but the theory is in the next section.
```
```{note}
This page needs to explicitly state that we will be using continuous parametric distributions as models, and have a strong link to the modeling terminology and framework that will be added elsewhere in the MUDE book.
```
# Empirical Distributions
As you can imagine, it is possible to define a PDF and a CDF based on observations. Let's see it with an example dataset of wind speeds close to Schiphold Airport. The figure below shows the dataset which spans for 1 year.
```{figure} /sandbox/continuous/figures/data_overview.png
---
scale: 100%
name: data_wind
---
Time series of wind speed close to Schiphol Airport.
```
Let's start computing the empirical CDF. We need to assign to each observation a non-exceedance probability. To do so, we just need to sort to observations and compute the non-exceedance probabilities using the ranks. This is illustrated below with pseudo-code.
read observations
#sort the observations in ascending order
x = sort(observations, order=ascending)
#calculate the non-exceedance probabilities
length = len(x)
non_exc_prob = range(from=1, to=length+1, by=1)/length
#plot ecdf
plot(x, non_exc_prob)
Using the above algorithm, the following figure is obtained. Note that empirical CDFs are usually plotted using a step plot.
```{figure} /sandbox/continuous/figures/ecdf_wind.png
---
scale: 75%
name: ecdf
---
Empirical cumulative distribution function of the wind speed data.
```
It can be useful to also visualize the empirical PDF. As mentioned above, the PDF is the derivative of the CDF, leading to the following equation.
$
f(x) = F'(x) = \lim_{\Delta x \to 0} \frac{F(x+\Delta x)-F(x)}{\Delta x}
$
Thus, we can compute the empirical PDF assuming a bin size. To do so, we need to count the number of observations in each bin and calculate the relative frequency of each bin by dividing that count with the total number of observations. The density will be then those relative frequencies divided by the bin size. This process is illustrated with the following pseudo-code [^density].
read observations
#Assume the bin size
bin_size = 2
#Calculate the number of bins and the bin edges given the bin size
min_value = min(observations)
max_value = max(observations)
n_bins = (max_value-min_value)/bin_size
bin_edges = linspace(trunc(min_value), ceil(max_value), n_bins+1)
#Count the number of observations in each interval
count = []
for in in range(len(bin_edges)-1):
n = len(observations>bin_edges[i] and observations<bin_edges[i+1])
count.append(n)
#Compute relative frequencies
freq = count/len(observations)
#Compute densities
densities = freq/bin_size
#plot epdf
mid_points = (bin_edges[1:] + bin_edges[:-1]) * 0.5
bar(mid_points, densities, width=bin_size)
Using the above algorithm, the following figure is obtained. We can see that most of the density is concentrated between 2 and 9 m/s.
```{figure} /sandbox/continuous/figures/epdf_wind.png
---
scale: 75%
name: epdf
---
Empirical probability density function of the wind speed data.
```
[^density]: Happily, in most coding languages, the algorithm to compute the pdf is already implemented and we just need to plot a histogram selecting the option to show us the densities.
\ No newline at end of file
book/sandbox/continuous/figures/GOF_data.png

22.9 KiB

book/sandbox/continuous/figures/QQplot.png

41.3 KiB

book/sandbox/continuous/figures/data_overview.png

72.2 KiB

book/sandbox/continuous/figures/ecdf_wind.png

17.2 KiB

book/sandbox/continuous/figures/epdf_wind.png

18.2 KiB

book/sandbox/continuous/figures/gaussian.png

66.6 KiB

book/sandbox/continuous/figures/log-scale.png

59.6 KiB

book/sandbox/continuous/figures/sketch_KS.png

23.4 KiB

book/sandbox/continuous/figures/survival.png

35.3 KiB

# Fitting a Distribtuion
In the previous section, distr.
## L-moments
moments distr = moments data
## MLE
MLE video
## Quizz/exercises
practice
\ No newline at end of file
# Non-Gaussian distributions
In the previous section, Gaussian.
## Concept of tail, asymmetry
Tail
## Lognormal distribution
for instance
## Exponential distribution
for instance
## Gumbel distribution
List of distr is tentative
\ No newline at end of file
# Probability Theory
The discussion about what probability is, as well as the interpretation of what it means, has been going on for centuries, and we won’t cover it here. Lukcily, most modern perspectives on probability are compatible and can be traced back to a few key fundamental concepts, known as the axioms of probability. These three axioms are summarized here:
1. The probability of any event is $[0,1]$.
2. The set of all possible outomes has probability 1.0.
3. If two events are mutually exclusive, the probability of both events (the union) is the sum of their probabilities: $P[A \cup B]=P(A)+P(B)$
While they may seem simple, these axioms are actually precise mathematical statements that provide the basis for a number of theorems and proofs which allow us to apply probability theory to a wide range of applications.
\ No newline at end of file
# Discrete Case
This will be to point out the existence and use of key rules (set theory, events) and discrete distributions, but then largely ignore them for MUDE to focus on the continuous cases.
\ No newline at end of file
# Probability: an Introduction and Refresher
In this Chapter, a reminder on the basic concepts of probabily theory is given focusing on the following concepts:
- Axioms and rules of probability,
- Random variable, introduced with the Gaussian (or Normal) distribution
- A quick shout-out to discrete distributions
In a separate chapter, we focus on key concepts for applying probability to engineering problems, where our fundamental tools will be continuous distributions and related concepts: probability density function (PDF), cumulative distribution function (CDF), empirical distributions.
## Probability Theory
Fundamental probability concepts are a pre-requisite for this course. While we try to explain key concepts throughout the course materials, you should refer to an appropriate textbook if something is unclear, or if you desire further explanation or mathematical proofs. You have also available the [online course](https://tudelft-citg.github.io/learn-probability/intro_in_toc.html) we prepared for you.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment