"This workshop focuses fundamental skills for building, using and understanding multivariate distribtutions: in particular when our variables are no longer statistically independent.\n",
"This workshop focuses fundamental skills for building, using and understanding multivariate distributions: in particular when our variables are no longer statistically independent.\n",
"\n",
"For our case study we will use a Thingamajig: an imaginary object for which we have limited information. One thing we _do_ know, however, is that it is very much influenced by two random variables, $X_1$ and $X_2$: high values for these variables can cause the Thingamajig to fail. We will use a multivariate probability distribution to compute the probability of interest under various _cases_ (we aren't sure which one is relevant, so we consider them all!). We will also use a comparison of distributions drawn from our multivariate probability model with the empirical distributions to validate the model.\n",
"\n",
...
...
@@ -42,7 +42,7 @@
"f_{X_1,X_2}(x_1,x_2)\n",
"$$\n",
"\n",
"This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribtution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:\n",
"This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:\n",
"\n",
"$$\n",
"F_{X_1,X_2}(x_1,x_2)\n",
...
...
@@ -101,7 +101,7 @@
"3. Do something with the samples (deterministic calculation) \n",
"4. Evaluate the results: e.g., “empirical” PDF, CDF of samples, etc.\n",
"\n",
"_Note that as we now have a multivariate distribtution we can no longer sample from the univariate distributions independently!_\n",
"_Note that as we now have a multivariate distribution we can no longer sample from the univariate distributions independently!_\n",
"\n",
"### Task 3: Validating the Bivariate Distribution\n",
"<b>Task 3.1:</b> first the histogram. Note that you probably already computed the samples in Part 2.\n",
"<b>Task 3.1:</b> Plot histograms of $Z$ based on the Monte Carlo samples, and based on the data. Note that you probably already computed the samples in Part 2.\n",
This workshop focuses fundamental skills for building, using and understanding multivariate distribtutions: in particular when our variables are no longer statistically independent.
This workshop focuses fundamental skills for building, using and understanding multivariate distributions: in particular when our variables are no longer statistically independent.
For our case study we will use a Thingamajig: an imaginary object for which we have limited information. One thing we _do_ know, however, is that it is very much influenced by two random variables, $X_1$ and $X_2$: high values for these variables can cause the Thingamajig to fail. We will use a multivariate probability distribution to compute the probability of interest under various _cases_ (we aren't sure which one is relevant, so we consider them all!). We will also use a comparison of distributions drawn from our multivariate probability model with the empirical distributions to validate the model.
### Multivariate Distribution (Task 1)
In Task 1 we will build a multivariate distribution, which is defined by a probability density function. From now on, we will call it _bivariate_, since there are only two random variables:
$$
f_{X_1,X_2}(x_1,x_2)
$$
This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribtution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:
This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:
$$
F_{X_1,X_2}(x_1,x_2)
$$
### Cases (Task 2)
We will consider three different cases and see how the probability of interest is different for each, as well as how they are influenced by the dependence structure of the data. The cases are described here; although they vary slightly, they have something in common: _they are all integrals of the bivariate PDF over some domain of interest $\Omega$._
#### Case 1: Union (OR)
The union case is relevant if the Thingamajig fails when either or both random variable exceeds a specified value:
$$
P[X_1>x_1]\cup P[X_2>x_2]
$$
This is also called the "OR" probability because it considers either one variable _or_ the other _or_ both exceeding a specified value.
#### Case 2: Intersection (AND)
The intersection case is relevant if the Thingamajig fails when the specified interval for each random variable are exceeded together:
$$
P[X_1>20]\cap P[X_2>20]
$$
This is also called the "AND" probability because it considers _both_ variables exceeding a specified value.
#### Case 3: Function of Random Variables
Often it is not possible to describe a region of interest $\Omega$ as a simple union or intersection probability. Instead, there are many combinations of $X_1$ and $X_2$ that define $\Omega$. If we can integrate the probability density function over this region we can evaluate the probability.
Luckily, it turns out there is some extra information about the Thingamajig: a function that describes some aspect of its behavior that we are very very interested in:
$$
Z(X_{1},X_{2}) = 800 - X_{1}^2 - 20X_{2}
$$
where the condition in which we are interested occurs when $Z(X_{1},X_{2})<0$. Thus, the probability of interest is:
$$
P[X_1,X_2:\; Z<0]
$$
#### Evaluating Probabilities in Task 2
Cases 1 and 2 can be evaluated with the bivariate cdf directly because the integral bounds are relatively simple (be aware that some arithmetic and thinking is required, it's not so simple as `multivariate.cdf()`).
Case 3 is not easy to evaluate because it must be integrated over a complicated region. Instead, we will approximate the integral numerically using _Monte Carlo simulation_ (MCS). This is also how we will evaluate the distribution of the function of random variables in Task 3. Remember, there are four essential steps to MCS:
1. Define distributions for random variables (probability density over a domain)
2. Generate random samples
3. Do something with the samples (deterministic calculation)
4. Evaluate the results: e.g., “empirical” PDF, CDF of samples, etc.
_Note that as we now have a multivariate distribtution we can no longer sample from the univariate distributions independently!_
_Note that as we now have a multivariate distribution we can no longer sample from the univariate distributions independently!_
### Task 3: Validating the Bivariate Distribution
This task uses the distribution of the function of random variables (univariate) to validate the bivariate distribution, by comparing the empirical distribution to our model. Once the sample is generated, it involves the same goodness of fit tools that we used last week.
We need to represent our two dependent random variables with a bivariate distribution; a simple model is the bivariate Gaussian distribution, which is readily available via `scipy.stats.multivariate_normal`. To use it in this case study, we first need to check that the marginal distributions are each Gaussian, as well as compute the covariance and correlation coefficient.
Import the data in <code>data.csv</code>, then find the parameters of a normal distribution to fit to the data for each marginal. <em>Quickly</em> check the goodness of fit and state whether you think it is an appropriate distribution (we will keep using it anyway, regardless of the answer).
<p>
<em>Don't spend more than a few minutes on this, you should be able to quickly use some of your code from last week.</em>
Build the bivariate distribution using <code>scipy.stats.multivariate_normal</code> (as well as the mean vector and covariance matrix). To validate the result, create a plot that shows contours of the joint PDF, compared with the data (see note below). Comment on the quality of the fit in 2-3 sentences or bullet points.
<divstyle="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"><p>Use the helper function <code>plot_contour</code> in <code>helper.py</code>; it was already imported above. Either look in the file to read it, or view the documentation in the notebook with <code>plot_contour?</code></p>
<p><em>Hint: for this Task use the optional </em><code>data</code><em> argument!.</em></p></div>
%% Cell type:code id:896fd1c2 tags:
``` python
# plot_contour? # uncomment and run to read docstring
<li>Compute the requested probability using the empirical distribution.</li>
<li>Compute the requested probability using the bivariate distribution.</li>
<li>Create a bivariate plot that includes PDF contours <em>and</em> the region of interest.</li>
<li>Repeat the calculations for additional cases of correlation coefficient (+0.9, 0.0, -0.9) to see how the answer changes (you can simply regenerate the plot, you don't need to make multiple versions). <em>You can save this sub-task for later if you are running out of time. It is more important to get through Task 3 during the in-class session.</em></li>
<li>Write two or three sentences that summarize the results and explains the quantitative impact of correlation coefficient. Make a particular note about whether or not one case may or be affected more or less than the others.</li>
<divstyle="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"><p>Note that the optional arguments in the helper function <code>plot_contour</code> will be useful here--<b>also for the Project on Friday!</b>
Here is an example code that shows you what it can do (the values are meaningless)
<b>Task 2.3:</b> create cells below to carry out the Case 3 calculations.
Note that in this case you need to make the plot to visualize the region over which we want to integrate. We need to define the boundary of the region of interest by solving the equation $Z(X_1,X_2)$ for $X_2$ when $Z=0$.
</p>
</div>
%% Cell type:markdown id:e960cb2f tags:
The equation can be defined as follows:
$$
\textrm{WRITE THE EQUATION HERE}
$$
which is then defined in Python and included in the `plot_contours` function as an array for the keyword argument `region`.
<divstyle="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"><p>Note: the bivariate figures are an important concept for the exam, so if using the code is too difficult for you to use when studying on your own, try sketching it on paper.</p></div>
## Part 3: Validate Bivariate with Monte Carlo Simulation
Now that we have seen how the different cases give different values of probability, let's focus on the function of random variables. This is a more interesting case because we can use the samples of $Z$ to approximate the distribution $f_Z(z)$ and use the empirical distribution of $Z$ to help validate the bivariate model.
- Use Monte Carlo Simulation to create a sample of $Z(X_1,X_2)$ and compare this distribution to the empirical distribution.</li>
- Write 2-3 sentences assessing the quality of the distribution from MCS, and whether the bivariate distribution is acceptable for this problem. Use qualitative and quantitative measures from last week to support your arguments.
</p>
<p>
<em>Note: it would be interesting to fit a parametric distribution to the MCS sample, but it is not required for this assignment.</em>
<b>Task 3.1:</b>first the histogram. Note that you probably already computed the samples in Part 2.
<b>Task 3.1:</b>Plot histograms of $Z$ based on the Monte Carlo samples, and based on the data. Note that you probably already computed the samples in Part 2.
<divstyle="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"><p>In case you are wondering, the data for this exercise was computed with a Clayton Copula. A Copula is a useful way of modelling non-linear dependence. If you would like to learn more about this, you should consider the 2nd year cross-over module CEGM2005 Probabilistic Modelling of real-world phenomena through ObseRvations and Elicitation (MORE).</p></div>
"This workshop focuses fundamental skills for building, using and understanding multivariate distribtutions: in particular when our variables are no longer statistically independent.\n",
"This workshop focuses fundamental skills for building, using and understanding multivariate distributions: in particular when our variables are no longer statistically independent.\n",
"\n",
"For our case study we will use a Thingamajig: an imaginary object for which we have limited information. One thing we _do_ know, however, is that it is very much influenced by two random variables, $X_1$ and $X_2$: high values for these variables can cause the Thingamajig to fail. We will use a multivariate probability distribution to compute the probability of interest under various _cases_ (we aren't sure which one is relevant, so we consider them all!). We will also use a comparison of distributions drawn from our multivariate probability model with the empirical distributions to validate the model.\n",
"\n",
...
...
@@ -42,7 +42,7 @@
"f_{X_1,X_2}(x_1,x_2)\n",
"$$\n",
"\n",
"This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribtution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:\n",
"This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:\n",
"\n",
"$$\n",
"F_{X_1,X_2}(x_1,x_2)\n",
...
...
@@ -101,7 +101,7 @@
"3. Do something with the samples (deterministic calculation) \n",
"4. Evaluate the results: e.g., “empirical” PDF, CDF of samples, etc.\n",
"\n",
"_Note that as we now have a multivariate distribtution we can no longer sample from the univariate distributions independently!_\n",
"_Note that as we now have a multivariate distribution we can no longer sample from the univariate distributions independently!_\n",
"\n",
"### Task 3: Validating the Bivariate Distribution\n",
"\n",
...
...
@@ -968,7 +968,7 @@
"\n",
"The reason why is best explained by looking at the bivariate plot above from Task 1: we can see that the data is not well matched by the probability density contours in the upper right corner of the figure, which is precisely where our region of interest is (when $Z<0$). Whether or not the model is \"good enough\" depends on what values of $Z$ we are interested (we might need more information about the Thingamajig). Since we have focused on the condition where $Z<0$, which does not have a good fit, we should not accept this model and consider a different multivariate model for $f_{X_1,X_2}(x_1,x_2)$.\n",
"\n",
"Note that the probability $P[Z,0]$ calculated for Case 3 in Part 2 indicates the model is not bad (0.002 and 0.004); however, the univariate distribution tells a different story!\n",
"Note that the probability $P[Z<0]$ calculated for Case 3 in Part 2 indicates the model is not bad (0.002 and 0.007); however, the univariate distribution tells a different story!\n",
" \n",
"This is actually a case where you can clearly see that the dependence structure in the data is <em>non-linear</em>. To do this we need a non-Gaussian multivariate distribution; unfortunately these are outside the scope of MUDE.\n",
"<b>Task 3.1:</b> first the histogram. Note that you probably already computed the samples in Part 2.\n",
"<b>Task 3.1:</b> Plot histograms of $Z$ based on the Monte Carlo samples, and based on the data. Note that you probably already computed the samples in Part 2.\n",
This workshop focuses fundamental skills for building, using and understanding multivariate distribtutions: in particular when our variables are no longer statistically independent.
This workshop focuses fundamental skills for building, using and understanding multivariate distributions: in particular when our variables are no longer statistically independent.
For our case study we will use a Thingamajig: an imaginary object for which we have limited information. One thing we _do_ know, however, is that it is very much influenced by two random variables, $X_1$ and $X_2$: high values for these variables can cause the Thingamajig to fail. We will use a multivariate probability distribution to compute the probability of interest under various _cases_ (we aren't sure which one is relevant, so we consider them all!). We will also use a comparison of distributions drawn from our multivariate probability model with the empirical distributions to validate the model.
### Multivariate Distribution (Task 1)
In Task 1 we will build a multivariate distribution, which is defined by a probability density function. From now on, we will call it _bivariate_, since there are only two random variables:
$$
f_{X_1,X_2}(x_1,x_2)
$$
This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribtution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:
This distribution is implemented in `scipy.stats.multivariate_normal`. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:
$$
F_{X_1,X_2}(x_1,x_2)
$$
### Cases (Task 2)
We will consider three different cases and see how the probability of interest is different for each, as well as how they are influenced by the dependence structure of the data. The cases are described here; although they vary slightly, they have something in common: _they are all integrals of the bivariate PDF over some domain of interest $\Omega$._
#### Case 1: Union (OR)
The union case is relevant if the Thingamajig fails when either or both random variable exceeds a specified value:
$$
P[X_1>x_1]\cup P[X_2>x_2]
$$
This is also called the "OR" probability because it considers either one variable _or_ the other _or_ both exceeding a specified value.
#### Case 2: Intersection (AND)
The intersection case is relevant if the Thingamajig fails when the specified interval for each random variable are exceeded together:
$$
P[X_1>20]\cap P[X_2>20]
$$
This is also called the "AND" probability because it considers _both_ variables exceeding a specified value.
#### Case 3: Function of Random Variables
Often it is not possible to describe a region of interest $\Omega$ as a simple union or intersection probability. Instead, there are many combinations of $X_1$ and $X_2$ that define $\Omega$. If we can integrate the probability density function over this region we can evaluate the probability.
Luckily, it turns out there is some extra information about the Thingamajig: a function that describes some aspect of its behavior that we are very very interested in:
$$
Z(X_{1},X_{2}) = 800 - X_{1}^2 - 20X_{2}
$$
where the condition in which we are interested occurs when $Z(X_{1},X_{2})<0$. Thus, the probability of interest is:
$$
P[X_1,X_2:\; Z<0]
$$
#### Evaluating Probabilities in Task 2
Cases 1 and 2 can be evaluated with the bivariate cdf directly because the integral bounds are relatively simple (be aware that some arithmetic and thinking is required, it's not so simple as `multivariate.cdf()`).
Case 3 is not easy to evaluate because it must be integrated over a complicated region. Instead, we will approximate the integral numerically using _Monte Carlo simulation_ (MCS). This is also how we will evaluate the distribution of the function of random variables in Task 3. Remember, there are four essential steps to MCS:
1. Define distributions for random variables (probability density over a domain)
2. Generate random samples
3. Do something with the samples (deterministic calculation)
4. Evaluate the results: e.g., “empirical” PDF, CDF of samples, etc.
_Note that as we now have a multivariate distribtution we can no longer sample from the univariate distributions independently!_
_Note that as we now have a multivariate distribution we can no longer sample from the univariate distributions independently!_
### Task 3: Validating the Bivariate Distribution
This task uses the distribution of the function of random variables (univariate) to validate the bivariate distribution, by comparing the empirical distribution to our model. Once the sample is generated, it involves the same goodness of fit tools that we used last week.
We need to represent our two dependent random variables with a bivariate distribution; a simple model is the bivariate Gaussian distribution, which is readily available via `scipy.stats.multivariate_normal`. To use it in this case study, we first need to check that the marginal distributions are each Gaussian, as well as compute the covariance and correlation coefficient.
Import the data in <code>data.csv</code>, then find the parameters of a normal distribution to fit to the data for each marginal. <em>Quickly</em> check the goodness of fit and state whether you think it is an appropriate distribution (we will keep using it anyway, regardless of the answer).
<p>
<em>Don't spend more than a few minutes on this, you should be able to quickly use some of your code from last week.</em>
In addition to the code above, you should use some of the techniques from WS 1.7 and GA 1.7 to confirm that the marginal distributions are well-approximated by a normal distribution.
Build the bivariate distribution using <code>scipy.stats.multivariate_normal</code> (as well as the mean vector and covariance matrix). To validate the result, create a plot that shows contours of the joint PDF, compared with the data (see note below). Comment on the quality of the fit in 2-3 sentences or bullet points.
<div style="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>Use the helper function <code>plot_contour</code> in <code>helper.py</code>; it was already imported above. Either look in the file to read it, or view the documentation in the notebook with <code>plot_contour?</code></p>
<p><em>Hint: for this Task use the optional </em><code>data</code><em> argument!.</em></p></div>
%% Cell type:code id:896fd1c2 tags:
``` python
# plot_contour? # uncomment and run to read docstring
<li>Compute the requested probability using the empirical distribution.</li>
<li>Compute the requested probability using the bivariate distribution.</li>
<li>Create a bivariate plot that includes PDF contours <em>and</em> the region of interest.</li>
<li>Repeat the calculations for additional cases of correlation coefficient (+0.9, 0.0, -0.9) to see how the answer changes (you can simply regenerate the plot, you don't need to make multiple versions). <em>You can save this sub-task for later if you are running out of time. It is more important to get through Task 3 during the in-class session.</em></li>
<li>Write two or three sentences that summarize the results and explains the quantitative impact of correlation coefficient. Make a particular note about whether or not one case may or be affected more or less than the others.</li>
<div style="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>Note that the optional arguments in the helper function <code>plot_contour</code> will be useful here--<b>also for the Project on Friday!</b>
Here is an example code that shows you what it can do (the values are meaningless)
<b>Task 2.3:</b> create cells below to carry out the Case 3 calculations.
Note that in this case you need to make the plot to visualize the region over which we want to integrate. We need to define the boundary of the region of interest by solving the equation $Z(X_1,X_2)$ for $X_2$ when $Z=0$.
</p>
</div>
%% Cell type:markdown id:0bea4cff tags:
The equation can be defined as follows:
$$
\textrm{WRITE THE EQUATION HERE}
$$
which is then defined in Python and included in the `plot_contours` function as an array for the keyword argument `region`.
<div style="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>Note: order of the tasks in this solution is not important.</p></div>
It looks like the union probabilities are pretty close, whereas the intersection probabilities differ by nearly an order of magnitude! The reason why can be seen by using the plots.
Now for the function of random variables. First we will make the plot to visualize the region over which we want to integrate. We need to define the boundary of the region of interest by solving the equation $Z(X_1,X_2)$ for $X_2$ when $Z=0$:
We do not include the calculations for the different correlation coefficients here for compactness. All you need to do is define a new mulitivariate normal object and redefine the covariance matrix, then re-run the analyses (actually, it's best to do it in a function, which returns the probabilities of interest).
Some key observations about the calculated probabilities (not exhaustive):
<ol>
<li>Higher positive dependence increases probability for all cases.</li>
<li>Independence decreases probability compared to original correlation coefficient (all cases).</li>
<li>Negative correlation decreases probability for cases 2 and 3, but increases probability for case 1.</li>
</ol>
You should be able to confirm these observations by considering how the contours of probability density change relative to the region of interest, as well as by computing them numerically.
<div style="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"> <p>Note: the bivariate figures are an important concept for the exam, so if using the code is too difficult for you to use when studying on your own, try sketching it on paper.</p></div>
## Part 3: Validate Bivariate with Monte Carlo Simulation
Now that we have seen how the different cases give different values of probability, let's focus on the function of random variables. This is a more interesting case because we can use the samples of $Z$ to approximate the distribution $f_Z(z)$ and use the empirical distribution of $Z$ to help validate the bivariate model.
- Use Monte Carlo Simulation to create a sample of $Z(X_1,X_2)$ and compare this distribution to the empirical distribution.</li>
- Write 2-3 sentences assessing the quality of the distribution from MCS, and whether the bivariate distribution is acceptable for this problem. Use qualitative and quantitative measures from last week to support your arguments.
</p>
<p>
<em>Note: it would be interesting to fit a parametric distribution to the MCS sample, but it is not required for this assignment.</em>
Two plots were created below to compare the distributions from the data (empirical) and MCS sample, which is based on the probability "model"---our bivariate distribution. It is clear that the fit is not great: although the upper tail looks like a good fit (third figure), the first and second figures clearly show that the lower tail (see, e.g., value of $Z=0$) and center of the distribution (see, e.g., the mode) differ by around 100 units.
The reason why is best explained by looking at the bivariate plot above from Task 1: we can see that the data is not well matched by the probability density contours in the upper right corner of the figure, which is precisely where our region of interest is (when $Z<0$). Whether or not the model is "good enough" depends on what values of $Z$ we are interested (we might need more information about the Thingamajig). Since we have focused on the condition where $Z<0$, which does not have a good fit, we should not accept this model and consider a different multivariate model for $f_{X_1,X_2}(x_1,x_2)$.
Note that the probability $P[Z,0]$ calculated for Case 3 in Part 2 indicates the model is not bad (0.002 and 0.004); however, the univariate distribution tells a different story!
Note that the probability $P[Z<0]$ calculated for Case 3 in Part 2 indicates the model is not bad (0.002 and 0.007); however, the univariate distribution tells a different story!
This is actually a case where you can clearly see that the dependence structure in the data is <em>non-linear</em>. To do this we need a non-Gaussian multivariate distribution; unfortunately these are outside the scope of MUDE.
<b>Task 3.1:</b> first the histogram. Note that you probably already computed the samples in Part 2.
<b>Task 3.1:</b> Plot histograms of $Z$ based on the Monte Carlo samples, and based on the data. Note that you probably already computed the samples in Part 2.
<div style="background-color:#facb8e; color: black; width: 95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px"><p>In case you are wondering, the data for this exercise was computed with a Clayton Copula. A Copula is a useful way of modelling non-linear dependence. If you would like to learn more about this, you should consider the 2nd year cross-over module CEGM2005 Probabilistic Modelling of real-world phenomena through ObseRvations and Elicitation (MORE).</p></div>