diff --git a/content/GA_1_7/Discharge/Distribution_Fitting_Disch.ipynb b/content/GA_1_7/Discharge/Distribution_Fitting_Disch.ipynb index b6c5e0a7392528849826fb58fbd4a6a19e54990a..daaefbab3bb9ec360d8ad5b908c2377cf53a0a31 100644 --- a/content/GA_1_7/Discharge/Distribution_Fitting_Disch.ipynb +++ b/content/GA_1_7/Discharge/Distribution_Fitting_Disch.ipynb @@ -178,9 +178,10 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 1:</b> \n", + "<b>Task 1:</b> \n", + " \n", "Describe the data based on the previous statistics:\n", " <li>Which variable presents a higher variability?</li>\n", " <li>What does the skewness coefficient means? Which kind of distribution functions should we consider to fit them?</li>\n", @@ -192,10 +193,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 1:</b>\n", - " <li>$h$ presents a higher variance while $h$ and $u$ have a similar mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(h)=0.130/1.211 = 0.107$ and $CV(u)= 0.092/1.464 = 0.063$. Thus, $h$ has higher variability than $u$.</li>\n", - " <li>Both $h$ and $u$ have a positive non-zero skewness, being the one for $u$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $h$ and $u$ would be one which: (1) it is bounded in 0 (no negative values of $h$ or $u$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility. Also, Gaussian distribution might be a possibility for $h$ as the skewness is relatively low and might not be significant.</li>\n", + "\n", + "- $h$ presents a higher variance while $h$ and $u$ have a similar mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(h)=0.130/1.211 = 0.107$ and $CV(u)= 0.092/1.464 = 0.063$. Thus, $h$ has higher variability than $u$.</li>\n", + "- Both $h$ and $u$ have a positive non-zero skewness, being the one for $u$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $h$ and $u$ would be one which: (1) it is bounded in 0 (no negative values of $h$ or $u$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility. Also, Gaussian distribution might be a possibility for $h$ as the skewness is relatively low and might not be significant.</li>\n", "</div>\n", "</div>" ] @@ -221,9 +223,10 @@ "id": "b20641d9", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 2:</b> \n", + "\n", "Define a function to compute the empirical CDF.\n", "</p>\n", "</div>" @@ -290,9 +293,10 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 3:</b> \n", + "\n", "Based on the results of Task 1 and the empirical PDF and CDF, select <b>one</b> distribution to fit to each variable. For $h$, select between Uniform or Gaussian distribution, while for $u$ choose between Exponential or Gumbel.\n", "</p>\n", "</div>" @@ -302,10 +306,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 3:</b>\n", - " $h$: Gaussian\n", - " $u$: Gumbel</li>\n", + "\n", + "$h$: Gaussian\n", + "$u$: Gumbel\n", "</div>\n", "</div>" ] @@ -323,9 +328,10 @@ "id": "6a24dbfa", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 4:</b> \n", + "\n", "Fit the selected distributions to the observations using MLE.\n", "</p>\n", "</div>\n", @@ -357,9 +363,10 @@ "id": "30f5112c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 5:</b> \n", + "\n", "Assess the goodness of fit of the selected distribution using:\n", " <li> One graphical method: QQplot or Logscale. Choose one.</li>\n", " <li> Kolmogorov-Smirnov test.</li>\n", @@ -490,9 +497,10 @@ "id": "ad185d88", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 6:</b> \n", + "\n", "Interpret the results of the GOF techniques. How does the selected parametric distribution perform?\n", "</p>\n", "</div>" @@ -502,11 +510,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 6:</b>\n", - " <li> Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For both $h$ and $u$, Gaussian and Gumbel distributions performs well even in the tail of the distribution. For $h$, high values start to deviate from the Gaussian distribution, indicating that for lower non-exceedance probabilities it might not be a good fit. </li>\n", - " <li> QQplot: Similar conclusions to those for Logscale can be derived.</li>\n", - " <li> Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=$0.05, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can accept that the variable $h$ comes from a Gaussian distribution and that $u$ comes from a Gumbel distribution.</li>\n", + " \n", + "- Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For both $h$ and $u$, Gaussian and Gumbel distributions performs well even in the tail of the distribution. For $h$, high values start to deviate from the Gaussian distribution, indicating that for lower non-exceedance probabilities it might not be a good fit. </li>\n", + "- QQplot: Similar conclusions to those for Logscale can be derived.</li>\n", + "- Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=0.05$, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can accept that the variable $h$ comes from a Gaussian distribution and that $u$ comes from a Gumbel distribution.</li>\n", "</div>\n", "</div>" ] @@ -526,7 +535,7 @@ "source": [ "Using the fitted distributions, we are going to propagate the uncertainty from $h$ and $u$ to $q$ **assuming that $h$ and $u$ are independent**.\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 7:</b> \n", " \n", @@ -597,12 +606,13 @@ "id": "f0841e5c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 8:</b> \n", + "<b>Task 8:</b> \n", + " \n", "Interpret the figures above, answering the following questions:\n", - " <li>Are there differences between the two computed distributions for $q$?</li>\n", - " <li>What are the advantages and disadvantages of using the simulations?</li>\n", + "- Are there differences between the two computed distributions for $q$?</li>\n", + "- What are the advantages and disadvantages of using the simulations?</li>\n", "</p>\n", "</div>" ] @@ -611,10 +621,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 8:</b>\n", - " <li> In the PDF plot, we can see that the shape of the distribution is similar for $q$. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $q$, being the values from the observations higher than those computed from the simulations. This is because the Gaussian distribution does not properly fit the tail of the distribution of $h$ and when inferring values with very low exceedance probabilities, that becomes more noticeable. </li>\n", - " <li> <b>Disadvantages:</b> we are assuming that $h$ and $u$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distribution of $h$ is not properly fitted, the obtained distribution for $q$ deviates from the one obtained from the observations. Also, some simulated values are negative and, thus, non-physical. That could be corrected using distributions bounded in 0. <li><b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).\n", + "\n", + "- In the PDF plot, we can see that the shape of the distribution is similar for $q$. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $q$, being the values from the observations higher than those computed from the simulations. This is because the Gaussian distribution does not properly fit the tail of the distribution of $h$ and when inferring values with very low exceedance probabilities, that becomes more noticeable. </li>\n", + "- <b>Disadvantages:</b> we are assuming that $h$ and $u$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distribution of $h$ is not properly fitted, the obtained distribution for $q$ deviates from the one obtained from the observations. Also, some simulated values are negative and, thus, non-physical. That could be corrected using distributions bounded in 0. <li><b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).\n", "</div>" ] }, @@ -624,7 +635,7 @@ "source": [ "If you run the code in the cell below, you will obtain a scatter plot of both variables. Explore the relationship between both variables and answer the following questions:\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 9:</b> \n", " \n", @@ -689,10 +700,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 9:</b>\n", - " <li> The observations are focussed in an area of the plot while the simulations are spreaded all around. this is because the observations are dependent to each other, there is a physical relationship between the water depth and the velocity of the flow, while the simualtions were assumed to be independent. </li>\n", - " <li> There is a correlation of 0.39 between the observed $h$ and $u$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.</li> <li><b>Some suggestions:</b> Improve the fit in the tail of $h$. Account for the dependence between the two variables. </li>\n", + "\n", + "- The observations are focussed in an area of the plot while the simulations are spreaded all around. this is because the observations are dependent to each other, there is a physical relationship between the water depth and the velocity of the flow, while the simualtions were assumed to be independent. </li>\n", + "- There is a correlation of 0.39 between the observed $h$ and $u$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.</li>\n", + "- <b>Some suggestions:</b> Improve the fit in the tail of $h$. Account for the dependence between the two variables. </li>\n", "</div>" ] }, diff --git a/content/GA_1_7/Force/Distribution_Fitting_force.ipynb b/content/GA_1_7/Force/Distribution_Fitting_force.ipynb index 9b5b2bd539a7a96414c7642e311c935fffffa17f..22cdc7be8f743feddecc482e47a7d42ce01d269c 100644 --- a/content/GA_1_7/Force/Distribution_Fitting_force.ipynb +++ b/content/GA_1_7/Force/Distribution_Fitting_force.ipynb @@ -32,7 +32,7 @@ "source": [ "## Case 1: Wave impacts on a crest wall\n", "\n", - "**What's the propagated uncertainty? *How large will be the horizontal force?***\n", + "**What's the propagated uncertainty? *How large will the horizontal force be?***\n", "\n", "In this project, you have chosen to work on the uncertainty of wave periods and wave heights in the Alboran sea to estimate the impacts on a crest wall: a concrete element installed on top of mound breakwater. You have observations from buoys of the significant wave height ($H$) and the peak wave period ($T$) each hour for several years. As you know, $H$ and $T$ are hydrodynamic variables relevant to estimate wave impacts on the structure. The maximum horizontal force (exceeded by 0.1% of incoming waves) can be estimated using the following equation (USACE, 2002).\n", "\n", @@ -49,7 +49,7 @@ "**The goal of this project is:**\n", "1. Choose a reasonable distribution function for $H$ and $T$.\n", "2. Fit the chosen distributions to the observations of $H$ and $T$.\n", - "3. Assuming $H$ and $d$ are independent, propagate their distributions to obtain the distribution of $F_h$.\n", + "3. Assuming $H$ and $T$ are independent, propagate their distributions to obtain the distribution of $F_h$.\n", "4. Analyze the distribution of $F_h$." ] }, @@ -178,9 +178,10 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 1:</b> \n", + "<b>Task 1:</b>\n", + "\n", "Describe the data based on the previous statistics:\n", " <li>Which variable presents a higher variability?</li>\n", " <li>What does the skewness coefficient means? Which kind of distribution functions should we consider to fit them?</li>\n", @@ -192,10 +193,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", - " <b>Solution 1:</b>\n", - " <li>$T$ presents a higher variance but a much higher mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(H)=0.664/1.296 = 0.512$ and $CV(T)= 4.710/6.861 = 0.686.3$. Thus, $T$ has higher variability than $H$.</li>\n", - " <li>Both $H$ and $T$ has a positive non-zero skewness, being the one for $H$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $H$ and $T$ would be one which: (1) it is bounded in 0 (no negative values of $H$ or $T$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility</li>\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<p>\n", + "<b>Solution:</b>\n", + "\n", + "- $T$ presents a higher variance but a much higher mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(H)=0.664/1.296 = 0.512$ and $CV(T)= 4.710/6.861 = 0.686.3$. Thus, $T$ has higher variability than $H$.\n", + "- Both $H$ and $T$ has a positive non-zero skewness, being the one for $H$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $H$ and $T$ would be one which: (1) it is bounded in 0 (no negative values of $H$ or $T$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility\n", + "\n", "</div>\n", "</div>" ] @@ -221,9 +225,10 @@ "id": "b20641d9", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 2:</b> \n", + "<b>Task 2:</b> \n", + " \n", "Define a function to compute the empirical CDF.\n", "</p>\n", "</div>" @@ -238,7 +243,7 @@ "source": [ "def ecdf(var):\n", " x = np.sort(var) # sort the values from small to large\n", - " n = x.size # determine the number of datapoints\\\n", + " n = x.size # determine the number of datapoints\n", " y = np.arange(1, n+1) / (n+1)\n", " return [y, x]" ] @@ -290,9 +295,10 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 3:</b> \n", + "\n", "Based on the results of Task 1 and the empirical PDF and CDF, select <b>one</b> distribution to fit to each variable. For $H$, select between Exponential or Gaussian distribution, while for $T$ choose between Uniform or Gumbel.\n", "</p>\n", "</div>" @@ -302,10 +308,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", - " <b>Solution 3:</b>\n", - " $H$: Exponential\n", - " $T$: Gumbel</li>\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<b>Solution:</b>\n", + "\n", + "$H$: Exponential\n", + "$T$: Gumbel\n", + "\n", "</div>\n", "</div>" ] @@ -323,9 +331,10 @@ "id": "6a24dbfa", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 4:</b> \n", + "<b>Task 4:</b> \n", + " \n", "Fit the selected distributions to the observations using MLE.\n", "</p>\n", "</div>\n", @@ -357,9 +366,10 @@ "id": "30f5112c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 5:</b> \n", + "<b>Task 5:</b> \n", + " \n", "Assess the goodness of fit of the selected distribution using:\n", " <li> One graphical method: QQplot or Logscale. Choose one.</li>\n", " <li> Kolmogorov-Smirnov test.</li>\n", @@ -490,9 +500,10 @@ "id": "ad185d88", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 6:</b> \n", + "<b>Task 6:</b> \n", + " \n", "Interpret the results of the GOF techniques. How does the selected parametric distribution perform?\n", "</p>\n", "</div>" @@ -502,11 +513,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 6:</b>\n", - " <li> Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For $H$, Exponential distribution performs well for low values. On the contrary, it does not properly model the right tail. It is on the safe side providing predictions higher than those observed. Note that this may lead to predictions of $H$ that are not physically possible. Regarding $T$, the Gumbel distribution seems to follow the low observations and those around the central moments but not those on the right tail. The predictions provided by the Gumbel distribution are on the safe side. </li>\n", - " <li> QQplot: Similar conclusions to those for Logscale can be derived.</li>\n", - " <li> Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=$0.05, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can reject that the variable $H$ comes from a Exponential distribution and that $T$ comes from a Gumbel distribution.</li>\n", + "\n", + "- Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For $H$, Exponential distribution performs well for low values. On the contrary, it does not properly model the right tail. It is on the safe side providing predictions higher than those observed. Note that this may lead to predictions of $H$ that are not physically possible. Regarding $T$, the Gumbel distribution seems to follow the low observations and those around the central moments but not those on the right tail. The predictions provided by the Gumbel distribution are on the safe side. \n", + "- QQplot: Similar conclusions to those for Logscale can be derived.\n", + "- Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=0.05$, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can reject that the variable $H$ comes from a Exponential distribution and that $T$ comes from a Gumbel distribution.\n", "</div>\n", "</div>" ] @@ -526,7 +538,7 @@ "source": [ "Using the fitted distributions, we are going to propagate the uncertainty from $H$ and $T$ to $F_h$ **assuming that $H$ and $T$ are independent**.\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 7:</b> \n", " \n", @@ -597,12 +609,13 @@ "id": "f0841e5c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 8:</b> \n", + "\n", "Interpret the figures above, answering the following questions:\n", - " <li>Are there differences between the two computed distributions for $F_h$?</li>\n", - " <li>What are the advantages and disadvantages of using the simulations?</li>\n", + "- Are there differences between the two computed distributions for $F_h$?\n", + "- What are the advantages and disadvantages of using the simulations?\n", "</p>\n", "</div>" ] @@ -611,10 +624,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 8:</b>\n", - " <li> In the PDF plot, we can see that the shape of the distribution is similar for $F_h$ although more density around the central moments is concentrated in the simualted data. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $F_h$, being the values from the simulations higher than those computed from the observations. This is because both the Exponential and the Gumbel distribution overpredict the tail of the distributions of $H$ and $T$, respectively. </li>\n", - " <li> <b>Disadvantages:</b> we are assuming that $H$ and $T$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distributions of $H$ and $T$ are not properly fitted, the obtained distribution for $F_h$ deviates from the one obtained from the observations. <b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).<li> <b>Extra note:</b> The equation you are applying to compute $F_h$ is prepared for extreme waves. Thus, when applied out its range of application, it leads to negative forces which do not have physical meaning.</li>\n", + "\n", + "- In the PDF plot, we can see that the shape of the distribution is similar for $F_h$ although more density around the central moments is concentrated in the simualted data. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $F_h$, being the values from the simulations higher than those computed from the observations. This is because both the Exponential and the Gumbel distribution overpredict the tail of the distributions of $H$ and $T$, respectively. \n", + "- <b>Disadvantages:</b> we are assuming that $H$ and $T$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distributions of $H$ and $T$ are not properly fitted, the obtained distribution for $F_h$ deviates from the one obtained from the observations. <b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).\n", + "- <b>Extra note:</b> The equation you are applying to compute $F_h$ is prepared for extreme waves. Thus, when applied out its range of application, it leads to negative forces which do not have physical meaning.\n", "</div>\n", "</div>" ] @@ -625,7 +640,7 @@ "source": [ "If you run the code in the cell below, you will obtain a scatter plot of both variables. Explore the relationship between both variables and answer the following questions:\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 9:</b> \n", " \n", @@ -690,10 +705,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 9:</b>\n", - " <li> The observations are focussed in an area of the plot while the simulations are spreaded all around. this is because the observations are dependent to each other, there is a physical relationship between the wave height and the wave period, while the simualtions were assumed to be independent. </li>\n", - " <li> There is a correlation of 0.46 between the observed $H$ and $T$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.</li> <li><b>Some suggestions:</b> Improve the fit of $H$ and $T$. Maybe propose Gumbel or Lognormal distribution for $H$ and Lognormal for $T$. Account for the dependence between the two variables. </li>\n", + "\n", + "- The observations are focussed in an area of the plot while the simulations are spreaded all around. this is because the observations are dependent to each other, there is a physical relationship between the wave height and the wave period, while the simualtions were assumed to be independent. </li>\n", + "- There is a correlation of 0.46 between the observed $H$ and $T$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.\n", + "- <b>Some suggestions:</b> Improve the fit of $H$ and $T$. Maybe propose Gumbel or Lognormal distribution for $H$ and Lognormal for $T$. Account for the dependence between the two variables. </li>\n", "</div>" ] }, diff --git a/content/GA_1_7/emissions/Distribution_Fitting_emissions.ipynb b/content/GA_1_7/emissions/Distribution_Fitting_emissions.ipynb index e721fb1bb8cff27c3daef89097ef198fb5478c84..64b2334a9dca66c7d1722ac43ecfab500025eb76 100644 --- a/content/GA_1_7/emissions/Distribution_Fitting_emissions.ipynb +++ b/content/GA_1_7/emissions/Distribution_Fitting_emissions.ipynb @@ -30,7 +30,7 @@ "id": "1db6fea9-f3ad-44bc-a4c8-7b2b3008e945" }, "source": [ - "## Case 1: $CO_2$ emissions from traffic\n", + "## Case 2: $CO_2$ emissions from traffic\n", "\n", "**What's the propagated uncertainty? *How large will be the $CO_2$ emissions?***\n", "\n", @@ -172,12 +172,13 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", - "<b>Task 1:</b> \n", + "<b>Task 1:</b> \n", + "\n", "Describe the data based on the previous statistics:\n", - " <li>Which variable presents a higher variability?</li>\n", - " <li>What does the skewness coefficient means? Which kind of distribution functions should we consider to fit them?</li>\n", + "- Which variable presents a higher variability?</li>\n", + "- What does the skewness coefficient means? Which kind of distribution functions should we consider to fit them?</li>\n", "</p>\n", "</div>" ] @@ -186,10 +187,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 1: TO UPDATE</b>\n", - " <li>$T$ presents a higher variance but a much higher mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(H)=0.664/1.296 = 0.512$ and $CV(T)= 4.710/6.861 = 0.686.3$. Thus, $T$ has higher variability than $H$.</li>\n", - " <li>Both $H$ and $T$ has a positive non-zero skewness, being the one for $H$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $H$ and $T$ would be one which: (1) it is bounded in 0 (no negative values of $H$ or $T$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility</li>\n", + "\n", + "- $T$ presents a higher variance but a much higher mean. Then, if we compute the coefficient of variation to standardize that variability, we obtain $CV(H)=0.664/1.296 = 0.512$ and $CV(T)= 4.710/6.861 = 0.686.3$. Thus, $T$ has higher variability than $H$.</li>\n", + "- Both $H$ and $T$ has a positive non-zero skewness, being the one for $H$ significantly higher. Thus, the data presents a right tail and mode < median < mean. An appropriate distribution for $H$ and $T$ would be one which: (1) it is bounded in 0 (no negative values of $H$ or $T$ are physically possible), and (2) has a positive tail. If we consider the distributions that you have been introduced to, Lognormal, Gumbel or Exponential would be a possibility</li>\n", "</div>\n", "</div>" ] @@ -215,9 +217,10 @@ "id": "b20641d9", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 2:</b> \n", + "\n", "Define a function to compute the empirical CDF.\n", "</p>\n", "</div>" @@ -284,9 +287,10 @@ "id": "bfadcf3f-4578-4809-acdb-625ab3a71f27" }, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 3:</b> \n", + "\n", "Based on the results of Task 1 and the empirical PDF and CDF, select <b>one</b> distribution to fit to each variable. For $H$, select between Gumbel or Gaussian distribution, while for $C$ choose between Uniform or Lognormal.\n", "</p>\n", "</div>" @@ -296,10 +300,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 3:</b>\n", - " $H$: Gaussian\n", - " $C$: Uniform</li>\n", + "\n", + "$H$: Gaussian\n", + "$C$: Uniform\n", "</div>\n", "</div>" ] @@ -317,9 +322,10 @@ "id": "6a24dbfa", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 4:</b> \n", + "\n", "Fit the selected distributions to the observations using MLE.\n", "</p>\n", "</div>\n", @@ -351,12 +357,13 @@ "id": "30f5112c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 5:</b> \n", + "\n", "Assess the goodness of fit of the selected distribution using:\n", - " <li> One graphical method: QQplot or Logscale. Choose one.</li>\n", - " <li> Kolmogorov-Smirnov test.</li>\n", + "- One graphical method: QQplot or Logscale. Choose one.</li>\n", + "- Kolmogorov-Smirnov test.</li>\n", "</p>\n", "</div>\n", "\n", @@ -484,9 +491,10 @@ "id": "ad185d88", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 6:</b> \n", + "\n", "Interpret the results of the GOF techniques. How does the selected parametric distribution perform?\n", "</p>\n", "</div>" @@ -496,11 +504,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 6: TO UPDATE</b>\n", - " <li> Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For $H$, Exponential distribution performs well for low values. On the contrary, it does not properly model the right tail. It is on the safe side providing predictions higher than those observed. Note that this may lead to predictions of $H$ that are not physically possible. Regarding $T$, the Gumbel distribution seems to follow the low observations and those around the central moments but not those on the right tail. The predictions provided by the Gumbel distribution are on the safe side. </li>\n", - " <li> QQplot: Similar conclusions to those for Logscale can be derived.</li>\n", - " <li> Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=$0.05, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can reject that the variable $H$ comes from a Exponential distribution and that $T$ comes from a Gumbel distribution.</li>\n", + "\n", + "- Logscale plot: This technique allows to visually assess the fitting of the parametric distribution to the tail of the empirical distribution. For $H$, Exponential distribution performs well for low values. On the contrary, it does not properly model the right tail. It is on the safe side providing predictions higher than those observed. Note that this may lead to predictions of $H$ that are not physically possible. Regarding $T$, the Gumbel distribution seems to follow the low observations and those around the central moments but not those on the right tail. The predictions provided by the Gumbel distribution are on the safe side. </li>\n", + "- QQplot: Similar conclusions to those for Logscale can be derived.</li>\n", + "- Kolmogorov-Smirnov test: remember that the null hypothesis of this test is that the samples follow the parametric distribution. Therefore, the p-value represents the probability of the null hypothesis being true. If p-value is lower than the significance ($\\alpha=0.05$, for instance), the null hypothesis is rejected. Considering here $\\alpha=0.05$, we can reject that the variable $H$ comes from a Exponential distribution and that $T$ comes from a Gumbel distribution.</li>\n", "</div>\n", "</div>" ] @@ -520,7 +529,7 @@ "source": [ "Using the fitted distributions, we are going to propagate the uncertainty from $H$ and $C$ to the emissions of $CO_2$ **assuming that $H$ and $C$ are independent**.\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 7:</b> \n", " \n", @@ -591,9 +600,10 @@ "id": "f0841e5c", "metadata": {}, "source": [ - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 8:</b> \n", + "\n", "Interpret the figures above, answering the following questions:\n", " <li>Are there differences between the two computed distributions for $F_h$?</li>\n", " <li>What are the advantages and disadvantages of using the simulations?</li>\n", @@ -605,10 +615,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 8: TO UPDATE</b>\n", - " <li> In the PDF plot, we can see that the shape of the distribution is similar for $F_h$ although more density around the central moments is concentrated in the simualted data. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $F_h$, being the values from the simulations higher than those computed from the observations. This is because both the Exponential and the Gumbel distribution overpredict the tail of the distributions of $H$ and $T$, respectively. </li>\n", - " <li> <b>Disadvantages:</b> we are assuming that $H$ and $T$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distributions of $H$ and $T$ are not properly fitted, the obtained distribution for $F_h$ deviates from the one obtained from the observations. <b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).<li> <b>Extra note:</b> The equation you are applying to compute $F_h$ is prepared for extreme waves. Thus, when applied out its range of application, it leads to negative forces which do not have physical meaning.</li>\n", + "\n", + "- In the PDF plot, we can see that the shape of the distribution is similar for $F_h$ although more density around the central moments is concentrated in the simualted data. In the CDF plot, we can see that there are significant differences in the tail of the distribution of $F_h$, being the values from the simulations higher than those computed from the observations. This is because both the Exponential and the Gumbel distribution overpredict the tail of the distributions of $H$ and $T$, respectively. </li>\n", + "- <b>Disadvantages:</b> we are assuming that $H$ and $T$ are independent (we will see how to address this issue next week). But is that true? Also, the results are conditioned to how good model is the selected parametric distribution. In this case, since the tail of the distributions of $H$ and $T$ are not properly fitted, the obtained distribution for $F_h$ deviates from the one obtained from the observations. <b>Advantages:</b> I can draw all the samples I want allowing the computation of events I have not observed yet (extreme events).\n", + "- <b>Extra note:</b> The equation you are applying to compute $F_h$ is prepared for extreme waves. Thus, when applied out its range of application, it leads to negative forces which do not have physical meaning.</li>\n", "</div>\n", "</div>" ] @@ -619,7 +631,7 @@ "source": [ "If you run the code in the cell below, you will obtain a scatter plot of both variables. Explore the relationship between both variables and answer the following questions:\n", "\n", - "<div style=\"background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#AABAB2; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", "<p>\n", "<b>Task 9:</b> \n", " \n", @@ -684,10 +696,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "<div style=\"background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", + "<div style=\"background-color:#FAE99E; color: black; width:95%; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px\">\n", " <b>Solution 9: TO UPDATE</b>\n", - " <li> The observations are focussed in an area of the plot while the simulations are spreaded all around. This is because the observations are dependent to each other, there is a physical relationship between the number of cars and the number of trucks, while the simualtions were assumed to be independent. Moreover, negative numbers for the number of vehicles are sampled, which do not have a physical meaning. </li>\n", - " <li> There is a correlation of 0.46 between the observed $H$ and $T$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.</li> <li><b>Some suggestions:</b> Improve the fit of $H$ and $T$. Maybe propose Gumbel or Lognormal distribution for $H$ and Lognormal for $T$. Account for the dependence between the two variables. </li>\n", + "\n", + "- The observations are focussed in an area of the plot while the simulations are spreaded all around. This is because the observations are dependent to each other, there is a physical relationship between the number of cars and the number of trucks, while the simualtions were assumed to be independent. Moreover, negative numbers for the number of vehicles are sampled, which do not have a physical meaning. </li>\n", + "- There is a correlation of 0.46 between the observed $H$ and $T$, indicating the physical dependence between the variables. On the contrary, no significant correlation between the generated samples is observed.</li>\n", + "- <b>Some suggestions:</b> Improve the fit of $H$ and $T$. Maybe propose Gumbel or Lognormal distribution for $H$ and Lognormal for $T$. Account for the dependence between the two variables. </li>\n", "</div>" ] },