Merge branch '5-continuous-distribution-outline' into 'main'

Resolve "continuous distribution outline" Closes #5 See merge request !16

Merge branch '5-continuous-distribution-outline' into 'main'
9fe5e28a · Robert Lanzafame · abfb2404 · 7e434e0e · 9fe5e28a · 9fe5e28a
Commit 9fe5e28a authored 1 year ago by Robert Lanzafame
--- a/book/sandbox/prob/prob-notation.md
+++ b/book/sandbox/prob/prob-notation.md
+# Probability Notation
+
+Maybe we should have a notation section in the main chapters?
\ No newline at end of file
--- a/book/sandbox/prob/prob-rv.md
+++ b/book/sandbox/prob/prob-rv.md
+# Random variables
+
+From a previous course on probability and statistics you probably remember the use of Venn diagrams to aid our understanding of arithmetic involving probability. The box represented the sample space, with circles (and their relative size) representing various events and their probability of occurrence. The concept of a random variable is nothing more than a mapping of this sample space to a number line, which allows us to combine probability theory with calculus. We use a capital letter to denote a random variable and realizations of that random variable are described with a lower case letter. For example, consider a discrete random variable, $X$, which can take 1, 2 or 3 as possible outcomes. We can write the probability of each event mathematically as:
+
+$
+p_X(x_i)=P(X=x_i)
+$
+
+Where $i = 1,2,3$ and $P$ is mathematical notation for describing the event in the parenthesis. The function describes the probability for all outcomes of $P$ (i.e., the sample space), which, as implied by the axioms, should sum to 1.
+
+This simple example is actually a discrete random variable, because the values of $X$ take on a finite number of values. For most of this course, however, we will work with continuous random variables, which can take on an infinite set of values. A key characteristic of continuous variables is that the probaility of an event must be defined over an interval. For example, we might be interested in the probability that $X$ takes a value between 3 and 4, which we will see in the next sections.
+
+The mapping of a sample space to a number line combined with a (mathematical) specification of probability describes how probability is distributed across all events in the sample space. For this reason we use the term *probability distribution* to describe the mathematical functions defining probability for outcomes of a random variable, regardless of whether it is discrete or continous.
\ No newline at end of file
--- a/code/pd/GOF_plots.py
+++ b/code/pd/GOF_plots.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Wed Jun 21 13:36:10 2023
+
+@author: pmaresnasarre
+"""
+
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from scipy.stats import expon
+from scipy.stats import norm
+from scipy.stats import kstest
+
+#%% Functions
+
+def ecdf(var):
+    x = np.sort(var) # sort the values from small to large
+    n = x.size # determine the number of datapoints
+    y = np.arange(1, n+1) / n
+    return [y, x]
+
+#%% Plots
+
+#overview
+
+obs = [19.54, 9.12, 11.89, -0.29, 2.65, 3.63, 10.49, 3.61, 8.50, -5.25, 3.23, 
+       0.88, -2.88, 7.53, 6.40, 5.16, -1.66, 10.63, 6.75, 3.50, 12.32, 32.67, 17.21]
+
+plt.rcParams.update({'font.size': 12})
+
+fig, ax = plt.subplots(1,2, figsize=(10,5), layout='constrained')
+
+ax[0].hist(obs, density=True, edgecolor = 'cornflowerblue', 
+           facecolor = 'lightskyblue', alpha = 0.6, bins = 8)
+ax[0].set_xlabel('x')
+ax[0].set_ylabel('pdf')
+ax[0].grid()
+
+ax[1].step(ecdf(obs)[1], ecdf(obs)[0], color = 'cornflowerblue')
+ax[1].set_xlabel('x')
+ax[1].set_ylabel('${P[X \leq x]}$')
+ax[1].grid()
+
+#normal distribution
+q_norm=norm.ppf(ecdf(obs)[0], loc=5.17, scale=5.76)
+
+#exponential distribution
+q_expon=expon.ppf(ecdf(obs)[0], loc=-5.25, scale=10.42)
+
+#QQplot
+fig, ax=plt.subplots(figsize=(6,6))
+ax.scatter(ecdf(obs)[1],q_norm, 40, 'cornflowerblue', label='N(5.17, 5.76)')
+ax.scatter(ecdf(obs)[1],q_expon, 40, edgecolor = 'k', facecolor='w', label='Expon(-5.25, 10.42)')
+ax.plot([-10,30], [-10,30], 'k')
+ax.set_ylabel('Theoretical quantiles')
+ax.set_xlabel('Empirical quantiles')
+ax.grid()
+ax.set_ylim([-10,30])
+ax.set_xlim([-10,30])
+ax.legend()
+
+#cdfs
+x = np.linspace(0.01, 0.99, 50)
+cdf_norm=norm.ppf(x, loc=5.17, scale=5.76)
+cdf_expon=expon.ppf(x, loc=-5.25, scale=10.42)
+
+#Log-scale
+fig, ax = plt.subplots(1,2, figsize=(10,5), layout='constrained')
+
+ax[0].plot(cdf_norm, 1-x,'cornflowerblue', label='N(5.17, 5.76)')
+ax[0].plot(cdf_expon, 1-x, 'k', label='Expon(-5.25, 10.42)')
+ax[0].scatter(ecdf(obs)[1], 1-ecdf(obs)[0], 40, 'r', label = 'Observations')
+ax[0].grid()
+ax[0].legend()
+ax[0].set_xlabel('x')
+ax[0].set_ylabel('${P[X > x]}$')
+
+ax[1].plot(cdf_norm, 1-x,'cornflowerblue', label='N(5.17, 5.76)')
+ax[1].set_yscale('log')
+ax[1].plot(cdf_expon, 1-x, 'k', label='Expon(-5.25, 10.42)')
+ax[1].scatter(ecdf(obs)[1], 1-ecdf(obs)[0], 40, 'r', label = 'Observations')
+ax[1].grid()
+ax[1].legend()
+ax[1].set_xlabel('x')
+ax[1].set_ylabel('${P[X > x]}$')
+
+#KS test
+
+fig, ax = plt.subplots(1,1, figsize=(6,5), layout='constrained')
+
+ax.step(ecdf(obs)[1], ecdf(obs)[0], color = 'cornflowerblue')
+ax.plot(cdf_norm, x, color = 'k')
+ax.set_xlabel('x')
+ax.set_ylabel('${P[X \leq x]}$')
+ax.grid()
+ax.annotate('', xy=(3, 0.39), xytext=(2.94, 0.49), arrowprops=dict(arrowstyle='<->', ec = 'r'))
+ax.annotate('Maximum distance', xy=(3, 0.39), xytext=(-8.75, 0.42), fontsize = 10, c ='r')
+
+
+stat, pvalue = kstest(obs, cdf_norm)
+
--- a/code/pd/ecdf_epdf_plots.py
+++ b/code/pd/ecdf_epdf_plots.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Fri Jun 30 14:21:12 2023
+
+@author: pmaresnasarre
+"""
+
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import datetime
+from math import ceil, trunc
+
+data = pd.read_csv('Wind_Speed_1974_1975.csv', sep = ',')
+data.iloc[:,0] = pd.to_datetime(data.iloc[:,0])
+
+plt.rcParams.update({'font.size': 12})
+
+#%% Timeseries plot
+
+fig, axs=plt.subplots(1, 1)
+axs.plot(data.iloc[:,0], data.iloc[:,2], 'k')
+axs.set_xlim([datetime.datetime(1974, 9, 1, 0, 0), datetime.datetime(1975, 9, 1, 0, 0)])
+axs.grid()
+axs.set_xlabel('Date')
+axs.set_ylabel('${W_s (m/s)}$')
+fig.set_size_inches(12, 5)
+fig.savefig('data_overview.png')
+
+
+#%% ecdf
+def ecdf(var):
+    x = np.sort(var) # sort the values from small to large
+    n = x.size # determine the number of datapoints
+    y = np.arange(1, n+1) / n
+    return [y, x]
+
+ecdf_wind = ecdf(data.iloc[:,2])
+
+fig, ax = plt.subplots(1,1, figsize=(7,5), layout='constrained')
+ax.step(ecdf_wind[1], ecdf_wind[0], color = 'cornflowerblue')
+ax.set_xlabel('${W_s}$ (m/s)')
+ax.set_ylabel('${P[X \leq x]}$')
+ax.grid()
+
+fig.savefig('ecdf_wind.png')
+
+#%% epdf
+obs = data.iloc[:,2]
+
+bin_size = 2
+
+min_value = data.iloc[:,2].min()
+max_value = data.iloc[:,2].max()
+
+n_bins = ceil((max_value-min_value)/bin_size)
+bin_edges = np.linspace(trunc(min_value), ceil(max_value), n_bins+1)
+
+count = []
+for i in range(len(bin_edges)-1):
+    n = len(obs[(obs<bin_edges[i+1]) & (obs>bin_edges[i])])
+    count.append(n)
+    
+freq = [k/len(obs) for k in count]
+densities = [k/bin_size for k in freq]
+
+fig, ax = plt.subplots(1,1, figsize=(7,5), layout='constrained')
+ax.bar((bin_edges[1:] + bin_edges[:-1]) * .5, densities, width=(bin_edges[1] - bin_edges[0]), 
+       edgecolor = 'cornflowerblue', color = 'lightskyblue', alpha = 0.6)
+ax.set_xlabel('${W_s}$ (m/s)')
+ax.set_ylabel('pdf')
+ax.grid()
+
+fig.savefig('epdf_wind.png')
+
+