Skip to content
Snippets Groups Projects
Commit 9fe5e28a authored by Robert Lanzafame's avatar Robert Lanzafame
Browse files

Merge branch '5-continuous-distribution-outline' into 'main'

Resolve "continuous distribution outline"

Closes #5

See merge request !16
parents abfb2404 7e434e0e
No related branches found
No related tags found
1 merge request!16Resolve "continuous distribution outline"
Pipeline #204001 passed
# Probability Notation
Maybe we should have a notation section in the main chapters?
\ No newline at end of file
# Random variables
From a previous course on probability and statistics you probably remember the use of Venn diagrams to aid our understanding of arithmetic involving probability. The box represented the sample space, with circles (and their relative size) representing various events and their probability of occurrence. The concept of a random variable is nothing more than a mapping of this sample space to a number line, which allows us to combine probability theory with calculus. We use a capital letter to denote a random variable and realizations of that random variable are described with a lower case letter. For example, consider a discrete random variable, $X$, which can take 1, 2 or 3 as possible outcomes. We can write the probability of each event mathematically as:
$
p_X(x_i)=P(X=x_i)
$
Where $i = 1,2,3$ and $P$ is mathematical notation for describing the event in the parenthesis. The function describes the probability for all outcomes of $P$ (i.e., the sample space), which, as implied by the axioms, should sum to 1.
This simple example is actually a discrete random variable, because the values of $X$ take on a finite number of values. For most of this course, however, we will work with continuous random variables, which can take on an infinite set of values. A key characteristic of continuous variables is that the probaility of an event must be defined over an interval. For example, we might be interested in the probability that $X$ takes a value between 3 and 4, which we will see in the next sections.
The mapping of a sample space to a number line combined with a (mathematical) specification of probability describes how probability is distributed across all events in the sample space. For this reason we use the term *probability distribution* to describe the mathematical functions defining probability for outcomes of a random variable, regardless of whether it is discrete or continous.
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jun 21 13:36:10 2023
@author: pmaresnasarre
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import expon
from scipy.stats import norm
from scipy.stats import kstest
#%% Functions
def ecdf(var):
x = np.sort(var) # sort the values from small to large
n = x.size # determine the number of datapoints
y = np.arange(1, n+1) / n
return [y, x]
#%% Plots
#overview
obs = [19.54, 9.12, 11.89, -0.29, 2.65, 3.63, 10.49, 3.61, 8.50, -5.25, 3.23,
0.88, -2.88, 7.53, 6.40, 5.16, -1.66, 10.63, 6.75, 3.50, 12.32, 32.67, 17.21]
plt.rcParams.update({'font.size': 12})
fig, ax = plt.subplots(1,2, figsize=(10,5), layout='constrained')
ax[0].hist(obs, density=True, edgecolor = 'cornflowerblue',
facecolor = 'lightskyblue', alpha = 0.6, bins = 8)
ax[0].set_xlabel('x')
ax[0].set_ylabel('pdf')
ax[0].grid()
ax[1].step(ecdf(obs)[1], ecdf(obs)[0], color = 'cornflowerblue')
ax[1].set_xlabel('x')
ax[1].set_ylabel('${P[X \leq x]}$')
ax[1].grid()
#normal distribution
q_norm=norm.ppf(ecdf(obs)[0], loc=5.17, scale=5.76)
#exponential distribution
q_expon=expon.ppf(ecdf(obs)[0], loc=-5.25, scale=10.42)
#QQplot
fig, ax=plt.subplots(figsize=(6,6))
ax.scatter(ecdf(obs)[1],q_norm, 40, 'cornflowerblue', label='N(5.17, 5.76)')
ax.scatter(ecdf(obs)[1],q_expon, 40, edgecolor = 'k', facecolor='w', label='Expon(-5.25, 10.42)')
ax.plot([-10,30], [-10,30], 'k')
ax.set_ylabel('Theoretical quantiles')
ax.set_xlabel('Empirical quantiles')
ax.grid()
ax.set_ylim([-10,30])
ax.set_xlim([-10,30])
ax.legend()
#cdfs
x = np.linspace(0.01, 0.99, 50)
cdf_norm=norm.ppf(x, loc=5.17, scale=5.76)
cdf_expon=expon.ppf(x, loc=-5.25, scale=10.42)
#Log-scale
fig, ax = plt.subplots(1,2, figsize=(10,5), layout='constrained')
ax[0].plot(cdf_norm, 1-x,'cornflowerblue', label='N(5.17, 5.76)')
ax[0].plot(cdf_expon, 1-x, 'k', label='Expon(-5.25, 10.42)')
ax[0].scatter(ecdf(obs)[1], 1-ecdf(obs)[0], 40, 'r', label = 'Observations')
ax[0].grid()
ax[0].legend()
ax[0].set_xlabel('x')
ax[0].set_ylabel('${P[X > x]}$')
ax[1].plot(cdf_norm, 1-x,'cornflowerblue', label='N(5.17, 5.76)')
ax[1].set_yscale('log')
ax[1].plot(cdf_expon, 1-x, 'k', label='Expon(-5.25, 10.42)')
ax[1].scatter(ecdf(obs)[1], 1-ecdf(obs)[0], 40, 'r', label = 'Observations')
ax[1].grid()
ax[1].legend()
ax[1].set_xlabel('x')
ax[1].set_ylabel('${P[X > x]}$')
#KS test
fig, ax = plt.subplots(1,1, figsize=(6,5), layout='constrained')
ax.step(ecdf(obs)[1], ecdf(obs)[0], color = 'cornflowerblue')
ax.plot(cdf_norm, x, color = 'k')
ax.set_xlabel('x')
ax.set_ylabel('${P[X \leq x]}$')
ax.grid()
ax.annotate('', xy=(3, 0.39), xytext=(2.94, 0.49), arrowprops=dict(arrowstyle='<->', ec = 'r'))
ax.annotate('Maximum distance', xy=(3, 0.39), xytext=(-8.75, 0.42), fontsize = 10, c ='r')
stat, pvalue = kstest(obs, cdf_norm)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 30 14:21:12 2023
@author: pmaresnasarre
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
from math import ceil, trunc
data = pd.read_csv('Wind_Speed_1974_1975.csv', sep = ',')
data.iloc[:,0] = pd.to_datetime(data.iloc[:,0])
plt.rcParams.update({'font.size': 12})
#%% Timeseries plot
fig, axs=plt.subplots(1, 1)
axs.plot(data.iloc[:,0], data.iloc[:,2], 'k')
axs.set_xlim([datetime.datetime(1974, 9, 1, 0, 0), datetime.datetime(1975, 9, 1, 0, 0)])
axs.grid()
axs.set_xlabel('Date')
axs.set_ylabel('${W_s (m/s)}$')
fig.set_size_inches(12, 5)
fig.savefig('data_overview.png')
#%% ecdf
def ecdf(var):
x = np.sort(var) # sort the values from small to large
n = x.size # determine the number of datapoints
y = np.arange(1, n+1) / n
return [y, x]
ecdf_wind = ecdf(data.iloc[:,2])
fig, ax = plt.subplots(1,1, figsize=(7,5), layout='constrained')
ax.step(ecdf_wind[1], ecdf_wind[0], color = 'cornflowerblue')
ax.set_xlabel('${W_s}$ (m/s)')
ax.set_ylabel('${P[X \leq x]}$')
ax.grid()
fig.savefig('ecdf_wind.png')
#%% epdf
obs = data.iloc[:,2]
bin_size = 2
min_value = data.iloc[:,2].min()
max_value = data.iloc[:,2].max()
n_bins = ceil((max_value-min_value)/bin_size)
bin_edges = np.linspace(trunc(min_value), ceil(max_value), n_bins+1)
count = []
for i in range(len(bin_edges)-1):
n = len(obs[(obs<bin_edges[i+1]) & (obs>bin_edges[i])])
count.append(n)
freq = [k/len(obs) for k in count]
densities = [k/bin_size for k in freq]
fig, ax = plt.subplots(1,1, figsize=(7,5), layout='constrained')
ax.bar((bin_edges[1:] + bin_edges[:-1]) * .5, densities, width=(bin_edges[1] - bin_edges[0]),
edgecolor = 'cornflowerblue', color = 'lightskyblue', alpha = 0.6)
ax.set_xlabel('${W_s}$ (m/s)')
ax.set_ylabel('pdf')
ax.grid()
fig.savefig('epdf_wind.png')
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment