Fitting Theoretical Distributions to Data¶
Faisal Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca
Copyright information¶
© Faisal Qureshi
License¶
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.import numpy as np
import matplotlib.pyplot as plt
Fitting a Gaussian distribution¶
from scipy.stats import norm
Generating samples¶
true_mean = 1
true_var = 1
samples = norm.rvs(loc=true_mean, scale=true_var, size=15000)
plt.figure(figsize=(5,5))
plt.title('Samples')
plt.hist(samples, alpha=.8, color='magenta', density=True);
Fitting a Gaussian distribution¶
params = norm.fit(samples)
fitted_mean = params[0]
fitted_var = params[1]
print(f'mean = {fitted_mean}, variance={fitted_var}')
mean = 1.020268336495857, variance=1.0078619653217207
Visualization¶
x = np.linspace(-5, 5, 100)
plt.figure()
plt.title('Fitting a Normal distribution')
plt.plot(x, norm.pdf(x, loc=true_mean, scale=true_var), 'r-', label='True Normal')
plt.hist(samples, alpha=.3, density=True, color='magenta', label='Samples')
plt.plot(x, norm.pdf(x, loc=fitted_mean, scale=fitted_var), 'b-', label='Fitted Normal')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.legend()
<matplotlib.legend.Legend at 0x1547d3790>
Fitting a Poisson distribution¶
The Poisson distribution is a discrete probability distribution that describes the number of events occurring in a fixed interval of time or space, given a known average rate of occurrence and assuming that the events occur independently of the time since the last event. It is named after the French mathematician Siméon Denis Poisson.
The Poisson distribution is defined by a single parameter, often denoted by $\lambda$, which represents the average rate of occurrence of the events. The probability mass function (PMF) of the Poisson distribution, which gives the probability of observing $k$ events in a fixed interval of time or space, is given by the formula:
$$ P(k; \lambda) = \frac{e^{-\lambda} \lambda^k}{k!} $$
Where:
- $ k $ is the number of events,
- $ \lambda $ is the average rate of occurrence,
- $ e $ is Euler's number (approximately equal to 2.71828), and
- $ k! $ represents the factorial of $ k $.
Key properties of the Poisson distribution include:
- It is defined for non-negative integer values of $ k $, representing counts.
- The mean and variance of a Poisson distribution are both equal to $ \lambda $.
- It is often used to model rare events or the number of occurrences of events in a fixed interval of time or space, such as the number of phone calls received by a call center in an hour, the number of arrivals at a service point within a certain time period, or the number of accidents at a particular intersection in a day.
The Poisson distribution is commonly used in various fields such as queueing theory, telecommunications, insurance, and reliability engineering to model random events that occur independently of one another with a known average rate.
from scipy.stats import poisson
Generating samples¶
true_lambda = 5
samples = poisson.rvs(mu=true_lambda, size=150) # samples generation
plt.figure(figsize=(5,5))
plt.title('Samples')
plt.xlabel([0, 100])
plt.hist(samples, alpha=.8, color='magenta', density=True);
Fitting a Poisson distribution¶
from scipy.optimize import curve_fit
def poisson_pmf(k, lamb):
return poisson.pmf(k, lamb)
params, cov = curve_fit(poisson_pmf, np.arange(0, max(samples)+1), np.bincount(samples))
fitted_lambda = params[0]
print(f'Fitted lambda = {fitted_lambda}')
Fitted lambda = 4.457994486704222
Visualization¶
bins = np.arange(0, max(samples)+1)-0.5
x = np.arange(0, max(samples)+1)
plt.plot(x, poisson_pmf(x, fitted_lambda), 'b-', label='True Poisson')
plt.hist(samples, bins=bins, density=True, label='Samples', color='magenta')
plt.plot(x, poisson_pmf(x, true_lambda), 'r-', label='Fitted Poisson')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Fitting a Poisson Distribution to Data')
plt.legend()
<matplotlib.legend.Legend at 0x1546c8c90>
Fitting an exponential distribution¶
The exponential distribution is a continuous probability distribution that describes the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. It is often used to model the time until the next occurrence of an event.
The exponential distribution is defined by a single parameter, often denoted by $\lambda$, which represents the rate parameter or the average number of events per unit time. The probability density function (PDF) of the exponential distribution is given by:
$$ f(x; \lambda) = \lambda e^{-\lambda x} $$
Where:
- $x$ is the random variable representing the time until the next event,
- $\lambda$ is the rate parameter.
The mean (expected value) and variance of the exponential distribution are both equal to $\frac{1}{\lambda}$.
The exponential distribution is commonly used in various fields such as reliability engineering, queueing theory, telecommunications, and survival analysis. In reliability engineering, it models the time until failure of a component, while in queueing theory, it represents the time between arrivals of customers at a service point.
Aside: Difference between Poisson distribution and Exponential distribution¶
Poisson and exponential distributions are both used to model the timing of events, but they serve different purposes. The Poisson distribution models the count of events, while the exponential distribution models the timing between events.
Poisson distribution¶
- Describes the number of events occurring in a fixed interval of time or space.
- Assumes events occur at a constant average rate and independently of each other.
- Discrete distribution: Values are non-negative integers
- Parameter $\lambda$ represents the average rate of occurrence of events.
- Example applications:
- Number of phone calls received by a call center in an hour.
- Number of arrivals at a service point within a certain time period.
Exponential Distribution¶
- Describes the time between events in a Poisson process.
- Assumes events occur continuously and independently at a constant average rate.
- Continuous distribution: Values are non-negative real numbers.
- Parameter $\lambda$ represents the average number of events per unit time.
- Example applications:
- Time until failure of a component in reliability engineering.
- Time between arrivals of customers at a service point in queueing theory.
from scipy.stats import rayleigh, expon
samples = rayleigh.rvs(loc=5, scale=2, size=150) # samples generation
params = expon.fit(samples) # distribution fitting
print(params)
(5.104829420271813, 2.321443803081821)
x = np.linspace(0, 20, 1000)
plt.figure()
plt.hist(samples, density=True, alpha=.3, color='magenta', label='Samples')
plt.plot(x, expon.pdf(x, loc=params[0], scale=params[1]), 'b-', label='Fitted Exponential')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Fitting a Triangular Distribution to Data')
plt.legend()
<matplotlib.legend.Legend at 0x1114b20d0>
Fitting a triangular distribution¶
The triangular distribution is a continuous probability distribution with a triangular shape, bounded by a minimum, maximum, and a mode value within that range. It is often used when there is limited data available about the distribution of a variable.
In a triangular distribution, the probability density function (PDF) is triangular in shape and defined by three parameters:
- $a$: the minimum value (left boundary),
- $b$: the maximum value (right boundary), and
- $c$: the mode value (the peak of the distribution within the range $a$ to $b$).
The PDF of the triangular distribution is defined as follows:
$$ f(x;a, b, c) = \begin{cases} 0 & \text{for } x < a \\ \frac{2(x - a)}{(b - a)(c - a)} & \text{for } a \leq x < c \\ \frac{2(b - x)}{(b - a)(b - c)} & \text{for } c \leq x \leq b \\ 0 & \text{for } x > b \end{cases} $$
The mean, variance, and mode of the triangular distribution can be calculated using the following formulas:
$$ \text{Mean} = \frac{a + b + c}{3} $$
$$ \text{Variance} = \frac{a^2 + b^2 + c^2 - (ab + ac + bc)}{18} $$
$$ \text{Mode} = c $$
The triangular distribution is often used in simulation, risk analysis, and project management when there is uncertainty about the values of variables, but some information about their minimum, maximum, and likely values is available. It's worth noting that the triangular distribution assumes that all values within the specified range are equally likely.
If you have specific data and want to fit a triangular distribution to it or generate random variates from it, you can use libraries like SciPy in Python.
from scipy.stats import rayleigh, triang
samples = rayleigh.rvs(loc=5,scale=2,size=150)
params = triang.fit(samples)
print(params)
(0.18410371336804132, 4.9769540456328745, 7.3705126516115085)
x = np.linspace(0, 20, 1000)
plt.figure()
plt.hist(samples, density=True, alpha=.3, color='magenta', label='Samples')
plt.plot(x, triang.pdf(x,c=params[0], loc=params[1], scale=params[2]), 'b-', label='Fitted Triangular')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Fitting a Triangular Distribution to Data')
plt.legend()
<matplotlib.legend.Legend at 0x2caaf0c90>