Statistics Distributions of Uncertainty¶
Todo
Intro
Common Statistical Distributions¶
Todo
Uniform
Normal/Truncated normal
Lognormal
Gamma
Beta
Ensemble
Etc.
Log-normal distribution¶
A random variable \(X\) has a log-normal distribution with parameters \(\mu\) and \(\sigma\), denoted \(X\sim \mathrm{Lognorm}(\mu, \sigma^2)\), if and only if its logarithm \(Y=\log(X)\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\). Equivalently, a random variable \(Y\) satisfies \(Y\sim \mathcal{N}(\mu, \sigma^2)\) if and only if its exponential \(X = \exp(Y)\) satisfies \(X \sim \mathrm{Lognorm}(\mu, \sigma^2)\).
Defining a log-normal distribution to simulate parameter uncertainty¶
A lognormal distribution may be an appropriate model of uncertainty for any positive, unbounded quantity \(X\). Rates and relative risks are common model parameters for which a lognormal distribution may be appropriate. If the literature reports a value \(v\) for \(X\) and an asymmetric 95% confidence interval \((a,b)\) such that \(v\) is close to the geometric mean of the endpoints (that is, \(v \approx \sqrt{ab}\), which is to the left of the interval’s midpoint), this is a good indication that the uncertainty in \(X\) can be modeled by a lognormal distribution with geometric mean \(v\) and central 95% interval \((a,b)\).
A lognormal distribution with parameters \(\mu\) and \(\sigma\) has
geometric mean \(e^\mu\), which is also equal to the median of the
distribution. Below is Python code to construct a scipy.stats lognormal
distribution with a specified geometric mean (median) \(e^\mu = v =\)
median
and approximate central 95% interval \((a,b)\) (where \(a =\)
lower
and \(b =\) upper
). The interval \((a,b)\) will be exactly
the central 95% interval of the distribution if and only if \(v =
\sqrt{ab}\).
import numpy as np
from scipy import stats
def lognorm_from_median_lower_upper(median, lower, upper, quantile_ranks=(0.025,0.975)):
"""Returns a frozen lognormal distribution with the specified median, such that
the values (lower, upper) are approximately equal to the quantiles with ranks
(quantile_ranks[0], quantile_ranks[1]). More precisely, if q0 and q1 are
the quantiles of the returned distribution with ranks quantile_ranks[0]
and quantile_ranks[1], respectively, then q1/q0 = upper/lower. If the
quantile ranks are symmetric about 0.5, lower and upper will coincide with
q0 and q1 precisely when median^2 = lower*upper.
"""
# Let Y ~ Norm(mu, sigma^2) and X = exp(Y), where mu = log(median)
# so X ~ Lognorm(s=sigma, scale=exp(mu)) in scipy's notation.
# We will determine sigma from the two specified quantiles lower and upper.
# mean (and median) of the normal random variable Y = log(X)
mu = np.log(median)
# quantiles of the standard normal distribution corresponding to quantile_ranks
stdnorm_quantiles = stats.norm.ppf(quantile_ranks)
# quantiles of Y = log(X) corresponding to the quantiles (lower, upper) for X
norm_quantiles = np.log([lower, upper])
# standard deviation of Y = log(X) computed from the above quantiles for Y
# and the corresponding standard normal quantiles
sigma = (norm_quantiles[1] - norm_quantiles[0]) / (stdnorm_quantiles[1] - stdnorm_quantiles[0])
# Frozen lognormal distribution for X = exp(Y)
# (s=sigma is the shape parameter; the scale parameter is exp(mu), which equals the median)
return stats.lognorm(s=sigma, scale=median)
Note that since the log-normal distribution has only two parameters \(\mu\)
and \(\sigma\), using three parameters \(v=\) median
, \(a=\)
lower
, and \(b=\) upper
to specify the distribution results in an
overdetermined system if the specified median \(v\) is not exactly equal to
the geometric mean of \(a\) and \(b\). In ths case, the above code
actually determines the distribution’s two parameters from the median \(v\)
and the ratio \(b/a\), and the returned distribution’s central 95% interval
will only be approximately equal to \((a,b)\).
If the confidence interval \((a,b)\) was in fact generated from a log-normal distribution with the reported value \(v\) being an estimate of the geometric mean \(e^\mu\) of \(X\), then any discrepancy between \(v\) and \(\sqrt{ab}\) would be due to rounding errors. In this case, the above function incorporates all available data and generally results in a better fit than using only one of the two estimated quantiles \(a\) or \(b\). See Nathaniel’s notebook investigating different lognormal distributions for the CIFF acute malnutrition project.
Todo
Investigate more fully what the above algorithm does when there is no
lognormal distribution matching the three parameters median
, lower
and
upper
for the specified quantile ranks.
As noted in the docstring, here’s a brief description of what the algorithm
does. Let \(v=\) median
, \(a=\) lower
, and \(b=\)
upper
, and let \((p_0, p_1) =\) (quantile_ranks[0],
quantile_ranks[1])
be the specified quantile ranks. Then the returned
lognormal distribution has median \(v\) and quantiles \(a'\) and
\(b'\) of ranks \(p_0\) and \(p_1\) such that \(b'/a' = b/a\).
Common Model Parameters and Their Possible Appropriate Uncertainty Distributions¶
Todo
Relative risk
Mean difference
Proportion
Cost estimate
Etc.
Other Considerations¶
Todo
How to handle very asymmetric confidence intervals
How to handle uncertainty in data source(s) rather than statistical uncertainty from a single high quality data source? - Ex: combining multiple estimates from published papers with their own statistical uncertainty
How to handle uncertaity when extrapolating a subnataional estimate to a national estimate?
How to handle uncertainty distribution in the case of joint distributions