Statistics Distributions of Uncertainty

Todo

Intro

Common Statistical Distributions

Todo

  • Uniform

  • Normal/Truncated normal

  • Lognormal

  • Gamma

  • Beta

  • Ensemble

  • Etc.

Log-normal distribution

A random variable \(X\) has a log-normal distribution with parameters \(\mu\) and \(\sigma\), denoted \(X\sim \mathrm{Lognorm}(\mu, \sigma^2)\), if and only if its logarithm \(Y=\log(X)\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\). Equivalently, a random variable \(Y\) satisfies \(Y\sim \mathcal{N}(\mu, \sigma^2)\) if and only if its exponential \(X = \exp(Y)\) satisfies \(X \sim \mathrm{Lognorm}(\mu, \sigma^2)\).

Defining a log-normal distribution to simulate parameter uncertainty

A lognormal distribution may be an appropriate model of uncertainty for any positive, unbounded quantity \(X\). Rates and relative risks are common model parameters for which a lognormal distribution may be appropriate. If the literature reports a value \(v\) for \(X\) and an asymmetric 95% confidence interval \((a,b)\) such that \(v\) is close to the geometric mean of the endpoints (that is, \(v \approx \sqrt{ab}\), which is to the left of the interval’s midpoint), this is a good indication that the uncertainty in \(X\) can be modeled by a lognormal distribution with geometric mean \(v\) and central 95% interval \((a,b)\).

A lognormal distribution with parameters \(\mu\) and \(\sigma\) has geometric mean \(e^\mu\), which is also equal to the median of the distribution. Below is Python code to construct a scipy.stats lognormal distribution with a specified geometric mean (median) \(e^\mu = v =\) median and approximate central 95% interval \((a,b)\) (where \(a =\) lower and \(b =\) upper). The interval \((a,b)\) will be exactly the central 95% interval of the distribution if and only if \(v = \sqrt{ab}\).

import numpy as np
from scipy import stats

def lognorm_from_median_lower_upper(median, lower, upper, quantile_ranks=(0.025,0.975)):
  """Returns a frozen lognormal distribution with the specified median, such that
  the values (lower, upper) are approximately equal to the quantiles with ranks
  (quantile_ranks[0], quantile_ranks[1]). More precisely, if q0 and q1 are
  the quantiles of the returned distribution with ranks quantile_ranks[0]
  and quantile_ranks[1], respectively, then q1/q0 = upper/lower. If the
  quantile ranks are symmetric about 0.5, lower and upper will coincide with
  q0 and q1 precisely when median^2 = lower*upper.
  """
  # Let Y ~ Norm(mu, sigma^2) and X = exp(Y), where mu = log(median)
  # so X ~ Lognorm(s=sigma, scale=exp(mu)) in scipy's notation.
  # We will determine sigma from the two specified quantiles lower and upper.

  # mean (and median) of the normal random variable Y = log(X)
  mu = np.log(median)
  # quantiles of the standard normal distribution corresponding to quantile_ranks
  stdnorm_quantiles = stats.norm.ppf(quantile_ranks)
  # quantiles of Y = log(X) corresponding to the quantiles (lower, upper) for X
  norm_quantiles = np.log([lower, upper])
  # standard deviation of Y = log(X) computed from the above quantiles for Y
  # and the corresponding standard normal quantiles
  sigma = (norm_quantiles[1] - norm_quantiles[0]) / (stdnorm_quantiles[1] - stdnorm_quantiles[0])
  # Frozen lognormal distribution for X = exp(Y)
  # (s=sigma is the shape parameter; the scale parameter is exp(mu), which equals the median)
  return stats.lognorm(s=sigma, scale=median)

Note that since the log-normal distribution has only two parameters \(\mu\) and \(\sigma\), using three parameters \(v=\) median, \(a=\) lower, and \(b=\) upper to specify the distribution results in an overdetermined system if the specified median \(v\) is not exactly equal to the geometric mean of \(a\) and \(b\). In ths case, the above code actually determines the distribution’s two parameters from the median \(v\) and the ratio \(b/a\), and the returned distribution’s central 95% interval will only be approximately equal to \((a,b)\).

If the confidence interval \((a,b)\) was in fact generated from a log-normal distribution with the reported value \(v\) being an estimate of the geometric mean \(e^\mu\) of \(X\), then any discrepancy between \(v\) and \(\sqrt{ab}\) would be due to rounding errors. In this case, the above function incorporates all available data and generally results in a better fit than using only one of the two estimated quantiles \(a\) or \(b\). See Nathaniel’s notebook investigating different lognormal distributions for the CIFF acute malnutrition project.

Todo

Investigate more fully what the above algorithm does when there is no lognormal distribution matching the three parameters median, lower and upper for the specified quantile ranks.

As noted in the docstring, here’s a brief description of what the algorithm does. Let \(v=\) median, \(a=\) lower, and \(b=\) upper, and let \((p_0, p_1) =\) (quantile_ranks[0], quantile_ranks[1]) be the specified quantile ranks. Then the returned lognormal distribution has median \(v\) and quantiles \(a'\) and \(b'\) of ranks \(p_0\) and \(p_1\) such that \(b'/a' = b/a\).

Common Model Parameters and Their Possible Appropriate Uncertainty Distributions

Todo

  • Relative risk

  • Mean difference

  • Proportion

  • Cost estimate

  • Etc.

Other Considerations

Todo

  • How to handle very asymmetric confidence intervals

  • How to handle uncertainty in data source(s) rather than statistical uncertainty from a single high quality data source? - Ex: combining multiple estimates from published papers with their own statistical uncertainty

  • How to handle uncertaity when extrapolating a subnataional estimate to a national estimate?

  • How to handle uncertainty distribution in the case of joint distributions