Ensemble Distributions in GBD

What Is an Ensemble Distribution?

In GBD, many risk factor exposures are modeled by continuous probability distributions. Occasionally, if we’re lucky, the exposure data may closely follow a standard probability distribution, such as normal, lognormal, Weibull, gamma, Gumbel, etc. But in general there is no reason to expect real-world data to match these nicely parameterized, mathematically convenient models. Thus, in order to model exposure data more accurately, GBD uses an ensemble learning approach, fitting multiple distributions to the data, and averaging the distribution functions together to get an overall “ensemble” distribution that fits the data better than any of its individual components.

Mathematical Definition of Ensemble Distributions

Mathematically, an ensemble distribution in GBD is a mixture distribution, which is defined as follows.

Mixture Distributions

Suppose \(F_1, F_2,\ldots, F_k\) are cumulative distribution functions (CDFs), called the base distributions of the ensemble or the components of the mixture, and suppose we have a collection of non-negative numbers \(w_1, w_2,\ldots, w_k\), called weights, such that \(\sum_1^k w_i = 1\). Then the mixture distribution with components \(\{F_i\}_1^k\) and weights \(\{w_i\}_1^k\) is the probability distribution on \(\mathbb{R}\) whose cumulative distribution function \(F\) is defined by

\[F(x) = \sum_{i=1}^k w_i F_i(x)\quad \text{for } x\in \mathbb{R}.\]

That is, the CDF of the mixture is a weighted average of the CDFs of the components, with the weight of \(F_i\) being \(w_i\).

If the \(F_i\)’s all correspond to continuous probability distributions on \(\mathbb{R}\), then differentiating the above equation shows that the mixture distribution is also continuous, with probability density function (PDF) given by the coresponding weighted average of the component densities, i.e. \(f(x) = \sum_1^k w_i f_i(x)\), where \(f=F'\) and \(f_i=F_i'\). Thus, the mixture distribution can be defined in terms of density functions (PDFs) instead of distribution functions (CDFs) in the continuous case.

Fitting an Ensemble Distribution to GBD Risk Exposure Data

This section describes how GBD uses mixture distributions to define ensemble distributions for modeling risk exposures.

Problem setup

Suppose we have a continuous risk exposure variable \(E\) whose distribution we want to model. For example, \(E\) could be body mass index, fasting plasma glucose, birthweight, or hemoglobin level. Suppose that we want to model \(E\) for \(n\) different population groups, e.g. indexed by (location, age, sex, year) as in GBD. Let \(j\in \{1,2,\ldots,n\}\) specify which population group we are referring to.

Typically, a GBD team will have microdata on \(E\) for at least some of the population groups, and they are able to estimate the mean \(\mu_j\) and standard deviation \(\sigma_j\) of \(E\) for each group \(j\) using various regression methods. The goal is to use all the available microdata plus the estimates of \(\mu_j\) and \(\sigma_j\) to reconstruct the approxmiate distribution function of \(E\) for each group \(j\).

More precisely, the goal is to find a family of cumulative distribution functions \(F(x | \mu, \sigma)\), parameterized by mean \(\mu\) and standard deviation \(\sigma\), satisfying the following criteria:

  • For each group \(j\) that has microdata, the distribution function \(F(x | \mu_j, \sigma_j)\) provides a good approximation of the empirical CDF of \(E\) ; and

  • For all groups \(j\), the function \(F(x | \mu_j, \sigma_j)\) provides a plausible estimate of the population distribution of \(E\) — in particular, its mean and standard deviation match \(\mu_j\) and \(\sigma_j\).

A simple strategy for estimating exposure distributions

One simple approach to finding such a family of distributions \(F(x | \mu, \sigma)\) would be to choose a 2-parameter family of probability distributions on \(\mathbb{R}\), such as the family of normal distributions \(\mathcal{N}(\mu,\sigma^2)\), and to fit distributions from that family to the exposure data using the method of moments. That is, we would set \(F(x | \mu, \sigma) \equiv \mathcal{N}(\mu,\sigma^2)\) so that \(E\) would be normally distributed for each population group, with mean \(\mu_j\) and standard deviation \(\sigma_j\) for the population group \(j\).

Improving the distribution estimates with an ensemble approach

Todo

Fill in this section. Here are some unfinished snippets I started:

The family of functions \(F(x | \mu, \sigma)\) will be defined as a mixture of a fixed set of base distributions \(F_1(x | \mu, \sigma), F_2(x | \mu, \sigma),\ldots, F_k(x | \mu, \sigma)\) parameterized by mean \(\mu\) and standard deviation \(\sigma\), using a fixed global set of weights \(\{w_i\}_1^k\) for all \(\mu\) and \(\sigma\). GBD calls \(F(x | \mu, \sigma)\) an ensemble distribution.

The goal for modeling \(E\) with an ensemble distribution is to find a global set of weights \(\{w_i\}_1^k\)

to start with a specified set of base distributions \(F_1(x | \mu, \sigma), F_2(x | \mu, \sigma),\ldots, F_k(x | \mu, \sigma)\) parameterized by mean \(\mu\) and standard deviation \(\sigma\) and

To model risk exposures using ensemble distributions, GBD uses the following procedure:

1. Start with a collection of base distributions \(F_1, F_2,\ldots, F_k\)

Where Is the Code for GBD’s Ensemble Distributions?

Here are the repositories implementing ensemble distributions for GBD modeling and for Vivarium:

  • Ensemble Distributions repository (R code) on Stash, maintained by Central Comp and accessible on the cluster at /ihme/code/risk/ensemble/

    • pdf_families.R contains the list of base distributions for the ensemble, each with functions ddistname to compute the density function (PDF), pdistname to compute the probability distribution function (CDF), and distname_mv2p to implement the method of moments (i.e. calculate the parameters of the distribution given its mean and variance), where distname is the name of the base distribution in the R code. Each distribution is added to one of the classes classA (standard distributions supported on \([0,\infty)\) or \((-\infty, \infty)\)), classB (the shifted, scaled Beta distribution supported on \([a,b]\)), or classM (mirrored distributions) at the end of the file. The three distribution classes then get concatenated into a single distribution list called dlist in fit.R and passed to the eKS function in eKS_parallel.R.

    • eKS_parallel.R contains the optimization routine that finds the best set of ensemble weights by minimizing the Kolmogorov–Smirnov statistic between the ensemble distribution and the empirical distribution of the data. This file also has code to plot the results of the fitting procedure.

    • fit.R loads the list of distributions from pdf_families.R and calls eKS_parallel.R to do the actual distribution fitting and plot the results.

    • fit_submit.R is the main program to submit a job on the cluster to fit an ensemble distribution to data by calling fit.R.

    • scale_density_simpson.cpp is C++ code to rescale a distribution so that it integrates to 1 after it has been truncated at min and max values, using Simpson’s rule. This function is called by the get_edensity function in edensity.R. However, neither edensity.R nor scale_density_simpson.cpp appear to be used in any of the files above, as the get_edensity function has a different implementation in eKS_parallel.R.

    • edensity.R is the function that returns an ensemble distribution density function for use in PAF calculations. Note that this function sets the XMIN and XMAX parameters that the mirrored gamma, mirrored gumbel, and betasr functions rely on to the 0.001 and 0.999 percentiles of a lognormal distribution with the specified mean and standard deviation (or else a normal distribution if the specified mean is equal to zero). However, XMIN and XMAX parameter values are overwritten with custom values for several risk factors that can be found here.

  • Risk Distributions repository (Python code) on GitHub, maintained by the Vivarium Engineering Team as part of Vivarium Public Health

    • risk_distributions.py implements GBD’s ensemble distributions via a collection of classes that wrap scipy.stats distributions for use as risk exposure distributions in Vivarium. There is a class for each base distribution, each extending the BaseDistribution class, and an EnsembleDistribution class implementing the mixture of the base distributions. Each base distribution class implements the functions get_parameters to calculate the parameters of the distribution given its mean and standard deviation (implementing the method of moments), pdf to compute the probability density function, cdf to compute the cumulative distribution function, and ppf to compute the percent point function (i.e. quantile function, used for inverse transform sampling).

    • formatting.py contains helper functions for formatting data and converting between data types.

    • Note that as of July 2023 by default, risk_distributions.py sets XMIN and XMAX parameter values to the 0.001 and 0.999 percentile values of a lognormal distribution with the specified mean and standard deviation, as shown here

Note

Given that there are two different code bases meant to perform the same task, we have verified that the python implementation used in vivarium accurately replicates the R implementation used in GBD processes in a notebook that can be found here. We found that the python implementation does accurately replicate the R implementation, but with a few exceptions:

1. Accurate replication of the shape of mirrored gamma and betasr distributions rely on values for the XMIN and XMAX parameters. Notably, the mean value of the distribution is unaffected by the XMIN and XMAX parameter values. While the GBD R code implementation and Vivarium python code implementation share common default values for the XMIN and XMAX parameters, it is critical to ensure the python implementation reflect custom values used in the GBD implementation to accurately replicate the GBD distributions.

2. The GBD code for the invweibull distribution relies on an optimization function that can return unexpected results. The python function appears to solve the distribution algebraically rather than via optimization and therefore does not return these unexpected results. See more in this slack thread: https://ihme.slack.com/archives/C01N6LFMN3W/p1688175801435049?thread_ts=1688162029.138979&cid=C01N6LFMN3W

TODO: Run more comparisons to determine when python optimization may/may not replicate the R optimization, as brought up in this github comment

3. The cumulative distribution function (.cdf) in the python risk_distribution implementation does not function appropriately for the mirrored gamma or mirrored gumbel distributions.

4. The python implementation reads in standard deviation while the R implementation reads in variance (equal to standard deviation squared), so input data must be appropriately converted when comparing the two implementations.

What Base Distributions Are Used in GBD’s Ensembles?

Central Comp’s R code base lists 14 distribution families in the file pdf_families.R, and 12 of them are implemented for Vivarium in risk_distributions.py (all except the Generalized Normal and Generalized Log-normal):

  • Gamma

  • Mirrored Gamma

  • Inverse Gamma

  • Normal

  • Generalized Normal (3-parameter)

  • Log-normal

  • Generalized Log-normal (3-parameter)

  • Exponential (1-parameter)

  • Weibull

  • Inverse Weibull

  • Log-logistic

  • Gumbel

  • Mirrored Gumbel

  • Beta (with shift+scale)

Each of the above distribution families has 2 parameters except as otherwise specified.

Todo

Make a table out of the above list, including the following possible columns:

  • Distribution name + Wikipedia link (or other source)

  • Variable names for the distributions used in the R and Python repos

  • R function + documentation link

  • scipy.stats function + documentation link

  • Number of parameters for the distribution family (1, 2, or 3)

  • Formula for pdf or cdf?

  • How to get distribution parameters from mean and variance (i.e. method of moments)?

  • Tail behavior?

  • Whether min and max are needed in addition to other parameters

It may be better to make a table with some small number of the above things, then add a brief section on each distribution going into more details.

Also, here are some other things to add somewhere:

  • Explain how the mean and variance will stay the same when we average the distributions together because of the Laws of Total Expectation/Variance, unless the weight for the Exponential distribution is nonzero, which will throw the variance off.

  • Add examples of GBD and Vivarium models that use ensemble distributions and/or specific distributions from the list. It would be nice to show pictures of the ensembles and list the ensemble weights.

  • Add a link to the anemia team’s repo with its own implementation of ensembles, and any other custom implementations that we find.

  • Add an explanation of how/why the support of the ensemble distribution is cut off at the 0.001 and 0.999 quantiles of the lognormal distribution. It appears that this is done in edensity.R, but does that code actually get used, or does GBD use the version of get_edensity in eKS_parallel.R instead? This truncation is also implemented in Vivarium in risk_distributions.py.

  • Note somewhere that in the R code, betasr evidently stands for “beta with shifted range”.

Note

As of May 12, 2021, it appears that three of the above distributions (Generalized Normal, Generalized Log-normal, Inverse Weibull) are not currently used for ensemble modeling by Central Comp, because they don’t get added to any of the distribution classes classA (most distributions), classB (beta distribution), or classM (mirrored distributions) at the end of pdf_families.R, and hence they don’t get added to the distribution list variable dlist in fit.R. In previous versions of the code, the Generalized Log-normal and Inverse Weibull distributions were put in classA, but they have been removed.

Sampling from Ensemble Distributions in Vivarium

Vivarium needs to sample values of the exposure variable \(E\) from its estimated ensemble distribution in order to assign an exposure value to each simulant. This contrasts with the usage of ensemble distributions in GBD, where a typical use case might be to estimate the prevalence of each exposure category of \(E\) by computing areas under its PDF. In particular, computing the PDF or CDF of \(E\) is generally sufficient for a GBD team, whereas sampling values from the ensemble distribution requires an additional algorithm.

Below, we describe two possible strategies for sampling from an ensemble distribution. The first strategy (two-step sampling) requires two random choices for each sampled value but is easy to implement if we already have a sampling strategy for all the base distributions. The second strategy (inverse transform sampling) uses a single random choice for each sampled value but requires calculation of the ensemble distribution’s quantile function, which is not straightforward. As noted above, a GBD ensemble distribution is mathematically defined as a mixture distribution, so it is sufficient to describe how to sample from mixture distributions.

Sampling from a Mixture Distribution

Let \(F = \sum_1^k w_i F_i\) be the CDF of a mixture distribution with component CDFs \(\{F_i\}_1^k\) and weights \(\{w_i\}_1^k\). Our goal is to algorithmically generate a random variable \(E\) whose CDF is \(F\) (i.e. generate a “draw” from the distribution \(F\)).

Two-step sampling from mixture distributions

There is a simple two-step procedure to sample a random variable \(E\) distributed according to \(F\), assuming that we have a method of sampling from each of the component distributions \(F_i\):

  1. Randomly choose a number \(i\in \{1,2,\ldots, k\}\), with the probability of \(i\) being \(w_i\).

  2. Independently sample \(E\) from the distribution \(F_i\), where \(i\) is the number chosen in Step 1.

This sampling method is the source of the name “mixture”: You can think of a sequence of independent draws from the mixture distribution \(F\) as each coming from one of the component distributions \(F_i\), but the draws are mixed together in a random order, with the proportion of draws from \(F_i\) being \(w_i\) on average. The following theorem shows that this sampling strategy works.

Theorem

(Two-step mixture sampling) If \(E\) is sampled using the above two-step procedure, then the distribution of \(E\) is the mixture distribution \(F = \sum_1^k w_i F_i\).

Proof. Let \(Y\) be a random variable having the categorical distribution on \(\{1,2,\ldots, k\}\) with \(\Pr(Y=i) = w_i\). Let \(X_1,X_2,\ldots, X_k\) be random variables with distributions \(F_1,F_2,\ldots F_k\), respectively, and assume that each \(X_i\) is independent of \(Y\). Then the above sampling procedure can be formalized as the statement

\[E = X_1 \cdot \mathbf{1}_{\{Y=1\}} + X_2\cdot \mathbf{1}_{\{Y=2\}} + \dotsb + X_k\cdot \mathbf{1}_{\{Y=k\}},\]

where \(\mathbf{1}_{\{Y=i\}}\) is the indicator function of the event \(\{Y=i\}\) (i.e. for each outcome \(\omega\), \(\mathbf{1}_{\{Y=i\}}(\omega) = 1\) if \(Y(\omega) = i\), and \(\mathbf{1}_{\{Y=i\}}(\omega) = 0\) otherwise). In particular, we have

\[E=X_i \text{ on the event } \{Y=i\}\]

because the events \(\{Y=i\}\) partition the sample space, so exactly one of the indicator functions \(\mathbf{1}_{\{Y=i\}}\) will be nonzero for any outcome. Using this observation, plus the independence between \(Y\) and \(X_i\), it follows that for any \(x\in\mathbb{R}\) we have

\[\begin{split}\Pr(E\le x) = \sum_{i=1}^k \Pr(Y=i \text{ and } E\le x) &= \sum_{i=1}^k \Pr(Y=i \text{ and } X_i\le x)\\ &= \sum_{i=1}^k \Pr(Y=i) \Pr(X_i\le x)\\ &= \sum_{i=1}^k w_i F_i(x) = F(x).\end{split}\]

This shows that the distribution function of \(E\) is \(F\). \(\blacksquare\)

Inverse transform sampling from mixture distributions

Like any probability distribution, a draw \(E\) from a mixture distribution can be generated using inverse transform sampling. That is, sample a \(\mathrm{Uniform}(0,1)\) random variable \(U\), then compute \(E = Q(U)\), where \(Q\) is the quantile function for \(E\) (also called the percent point function or inverse CDF of \(E\)).

The challenge with this approach is that in general, there is no simple formula for the quantile function \(Q\) of the mixture distribution \(F = \sum_1^k w_i F_i\). In particular, even when \(F\) and \(F_i\) are invertible, we have \(Q = F^{-1}\), but there is generally no simple way to write \(F^{-1}\) in terms of the components’ quantile functions \(Q_i = F_i^{-1}\).

However, it is still possible to compute the quantile function directly from the definition \(Q(p) = \min \{x\in \mathbb{R} : F(x) \ge p\}\). Namely, to compute \(Q(p)\) for \(p\in [0,1]\), we look for the smallest number \(x\) such that \(F(x)\ge p\). Since \(F\) is an increasing function, this optimization problem can be solved efficiently using a binary search. For an example of this approach coded in Python, see this blog post by Andrew Webb.

Propensity-Based Sampling from Ensemble Distributions

A key design feature of Vivarium is propensity-based sampling, in which each simulant posesses a random “propensity” for an attribute \(E\) (such as a risk exposure) that is invariant across scenarios and/or time and/or draws of model parameters, and the propensity is used to deterministically assign the value of \(E\) for the simulant at each time step. This is typically done by defining the propensities as real numbers that are drawn uniformly from the interval \([0,1]\) and then applying inverse transform sampling in order to guarantee that \(E\) follows a prescribed distribution across the simulated population. Here we describe two propensity-based approaches to sampling from an ensemble distribution, based on the two sampling strategies for mixture distributions deescribed above.

One-propensity sampling from ensemble distributions

Todo

Add discussion of pros and cons of the above sampling approaches, including what would happen if we make propensities change over time. Here’s a note from Abie, based on a conversation with James:

In future simulations, we might want individual simulants to have risk factor exposures that are correlated over time, but not perfectly correlated over time. For example, this could be accomplished in a normally distributed risk by changing the risk factor propensity by a small amount every timestep, in a way that keeps the propensity uniformly distributed between zero and one. The mixture interpretation of the ensemble distribution is amenable to a similar sort of imperfectly autocorrelated risk exposure, by changing the second propensity while leaving the first fixed. This might result in unexpected features, however, and we will need to proceed with caution, if and when the time comes.