Understanding and Pulling GBD Data¶

Global Burden of Disease (GBD) Study data is a fundamental data source for our simulation models. Understanding what data is available in the GBD and what modeling processes produced it is a difficult task. Some helpful resources for understanding the GBD study are listed below:

IHME onboarding trainings
GBD capstone papers and their methods appendices, such as:
The GBD compare tool, which allows you to visualize GBD estimates
GBD ID viewer
GBD risk factors toolbox
Your simulation science team members!
Talking to GBD modelers directly

Pulling GBD Data using Shared Functions¶

IHME central computation maintains functions for accessing GBD data, referred to as “Shared Functions.” The main HUB page for shared functions can be found here

Note that there is a central computation maintained conda environment that is guaranteed to have the latest version of all GBD shared functions, called gbd_env, as described on the Shared Functions HUB page.

Note that archived GBD rounds (for example, GBD 2017) may require archived GBD environments to access - see the “Current and Archive GBD environments” subpage for more details.
Also note that while the gbd_env environment is guaranteed to have the most up to date versions of shared functions, it is unlikely to include additional packages you may want to use, which is a downside of using this environment.

If you wish to use your own environment and add shared functions to that environment, you may do so using pip, but you will need to add artifactory.ihme.washington.edu as a trusted host in you ~/.pip/pip.conf file first, as described on this HUB page. See the computing onboarding resource page for more information on managing conda environments.

The packages most relevant to pulling GBD data using shared functions include db_queries and get_draws.

Overview of `db_queries`¶

Documentation for db_queries can be found here.

Some particularly helpful functions in db_queries include:

get_ids: Returns a list of GBD IDs for any entities in GBD (age groups, locations, causes, etc.)
get_outputs: Returns mean value and uncertainty interval for GBD results
get_population: Returns population size estimates
get_covariate_estimates: Returns mean value and uncertainty interval for GBD covariates

Overview of `get_draws`¶

Documentation for get_draws can be found here.

get_draws differs from db_queries.get_outputs in that rather than returning a mean estimate and uncertainty interval, it returns draw-level estimates from which a mean value and uncertainty interval can be estimated. Unlike db_queries.get_outputs, get_draws does not automatically aggregate results from the most detailed estimates (for instance: it returns sex-specific values and will not automatically return vaues for sex_id=3/”both” sexes combined).

Additionally, there are certain intermediate values used in GBD that are not available in GBD’s final results found in db_queries.get_outputs and can only be pulled using get_draws, such as risk exposures and relative risks. The various data source available in :code`get_draws` are summarized in the table below and also described in more detail on the get_draws documentation page here.

Sources of draws¶
Source	Description	GBD ID type	Note
`epi`	Dismod and custom epi models. This source contains data that is computed by GBD modelers and often used as inputs to central GBD processes.	modelable_entity_id
`codcorrect`	Deaths and YLLs	cause_id	Returns counts only
`como`	YLDs, incidence, and comorbidity-adjusted prevalence	cause_id, sequela_id, rei_id	Returns rates only
`dalynator`	DALYs	cause_id
`exposure`	Risk factor exposure	rei_id	Can be continuous (like mean BMI) or categorical (like stunting prevalence)
`exposure_sd`	Risk exposure standard deviation	rei_id	Only for continuous risks
`rr`	Risk factor relative risk	rei_id, cause_id	Will return values for all affected causes unless a cause_id is specified
`burdenator`	Risk attributable burden (deaths/dalys/ylls/ylds) and mediated/aggregated PAFs	cause_id, rei_id
`paf`	Pre-burdenator (non-finalized) PAF estimates	rei_id, cause_id
`sev`	Summary exposure values	rei_id
`tmrel`	Risk factor theoretical minimum exposure level	rei_id
`rr_max`	Relative risk maximum value	rei_id, cause_id
`codem`	codem models and custom cod models	cause_id
`stgpr`	ST-GPR models	modelable_entity_id	If you pass an MEID with a dismod model type but try to use the ST-GPR source, get_draws will use the epi source instead.

Handling GBD versioning¶

Decomposition (or “decomp”) steps are a versioning scheme used in some GBD rounds that allowed updates to GBD results based on iterative updates to certain parts of the computation process. For instance, the first step may be equivalent to the prior GBD round in all aspects except for an updated demographic model; the second step may be equivalent to the prior steps, but with updated risk exposures; and so on. This process allowed GBD researchers to evaluate how individual components of the many changes included in a GBD round advancement influenced the main results of the GBD study, rather than updating the entire pipeline at once.

When pulling GBD data from GBD rounds that used decomp step versioning, you are required to specify a decomp_step value in your shared functions call.

Unfortunately, the steps are not necessarily equivalent between GBD rounds. For this reason, we advise consulting the HUB space specific to the GBD round you are interested in, which often contains information about that round’s “Decomposition rules.”

For reference, the decomposition rules for GBD 2021 can be found here

Additionally, you may be required to specify a version_id, release_id, and/or status when pulling GBD results from certain GBD rounds. The HUB space for a given GBD round is a good resource on where to obtain this information, but do not hesitate to open a helpdesk ticket to inquire or confirm whether you are using appropriate versioning IDs for you GBD shared functions call.

Todo

Discuss release_id as preferred alternative to gbd_round_id + decomp_step.

Pulling GBD Data using Vivarium Inputs¶

There are two main packages within the Vivarium software framework that are especially useful for interacting with GBD data: gbd_mapping and vivarium_inputs.

Both of these packages translate ID numbers used in GBD to human-readable text.

Overview of `gbd_mapping`¶

gbd_mapping provides a convenient way to access all of the metadata associated with a given GBD entity (ex: diarrheal diseases cause or child growth failure risk factor), but does not return any estimates associated with that entity (ex: prevalence or relative risks).

Overview of `vivarium_inputs`¶

vivarium_inputs provides simplified functions to query GBD data and reformats the data to be compatible with the data structure required for building Vivarium Artifact objects. vivarium_inputs generally returns data for the most up-to-date complete GBD round/release and does not allow for user-specification of prior rounds/releases – ask the software engineers if you have questions about which GBD round/release is active in vivarium_inputs at any given time. Additionally, if there is any doubt as to which GBD versioning is being returned by a given vivarium_inputs call, you can utilize get_raw_data, which will return full data including GBD versioning IDs for a given call.

For documentation on Vivarium Inputs, click here.

Some important notes and considerations not included in the documentation above are listed below:

Todo

List default behavior of get_measures/other functions once the GBD 2021 update is finalized, including things like:

Returning most recent available year - note potential exception with risk effects?
Filtering of draws (reduction of 1,000 COD draws down to 500 that are present in COMO)?
Returning all ages/sexes and filling NANs with zeros
Version ID behavior with GBD 2021?
Anything else?

Notable default behavior of get_measures¶
Measure	Data returned	Note
`'incidence'`	GBD_incidence / (1 - GBD_prevalence)	By default, get_measures automatically converts GBD’s “population-level incidence rates” to “susceptible population incidence rates” using the GBD estimate of prevalence. Note that if a model is using an alternative value for prevalence, this rescaling should be done separately using that prevalence value.
`'raw_incidence_rate'`	GBD_incidence
`'cause_specific_mortality'`	GBD_death_count / GBD_population_counts
`'excess_mortality'`	cause_specific_mortality / GBD_prevalence	By default, get_measures calculates excess mortality rates in accordance with the GBD estimate of prevalence. If a model is using an alternative value for cause prevalence, excess mortality rates should likely be calculated separately using that prevalence value.

Applied examples¶

Todo

Link notebook that shows examples of using these functions.

Considerations of each approach¶

Generally, GBD shared functions offer greater flexibility in querying GBD data than Vivarium Inputs, but require specification of detailed IDs that are not human-readable and require translation with get_ids. Vivarium Inputs offers less flexibility in favor of the convenience of returning a human-readable version of the most relevant data for running Vivarium simulations and compatibility with required Vivarium Artifact formatting. Therefore, GBD shared functions may be the code base to use when taking deep dives into GBD data, and Vivarium Inputs when preparing GBD data for Vivarium simulations. Some additional specific considerations about the differences between the two options are summarized in the table below.

Topic	GBD Shared Functions	Vivarium Inputs
GBD round	Able to specify any GBD round/release; useful for noting and comparing major changes between rounds	Returns most recent complete GBD round/release only
DALYs	Returns YLD, YLL, DALY estimates	Does not return YLD, YLL, or DALY estimates
Metrics	Returns counts, rates, and prevalence estimates	Returns rate estimates with the exception of population structure, which are in counts; convenient
Summary values	Can return mean, upper, and lower estimates using get_outputs	Returns draw-level estimates only
Age/sex/location specificity	Allows for specification across all these parameters, allows for grouping (via get_outputs) and/or aggregation (via make_custom_aggregates) across demographic categories	Returns all most-detailed age and sex estimates. Supports only one location at a time.
Format	Generally uses ID numbers that are not human-readable before pairing with get_ids information	Converts to human readable entity names rather than IDs and is compatible with formatting required for vivarium Artifacts and simulations

Note

Generally, to convert between GBD shared function entity names (such as cause_name) to the entity name in Vivarium inputs, convert the GBD shared function entity name to all lower case and replace spaces with underscores. Python code to do this is shown below:

vivarium_inputs_entity_name = gbd_entity_name.lower().replace(' ', '_')

There are some exceptions to this code that will require additional conversion, which can be viewed in the vivarium inputs source code found in the clean_entity_list method, found here.

Understanding and Pulling GBD Data¶

Pulling GBD Data using Shared Functions¶

Overview of db_queries¶

Overview of get_draws¶

Handling GBD versioning¶

Pulling GBD Data using Vivarium Inputs¶

Overview of gbd_mapping¶

Overview of vivarium_inputs¶

Applied examples¶

Considerations of each approach¶

Overview of `db_queries`¶

Overview of `get_draws`¶

Overview of `gbd_mapping`¶

Overview of `vivarium_inputs`¶