Understanding and Pulling GBD Data

Global Burden of Disease (GBD) Study data is a fundamental data source for our simulation models. Understanding what data is available in the GBD and what modeling processes produced it is a difficult task. Some helpful resources for understanding the GBD study are listed below:

Pulling GBD Data using Shared Functions

IHME central computation maintains functions for accessing GBD data, referred to as “Shared Functions.” The main HUB page for shared functions can be found here

Note that there is a central computation maintained conda environment that is guaranteed to have the latest version of all GBD shared functions, called gbd_env, as described on the Shared Functions HUB page.

  • Note that archived GBD rounds (for example, GBD 2017) may require archived GBD environments to access - see the “Current and Archive GBD environments” subpage for more details.

  • Also note that while the gbd_env environment is guaranteed to have the most up to date versions of shared functions, it is unlikely to include additional packages you may want to use, which is a downside of using this environment.

If you wish to use your own environment and add shared functions to that environment, you may do so using pip, but you will need to add artifactory.ihme.washington.edu as a trusted host in you ~/.pip/pip.conf file first, as described on this HUB page. See the computing onboarding resource page for more information on managing conda environments.

The packages most relevant to pulling GBD data using shared functions include db_queries and get_draws.

Overview of db_queries

Documentation for db_queries can be found here.

Some particularly helpful functions in db_queries include:

  • get_ids: Returns a list of GBD IDs for any entities in GBD (age groups, locations, causes, etc.)

  • get_outputs: Returns mean value and uncertainty interval for GBD results

  • get_population: Returns population size estimates

  • get_covariate_estimates: Returns mean value and uncertainty interval for GBD covariates

Overview of get_draws

Documentation for get_draws can be found here.

get_draws differs from db_queries.get_outputs in that rather than returning a mean estimate and uncertainty interval, it returns draw-level estimates from which a mean value and uncertainty interval can be estimated. Unlike db_queries.get_outputs, get_draws does not automatically aggregate results from the most detailed estimates (for instance: it returns sex-specific values and will not automatically return vaues for sex_id=3/”both” sexes combined).

Additionally, there are certain intermediate values used in GBD that are not available in GBD’s final results found in db_queries.get_outputs and can only be pulled using get_draws, such as risk exposures and relative risks. The various data source available in :code`get_draws` are summarized in the table below and also described in more detail on the get_draws documentation page here.

Sources of draws

Source

Description

GBD ID type

Note

epi

Dismod and custom epi models. This source contains data that is computed by GBD modelers and often used as inputs to central GBD processes.

modelable_entity_id

codcorrect

Deaths and YLLs

cause_id

Returns counts only

como

YLDs, incidence, and comorbidity-adjusted prevalence

cause_id, sequela_id, rei_id

Returns rates only

dalynator

DALYs

cause_id

exposure

Risk factor exposure

rei_id

Can be continuous (like mean BMI) or categorical (like stunting prevalence)

exposure_sd

Risk exposure standard deviation

rei_id

Only for continuous risks

rr

Risk factor relative risk

rei_id, cause_id

Will return values for all affected causes unless a cause_id is specified

burdenator

Risk attributable burden (deaths/dalys/ylls/ylds) and mediated/aggregated PAFs

cause_id, rei_id

paf

Pre-burdenator (non-finalized) PAF estimates

rei_id, cause_id

sev

Summary exposure values

rei_id

tmrel

Risk factor theoretical minimum exposure level

rei_id

rr_max

Relative risk maximum value

rei_id, cause_id

codem

codem models and custom cod models

cause_id

stgpr

ST-GPR models

modelable_entity_id

If you pass an MEID with a dismod model type but try to use the ST-GPR source, get_draws will use the epi source instead.

Handling GBD versioning

Decomposition (or “decomp”) steps are a versioning scheme used in some GBD rounds that allowed updates to GBD results based on iterative updates to certain parts of the computation process. For instance, the first step may be equivalent to the prior GBD round in all aspects except for an updated demographic model; the second step may be equivalent to the prior steps, but with updated risk exposures; and so on. This process allowed GBD researchers to evaluate how individual components of the many changes included in a GBD round advancement influenced the main results of the GBD study, rather than updating the entire pipeline at once.

When pulling GBD data from GBD rounds that used decomp step versioning, you are required to specify a decomp_step value in your shared functions call.

Unfortunately, the steps are not necessarily equivalent between GBD rounds. For this reason, we advise consulting the HUB space specific to the GBD round you are interested in, which often contains information about that round’s “Decomposition rules.”

For reference, the decomposition rules for GBD 2021 can be found here

Additionally, you may be required to specify a version_id, release_id, and/or status when pulling GBD results from certain GBD rounds. The HUB space for a given GBD round is a good resource on where to obtain this information, but do not hesitate to open a helpdesk ticket to inquire or confirm whether you are using appropriate versioning IDs for you GBD shared functions call.

Todo

Discuss release_id as preferred alternative to gbd_round_id + decomp_step.

Pulling GBD Data using Vivarium Inputs

There are two main packages within the Vivarium software framework that are especially useful for interacting with GBD data: gbd_mapping and vivarium_inputs.

Both of these packages translate ID numbers used in GBD to human-readable text.

Overview of gbd_mapping

gbd_mapping provides a convenient way to access all of the metadata associated with a given GBD entity (ex: diarrheal diseases cause or child growth failure risk factor), but does not return any estimates associated with that entity (ex: prevalence or relative risks).

Overview of vivarium_inputs

vivarium_inputs provides simplified functions to query GBD data and reformats the data to be compatible with the data structure required for building Vivarium Artifact objects. vivarium_inputs generally returns data for the most up-to-date complete GBD round/release and does not allow for user-specification of prior rounds/releases – ask the software engineers if you have questions about which GBD round/release is active in vivarium_inputs at any given time. Additionally, if there is any doubt as to which GBD versioning is being returned by a given vivarium_inputs call, you can utilize get_raw_data, which will return full data including GBD versioning IDs for a given call.

For documentation on Vivarium Inputs, click here.

Some important notes and considerations not included in the documentation above are listed below:

Todo

List default behavior of get_measures/other functions once the GBD 2021 update is finalized, including things like:

  • Returning most recent available year - note potential exception with risk effects?

  • Filtering of draws (reduction of 1,000 COD draws down to 500 that are present in COMO)?

  • Returning all ages/sexes and filling NANs with zeros

  • Version ID behavior with GBD 2021?

  • Anything else?

Notable default behavior of get_measures

Measure

Data returned

Note

'incidence'

GBD_incidence / (1 - GBD_prevalence)

By default, get_measures automatically converts GBD’s “population-level incidence rates” to “susceptible population incidence rates” using the GBD estimate of prevalence. Note that if a model is using an alternative value for prevalence, this rescaling should be done separately using that prevalence value.

'raw_incidence_rate'

GBD_incidence

'cause_specific_mortality'

GBD_death_count / GBD_population_counts

'excess_mortality'

cause_specific_mortality / GBD_prevalence

By default, get_measures calculates excess mortality rates in accordance with the GBD estimate of prevalence. If a model is using an alternative value for cause prevalence, excess mortality rates should likely be calculated separately using that prevalence value.

Applied examples

Todo

Link notebook that shows examples of using these functions.

Considerations of each approach

Generally, GBD shared functions offer greater flexibility in querying GBD data than Vivarium Inputs, but require specification of detailed IDs that are not human-readable and require translation with get_ids. Vivarium Inputs offers less flexibility in favor of the convenience of returning a human-readable version of the most relevant data for running Vivarium simulations and compatibility with required Vivarium Artifact formatting. Therefore, GBD shared functions may be the code base to use when taking deep dives into GBD data, and Vivarium Inputs when preparing GBD data for Vivarium simulations. Some additional specific considerations about the differences between the two options are summarized in the table below.

Topic

GBD Shared Functions

Vivarium Inputs

GBD round

Able to specify any GBD round/release; useful for noting and comparing major changes between rounds

Returns most recent complete GBD round/release only

DALYs

Returns YLD, YLL, DALY estimates

Does not return YLD, YLL, or DALY estimates

Metrics

Returns counts, rates, and prevalence estimates

Returns rate estimates with the exception of population structure, which are in counts; convenient

Summary values

Can return mean, upper, and lower estimates using get_outputs

Returns draw-level estimates only

Age/sex/location specificity

Allows for specification across all these parameters, allows for grouping (via get_outputs) and/or aggregation (via make_custom_aggregates) across demographic categories

Returns all most-detailed age and sex estimates. Supports only one location at a time.

Format

Generally uses ID numbers that are not human-readable before pairing with get_ids information

Converts to human readable entity names rather than IDs and is compatible with formatting required for vivarium Artifacts and simulations

Note

Generally, to convert between GBD shared function entity names (such as cause_name) to the entity name in Vivarium inputs, convert the GBD shared function entity name to all lower case and replace spaces with underscores. Python code to do this is shown below:

vivarium_inputs_entity_name = gbd_entity_name.lower().replace(' ', '_')

There are some exceptions to this code that will require additional conversion, which can be viewed in the vivarium inputs source code found in the clean_entity_list method, found here.