LDL-C, SBP, IKF, and FPG Risk Factor Correlation¶

Risk Factor Correlation Overview¶

Let X and Y be random variables with marginal densities f_x and g_y, and with joint density h_xy. The corresponding CDFs are denoted by capital letters: F_X, G_Y and H_XY. Although I am using terminology consistent with continuous random variables, I am sloppy herein concerning whether X and Y are in fact discrete or continuous (e.g. see assumption that the inverse CDF exists…).

Define the matrix of correlation parameters between random variables to be \(\theta\), such that, in two variable case:

Todo

Add correct matrix

\[\begin{split}\theta = \begin{bmatrix} 1 & \rho \\ \rho & 1 \\ \end{bmatrix}\end{split}\]

Description of general approach¶

We are starting with two correlated random variables. Using a variety of transformations, we can express these random variables in terms of independent standard normal variables. (See literature on dependence measures, such as Pearson’s rho, Spearman’s rho and Kendall’s tau, as well as literature on Nataf transformation, Rosenblatt transformation, a ton of literature on “copulas”, including Archimedean, Gaussian, Clayton and elliptical varieties.) Some background is provided here, but the scope is severely limited compared to the available literature on the subject.

Background¶

A copula is a multivariate distribution function whose marginal distributions are uniform on the interval [0,1]. There are many different copulas, all satisfying the following: if the random variables have marginal distributions, then there exists an n-dimensional copula C such that

\(F_{X_{1},\ldots,X_{n}}\left( x_{1},\ldots,x_{n} \right) = C\left\lbrack F_{X_{1}}\left( x_{1} \right),\ldots,F_{X_{n}}(x_{n}) \right\rbrack\)

Knowing that the relationship between the random variables in not often linear, Pearson’s rho is not an appropriate metric. Instead, metrics such as Spearman’s rho or Kendall’s tau, which measure correlation in terms of rank, are a better choice. Rank correlation is called concordance, and is defined as follows for random variables X and Y: if large values of X are associated with large values of Y, X and Y are concordant (as opposed to being discordant). A concordance function, Q, gives the difference between the probability of concordance and the probability of discordance for an independent pair of vectors (X₁,Y₁ ) and (X₂,Y₂ ) of random variables:

\(Q = P\left\lbrack \left( X_{1} - X_{2} \right)\left( \left( Y_{1} - Y_{2} \right) > 0 \right) \right\rbrack - P\left\lbrack \left( X_{1} - X_{2} \right)\left( \left( Y_{1} - Y_{2} \right) < 0 \right) \right\rbrack\)

Note that the vectors are independent of each other, but the random variables which make up the vectors are correlated. Specifically, they have the following joint distributions:

\(H_{1}\left( x,y \right) = C_{1}\left( F\left( x \right),G\left( y \right) \right)\ \text{and}\ H_{2}\left( x,y \right) = C_{2}\left( F\left( x \right),G\left( y \right) \right)\)

This can be expressed as copulas:

\(\ Q = Q\left( C_{1},C_{2} \right) = 4\iint_{}^{}{C_{2}\left( F\left( x \right),G\left( y \right) \right)\text{dC}_{1}\left( F\left( x \right),G\left( y \right) \right) - 1}\)

Spearman’s rho is proportional to the probability of concordance minus the probability of discordance between two random vectors (X₁,Y₁ ) and (X₂,Y₂ ) with the same marginal distributions F(x) and G(y), but with difference copulas:

\(H_{1}\left( x,y \right) = C_{1}\left( F\left( x \right),G\left( y \right) \right)\ \text{for}\ \left( X_{1},Y_{1} \right)\)

and

\(H_{2}\left( x,y \right) = C_{2}\left( x,y \right) = F\left( x \right)G\left( y \right)\ \text{for}\ \left( X_{2},Y_{2} \right)\)

The population version of Spearman’s rho is defined as

\(\rho_{s} = 3P\left\lbrack \left( X_{1} - X_{2} \right)\left( \left( Y_{1} - Y_{2} \right) > 0 \right) \right\rbrack - P\left\lbrack \left( X_{1} - X_{2} \right)\left( \left( Y_{1} - Y_{2} \right) < 0 \right) \right\rbrack\)

where multiplication by 3 normalizes Spearman’s rho to be on the interval [-1,1]. A result of the definition of copula H₂ is that Spearman’s rho, when written in terms of the integration of copulas,

\(\rho_{s} = 3Q\left( C_{1},C_{2} \right) = 12\iint_{}^{}{F\left( x \right)G\left( y \right)\text{dC}_{1}\left( F\left( x \right),G\left( y \right) \right) - 3}\)

simplifies to the following:

\(\rho_{s} = 12\iint_{}^{}{C_{1}\left( F\left( x \right),G\left( y \right) \right)\text{dF}\left( x \right)dG(y) - 3}\)

For certain copulas (Frank, Farlie-Cumbel-Morgenstern, and Gaussian), Spearman’s rho can be expressed as a simple function of the correlation parameter, \(\rho\)_s = k(\(\theta\)), where \(\theta\) is the linear correlation between the two random variables.

Kendall’s tau is the probability of concordance minus the probability of discordance between two random vectors (X₁,Y₁ ) and (X₂,Y₂) with the same marginal distributions F(x) and G(y), and with a common copula:

\(H\left( x,y \right) = C\left( F\left( x \right),G\left( y \right) \right)\ \text{for}\ \left( X_{1},Y_{1} \right)\ \text{and}\ \left( X_{2},Y_{2} \right)\)

The population version of Kendall’s tau is defined as

\(\tau = Q\left( C,C \right) = 4\iint_{}^{}{C(F\left( x \right),G\left( y \right))dC\left( F\left( x \right),G\left( y \right) \right) - 1}\)

Kendall’s tau can be expressed as a function of the correlation parameter for a broader set of copulas than Spearman’s rho.

A Gaussian copula is a multivariate normal distribution of standard normal variables:

\(C\left( x_{1},\ldots,x_{n} \right) = \Phi\left\lbrack \Phi^{- 1}\left( x_{1} \right),\ldots{,\Phi}^{- 1}\left( x_{n} \right) \right\rbrack\)

and is used in the Nataf transformation to transform the original variables X into correlated standard normal variables Y with Φ(0,P’) where P’ is the reduced covariance matrix.

Spearman’s rho with a Gaussian copula can be expressed as follows:

\(\rho_{s} = k\left( \theta \right) = \frac{6}{\pi}\operatorname{}\left( \frac{\theta}{2} \right)\)

Application¶

This process takes several steps. The ultimate goal is to generate risk-correlated distributions from which simulants will be initialized. I calculated Spearman’s rho using data from NHANES 2011. The steps below assume these are normally distributed, but it would probably be wise to try to find the “true” function form from a literature review. The first step is simply to create inverse CDFs from GBD data. Since the GBD distributions are empirical, the inverse CDF exists. It will be a step function, but this shouldn’t be problematic. Step 2 generates a random variable for each risk factor and simultaneously builds in the correlation. Next, transform the underlying GBD distribution into a uniform distribution on [0,1] by way of the Gaussian copula. Finally, use the inverse CDFs to transform these correlated random variables into correlated marginal distributions. Sample from these for initialization data.

Compute the CDF for each GBD risk factor distribution and find each inverse CDF. Call this GBD_i^(-1) where the subscript denotes the risk factor.
Define Z_i to be a random variable, where i = the number of risk factors. Generate random values for each 〖(Z〗_i 〖,Z〗_j) pair to be drawn from a bivariate normal distribution with (Spearman’s) correlation matrix ρ_s.
Define U_i=Φ_Z (Z_i) which is the CDF of each Z_i. Since we sampled values for Z_i, this can be computed (may be a step function – smoothing might be too fancy).
Generate X_i= GBD_i^(-1) (U_i) for each risk factor. These will have the same distributions s their counterparts in GBD, and they will have appropriate correlation thanks to step #2.
Sample from each X_i distribution to initialize the simulation population.

If I’m not mistaken, this approach should work for categorical risk factors as well. The inverse CDF from the GBD data for the categorical risks will be very much a step function, but I’m not sure that matters – since I can’t see where it would crash this recipe. As long as the inverse CDF is well defined, I think this should work.

Step #2 could be generalized, I think, so that values are drawn not pairwise, but from a generic multivariate with dimension = the number of risk factors. I started writing this with the idea that values would need to be sampled from different distributions (not always normal), but the more time I spend on this, the more I convince myself that we only need the normal distribution, regardless of the risk factor and “true” underlying distribution. (I hope I’m not overlooking negative values here…) I also computed rho values pairwise and I don’t want to take time to calculate the 4x4 matrix again.

The biggest weakness is obviously use of the Gaussian copula, which could be generalized with some additional time and effort. I know selection of the copula can make a reasonably significant difference (depending on the shape of the scatter plot), but time constraints are binding here, so it’s saved for future work.

Spearman correlations between LDL-c, SBP, FPG, GFR

Spearman correlations between LDL-c, SBP, FPG, GFR¶
age_group	risk_factor	SBP_correlation	LDLC_correlation	FPG_correlation	GFR_correlation
30_to_34	SBP	1	0.1176672	0.1948481	-0.1819566
30_to_34	LDLC	0.1176672	1	0.1186915	-0.1757608
30_to_34	FPG	0.1948481	0.1186915	1	0.0103628
30_to_34	GFR	-0.1819566	-0.1757608	0.0103628	1
35_to_39	SBP	1	0.1392054	0.2200049	-0.17667221
35_to_39	LDLC	0.1392054	1	0.15174071	-0.20808153
35_to_39	FPG	0.2200049	0.1517407	1	-0.03396372
35_to_39	GFR	-0.1766722	-0.2080815	-0.03396372	1
40_to_44	SBP	1	0.1522813	0.23464385	-0.178237
40_to_44	LDLC	0.1522813	1	0.16361198	-0.23030906
40_to_44	FPG	0.2346439	0.163612	1	-0.04468611
40_to_44	GFR	-0.178237	-0.2303091	-0.04468611	1
45_to_49	SBP	1	0.1517517	0.25028775	-0.18378793
45_to_49	LDLC	0.1517517	1	0.17982247	-0.24775229
45_to_49	FPG	0.2502877	0.1798225	1	-0.07947989
45_to_49	GFR	-0.1837879	-0.2477523	-0.07947989	1
50_to_54	SBP	1	0.1610482	0.2728209	-0.202464
50_to_54	LDLC	0.1610482	1	0.1958882	-0.2614925
50_to_54	FPG	0.2728209	0.1958882	1	-0.1052538
50_to_54	GFR	-0.202464	-0.2614925	-0.1052538	1
55_to_59	SBP	1	0.166594	0.2771441	-0.2173886
55_to_59	LDLC	0.166594	1	0.1812806	-0.2538965
55_to_59	FPG	0.2771441	0.1812806	1	-0.124089
55_to_59	GFR	-0.2173886	-0.2538965	-0.124089	1
60_to_64	SBP	1	0.1733038	0.2989019	-0.2511495
60_to_64	LDLC	0.1733038	1	0.1735191	-0.2469309
60_to_64	FPG	0.2989019	0.1735191	1	-0.1645333
60_to_64	GFR	-0.2511495	-0.2469309	-0.1645333	1
65_to_69	SBP	1	0.1705171	0.3144756	-0.2722179
65_to_69	LDLC	0.1705171	1	0.1673229	-0.2294974
65_to_69	FPG	0.3144756	0.1673229	1	-0.1917644
65_to_69	GFR	-0.2722179	-0.2294974	-0.1917644	1
70_to_74	SBP	1	0.1542949	0.3270861	-0.295815
70_to_74	LDLC	0.1542949	1	0.1434287	-0.2044215
70_to_74	FPG	0.3270861	0.1434287	1	-0.2133685
70_to_74	GFR	-0.295815	-0.2044215	-0.2133685	1
75_to_79	SBP	1	0.1468609	0.3351865	-0.3086651
75_to_79	LDLC	0.1468609	1	0.1305559	-0.184982
75_to_79	FPG	0.3351865	0.1305559	1	-0.233851
75_to_79	GFR	-0.3086651	-0.184982	-0.233851	1

PAF adjustment¶

With the correlated risk distributions in hand, we can make an adjustment to the GBD PAF calculation. Let 〖PAF〗_joint be the population attributable fraction which incorporates the correlated risks, such that

PAF_joint = 1 - \([{\int_{FPG}^{} \int_{IKF}^{} \int_{SBP}^{} \rho_{e_{FPG,IKF,SBP,LDL}} \times\ \prod_{i= \epsilon [LDL,SBP,IKF,FPG]} RR_i^{e_i} de_i}]^{-1}\)

If I’m not mistaken, this approach should work for categorical risk factors as well. The inverse CDF from the GBD data for the categorical risks will be very much a step function, but I’m not sure that matters – since I can’t see where it would crash this recipe. As long as the inverse CDF is well defined, I think this should work. Step #2 could be generalized, I think, so that values are drawn not pairwise, but from a generic multivariate with dimension = the number of risk factors. I started writing this with the idea that values would need to be sampled from different distributions (not always normal), but the more time I spend on this, the more I convince myself that we only need the normal distribution, regardless of the risk factor and “true” underlying distribution. (I hope I’m not overlooking negative values here…) The biggest weakness is obviously use of the Gaussian copula, which could be generalized with some additional time and effort. I know selection of the copula can make a reasonably significant difference (depending on the shape of the scatter plot), but time constraints are binding here, so it’s saved for future work.