Introduction to Qgcomp.jl
Note: this is (copied directly from the qgcomp package in R, basic vignette).
Qgcomp.jl is a module to implement g-computation for analyzing the effects of exposure mixtures. Quantile g-computation yields estimates of the effect of increasing all exposures by one quantile, simultaneously. This, it estimates a "mixture effect" useful in the study of exposure mixtures such as air pollution, diet, and water contamination.
Using terminology from methods developed for causal effect estimation, quantile g-computation estimates the parameters of a marginal structural model that characterizes the change in the expected potential outcome given a joint intervention on all exposures, possibly conditional on confounders. Under the assumptions of exchangeability, causal consistency, positivity, no interference, and correct model specification, this model yields a causal effect for an intervention on the mixture as a whole. While these assumptions may not be met exactly, they provide a useful road map for how to interpret the results of a qgcomp fit, and where efforts should be spent in terms of ensuring accurate model specification and selection of exposures that are sufficient to control co-pollutant confounding.
The model
Say we have an outcome $Y$, some exposures $\mathbb{X}$ and possibly some other covariates (e.g. potential confounders) denoted by $\mathbb{Z}$.
The basic model of quantile g-computation is a joint marginal structural model given by
\[\mathbb{E}(Y^{\mathbf{X}_q} | \mathbf{Z,\psi,\eta}) = g(\psi_0 + \psi_1 S_q + \mathbf{\eta Z})\]
where $g(\cdot)$ is a link function in a generalized linear model (e.g. the inverse logit function in the case of a logistic model for the probability that $Y=1$), $\psi_0$ is the model intercept, $\mathbf{\eta}$ is a set of model coefficients for the covariates and $S_q$ is an "index" that represents a joint value of exposures. Quantile g-computation (by default) transforms all exposures $\mathbf{X}$ into $\mathbf{X}_q$, which are "scores" taking on discrete values 0,1,2,etc. representing a categorical "bin" of exposure. By default, there are four bins with evenly spaced quantile cutpoints for each exposure, so ${X}_q=0$ means that $X$ was below the observed 25th percentile for that exposure. The index $S_q$ represents all exposures being set to the same value (again, by default, discrete values 0,1,2,3). Thus, the parameter $\psi_1$ quantifies the expected change in the outcome, given a one quantile increase in all exposures simultaneously, possibly adjusted for $\mathbf{Z}$.
There are nuances to this particular model form that are available in the qgcomp package which will be explored below. There exists one special case of quantile g-computation that leads to fast fitting: linear/additive exposure effects. Here we simulate "pre-quantized" data where the exposures $X_1, X_2, X_3$ can only take on values of 0,1,2,3 in equal proportions. The model underlying the outcomes is given by the linear regression:
\[\mathbb{E}(Y | \mathbf{X,\beta}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\]
with the true values of $\beta_0=0, \beta_1 =0.25, \beta_2 =-0.1, \beta_3=0.05$, and $X_1$ is strongly positively correlated with $X_2$ ($\rho=0.95$) and negatively correlated with $X_3$ ($\rho=-0.3$). In this simple setting, the parameter $\psi_1$ will equal the sum of the $\beta$ coefficients (0.2). Here we see that qgcomp estimates a value very close to 0.2 (as we increase sample size, the estimated value will be expected to become increasingly close to 0.2).
#cd("Qgcomp.jl/docs/src/fig/")
using Qgcomp, DataFrames, Random, StatsBase, GLM, StatsModels
# a function to generate quantized data with a specific correlation structure
function genxq(rng, n, corr=(0.95, -0.3), q=4)
x = rand(rng, n, length(corr)+1)
props = abs.(corr)
ns = floor.(Int, n .* props)
nidx = [setdiff(1:n, sample(rng, 1:n, nsi, replace=false)) for nsi in ns]
for (j,c) in enumerate(corr)
x[:,j+1] .= x[:,1]
x[nidx[j],j+1] .= sample(rng, x[nidx[j],1], length(nidx[j]), replace=false)
if c < 0
x[:,j+1] .= 1.0 .- x[:,j+1]
end
end
xq, _ = Qgcomp.get_xq(x, q)
x, xq
end
# generate some data under a linear model with "quantized" exposures
rng = Xoshiro(321)
n = 1000
X, Xq = genxq(rng, 1000, (0.95, -0.3))
y = randn(rng, n) + Xq * [0.25, -0.1, 0.05]
lindata = DataFrame(hcat(y, X), [:y, :x1, :x2, :x3])
# check correlations
println(cor(Xq))
# fit model
qgcomp_glm_noboot(@formula(y~x1+x2+x3), lindata, ["x1", "x2", "x3"], 4, Normal())Scaled effect size (negative direction)
1×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ x2 -0.166811 1.0 -0.166811
Scaled effect size (positive direction)
2×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼───────────────────────────────────────────
1 │ x3 0.0209875 0.0596686 0.351734
2 │ x1 0.330746 0.940331 0.351734
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.0292843 0.0784494 0.37 0.7089 -0.124474 0.183042
mixture 0.184923 0.0480701 3.85 0.0001 0.0907069 0.279138
─────────────────────────────────────────────────────────────────────────
How to use the Qgcomp module
Here we use a running example from the metals dataset (part of the qgcomp package in R) to demonstrate some features of the package and method.
Namely, the examples below demonstrate use of the package for:
- Fast estimation of exposure effects under a linear model for quantized exposures for continuous (normal) outcomes
- Estimating conditional and marginal odds/risk ratios of a mixture effect for binary outcomes
- Adjusting for non-exposure covariates when estimating effects of the mixture
- Allowing non-linear and non-homogeneous effects of individual exposures and the mixture as a whole by including product terms
- Using qgcomp to fit a time-to-event model to estimate conditional and marginal hazard ratios for the exposure mixture
For analogous approaches to estimating exposure mixture effects, illustrative examples can be seen in the gQWS package help files, which implements weighted quantile sum (WQS) regression, and at https://jenfb.github.io/bkmr/overview.html, which describes Bayesian kernel machine regression.
The metals dataset from the package qgcomp, comprises a set of simulated well water exposures and two health outcomes (one continuous, one binary/time-to-event). The exposures are transformed to have mean = 0.0, standard deviation = 1.0. The data are used throughout to demonstrate usage and features of the qgcomp package.
using RData
tf = tempname() * ".RData"
download("https://github.com/alexpkeil1/qgcomp/raw/refs/heads/main/data/metals.RData", tf)
metals = load(tf)["metals"]
println(metals[1:10,:])
# we save the names of the mixture variables in the variable "Xnm"
Xnm = [
"arsenic","barium","cadmium","calcium","chromium","copper",
"iron","lead","magnesium","manganese","mercury","selenium","silver",
"sodium","zinc"
];
covars = ["nitrate","nitrite","sulfate","ph", "total_alkalinity","total_hardness"];6-element Vector{String}:
"nitrate"
"nitrite"
"sulfate"
"ph"
"total_alkalinity"
"total_hardness"Example 1: linear model
Qgcomp.qgcomp_glm_noboot — Functionusing Qgcomp, DataFrames, StatsModels
x1 = rand(100, 3)
x = rand(100, 3)
z = rand(100, 3)
xq, _ = Qgcomp.get_xq(x, 4)
y = randn(100) + xq * [.1, 0.05, 0]
data = DataFrame(hcat(y,x,z), [:y, :x1, :x2, :x3, :z1, :z2, :z3])
form = @formula(y~x1+x2+x3+z1+z2+z3)
form_noint = @formula(y~-1+x1+x2+x3+z1+z2+z3)
expnms = [:x1, :x2, :x3]
m = qgcomp_glm_noboot(form, data, expnms, 4, Normal())
m = qgcomp_glm_noboot(form_noint, data, expnms, 4, Normal())
fitted(m)
aic(m)
aicc(m)
bic(m)
loglikelihood(m)# Example 1: linear model
# Run the model and save the results "qc_fit"
f = @formula(y~1+a+b)
ff = FormulaTerm(f.lhs, (Term.(Symbol.(Xnm))...,))
qgcomp_glm_noboot(ff, metals[:,vcat(Xnm, "y")], Xnm, 4, Normal()) # run once to compile
@time qc_fit = qgcomp_glm_noboot(ff, metals[:,vcat(Xnm, "y")], Xnm, 4, Normal())
# 0.001750 seconds (49.98 k allocations: 2.679 MiB)
# contrasting other methods with computational speed
# WQS regression (v3.0.1 of gWQS package in R)
#system.time(wqs.fit = gWQS::gwqs(y~wqs,mix_name=Xnm, data=metals[:,vcat(Xnm, "y")], Normal(), 4))
# user system elapsed
# 35.775 0.124 36.114
# Bayesian kernel machine regression (note that the number of iterations here would
# need to be >5,000, at minimum, so this underestimates the run time by a factor
# of 50+
#system.time(bkmr.fit = kmbayes(y=metals$y, Z=metals[,Xnm], family="gaussian", iter=100))
# user system elapsed
# 81.644 4.194 86.520Scaled effect size (negative direction)
5×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────────
1 │ selenium -0.000105864 0.000856712 -0.12357
2 │ manganese -0.00788716 0.0638276 -0.12357
3 │ lead -0.00914645 0.0740185 -0.12357
4 │ copper -0.0476113 0.385299 -0.12357
5 │ magnesium -0.058819 0.475999 -0.12357
Scaled effect size (positive direction)
10×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ zinc 0.00271249 0.00695574 0.389964
2 │ cadmium 0.00517726 0.0132763 0.389964
3 │ chromium 0.00802308 0.0205739 0.389964
4 │ sodium 0.00843016 0.0216178 0.389964
5 │ mercury 0.0095572 0.0245079 0.389964
6 │ arsenic 0.013444 0.0344749 0.389964
7 │ silver 0.0136815 0.0350839 0.389964
8 │ barium 0.0231929 0.0594745 0.389964
9 │ iron 0.0241278 0.061872 0.389964
10 │ calcium 0.281618 0.722163 0.389964
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.35667 0.107878 -3.31 0.0009 -0.568107 -0.145233
mixture 0.266394 0.0710247 3.75 0.0002 0.127188 0.4056
─────────────────────────────────────────────────────────────────────────
First note that qgcomp can be very fast relative to competing methods (with their example times given from single runs from a laptop).
One advantage of quantile g-computation over other methods that estimate "mixture effects" (the effect of changing all exposures at once), is that it is very computationally efficient. Contrasting methods such as WQS (gWQS package) and Bayesian Kernel Machine regression (bkmr package), quantile g-computation can provide results many orders of magnitude faster. For example, the example above ran 3000X faster for quantile g-computation versus WQS regression, and we estimate the speedup would be several hundred thousand times versus Bayesian kernel machine regression.
The speed relies on an efficient method to fit qgcomp when exposures are added additively to the model. When exposures are added using non-linear terms or non-additive terms (see below for examples), then qgcomp will be slower but often still faster than competetive approaches.
Quantile g-computation yields fixed weights in the estimation procedure, similar to WQS regression. However, note that the weights from qgcomp_glm_noboot can be negative or positive. When all effects are linear and in the same direction ("directional homogeneity"), quantile g-computation is equivalent to weighted quantile sum regression in large samples.
The overall mixture effect from quantile g-computation ($\psi$1) is interpreted as the effect on the outcome of increasing every exposure by one quantile, possibly conditional on covariates. Given the overall exposure effect, the weights are considered fixed and so do not have confidence intervals or p-values.
qc_fitScaled effect size (negative direction)
5×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────────
1 │ selenium -0.000105864 0.000856712 -0.12357
2 │ manganese -0.00788716 0.0638276 -0.12357
3 │ lead -0.00914645 0.0740185 -0.12357
4 │ copper -0.0476113 0.385299 -0.12357
5 │ magnesium -0.058819 0.475999 -0.12357
Scaled effect size (positive direction)
10×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ zinc 0.00271249 0.00695574 0.389964
2 │ cadmium 0.00517726 0.0132763 0.389964
3 │ chromium 0.00802308 0.0205739 0.389964
4 │ sodium 0.00843016 0.0216178 0.389964
5 │ mercury 0.0095572 0.0245079 0.389964
6 │ arsenic 0.013444 0.0344749 0.389964
7 │ silver 0.0136815 0.0350839 0.389964
8 │ barium 0.0231929 0.0594745 0.389964
9 │ iron 0.0241278 0.061872 0.389964
10 │ calcium 0.281618 0.722163 0.389964
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.35667 0.107878 -3.31 0.0009 -0.568107 -0.145233
mixture 0.266394 0.0710247 3.75 0.0002 0.127188 0.4056
─────────────────────────────────────────────────────────────────────────
Now let"s take a brief look under the hood. qgcomp works in steps. First, the exposure variables are "quantized" or turned into score variables based on the total number of quantiles from the parameter q. You can access these via the qx object from the qgcomp fit object.
# quantized data
println(qc_fit.data[1:10,:])10×16 DataFrame
Row │ arsenic barium cadmium calcium chromium copper iron lead magnesium manganese mercury selenium silver sodium zinc y
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 2.0 2.0 3.0 0.0 1.0 3.0 3.0 3.0 1.0 3.0 0.0 2.0 0.0 0.0 3.0 -0.600799
2 │ 2.0 2.0 0.0 0.0 2.0 2.0 1.0 2.0 2.0 0.0 1.0 3.0 1.0 3.0 0.0 -0.20223
3 │ 2.0 2.0 3.0 1.0 3.0 3.0 1.0 1.0 3.0 1.0 2.0 2.0 1.0 3.0 0.0 -1.21641
4 │ 1.0 0.0 0.0 3.0 1.0 3.0 0.0 2.0 3.0 1.0 2.0 2.0 3.0 3.0 3.0 0.182631
5 │ 2.0 3.0 2.0 3.0 0.0 1.0 2.0 2.0 1.0 3.0 3.0 1.0 0.0 0.0 0.0 1.17605
6 │ 3.0 2.0 0.0 0.0 0.0 2.0 1.0 1.0 0.0 1.0 2.0 0.0 3.0 3.0 2.0 -0.410091
7 │ 1.0 3.0 2.0 2.0 0.0 3.0 2.0 1.0 3.0 1.0 0.0 1.0 2.0 2.0 3.0 -0.592918
8 │ 2.0 0.0 1.0 2.0 2.0 3.0 0.0 1.0 2.0 0.0 2.0 0.0 0.0 0.0 3.0 0.0405167
9 │ 2.0 1.0 3.0 1.0 2.0 2.0 2.0 3.0 2.0 2.0 0.0 3.0 2.0 2.0 1.0 -0.472098
10 │ 2.0 0.0 0.0 3.0 1.0 1.0 3.0 2.0 3.0 3.0 3.0 0.0 2.0 3.0 3.0 0.434029You can re-fit a linear model using these quantized exposures. This is the "underlying model" of a qgcomp fit.
# regression with quantized data
newfit = lm(@formula(y ~ arsenic + barium + cadmium + calcium + chromium + copper +
iron + lead + magnesium + manganese + mercury + selenium +
silver + sodium + zinc), qc_fit.data)
println(newfit)StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + arsenic + barium + cadmium + calcium + chromium + copper + iron + lead + magnesium + manganese + mercury + selenium + silver + sodium + zinc
Coefficients:
──────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) -0.35667 0.107878 -3.31 0.0010 -0.568696 -0.144644
arsenic 0.013444 0.0184371 0.73 0.4663 -0.0227926 0.0496805
barium 0.0231929 0.0183474 1.26 0.2069 -0.0128675 0.0592533
cadmium 0.00517726 0.0181048 0.29 0.7750 -0.0304063 0.0407608
calcium 0.281618 0.0221639 12.71 <1e-30 0.238056 0.325179
chromium 0.00802308 0.0183908 0.44 0.6629 -0.0281225 0.0441687
copper -0.0476113 0.0184657 -2.58 0.0103 -0.0839042 -0.0113184
iron 0.0241278 0.0201366 1.20 0.2315 -0.015449 0.0637047
lead -0.00914645 0.0183426 -0.50 0.6183 -0.0451973 0.0269044
magnesium -0.058819 0.0195185 -3.01 0.0027 -0.0971811 -0.0204569
manganese -0.00788716 0.0210958 -0.37 0.7087 -0.0493493 0.033575
mercury 0.0095572 0.018297 0.52 0.6017 -0.0264041 0.0455185
selenium -0.000105864 0.0183615 -0.01 0.9954 -0.036194 0.0359823
silver 0.0136815 0.0183149 0.75 0.4555 -0.022315 0.049678
sodium 0.00843016 0.0194193 0.43 0.6644 -0.0297368 0.0465971
zinc 0.00271249 0.0188122 0.14 0.8854 -0.0342614 0.0396864
──────────────────────────────────────────────────────────────────────────────Here you can see that, for a GLM in which all quantized exposures enter linearly and additively into the underlying model, the overall effect from qgcomp is simply the sum of the adjusted coefficients from the underlying model.
println(sum(coef(newfit)[2:end])) # sum of all coefficients excluding intercept and confounders, if any
println(qc_fit.fit[1][2]) # overall effect and intercept from qgcomp fit0.2663941776757069
0.2663941776757085This equality is why we can fit qgcomp so efficiently under such a model. This is a specific case, and qgcomp also allows deviations from linear/additive approaches via Monte-Carlo (here generally referred to as bootstrapping methods) and estimating-equation-based methods, which require a 2-stage approach. qgcomp can allow for non-linearity and non-additivity in the underlying model, as well as non-linearity in the overall model. These extensions are described in some of the following examples.
Example 2: conditional odds ratio, marginal odds ratio in a logistic model<a name="ex-logistic"></a>
This example introduces the use of a binary outcome in qgcomp via the qgcomp_glm_noboot function, which yields a conditional odds ratio or the qgcomp_glm_boot, which yields a marginal odds ratio or risk/prevalence ratio. These will not equal each other when there are non-exposure covariates (e.g. confounders) included in the model because the odds ratio is not collapsible (both are still valid). Marginal parameters will yield estimates of the population average exposure effect, which is often of more interest due to better interpretability over conditional odds ratios. Further, odds ratios are not generally of interest when risk ratios can be validly estimated, so qgcomp_glm_boot will estimate the risk ratio by default for binary data (set rr=FALSE to allow estimation of ORs when using qgcomp_glm_boot).
f = @formula(disease_state~1+a+b)
ff = FormulaTerm(f.lhs, (GLM.Term.(Symbol.(Xnm))...,))
# conditional odds ratio
qc_fit2 = qgcomp_glm_noboot(ff, metals[:,vcat(Xnm, "disease_state")], Xnm, 4, Binomial())
# marginal odds ratio
qcboot_fit2 = qgcomp_glm_boot(Xoshiro(122),ff, metals[:,vcat(Xnm, "disease_state")], Xnm, 4, Binomial(), B=10)
# marginal risk ratio
qcboot_fit2b = qgcomp_glm_boot(Xoshiro(122),ff, metals[:,vcat(Xnm, "disease_state")], Xnm, 4, Binomial(), B=10, msmlink=LogLink())Underlying fit (Model-based CI)
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 0.26362 0.516085 0.51 0.6095 -0.747888 1.27513
arsenic -0.0885288 0.0882484 -1.00 0.3158 -0.261492 0.0844348
barium 0.138002 0.0881834 1.56 0.1176 -0.0348345 0.310838
cadmium -0.0447591 0.0868225 -0.52 0.6062 -0.214928 0.12541
calcium -0.085805 0.106409 -0.81 0.4200 -0.294363 0.122753
chromium 0.0628293 0.0881578 0.71 0.4760 -0.109957 0.235615
copper -0.113293 0.0886233 -1.28 0.2011 -0.286991 0.0604059
iron -0.0215195 0.0966865 -0.22 0.8239 -0.211022 0.167983
lead -0.0299151 0.0881026 -0.34 0.7342 -0.202593 0.142763
magnesium 0.0506531 0.0937246 0.54 0.5889 -0.133044 0.23435
manganese -0.0718927 0.10069 -0.71 0.4752 -0.269242 0.125457
mercury -0.033734 0.0878275 -0.38 0.7009 -0.205873 0.138405
selenium -0.206726 0.0884817 -2.34 0.0195 -0.380147 -0.0333048
silver 0.0367468 0.0879218 0.42 0.6760 -0.135577 0.20907
sodium 0.0252791 0.0928861 0.27 0.7855 -0.156774 0.207332
zinc 0.0784991 0.0900685 0.87 0.3835 -0.098032 0.25503
────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.56237 0.205017 -2.74 0.0061 -0.964197 -0.160543
mixture -0.163725 0.138491 -1.18 0.2371 -0.435162 0.107712
─────────────────────────────────────────────────────────────────────────
Compare a qgcomp_glm_noboot fit:
qc_fit2Scaled effect size (negative direction)
9×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ iron -0.0215195 0.0309112 -0.696173
2 │ lead -0.0299151 0.0429708 -0.696173
3 │ mercury -0.033734 0.0484564 -0.696173
4 │ cadmium -0.0447591 0.0642931 -0.696173
5 │ manganese -0.0718927 0.103268 -0.696173
6 │ calcium -0.085805 0.123252 -0.696173
7 │ arsenic -0.0885288 0.127165 -0.696173
8 │ copper -0.113293 0.162736 -0.696173
9 │ selenium -0.206726 0.296946 -0.696173
Scaled effect size (positive direction)
6×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ sodium 0.0252791 0.064486 0.392009
2 │ silver 0.0367468 0.0937398 0.392009
3 │ magnesium 0.0506531 0.129214 0.392009
4 │ chromium 0.0628293 0.160275 0.392009
5 │ zinc 0.0784991 0.200248 0.392009
6 │ barium 0.138002 0.352037 0.392009
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.26362 0.516085 0.51 0.6095 -0.747888 1.27513
mixture -0.304163 0.340134 -0.89 0.3712 -0.970813 0.362486
─────────────────────────────────────────────────────────────────────────
with a qgcompglmboot fit:
qcboot_fit2Underlying fit (Model-based CI)
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 0.26362 0.516085 0.51 0.6095 -0.747888 1.27513
arsenic -0.0885288 0.0882484 -1.00 0.3158 -0.261492 0.0844348
barium 0.138002 0.0881834 1.56 0.1176 -0.0348345 0.310838
cadmium -0.0447591 0.0868225 -0.52 0.6062 -0.214928 0.12541
calcium -0.085805 0.106409 -0.81 0.4200 -0.294363 0.122753
chromium 0.0628293 0.0881578 0.71 0.4760 -0.109957 0.235615
copper -0.113293 0.0886233 -1.28 0.2011 -0.286991 0.0604059
iron -0.0215195 0.0966865 -0.22 0.8239 -0.211022 0.167983
lead -0.0299151 0.0881026 -0.34 0.7342 -0.202593 0.142763
magnesium 0.0506531 0.0937246 0.54 0.5889 -0.133044 0.23435
manganese -0.0718927 0.10069 -0.71 0.4752 -0.269242 0.125457
mercury -0.033734 0.0878275 -0.38 0.7009 -0.205873 0.138405
selenium -0.206726 0.0884817 -2.34 0.0195 -0.380147 -0.0333048
silver 0.0367468 0.0879218 0.42 0.6760 -0.135577 0.20907
sodium 0.0252791 0.0928861 0.27 0.7855 -0.156774 0.207332
zinc 0.0784991 0.0900685 0.87 0.3835 -0.098032 0.25503
────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.26362 0.441401 0.60 0.5504 -0.60151 1.12875
mixture -0.304163 0.276131 -1.10 0.2707 -0.84537 0.237043
─────────────────────────────────────────────────────────────────────────
with a qgcompglmboot fit, where the risk/prevalence ratio is estimated, rather than the odds ratio:
qcboot_fit2bUnderlying fit (Model-based CI)
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 0.26362 0.516085 0.51 0.6095 -0.747888 1.27513
arsenic -0.0885288 0.0882484 -1.00 0.3158 -0.261492 0.0844348
barium 0.138002 0.0881834 1.56 0.1176 -0.0348345 0.310838
cadmium -0.0447591 0.0868225 -0.52 0.6062 -0.214928 0.12541
calcium -0.085805 0.106409 -0.81 0.4200 -0.294363 0.122753
chromium 0.0628293 0.0881578 0.71 0.4760 -0.109957 0.235615
copper -0.113293 0.0886233 -1.28 0.2011 -0.286991 0.0604059
iron -0.0215195 0.0966865 -0.22 0.8239 -0.211022 0.167983
lead -0.0299151 0.0881026 -0.34 0.7342 -0.202593 0.142763
magnesium 0.0506531 0.0937246 0.54 0.5889 -0.133044 0.23435
manganese -0.0718927 0.10069 -0.71 0.4752 -0.269242 0.125457
mercury -0.033734 0.0878275 -0.38 0.7009 -0.205873 0.138405
selenium -0.206726 0.0884817 -2.34 0.0195 -0.380147 -0.0333048
silver 0.0367468 0.0879218 0.42 0.6760 -0.135577 0.20907
sodium 0.0252791 0.0928861 0.27 0.7855 -0.156774 0.207332
zinc 0.0784991 0.0900685 0.87 0.3835 -0.098032 0.25503
────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.56237 0.205017 -2.74 0.0061 -0.964197 -0.160543
mixture -0.163725 0.138491 -1.18 0.2371 -0.435162 0.107712
─────────────────────────────────────────────────────────────────────────
Example 3: adjusting for covariates, plotting estimates
In the following code we run a maternal age-adjusted linear model with qgcomp (family = Normal()). Further, we plot both the weights, as well as the mixture slope which yields overall model confidence bounds, representing the bounds that, for each value of the joint exposure are expected to contain the true regression line over 95% of trials (so-called 95% "pointwise" bounds for the regression line). The pointwise comparison bounds, denoted by error bars on the plot, represent comparisons of the expected difference in outcomes at each quantile, with reference to a specific quantile (which can be specified by the user, as below). These pointwise bounds are similar to the bounds created in the bkmr package when plotting the overall effect of all exposures. The pointwise bounds can be obtained via the pointwisebound.boot function. To avoid confusion between "pointwise regression" and "pointwise comparison" bounds, the pointwise regression bounds are denoted as the "model confidence band" in the plots, since they yield estimates of the same type of bounds as the predict function in R when applied to linear model fits.
Note that the underlying regression model is on the exposure quantile "scores", which take on integer values 0, 1, ..., q-1. For plotting purposes (when plotting regression line results from qgcompglmboot), the quantile score is translated into a quantile (range = [0-1]). This is not a perfect correspondence, because the quantile g-computation model treats the quantile score as a continuous variable, but the each quantile category spans a range of quantiles. For visualization, we fix the ends of the plot at the mid-points of the first and last quantile cut-point, so the range of the plot will change slightly if "q" is changed.
using Plots
qc_fit3 = qgcomp_glm_noboot(@formula(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc),
metals, Xnm, 4, Normal())
println(qc_fit3)Scaled effect size (negative direction)
5×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ selenium -0.000460942 0.00371898 -0.123943
2 │ manganese -0.00755256 0.0609356 -0.123943
3 │ lead -0.0105484 0.0851069 -0.123943
4 │ copper -0.0439328 0.354459 -0.123943
5 │ magnesium -0.0614485 0.495779 -0.123943
Scaled effect size (positive direction)
10×4 DataFrame
Row │ exposure coef weight ψ_partial
│ String Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ cadmium 0.00206825 0.00542973 0.380913
2 │ sodium 0.00295159 0.00774873 0.380913
3 │ zinc 0.00420104 0.0110289 0.380913
4 │ chromium 0.00893029 0.0234445 0.380913
5 │ mercury 0.0103034 0.0270493 0.380913
6 │ arsenic 0.0107529 0.0282292 0.380913
7 │ silver 0.0143415 0.0376503 0.380913
8 │ iron 0.018433 0.0483918 0.380913
9 │ barium 0.025279 0.0663644 0.380913
10 │ calcium 0.283652 0.744663 0.380913
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) -0.348084 0.108037 -3.22 0.0013 -0.559832 -0.136336
mixture 0.256969 0.0603624 4.26 <1e-04 0.138661 0.375277
mage35 0.0346352 0.071459 0.48 0.6279 -0.105422 0.174692
chloride 0.0284367 0.0215315 1.32 0.1866 -0.0137643 0.0706378
────────────────────────────────────────────────────────────────────────────weightplot(qc_fit3)From the first plot we see weights from qgcomp_glm_noboot function, which include both positive and negative effect directions. When the weights are all on a single side of the null, these plots are easy to in interpret since the weight corresponds to the proportion of the overall effect from each exposure. WQS uses a constraint in the model to force all of the weights to be in the same direction - unfortunately such constraints lead to biased effect estimates. The qgcomp package takes a different approach and allows that "weights" might go in either direction, indicating that some exposures may beneficial, and some harmful, or there may be sampling variation due to using small or moderate sample sizes (or, more often, systematic bias such as unmeasured confounding). The "weights" in qgcomp correspond to the proportion of the overall effect when all of the exposures have effects in the same direction, but otherwise they correspond to the proportion of the effect in a particular direction, which may be small (or large) compared to the overall "mixture" effect. NOTE: the left and right sides of the plot should not be compared with each other because the length of the bars corresponds to the effect size only relative to other effects in the same direction. The darkness of the bars corresponds to the overall effect size - in this case the bars on the right (positive) side of the plot are darker because the overall "mixture" effect is positive. Thus, the shading allows one to make informal comparisons across the left and right sides: a large, darkly shaded bar indicates a larger independent effect than a large, lightly shaded bar.
qcboot_fit3 = qgcomp_glm_boot(Xoshiro(122), @formula(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc), metals, Xnm, 4, Normal(),
B=50)# B should be 200-500+ in practice
println(qcboot_fit3)Underlying fit (Model-based CI)
───────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept) -0.348084 0.108037 -3.22 0.0013 -0.559832 -0.136336
mage35 0.0346352 0.0603624 0.57 0.5661 -0.0836729 0.152943
arsenic 0.0107529 0.0188063 0.57 0.5675 -0.0261068 0.0476125
barium 0.025279 0.0184157 1.37 0.1698 -0.0108151 0.0613732
cadmium 0.00206825 0.0188566 0.11 0.9127 -0.03489 0.0390265
calcium 0.283652 0.0222722 12.74 <1e-36 0.239999 0.327304
chloride 0.0284367 0.0215315 1.32 0.1866 -0.0137643 0.0706378
chromium 0.00893029 0.0184023 0.49 0.6275 -0.0271375 0.0449981
copper -0.0439328 0.0186386 -2.36 0.0184 -0.0804638 -0.00740184
iron 0.018433 0.0205405 0.90 0.3695 -0.0218257 0.0586918
lead -0.0105484 0.0183997 -0.57 0.5664 -0.0466112 0.0255143
magnesium -0.0614485 0.019601 -3.13 0.0017 -0.0998658 -0.0230312
manganese -0.00755256 0.0211057 -0.36 0.7205 -0.048919 0.0338139
mercury 0.0103034 0.0184555 0.56 0.5767 -0.0258688 0.0464756
selenium -0.000460942 0.0183756 -0.03 0.9800 -0.0364765 0.0355546
silver 0.0143415 0.0183184 0.78 0.4337 -0.0215619 0.0502448
sodium 0.00295159 0.019978 0.15 0.8825 -0.0362045 0.0421077
zinc 0.00420104 0.0188433 0.22 0.8236 -0.0327311 0.0411332
───────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.342787 0.119434 -2.87 0.0041 -0.576874 -0.108701
mixture 0.256969 0.0782475 3.28 0.0010 0.103607 0.410332
─────────────────────────────────────────────────────────────────────────qcee_fit3 = qgcomp_glm_ee(@formula(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc), metals, Xnm, 4, Normal())
println(qcee_fit3)Underlying fit (estimating equation based CI)
────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept) -0.348084 0.101578 -3.43 0.0006 -0.547173 -0.148995
mage35 0.0346352 0.0580126 0.60 0.5505 -0.0790675 0.148338
arsenic 0.0107529 0.0180889 0.59 0.5522 -0.0247006 0.0462063
barium 0.025279 0.0175226 1.44 0.1491 -0.00906456 0.0596227
cadmium 0.00206825 0.0186968 0.11 0.9119 -0.0345768 0.0387133
calcium 0.283652 0.0198227 14.31 <1e-45 0.2448 0.322503
chloride 0.0284367 0.0212988 1.34 0.1818 -0.0133081 0.0701816
chromium 0.00893029 0.0181076 0.49 0.6219 -0.02656 0.0444205
copper -0.0439328 0.0178306 -2.46 0.0137 -0.0788801 -0.00898556
iron 0.018433 0.0186486 0.99 0.3229 -0.0181176 0.0549837
lead -0.0105484 0.0183068 -0.58 0.5645 -0.046429 0.0253322
magnesium -0.0614485 0.0191222 -3.21 0.0013 -0.0989273 -0.0239697
manganese -0.00755256 0.0211863 -0.36 0.7215 -0.0490769 0.0339718
mercury 0.0103034 0.017422 0.59 0.5542 -0.023843 0.0444499
selenium -0.000460942 0.0171991 -0.03 0.9786 -0.0341706 0.0332487
silver 0.0143415 0.0175776 0.82 0.4146 -0.02011 0.0487929
sodium 0.00295159 0.0178978 0.16 0.8690 -0.0321275 0.0380307
zinc 0.00420104 0.01872 0.22 0.8224 -0.0324894 0.0408915
────────────────────────────────────────────────────────────────────────────────
Exposure specific weights not estimated in this type of model
MSM (estimating equation based CI)
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.342787 0.1014 -3.38 0.0007 -0.541528 -0.144047
mixture 0.256969 0.0677098 3.80 0.0001 0.124261 0.389678
─────────────────────────────────────────────────────────────────────────We can change the referent category for pointwise comparisons via the referentindex parameter:
printbounds(bounds(qcee_fit3))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ 0.0 -0.342787 0.0 0.0 0.0 0.0
2 │ 1.0 -0.0858181 0.256969 0.124261 0.389678 0.0677098
3 │ 2.0 0.171151 0.513939 0.248521 0.779356 0.13542
4 │ 3.0 0.428121 0.770908 0.372782 1.16903 0.203129
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.0 -0.342787 -0.749405 0.0436042
2 │ 1.0 -0.0858181 -0.238027 0.0720151
3 │ 2.0 0.171151 0.0202361 0.319111
4 │ 3.0 0.428121 0.0662244 0.807802printbounds(bounds(qcboot_fit3))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ 0.0 -0.342787 0.0 0.0 0.0 0.0
2 │ 1.0 -0.0858181 0.256969 0.103607 0.410332 0.0782475
3 │ 2.0 0.171151 0.513939 0.207214 0.820663 0.156495
4 │ 3.0 0.428121 0.770908 0.310821 1.231 0.234743
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0.0 -0.342787 -0.44688 -0.44688
2 │ 1.0 -0.0858181 -0.0910631 -0.0910631
3 │ 2.0 0.171151 0.264754 0.264754
4 │ 3.0 0.428121 0.620571 0.620571responseplot(qcee_fit3, referentindex = 3, plots=["pointwise", "model"])responseplot(qcboot_fit3, referentindex = 3, plots=["pointwise", "model"])Using qgcomp_glm_boot also allows us to assess linearity of the total exposure effect (the second plot). Similar output is available for WQS (gWQS package), though WQS results will generally be less interpretable when exposure effects are non-linear (see below how to do this with qgcomp_glm_boot and qgcomp_glm_ee).
The plot for the qcboot_fit3 object (using g-computation with bootstrap variance) gives predictions at the joint intervention levels of exposure. It also displays a smoothed (graphical) fit.
Note that the uncertainty intervals given in the plot are directly accessible via the pointwisebound (pointwise comparison confidence intervals) and modelbound functions (confidence interval for the regression line):
printbounds(bounds(qcee_fit3, qcee_fit3.intvals, qcee_fit3.intvals[3]))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ 0.0 -0.342787 -0.513939 -0.779356 -0.248521 0.13542
2 │ 1.0 -0.0858181 -0.256969 -0.389678 -0.124261 0.0677098
3 │ 2.0 0.171151 0.0 0.0 0.0 0.0
4 │ 3.0 0.428121 0.256969 0.124261 0.389678 0.0677098
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0.0 -0.342787 -0.717575 0.12359
2 │ 1.0 -0.0858181 -0.220909 0.0855921
3 │ 2.0 0.171151 0.0161505 0.324496
4 │ 3.0 0.428121 0.00959701 0.845532printbounds(bounds(qcboot_fit3))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ 0.0 -0.342787 0.0 0.0 0.0 0.0
2 │ 1.0 -0.0858181 0.256969 0.103607 0.410332 0.0782475
3 │ 2.0 0.171151 0.513939 0.207214 0.820663 0.156495
4 │ 3.0 0.428121 0.770908 0.310821 1.231 0.234743
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0.0 -0.342787 -0.44688 -0.44688
2 │ 1.0 -0.0858181 -0.0910631 -0.0910631
3 │ 2.0 0.171151 0.264754 0.264754
4 │ 3.0 0.428121 0.620571 0.620571Because qgcomp estimates a joint effect of multiple exposures, we cannot, in general, assess model fit by overlaying predictions from the plots above with the data. Hence, it is useful to explore non-linearity by fitting models that allow for non-linear effects, as in the next example.
Example 4: non-linearity (and non-homogeneity)
qgcomp (specifically qgcomp_*_boot and qgcomp_*_ee methods) addresses non-linearity in a way similar to standard parametric regression models, which lends itself to being able to leverage R language features for n-lin parametric models (or, more precisely, parametric models that deviate from a purely additive, linear function on the link function basis via the use of basis function representation of non-linear functions). Here is an example where we use a feature of the R language for fitting models with interaction terms. We use y~. + .^2 as the model formula, which fits a model that allows for quadratic term for every predictor in the model.
Aside: some details on qgcomp methods for non-linearity
Note that both qgcomp_*_boot (bootstrap) and qgcomp_*_ee (estimating equations) use standard methods for g-computation, whereas the qgcomp_*_noboot methods use a fast algorithm that works under the assumption of linearity and additivity of exposures (as described in the original paper on quantile-based g-computation). The "standard" method of g-computation with time-fixed exposures involves first fitting conditional models for the outcome, making predictions from those models under set exposure values, and then summarizing the predicted outcome distribution, possibly by fitting a second (marginal structural) model. qgcomp_*_boot follows this three-step process, while qgcomp_*_ee leverages estimating equations (sometimes: M-estimation) to estimate the parameters of the conditional and marginal structural model simultaneously. qgcomp_*_ee uses a sandwich variance estimator, which is similar to GEE (generalized estimating equation) approaches, and thus, when used correctly, can yield inference for longitudinal data in the same way that GEE does. The bootstrapping approach can also do this, but it takes longer. The extension to longitudinal data is representative of the broader concept that qgcomp_*_boot and qgcomp_*_ee can be used in a broader number of settings than qgcomp_*_noboot algorithms, but if one assumes linearity and additivity with no clustering of observations, and conditional parameters are of interest, then they are just a slower way to get equivalent results to qgcomp_*_noboot.
Below, we demonstrate a non-linear conditional fit (with a linear MSM) using the bootstrap approach. Similar approaches could be used to include interaction terms between exposures, as well as between exposures and covariates. Note this example is purposefully done incorrectly, as explained below.
# create a formula with all polynomial terms for expsosure programatically by:
# 1: outcome term
lhs = GLM.Term(:y)
# 2: main terms for all exposures
main_terms = GLM.Term.(Symbol.(Xnm))
# 3: create squared terms for all exposures via a custom function
function powerterm(x::S,power::I) where {S<:AbstractString, I<:Int}
FunctionTerm(^, [Term(Symbol(x)),ConstantTerm(power)], Expr(:call, :^, Symbol(x), power))
end
squared_terms = powerterm.(Xnm, 2)
# use ... "splatting" to combine all terms into a tuple
rhs = (vcat(main_terms, squared_terms)...,)
ffsq = FormulaTerm(lhs, rhs) # this is a shorthand way of constructing the same object you would get from @formula(y~ x1 + x1^2 + x2 + x2^2 + ...)
qcboot_fit4 = qgcomp_glm_boot(rng, ffsq, metals, Xnm, 4, Normal(), B=100)
println(qcboot_fit4)Underlying fit (Model-based CI)
────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept) -0.376056 0.136222 -2.76 0.0058 -0.643045 -0.109066
arsenic 0.030428 0.065257 0.47 0.6410 -0.0974733 0.158329
barium -0.0198754 0.0658611 -0.30 0.7628 -0.148961 0.10921
cadmium 0.0911165 0.0640401 1.42 0.1548 -0.0343997 0.216633
calcium 0.145071 0.0855895 1.69 0.0901 -0.0226818 0.312823
chromium 0.0957044 0.0644359 1.49 0.1375 -0.0305877 0.221996
copper -0.102326 0.065496 -1.56 0.1182 -0.230695 0.0260441
iron 0.0470454 0.0692798 0.68 0.4971 -0.0887405 0.182831
lead 0.0545809 0.0653268 0.84 0.4034 -0.0734572 0.182619
magnesium -0.0198047 0.068335 -0.29 0.7720 -0.153739 0.114129
manganese -0.00827842 0.0657335 -0.13 0.8998 -0.137114 0.120557
mercury -0.0495722 0.0652753 -0.76 0.4476 -0.177509 0.0783651
selenium 0.0843153 0.064828 1.30 0.1934 -0.0427453 0.211376
silver 0.00961506 0.0651674 0.15 0.8827 -0.118111 0.137341
sodium 0.112357 0.074819 1.50 0.1332 -0.0342858 0.258999
zinc -0.0530423 0.0646022 -0.82 0.4116 -0.17966 0.0735756
arsenic ^ 2 -0.00548169 0.0210363 -0.26 0.7944 -0.0467121 0.0357487
barium ^ 2 0.0128003 0.0211339 0.61 0.5447 -0.0286214 0.054222
cadmium ^ 2 -0.02871 0.0204728 -1.40 0.1608 -0.0688359 0.0114158
calcium ^ 2 0.0415287 0.0261331 1.59 0.1120 -0.00969135 0.0927487
chromium ^ 2 -0.0290913 0.0206698 -1.41 0.1593 -0.0696034 0.0114208
copper ^ 2 0.0190171 0.0210433 0.90 0.3661 -0.0222269 0.0602611
iron ^ 2 -0.00839646 0.0228599 -0.37 0.7134 -0.0532011 0.0364082
lead ^ 2 -0.0213287 0.0210484 -1.01 0.3109 -0.0625829 0.0199255
magnesium ^ 2 -0.0127241 0.0218525 -0.58 0.5604 -0.0555542 0.0301061
manganese ^ 2 0.001592 0.0222575 0.07 0.9430 -0.0420318 0.0452158
mercury ^ 2 0.0190848 0.0207263 0.92 0.3572 -0.0215381 0.0597077
selenium ^ 2 -0.0282434 0.0207192 -1.36 0.1728 -0.0688522 0.0123654
silver ^ 2 0.00027449 0.020829 0.01 0.9895 -0.0405497 0.0410987
sodium ^ 2 -0.0370364 0.0247802 -1.49 0.1350 -0.0856047 0.0115319
zinc ^ 2 0.0176189 0.0207208 0.85 0.3952 -0.022993 0.0582309
────────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.31696 0.099879 -3.17 0.0015 -0.512719 -0.121201
mixture 0.240047 0.0655512 3.66 0.0003 0.111569 0.368525
─────────────────────────────────────────────────────────────────────────responseplot(qcboot_fit4, plots=["pointwise", "model"])Note that allowing for a non-linear effect of all exposures induces an apparent non-linear trend in the overall exposure effect. The smoothed regression line is still well within the confidence bands of the marginal linear model (by default, the overall effect of joint exposure is assumed linear, though this assumption can be relaxed via the "degree" parameter in qgcompglmboot or qgcompglmee, as follows:
qcboot_fit5 = qgcomp_glm_boot(Xoshiro(122),ffsq, metals, Xnm, 4, Normal(), msmformula = @formula(y~mixture+mixture^2)) # directly specify MSM formula
responseplot(qcboot_fit5)qcee_fit5b = qgcomp_glm_ee(ffsq, metals, Xnm, 4, Normal(), msmformula = @formula(y~mixture+mixture^2)) #Using estimating equations
responseplot(qcee_fit5b)Note that some features are not availble to qgcomp_*_ee methods, which use estimating equations, rather than maximum likelihood methods. Briefly, these allow assessment of uncertainty under n-lin (and other) scenarios where the qgcomp_*_noboot functions cannot, since they rely on the additivity and linearity assumptions to achieve speed. The qgcomp_*_ee methods will generally be faster than a bootstrapped version, but they are not used extensively here because they are the newest additions to the qgcomp package, and the bootstrapped versions can be made fast (but not accurate) by reducing the number of bootstraps. Where available, the qgcomp_*_ee will be preferred to the qgcomp_*_boot versions for more stable and faster analyses when bootstrapping would otherwise be necessary.
Once again, we can access numerical estimates of uncertainty (answers differ between the qgcomp_*_boot and qgcomp_*_ee fits due to the small number of bootstrap samples):
# not yet implemented
printbounds(bounds(qcboot_fit5))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ 0.0 -0.376056 0.0 0.0 0.0 0.0
2 │ 1.0 -0.0178171 0.358239 0.0194254 0.697052 0.172867
3 │ 2.0 0.22223 0.598286 0.180538 1.01603 0.213141
4 │ 3.0 0.344086 0.720141 0.300055 1.14023 0.214334
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0.0 -0.376056 -0.662441 0.00779986
2 │ 1.0 -0.0178171 -0.252487 0.244904
3 │ 2.0 0.22223 0.0256724 0.461402
4 │ 3.0 0.344086 -0.0834874 0.783709printbounds(bounds(qcee_fit5b))Pointwise bounds
4×6 DataFrame
Row │ mixture linpred diff ll_diff ul_diff se_diff
│ Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────
1 │ 0.0 -0.376056 0.0 0.0 0.0 0.0
2 │ 1.0 -0.0178171 0.358239 0.02908 0.687397 0.167941
3 │ 2.0 0.22223 0.598286 0.192297 1.00427 0.207141
4 │ 3.0 0.344086 0.720141 0.318168 1.12211 0.205092
Modelwise bounds
4×4 DataFrame
Row │ mixture linpred ll_simul ul_simul
│ Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────
1 │ 0.0 -0.376056 -0.882322 0.0608276
2 │ 1.0 -0.0178171 -0.337988 0.275058
3 │ 2.0 0.22223 -0.066583 0.565705
4 │ 3.0 0.344086 -0.166319 0.825812Ideally, the smooth fit will look very similar to the model prediction regression line.
Interpretation of model parameters
As the output below shows, setting "degree=2" yields a second parameter in the model fit ($\psi_2$ or mixture$^2$). The output of qgcomp now corresponds to estimates of the marginal structural model given by
\[\mathbb{E}\left(Y^{\mathbf{X}_q}\right) = g(\psi_0 + \psi_1 S_q + \psi_2 S_q^2)\]
println(qcboot_fit5)Underlying fit (Model-based CI)
────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept) -0.376056 0.136222 -2.76 0.0058 -0.643045 -0.109066
arsenic 0.030428 0.065257 0.47 0.6410 -0.0974733 0.158329
barium -0.0198754 0.0658611 -0.30 0.7628 -0.148961 0.10921
cadmium 0.0911165 0.0640401 1.42 0.1548 -0.0343997 0.216633
calcium 0.145071 0.0855895 1.69 0.0901 -0.0226818 0.312823
chromium 0.0957044 0.0644359 1.49 0.1375 -0.0305877 0.221996
copper -0.102326 0.065496 -1.56 0.1182 -0.230695 0.0260441
iron 0.0470454 0.0692798 0.68 0.4971 -0.0887405 0.182831
lead 0.0545809 0.0653268 0.84 0.4034 -0.0734572 0.182619
magnesium -0.0198047 0.068335 -0.29 0.7720 -0.153739 0.114129
manganese -0.00827842 0.0657335 -0.13 0.8998 -0.137114 0.120557
mercury -0.0495722 0.0652753 -0.76 0.4476 -0.177509 0.0783651
selenium 0.0843153 0.064828 1.30 0.1934 -0.0427453 0.211376
silver 0.00961506 0.0651674 0.15 0.8827 -0.118111 0.137341
sodium 0.112357 0.074819 1.50 0.1332 -0.0342858 0.258999
zinc -0.0530423 0.0646022 -0.82 0.4116 -0.17966 0.0735756
arsenic ^ 2 -0.00548169 0.0210363 -0.26 0.7944 -0.0467121 0.0357487
barium ^ 2 0.0128003 0.0211339 0.61 0.5447 -0.0286214 0.054222
cadmium ^ 2 -0.02871 0.0204728 -1.40 0.1608 -0.0688359 0.0114158
calcium ^ 2 0.0415287 0.0261331 1.59 0.1120 -0.00969135 0.0927487
chromium ^ 2 -0.0290913 0.0206698 -1.41 0.1593 -0.0696034 0.0114208
copper ^ 2 0.0190171 0.0210433 0.90 0.3661 -0.0222269 0.0602611
iron ^ 2 -0.00839646 0.0228599 -0.37 0.7134 -0.0532011 0.0364082
lead ^ 2 -0.0213287 0.0210484 -1.01 0.3109 -0.0625829 0.0199255
magnesium ^ 2 -0.0127241 0.0218525 -0.58 0.5604 -0.0555542 0.0301061
manganese ^ 2 0.001592 0.0222575 0.07 0.9430 -0.0420318 0.0452158
mercury ^ 2 0.0190848 0.0207263 0.92 0.3572 -0.0215381 0.0597077
selenium ^ 2 -0.0282434 0.0207192 -1.36 0.1728 -0.0688522 0.0123654
silver ^ 2 0.00027449 0.020829 0.01 0.9895 -0.0405497 0.0410987
sodium ^ 2 -0.0370364 0.0247802 -1.49 0.1350 -0.0856047 0.0115319
zinc ^ 2 0.0176189 0.0207208 0.85 0.3952 -0.022993 0.0582309
────────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) -0.376056 0.133002 -2.83 0.0047 -0.636736 -0.115376
mixture 0.417334 0.246336 1.69 0.0902 -0.0654763 0.900145
mixture ^ 2 -0.0590957 0.0783354 -0.75 0.4506 -0.21263 0.0944388
────────────────────────────────────────────────────────────────────────────so that $\psi_2$ can be interpreted similar to quadratic terms that might appear in a generalized linear model. $\psi_2$ estimates the change in the outcome for an additional unit of squared joint exposure, over-and-above the linear effect given by $\psi_1$. Informally, this is a way of assessing specific types of non-linearity in the joint exposure-response curves, and there are many other (slightly incorrect but intuitively useful) ways of interpreting parameters for squared terms in regressions (beyond the scope of this document). Intuition from generalized linear models (i.e. regarding interpretation of coefficients) applies directly to the models fit by quantile g-computation.
Example 5: comparing model fits and further exploring non-linearity
Exploring a non-linear fit in settings with multiple exposures is challenging. One way to explore non-linearity, as demonstrated above, is to to include all 2-way interaction terms (including quadratic terms, or "self-interactions"). Sometimes this approach is not desired, either because the number of terms in the model can become very large, or because some sort of model selection procedure is required, which risks inducing over-fit (biased estimates and standard errors that are too small). Short of having a set of a priori non-linear terms to include, we find it best to take a default approach (e.g. taking all second order terms) that doesn't rely on statistical significance, or to simply be honest that the search for a non-linear model is exploratory and shouldn't be relied upon for robust inference. Methods such as kernel machine regression may be good alternatives, or supplementary approaches to exploring non-linearity.
NOTE: qgcomp necessarily fits a regression model with exposures that have a small number of possible values, based on the quantile chosen. By package default, this is q=4, but it is difficult to fully examine non-linear fits using only four points, so we recommend exploring larger values of q, which will change effect estimates (i.e. the model coefficient implies a smaller change in exposures, so the expected change in the outcome will also decrease).
Here, we examine a one strategy for default and exploratory approaches to mixtures that can be implemented in qgcomp using a smaller subset of exposures (iron, lead, cadmium), which we choose via the correlation matrix. High correlations between exposures may result from a common source, so small subsets of the mixture may be useful for examining hypotheses that relate to interventions on a common environmental source or set of behaviors. We can still adjust for the measured exposures, even though only 3 our exposures of interest are considered as the mixture of interest. This next example will require a new R package to help in exploring non-linearity: splines. Note that qgcomp_glm_boot must be used in order to produce the graphics below, as qgcomp_glm_noboot does not calculate the necessary quantities.
Graphical approach to explore non-linearity in a correlated subset of exposures using splines
#library(splines)
# find all correlations > 0.6 (this is an arbitrary choice)
#cormat = cor(metals[:,Xnm])
#idx = which(cormat>0.6 & cormat <1.0, arr.ind = TRUE)
#newXnm = unique(rownames(idx)) # iron, lead, and cadmium
newXnm = [:iron, :lead, :cadmium]
qc_fit6lin = qgcomp_glm_boot(Xoshiro(122),@formula(y ~ iron + lead + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc),
metals, newXnm,
8, Normal(), B=100)
#=
qc_fit6nonlin = qgcomp_glm_boot(Xoshiro(122),y ~ bs(iron) + bs(cadmium) + bs(lead) +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc,
expnms=newXnm,
metals, family=gaussian(), q=8, B=10, degree=2)
qc_fit6nonhom = qgcomp_glm_boot(Xoshiro(122),y ~ bs(iron)*bs(lead) + bs(iron)*bs(cadmium) + bs(lead)*bs(cadmium) +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc,
expnms=newXnm,
metals, family=gaussian(), q=8, B=10, degree=3)
=#Underlying fit (Model-based CI)
──────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) -0.11903 0.0656342 -1.81 0.0698 -0.24767 0.00961124
iron 0.061136 0.0109759 5.57 <1e-07 0.0396237 0.0826482
lead -0.0114455 0.0104534 -1.09 0.2736 -0.0319338 0.00904289
cadmium -0.00834917 0.0108388 -0.77 0.4411 -0.0295929 0.0128945
mage35 0.0881685 0.072692 1.21 0.2252 -0.0543052 0.230642
arsenic 0.0241288 0.0260853 0.92 0.3550 -0.0269974 0.075255
magnesium -0.0151559 0.0243148 -0.62 0.5331 -0.0628121 0.0325002
manganese 0.00980085 0.0249445 0.39 0.6944 -0.0390895 0.0586912
mercury 0.0101796 0.0244803 0.42 0.6775 -0.0378008 0.0581601
selenium -0.0400341 0.0570126 -0.70 0.4826 -0.151777 0.0717086
silver 0.0258944 0.0240887 1.07 0.2824 -0.0213187 0.0731074
sodium -0.0664921 0.0239597 -2.78 0.0055 -0.113452 -0.0195321
zinc 0.0223891 0.0238369 0.94 0.3476 -0.0243304 0.0691086
──────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) -0.103926 0.0682166 -1.52 0.1276 -0.237628 0.0297762
mixture 0.0413413 0.0184607 2.24 0.0251 0.00515897 0.0775236
────────────────────────────────────────────────────────────────────────────
It helps to place the plots on a common y-axis, which is easy due to dependence of the qgcomp plotting functions on ggplot. Here"s the linear fit :
pl_fit6lin = responseplot(qc_fit6lin, referentindex = 4);
plot!(pl_fit6lin, ylim=(-0.75, .75))Here's the non-linear fit :
pl_fit6nonlin = plot(qc_fit6nonlin, suppressprint = TRUE, referentindex = 4)
pl_fit6nonlin + coord_cartesian(ylim=c(-0.75, .75)) +
ggtitle("Non-linear fit: mixture of iron, lead, and cadmium")And here's the non-linear fit with statistical interaction between exposures (recalling that this will lead to non-linearity in the overall effect):
pl_fit6nonhom = plot(qc_fit6nonhom, suppressprint = TRUE, referentindex = 4)
pl_fit6nonhom + coord_cartesian(ylim=c(-0.75, .75)) +
ggtitle("Non-linear, non-homogeneous fit: mixture of iron, lead, and cadmium")Caution about graphical approaches
The underlying conditional model fit can be made extremely flexible, and the graphical representation of this (via the smooth conditional fit) can look extremely flexible. Simply matching the overall (MSM) fit to this line is not a viable strategy for identifying parsimonious models because that would ignore potential for overfit. Thus, caution should be used when judging the accuracy of a fit when comparing the "smooth conditional fit" to the "MSM fit."
println("This is a placeholder for an example with spline terms")
#qc_overfit = qgcomp_glm_boot(Xoshiro(122),y ~ bs(iron) + bs(cadmium) + bs(lead) +
# mage35 + bs(arsenic) + bs(magnesium) + bs(manganese) #+ bs(mercury) +
# bs(selenium) + bs(silver) + bs(sodium) + bs(zinc),
# expnms=Xnm,
# metals, family=gaussian(), q=8, B=100)
#qc_overfit
#responseplot(qc_overfit, referentindex = 5)This is a placeholder for an example with spline termsHere, there is little statistical evidence for even a linear trend, which makes the smoothed conditional fit appear to be overfit. The smooth conditional fit can be turned off, as below. #@example metals #responseplot(qc_overfit, referentindex = 5, plots=["pointwise", "model"]) #
Example 6: miscellaneous other ways to allow non-linearity
Note that these are included as examples of how to include non-linearities, and are not intended as a demonstration of appropriate model selection. In fact, qc_fit7b is generally a bad idea in small to moderate sample sizes due to large numbers of parameters.
using indicator terms for each quantile
qc_fit7a = qgcomp_glm_boot(Xoshiro(122),@formula(y ~ iron + lead + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc),
metals, newXnm, 8,
Normal(), B=100,
msmformula = @formula(y~mixture + mixture^2),
contrasts = Dict(:iron => DummyCoding()))
# underlying fit
println(qc_fit7a.ulfit)StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + iron + lead + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0529811 0.0806243 -0.66 0.5111 -0.211002 0.10504
iron: 1.0 0.0465717 0.0946691 0.49 0.6228 -0.138976 0.23212
iron: 2.0 -0.0569843 0.0958166 -0.59 0.5520 -0.244781 0.130813
iron: 3.0 0.143813 0.0955805 1.50 0.1324 -0.0435213 0.331148
iron: 4.0 0.0533191 0.0964207 0.55 0.5803 -0.135662 0.2423
iron: 5.0 0.303968 0.0974326 3.12 0.0018 0.113003 0.494932
iron: 6.0 0.24626 0.0973439 2.53 0.0114 0.0554691 0.43705
iron: 7.0 0.447046 0.097862 4.57 <1e-05 0.25524 0.638852
lead -0.00921034 0.0104667 -0.88 0.3789 -0.0297247 0.011304
cadmium -0.010503 0.0108644 -0.97 0.3337 -0.0317969 0.0107908
mage35 0.0811147 0.0727458 1.12 0.2648 -0.0614645 0.223694
arsenic 0.0217555 0.0260585 0.83 0.4038 -0.0293182 0.0728292
magnesium -0.0107584 0.0246989 -0.44 0.6631 -0.0591674 0.0376507
manganese 0.00441827 0.0255145 0.17 0.8625 -0.0455892 0.0544257
mercury 0.0039139 0.0244808 0.16 0.8730 -0.0440676 0.0518953
selenium -0.0580853 0.057148 -1.02 0.3094 -0.170093 0.0539228
silver 0.0209716 0.024074 0.87 0.3837 -0.0262126 0.0681557
sodium -0.0620863 0.0240445 -2.58 0.0098 -0.109213 -0.01496
zinc 0.0170784 0.0239238 0.71 0.4753 -0.0298114 0.0639682
─────────────────────────────────────────────────────────────────────────────responseplot(qc_fit7a)interactions between indicator terms
qc_fit7b = qgcomp_glm_boot(Xoshiro(122),@formula(y ~ iron*lead + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc),
metals, newXnm, 8,
Normal(), B=200,
msmformula = @formula(y~mixture + mixture^2),
contrasts = Dict(
:iron => DummyCoding(),
#:lead => DummyCoding()
)
)Underlying fit (Model-based CI)
──────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept) -0.044317 0.119742 -0.37 0.7113 -0.279008 0.190374
iron: 1.0 -0.0677779 0.172513 -0.39 0.6944 -0.405897 0.270341
iron: 2.0 -0.0979828 0.176259 -0.56 0.5783 -0.443444 0.247478
iron: 3.0 0.159512 0.187103 0.85 0.3939 -0.207204 0.526228
iron: 4.0 -6.83558e-5 0.177794 -0.00 0.9997 -0.348538 0.348401
iron: 5.0 0.187668 0.166142 1.13 0.2587 -0.137966 0.513301
iron: 6.0 0.308667 0.162845 1.90 0.0580 -0.0105021 0.627837
iron: 7.0 0.557852 0.170882 3.26 0.0011 0.222929 0.892775
lead -0.0139104 0.030755 -0.45 0.6511 -0.0741892 0.0463683
cadmium -0.00857004 0.0110772 -0.77 0.4391 -0.0302809 0.0131408
mage35 0.0734449 0.0735961 1.00 0.3183 -0.0708008 0.217691
arsenic 0.0210758 0.0261915 0.80 0.4210 -0.0302587 0.0724103
magnesium -0.0115834 0.0249359 -0.46 0.6423 -0.0604569 0.0372902
manganese 0.00808792 0.026201 0.31 0.7576 -0.0432652 0.059441
mercury 0.0057199 0.0248216 0.23 0.8177 -0.0429296 0.0543694
selenium -0.0637555 0.0574697 -1.11 0.2673 -0.176394 0.0488831
silver 0.0207556 0.0242572 0.86 0.3922 -0.0267877 0.0682988
sodium -0.0584547 0.0245255 -2.38 0.0172 -0.106524 -0.0103855
zinc 0.0154227 0.0240857 0.64 0.5220 -0.0317845 0.0626299
iron: 1.0 & lead 0.0366519 0.0459872 0.80 0.4254 -0.0534813 0.126785
iron: 2.0 & lead 0.0118357 0.0420999 0.28 0.7786 -0.0706785 0.0943499
iron: 3.0 & lead -0.00298205 0.0446856 -0.07 0.9468 -0.0905641 0.0846001
iron: 4.0 & lead 0.0152142 0.0435494 0.35 0.7268 -0.0701411 0.100569
iron: 5.0 & lead 0.0340209 0.0408946 0.83 0.4055 -0.0461311 0.114173
iron: 6.0 & lead -0.0187322 0.0406756 -0.46 0.6451 -0.0984549 0.0609906
iron: 7.0 & lead -0.033601 0.042793 -0.79 0.4323 -0.117474 0.0502718
──────────────────────────────────────────────────────────────────────────────────
MSM (Bootstrap based CI)
Exposure specific weights not estimated in this type of model
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0778448 0.0856058 -0.91 0.3632 -0.245629 0.0899395
mixture 0.0107838 0.0356163 0.30 0.7621 -0.0590229 0.0805906
mixture ^ 2 0.00307447 0.00427635 0.72 0.4722 -0.00530703 0.011456
─────────────────────────────────────────────────────────────────────────────
responseplot(qc_fit7b, plots=["p", "m"])breaks at specific quantiles (these breaks act on the quantized basis)
qc_fit7c = qgcomp_glm_boot(Xoshiro(122),@formula(y ~ (iron>4)*(lead>4) + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc),
metals, newXnm, 8,
Normal(), B=100,
msmformula = @formula(y~mixture + mixture^2),
)
# underlying fit
qc_fit7c.ulfitStatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + :(iron > 4) + :(lead > 4) + cadmium + mage35 + arsenic + magnesium + manganese + mercury + selenium + silver + sodium + zinc + :(iron > 4) & :(lead > 4)
Coefficients:
──────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0591011 0.0518238 -1.14 0.2541 -0.160674 0.0424717
iron > 4 0.364994 0.0644886 5.66 <1e-07 0.238599 0.491389
lead > 4 -9.00407e-5 0.0618159 -0.00 0.9988 -0.121247 0.121067
cadmium -0.00687475 0.0107834 -0.64 0.5238 -0.0280098 0.0142603
mage35 0.0761367 0.0725511 1.05 0.2940 -0.0660608 0.218334
arsenic 0.0204237 0.02578 0.79 0.4282 -0.0301042 0.0709516
magnesium -0.00327998 0.0242751 -0.14 0.8925 -0.0508584 0.0442984
manganese 0.0105598 0.0247745 0.43 0.6699 -0.0379974 0.059117
mercury 0.0093969 0.0243506 0.39 0.6996 -0.0383294 0.0571231
selenium -0.0433773 0.0567001 -0.77 0.4443 -0.154507 0.0677528
silver 0.0180725 0.0239111 0.76 0.4498 -0.0287925 0.0649374
sodium -0.0553797 0.0240381 -2.30 0.0212 -0.102493 -0.00826591
zinc 0.0234991 0.0238576 0.98 0.3246 -0.023261 0.0702591
iron > 4 & lead > 4 -0.182884 0.102778 -1.78 0.0752 -0.384325 0.0185575
──────────────────────────────────────────────────────────────────────────────────────responseplot(qc_fit7c)Note one restriction on exploring non-linearity: while we can use flexible functions such as splines for individual exposures, the overall fit is limited via the degree parameter to polynomial functions (here a quadratic polynomial fits the non-linear model well, and a cubic polynomial fits the non-linear/non-homogeneous model well - though this is an informal argument and does not account for the wide confidence intervals). We note here that only 10 bootstrap iterations are used to calculate confidence intervals (to increase computational speed for the example), which is far too low.
Statistical approach explore non-linearity in a correlated subset of exposures using splines
The graphical approaches don't give a clear picture of which model might be preferred, but we can compare the model fits using AIC, or BIC (information criterion that weigh model fit with over-parameterization). Both of these criterion suggest the linear model fits best (lowest AIC and BIC), which suggests that the apparently non-linear fits observed in the graphical approaches don't improve prediction of the health outcome, relative to the linear fit, due to the increase in variance associated with including more parameters.
println(aic(qc_fit6lin))
#aic(qc_fit6nonlin) |> display
#aic(qc_fit6nonhom) |> display676.0430636979562println(bic(qc_fit6lin))
#bic(qc_fit6nonlin) |> display
#bic(qc_fit6nonhom) |> display733.6346142156075More examples on advanced topics can be viewed in the other package vignette.
FAQ
Why don't I get weights/scaled effects from the boot or ee functions? (and other questions about the weights/scaled effect sizes)
Users often use the qgcomp_*_boot or qgcomp_*_ee functions because they want to marginalize over confounders or fit a non-linear joint exposure function. In both cases, the overall exposure response will no longer correspond to a simple weighted average of model coefficients, so none of the qgcomp_*_boot or qgcomp_*_ee functions will calculate weights. In most use cases, the weights would vary according to which level of joint exposure you"re at, so it is not a straightforward proposition to calculate them (and you may not wish to report 4 sets of weights if you use the default q=4). That is, the contribution of each exposure to the overall effect will change across levels of exposure if there is any non-linearity, which makes the weights not useful as simple inferential tools (and, at best, an approximation). If you wish to have an approximation, then use a "noboot" method and report the weights from that along with a caveat that they are not directly applicable to the results in which non-linearity/non-additivity/marginalization-over-covariates is performed (using boot methods). If you fit an otherwise linear model, you can get weights from a qgcomp_*_noboot which will be very close to the weights you might get from a linear model fit via qgcomp_*_boot functions, but be explicit that the weights come from a different model than the inference about joint exposure effects.
It should be emphasized here that the weights are not a stable or entirely useful quantity for many research goals. Qgcomp addresses the mixtures problem of variance inflation by focusing on a parameter that is less susceptible to variance inflation than independent effects (the psi parameters, or overall effects of a mixture). The weights are a form of independent effect and will always be sensitive to this issue, regardless of the statistical method that is used. Some statistical approaches to improving estimation of independent effects (e.g. setting bayes=TRUE, R package only) is accessible in many qgcomp functions, but these approaches universally introduce bias in exchange for reducing variance and shouldn't be used without a good understanding of what shrinkage and penalization methods actually accomplish. Principled and rigorous integration of these statistical approaches with qgcomp is in progress, but that work is inherently more bespoke and likely will not be available in this R package. The shrinkage and penalization literature is large and outside the scope of this software and documentation, so no other guidance is given here. In any case, the calculated weights are only interpretable as proportional effect sizes in a setting in which there is linearity, additivity, and collapsibility, and so the package makes no efforts to try to introduce weights into other settings in which those assumptions may not be met. Outside of those narrow settings, the weights would have a dubious interpretation and the programming underlying the qgcomp package errs on the side of preventing the reporting of results that are mutually inconsistent. If you are using the boot versions of a qgcomp function in a setting in which you know that the weights are valid, it is very likely that you do not actually need to be using the boot versions of the functions.
Do I need to model non-linearity and non-additivity of exposures?
Maybe. The inferential object of qgcomp is the set of $\psi$ parameters that correspond to a joint exposure response. As it turns out, with correlated exposures non-linearity can disguise itself as non-additivity (Belzak and Bauer [2019] Addictive Behaviors). If we were inferring independent effects, this distinction would be crucial, but for joint effects it may turn out that it doesn't matter much if you model non-linearity in the joint response function through non-additivity or non-linearity of individual exposures in a given study. Models fit in qgcomp still make the crucial assumption that you are able to model the joint exposure response via parametric models, so that assumption should not be forgotten in an effort to try to disentagle non-linearity (e.g. quadratic terms of exposures) from non-additivity (e.g. product terms between exposures). The important part to note about parametric modeling is that we have to explicitly tell the model to be non-linear, and no adaptation to non-linear settings will happen automatically. Exploring non-linearity is not a trivial endeavor.
Do I have to use quantiles?
No. You can turn off "quantization" by setting q=nothing or you can supply your own categorization cutpoints via the "breaks" argument. It is up to the user to interpret the results if either of these options is taken. Frequently, q=nothing is used in concert with standardizing exposure variables by dividing them by their interquartile ranges (IQR). The joint exposure response can then be interpreted as the effect of an IQR change in all exposures. Using IQR/2 (with or without a log transformation before hand) will yield results that are most (roughly) compatible with the package defaults (q=4) but that does not require quantization. Quantized variables have nice properties: they prevent extrapolation and reduce influence of outliers, but the choice of how to include exposures in the model should be a deliberate and well-informed one. There are examples of setting q=nothing in the help files for qgcompglmboot and qgcompglmee, but this approach is available for any of the qgcomp methods (and is accomplished nearly 100% outside of the package functions, aside from setting q=nothing).
Can I cite this document?
Probably not in a scientific manuscript. If you find an idea here that is not published anywhere else and wish to develop it into a full manuscript, feel free! (But probably check with alex.keil@nih.gov to ask if a paper is already in development or is, perhaps, already published.)
Where else can I get help?
The vignettes of the package and the help files of the functions give many, many examples of usage. Additionally, some edge case or interesting applicationsare available in the form of github gists at https://gist.github.com/alexpkeil1. If you come up with an interesting problem that you think could be solved in this package, but is currently not, feel free to submit an issue on the R package github page https://github.com/alexpkeil1/qgcomp/issues or J package github page https://github.com/alexpkeil1/Qgcomp.jl/issues. Several additions to the R package have already come about through that avenue (though not always quickly).