Cohort Causal Graph

The following cohort causal graphs (CCGs) are based on 2 - 16 year old European children and adolescents from the IDEFICS/I.family cohort. The data set contains N = 5,112 children born between 1997 and 2006 who participated in all three waves of the study.

We used the temporal order of the variables as prior knowledge for the analysis by distributing the variables into different tiers. The analysis data set consisted of 51 variables that were distributed over five tiers:

  1. Context variables (social and cultural background)
  2. Early life factors
  3. Baseline variables (B, the first cohort survey)
  4. First follow-up (FU1, two years after baseline)
  5. Second follow-up (FU2, six years after baseline)

We assume that variables from a tier with a lower number can affect variables in tiers with a higher number, but not vice versa. In addition, we forbid edges between some variable pairs such as edges pointing to age or individual child variables (e.g. physical activity) pointing to ISCED or parental income.

All CGGs of childhood obesity were estimated by the temporal PC-algorithm (tPC) for multiple imputed data sets using the R-packages tpc and micd. The tPC package allows to make use of prior knowledge regarding the temporal order of the cohort data and the micd package offers the possibility to run the pc algorithm on multiple imputed data sets and with mixed variable scales. PC and tPC algorithm are both constraint-based structure learning algorithms. Both, tPC and micd rely on the PC algorithm as implemented in pcalg.

Tiers Variable/Node Unit Comments
Context Sex female/male Sex of child
Context Region North/Central/South Place of residence
Context Migrant no/yes Children were assumed to have a migrant background if they usually speak with their parents in a language other than the national language of the corresponding country
Early life Mother's age at birth years
Early life Total breastfeeding months incl. breast-feeding combinations prior child's diet was fully integrated into usual household diet
Early life Birthweight gramm
Early life Weeks of pregnancy weeks
Early life Formula milk no/yes Type of feeding prior child's diet was fully integrated into the usual household diet
Early life HH diet months Month when the child was introduced into the household's diet
Early life Smoking during pregnancy no/yes Mother consumed tobacco during pregnancy
Context: B, FU1, FU2 Age months
Context: B, FU1, FU2 School kindergarten/school/ neither one
Context: B, FU1, FU2 Income low/middle/high Country-specific parental income
Context: B, FU1, FU2 ISCED low/middle/high International Standard Classification of Education, highest parental education
B, FU1, FU2 AVM h/day Audio-visual media consumption
B, FU1, FU2 zBMI z-score Body mass index
B, FU1, FU2 Mother's BMI kg/m^2 Body mass index of the child's mother
B, FU1, FU2 Daily family meals no/yes
B, FU1, FU2 PA h/day Physical activity measured by questionnaire
B, FU1, FU2 Sleep h/day Total sleep
B, FU1, FU2 Well-being % Sum score based on the KINDL-R quality of life questionnaire
B, FU1, FU2 YHEI % Youth healthy eating score
B, FU1, FU2 HOMA z-score HOmeostatic Model Assessment
FU2 Alcohol no/yes Ever alcohol drinking in teen's life-time
FU2 Puberty pre- or early pubertal/pubertal Pubertal status
FU2 Smoking no/yes Ever smoking tobacco in teen's life-time
library(tpc)
library(micd)

## suffienct statistic
suff.all <- getSuff(my.mids.data, test = "flexMItest")

## CCG
  graph <- tpc(suffStat = suff.all,
              indepTest = flexMItest,
            skel.method = "stable.parallel",
                  label = V.pa,
                  alpha = 0.05,
                  tiers = c(rep(1, 3), rep(2, 7), rep(3, 13), rep(4, 13), rep(5, 15)),
              forbEdges = fg, # a matrix of size 
                              # ncol(my.mids.data$data) x ncol(my.mids.data$data)
               numCores = detectCores()-1)

Fig.1 Missing values were ten times imputed with multiple imputation based on chained equation using the mice R-package. Random forests served as imputation method. Graph discovery used a nominal level of 0.05. Click on nodes to shift them in the graph.

Note: nodes are coloured with respect to their appearance in the life course. Edges without arrowheads could not be orientated by the algorithm.

Graph characteristics Main graph
Number of selected edges 104
Number of undirected edges 12
Avg. number of outgoing edges 2.4

Sensitivity analysis

CCG based on MI with \(\alpha = 0.1\)

g.alpha <- tpc(suffStat = suff.all,
              indepTest = flexMItest,
            skel.method = "stable.parallel",
                  label = V.pa,
                  alpha = 0.1,
                  tiers = c(rep(1, 3), rep(2, 7), rep(3, 13), rep(4, 13), rep(5, 15)),
              forbEdges = fg, 
               numCores = detectCores()-1)

Fig.2 CCG based on same multiple imputed data as in Fig. 1, but graph discovery used a nominal level of 0.1.

Graph characteristics Main graph MI, α = 0.1
Number of selected edges 104 113
Number of undirected edges 12 13
Avg. number of outgoing edges 2.4 2.5
Hamming distance - 19
Structural Hamming distance - 34

CCG based on test-wise deletion

g.twd <- tpc(suffStat = data.with.missing.values,
            indepTest = flexCItwd,
          skel.method = "stable.parallel",
                alpha = 0.05,
            forbEdges = fg,
               labels = colnames(fg),
                tiers = c(rep(1, 3), rep(2, 7), rep(3, 13), rep(4, 13), rep(5, 15)),
             numCores = detectCores()-1)

Fig.2 CCG based on test-wise deletion. Each performed conditional independence test between two variables given a set of variables was computed using all complete observations on these variables.

Graph characteristics Main graph MI, α = 0.1 TWD
Number of selected edges 104 113 138
Number of undirected edges 12 13 5
Avg. number of outgoing edges 2.4 2.5 2.8
Hamming distance - 19 96
Structural Hamming distance - 34 110

Structural EM algorithm

library(bnlearn)
sem <- structural.em(data.with.missing.values,
                     maximize = "hc",
                     maximize.args = list(blacklist = bl))
                    # bl is a matrix of forbidden directed edges of dimension 
                    # "number of forbidden arrow" X 2

Fig.5 The DAG was estimated using the structural EM algorithm applying the Hill-Climbing score-based algorithm in the maximization step.

Graph characteristics Main graph MI, α = 0.1 TWD SEM
Number of selected edges 104 113 138 157
Number of undirected edges 12 13 5 0
Avg. number of outgoing edges 2.4 2.5 2.8 3.1
Hamming distance - 19 96 117
Structural Hamming distance - 34 110 131

Bootstrap CCGs

For each bootstrap sample the data was once imputed using mice based on random forest imputation. The following CCGs base on 100 bootstrap replications.

Fig.3 Bootstrapped CCG with edges frequencies larger equal than 44 %

Fig.4 Bootstrapped CCG with edges frequencies larger equal than 75 %

Graph characteristics Main CCG MI, α = 0.1 TWD SEM MI, BG44 MI, BG75
Number of selected edges 104 113 138 157 104 46
Number of undirected edges 12 13 5 0 3 0
Avg. number of outgoing edges 2.4 2.5 2.8 3.1 2.1 0.9
Hamming distance - 19 96 117 56 70
Structural Hamming distance - 34 110 131 73 86