High-dimensional Adversarial Random Forest (ARF) for omics and clinical data.
Source:R/h_arf.R
h_arf.RdThe algorithm extends (ARF) to high-dimensional settings. It partitions high-dimensional data into isolated regions and fits Adversarial Random Forest (ARF) models within each region and on a latent space representing region joint distribution to capture within and between region feature dependencies.
Usage
h_arf(
omx_data,
cli_lab_data = NULL,
target = NULL,
feature_ordering = NULL,
omx_onset_data = NULL,
num_trees = 10,
min_node_size = 5,
correlation_method = "spearman",
correlation_mat = NULL,
num_btwn_pcs = 2,
num_onset_pcs = 2,
chunk_size = 10,
oob = FALSE,
family = "truncnorm",
finite_bounds = "no",
alpha = 0,
epsilon = 0,
parallel = FALSE,
export_cor_mat = FALSE,
verbose = FALSE
)Arguments
- omx_data
A data.frame containing omics data where rows represent samples (e.g. patients or cells) and columns represent features (e.g., gene or protein expressions).
- cli_lab_data
A data.frame of clinical or laboratory data. For example, this may be a data.frame where rows represent cell types (for single cell data or additional clinical patient information, e.g. disease status, age, sex, BMI, etc).
- target
Optional name of the target variable in
cli_lab_datafor supervised downstream analysis. If NULL, training is conducted for unsupervised downstream analysis.- feature_ordering
Optional vector of feature names specifying the order of features in the synthesized data. If NULL, features are ordered according to their original order in
omx_data, followed by order inclin_lab_data.- omx_onset_data
Optional data.frame of conditional onset omics features. This can be used to condition the synthesis on specific omics features (e.g. expression of a gene of interest). If provided, dimension reduction will be performed on these features and the resulting components will be included as additional meta features for training the meta adversarial model.
- num_trees
Number of trees to grow in each Adversarial Random Forest (ARF) model.
- min_node_size
Minimum number of samples required to split an internal node in the ARF model.
- correlation_method
Methods to compute correlation between features. Options include "pearson", "spearman", and "kendall". Default is "spearman".
- correlation_mat
Optional pre-computed correlation matrix between features. If provided, this will be used instead of computing correlations from
omx_data. This can be usefull for tuning hyperparameters of the clustering step without having to recompute the correlation matrix each time.- num_btwn_pcs
Number of principal components to use for between cluster variability. Default is 2.
- num_onset_pcs
Number of principal components to use for onset features. Default is 2.
- chunk_size
Size of feature chunks to process at a time. Default is 10.
- oob
Logical indicating whether to use out-of-bag samples for density estimation. Default is FALSE. Also see
forde- family
Distribution to use for density estimation of continuous features. See
fordefor options.- finite_bounds
Impose finite bounds on all continuous variables? If "local", infinite bounds are set to empirical extrema within leaves. If "global", infinite bounds are set to global empirical extrema. if "no" (the default), infinite bounds are left unchanged. See
fordefor more details.- alpha
Optional pseudocount for Laplace smoothing in density estimation. See
fordefor more details.- epsilon
Optional slack parameter on empirical bounds. See
fordefor more details.- parallel
Logical indicating whether to use parallel processing. Default is TRUE.
- export_cor_mat
Logical indicating whether to export the correlation matrix used for clustering. Default is FALSE.
- verbose
Logical indicating whether to print progress messages. Default is FALSE.
Examples
if (FALSE) { # \dontrun{
data(single_cell)
harf_model <- h_arf(
omx_data = single_cell[ , - which(colnames(single_cell) == "cell_type")],
cli_lab_data = data.frame(cell_type = single_cell$cell_type)
)
# Unconditional sampling from harf_model
set.seed(123)
synth_single_cell <- h_forge(
harf_obj = harf_model,
n_synth = nrow(single_cell)
)
# Conditional resampling from harf_model
set.seed(142)
lung_single_cell <- h_forge(
harf_obj = harf_model,
n_synth = sum(single_cell$cell_type == "lung"),
evidence = data.frame(cell_type = "lung")
)
} # }