HarmonizR is a framework around the popular batch effect reduction
algorithms ComBat
and limma
, who, on their
own, are unable to deal with missing data points within the data.
Missing values are not uncommon in biological data, especially in the
field of proteomics, for which the tool has originally been developed.
HarmonizR uses a matrix dissection approach to circumvent the
problematic existence of missing values and still apply established
batch effect reduction strategies. Recent updates to the HarmonizR
algorithm include but are not limited to increased computational
efficiency and more reliable rescue of features (e.g. proteins/genes).
For a full overview we would like to direct the reader to our paper
published in 2022 in Nature Communications: https://doi.org/10.1038/s41467-022-31007-x
All information regarding the upcoming sections can also be found within our SOP on Github, which is found under the following link: https://www.github.com/HSU-HPC/HarmonizR/tree/HarmonizR-1.1/HarmonizR_SOP.pdf.
Please have a look over there as well.
The HarmonizR implementation is 100 % written in the programming
language R
. The easiest way to install it is using the
package devtools
that can be installed from CRAN via
install.packages("devtools")
while in the R
environment. For further information, please refer to the
devtools
documentation.
Installation from Github Repository:
The HarmonizR package (https://github.com/HSU-HPC/HarmonizR/tree/HarmonizR-1.1
leads to the package as well as example data) can be installed directly
from GitHub via the command
devtools::install_github("HSU-HPC/HarmonizR/[email protected]")
,
while in the R
software environment.
Installation from HarmonizR.zip file:
Please make sure to have devtools
installed. Download
the code via the green Code button (https://github.com/HSU-HPC/HarmonizR/tree/HarmonizR-1.1).
Unzip the downloaded .zip file. The HarmonizR package is the folder
called HarmonizR, which was within the .zip file. While in the
R
environment and while in the parent directory of said
HarmonizR, enter in the command line:
devtools::install("HarmonizR")
to install the package.
Alternatively, while inside the HarmonizR folder, enter
devtools::install()
.
Once installed, the HarmonizR package can be used via:
For this example, we create a simple dataframe containing 3 features and 6 samples:
# create a dataframe with 3 rows and 6 columns filled with random numbers
df <- data.frame(matrix(rnorm(n = 3*6), ncol = 6))
# set the column names
colnames(df) <- c("A", "B", "C", "D", "E", "F")
# create a vector of row names
row_names <- c("F1", "F2", "F3")
# set the row names
rownames(df) <- row_names
# this is what it looks like:
df
#> A B C D E F
#> F1 1.3655242 0.9189169 1.3997269 -1.1011716 1.251652 -0.9490417
#> F2 -0.7564413 -2.1574364 -0.3110256 0.5076546 -2.378861 0.1073165
#> F3 -0.6936057 0.3612722 2.2154759 -1.1157130 -1.171649 -1.0670647
Now we create a fitting description, which assigns 2 samples to a batch (3 batches total):
# create a vector of batch numbers
batch <- rep(1:3, each = 2)
# create a dataframe with 6 rows and 3 columns
des <- data.frame(ID = colnames(df), sample = 1:6, batch = batch)
# this is what it looks like:
des
#> ID sample batch
#> 1 A 1 1
#> 2 B 2 1
#> 3 C 3 2
#> 4 D 4 2
#> 5 E 5 3
#> 6 F 6 3
HarmonizR usage requires a single function call of the
harmonizR()
function. Here, we pass the created dataframes
directly for a sequential run of HarmonizR. Alternatively, a path to the
respective input files can be passed.
# use the harmonizR() function; turning off creation of an output .tsv file
result <- harmonizR(df, des, output_file = FALSE, cores = 1)
#> Initializing HarmonizR...
#> Reading the files...
#> Preparing...
#> Splitting the data using ComBat adjustment...
#> Found3batches
#> Adjusting for0covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data
#> Rebuilding...
#> Termination.
The result mirrors the input data.frame
closely, yet the
batch-effect has been reduced by either ComBat
or
limma
:
result
#> A B C D E F
#> F1 1.106921 0.4644636 0.8790055 -0.6218661 1.3816182 -0.19040608
#> F2 -0.301682 -1.9837688 -0.8836108 -0.3444818 -1.5780118 0.09101886
#> F3 -0.931131 0.5063345 0.8360306 -1.0914659 -0.5366926 -0.45330502
Alternatively, S4 data may be used as input. In this case, no
description file is needed as long as a batch description is included
within the SummarizedExperiment
. An example input may look
like this:
nrows <- 20
ncols <- 8
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
colData <- data.frame(Batch=c(1,1,1,1,2,2,2,2))
SummExp = SummarizedExperiment::SummarizedExperiment(
assays=list(counts=counts),
colData=colData)
It may be passed instead of data_as_input
(following
parameter explanation).
HarmonizR expects 2 mandatory and a total of nine optional arguments. First, the mandatory ones:
data_as_input
description_as_input
The first argument data_as_input
, is the path to the raw
data, the second argument, description_as_input
, is the
path to the description file. Both input files can be given via their
file path and do not have to be read in separately by hand. This method
is recommended, as it ensures correct operation if the notes regarding
Input above are followed. Alternatively, both parameters can be passed
as dataframes or matrices as long as they are fitting the expected input
layout.
Next will be four optional arguments found also within the already published HarmonizR:
algorithm
ComBat_mode
plot
output_file
The first optional argument is the algorithm of choice.
ComBat
will be used by default, but using the parameter
algorithm
, either "ComBat"
or
"limma"
can be chosen for data adjustment.
"ComBat"
serves as the default. The second optional
argument is ComBat_mode
. This parameter is only valid once
ComBat
is chosen for the adjustment.
The ComBat
mode is abbreviated for simplicity by using
integers:
ComBat_mode Corresponding ComBat Arguments
1 (default) par.prior = TRUE, mean.only = FALSE
2 par.prior = TRUE, mean.only = TRUE
3 par.prior = FALSE, mean.only = FALSE
4 par.prior = FALSE, mean.only = TRUE
Please refer to the ComBat
documentation for further
details.
The third optional parameter is plot
. plot
can be set to either "samplemeans"
,
"featuremeans"
or "CV"
and will show a boxplot
with a box for each batch depicting the chosen method. This plot will
also be saved to a .pdf. This will be either the mean for all samples,
the mean for all features or the coefficient of variation. There will be
a separate plot for the original, unaltered input dataset and for the
ComBat
/limma
adjusted dataset. By default,
this parameter is turned off. Since a log transformation is assumed,
this will also be accounted for before plotting. Trying to plot data
that has not been log transformed prior may lead to an unplottable
result.
The fourth optional parameter is output_file
. Setting
this parameter will grant the user the ability to choose the name of
their output .tsv file. Also, a path can be set. A string is expected as
input. This parameter will default to "cured_data"
,
yielding a file called cured_data.tsv
.
Further, the five new parameters will be explained one-by-one:
sort
sort
takes one of three available sorting algorithms as
input. Either "sparcity_sort"
, for a sparcity-based
sorting, "seriation_sort"
, using the seriation
package and "jaccard_sort"
, using a Jaccard-index-based
sorting approach. Sorting happens prior to the adjustment and may change
the way blocking is executed on the data. Sorting does not yield any
benefit when the block
parameter is unused.
block
block
takes an integer as input which dictates, how many
batches should be packed together during matrix dissection. This
parameter may greatly reduce the amount of sub-dataframes produced and
therefore decrease the algorithm’s runtime.
verbosity
verbosity
takes an integer with the lowest accepted
integer being 0
. The higher the number, the more feedback
will the HarmonizR provide in the command line. A verbosity of
1
is the default and should be sufficient. 0
or "mute"
will prevent HarmonizR from showing anything in
the command line.
cores
cores
gives the user the ability to control the number
of cores used by their machine during HarmonizR execution. By default,
all available cores will be used.
ur
ur
, short for unique removal, can be set to either
TRUE
or FALSE
to toggle the newly implemented
removal of unique combinations for increased feature rescue. By default,
this feature is applied, and it is strongly suggested to not turn it
off. The parameter is available for result reproducibility.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] HarmonizR_1.5.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 SummarizedExperiment_1.37.0
#> [3] xfun_0.48 bslib_0.8.0
#> [5] Biobase_2.67.0 lattice_0.22-6
#> [7] vctrs_0.6.5 tools_4.5.0
#> [9] stats4_4.5.0 parallel_4.5.0
#> [11] AnnotationDbi_1.69.0 RSQLite_2.3.7
#> [13] blob_1.2.4 Matrix_1.7-1
#> [15] S4Vectors_0.45.0 lifecycle_1.0.4
#> [17] GenomeInfoDbData_1.2.13 compiler_4.5.0
#> [19] Biostrings_2.75.0 statmod_1.5.0
#> [21] codetools_0.2-20 GenomeInfoDb_1.43.0
#> [23] htmltools_0.5.8.1 sass_0.4.9
#> [25] yaml_2.3.10 crayon_1.5.3
#> [27] jquerylib_0.1.4 BiocParallel_1.41.0
#> [29] DelayedArray_0.33.0 cachem_1.1.0
#> [31] limma_3.63.0 iterators_1.0.14
#> [33] abind_1.4-8 foreach_1.5.2
#> [35] nlme_3.1-166 sva_3.55.0
#> [37] genefilter_1.89.0 locfit_1.5-9.10
#> [39] digest_0.6.37 splines_4.5.0
#> [41] fastmap_1.2.0 grid_4.5.0
#> [43] SparseArray_1.7.0 cli_3.6.3
#> [45] S4Arrays_1.7.0 XML_3.99-0.17
#> [47] survival_3.7-0 edgeR_4.5.0
#> [49] UCSC.utils_1.3.0 bit64_4.5.2
#> [51] rmarkdown_2.28 XVector_0.47.0
#> [53] httr_1.4.7 matrixStats_1.4.1
#> [55] bit_4.5.0 png_0.1-8
#> [57] memoise_2.0.1 evaluate_1.0.1
#> [59] knitr_1.48 GenomicRanges_1.59.0
#> [61] IRanges_2.41.0 doParallel_1.0.17
#> [63] mgcv_1.9-1 rlang_1.1.4
#> [65] Rcpp_1.0.13 xtable_1.8-4
#> [67] DBI_1.2.3 BiocGenerics_0.53.0
#> [69] annotate_1.85.0 jsonlite_1.8.9
#> [71] R6_2.5.1 plyr_1.8.9
#> [73] MatrixGenerics_1.19.0 zlibbioc_1.53.0