RcwlPipelines is a Bioconductor package that manages a collection of commonly used bioinformatics tools and pipeline based on Rcwl. These pre-built and pre-tested tools and pipelines are highly modularized with easy customization to meet different bioinformatics data analysis needs.

Rcwl and RcwlPipelines together forms a Bioconductor toolchain for use and development of reproducible bioinformatics pipelines in Common Workflow Language (CWL). The project also aims to develop a community-driven platform for open source, open development, and open review of best-practice CWL bioinformatics pipelines.

1 Installation

  1. Install the package from Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("RcwlPipelines")

The development version is also available to download from GitHub.

BiocManager::install("rworkflow/RcwlPipelines")
  1. Load the package into the R session.
library(RcwlPipelines)

2 Project resources

The project website https://rcwl.org/ serves as a central hub for all related resources. It provides guidance for new users and tutorials for both users and developers. Specific resources are listed below.

2.1 The R recipes and cwl scripts

The R scripts to build the CWL tools and pipelines are now residing in a dedicated GitHub repository, which is intended to be a community effort to collect and contribute Bioinformatics tools and pipelines using Rcwl and CWL.

2.2 Tutorial book

The tutorial book provides detailed instructions for developing Rcwl tools/pipelines, and also includes examples of some commonly-used tools and pipelines that covers a wide range of Bioinformatics data analysis needs.

3 RcwlPipelines core functions

Here we show the usage of 3 core functions: cwlUpdate, cwlSearch and cwlLoad for updating, searching, and loading the needed tools or pipelines in R.

3.1 cwlUpdate

The cwlUpdate function syncs the current Rcwl recipes and returns a cwlHub object which contains the most updated Rcwl recipes. The mcols() function returns all related information about each available tool or pipeline.

The recipes will be locally cached, so users don’t need to call cwlUpdate every time unless they want to use a tool/pipeline that is newly added to RcwlPipelines. Here we are using the recipes from Bioconductor devel version.

## For vignette use only. users don't need to do this step.
Sys.setenv(cachePath = tempdir()) 
atls <- cwlUpdate(branch = "dev") ## sync the tools/pipelines.
atls
#> cwlHub with 177 records
#> cache path:  /tmp/Rtmp8Oqxd2/Rcwl 
#> # last modified date:  2021-08-02 
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#> 
#>            title                      
#>   BFC1   | pl_AnnPhaseVcf             
#>   BFC2   | pl_BaseRecal               
#>   BFC3   | pl_CombineGenotypeGVCFs    
#>   BFC4   | pl_GAlign                  
#>   BFC5   | pl_GPoN                    
#>   ...      ...                        
#>   BFC173 | tl_vcf2bed                 
#>   BFC174 | tl_vcf_expression_annotator
#>   BFC175 | tl_vcf_readcount_annotator 
#>   BFC176 | tl_vep                     
#>   BFC177 | tl_vt_decompose            
#>          Command                                                            
#>   BFC1   VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf            
#>   BFC2   BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
#>   BFC3   CombineGVCFs+GenotypeGVCFs                                         
#>   BFC4   fqJson+fq2ubam+ubam2bamJson+align+mvOut                            
#>   BFC5   GenomicsDB+PoN                                                     
#>   ...    ...                                                                
#>   BFC173 R function                                                         
#>   BFC174 vcf-expression-annotator                                           
#>   BFC175 vcf-readcount-annotator                                            
#>   BFC176 vep                                                                
#>   BFC177 vt decompose
table(mcols(atls)$Type)
#> 
#> pipeline     tool 
#>       37      139

Currently, we have integrated NA command line tools and NA pipelines.

3.2 cwlSearch

We can use (multiple) keywords to search for specific tools/pipelines of interest, which internally search the mcols of “rname”, “rpath”, “fpath”, “Command” and “Containers”. Here we show how to search the alignment tool bwa mem.

t1 <- cwlSearch(c("bwa", "mem"))
t1
#> cwlHub with 1 records
#> cache path:  /tmp/Rtmp8Oqxd2/Rcwl 
#> # last modified date:  2021-05-20 
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#> 
#>            title  Command
#>   BFC105 | tl_bwa bwa mem
mcols(t1)
#> DataFrame with 1 row and 14 columns
#>           rid       rname         create_time         access_time
#>   <character> <character>         <character>         <character>
#> 1      BFC105      tl_bwa 2021-10-26 22:30:30 2021-10-26 22:30:30
#>                    rpath       rtype                  fpath last_modified_time
#>              <character> <character>            <character>          <numeric>
#> 1 /tmp/Rtmp8Oqxd2/Rcwl..       local /tmp/Rtmp8Oqxd2/Rcwl..                 NA
#>          etag   expires        Type     Command              Container
#>   <character> <numeric> <character> <character>            <character>
#> 1          NA        NA        tool     bwa mem biocontainers/bwa:v0..
#>                 mtime
#>           <character>
#> 1 2021-05-20 12:15:10

3.3 cwlLoad

The last core function cwlLoad loads the Rcwl tool/pipeline into the R working environment. The code below loads the tool with a user-defined name bwa to do the read alignment.

bwa <- cwlLoad(title(t1)[1])  ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above. 
bwa
#> class: cwlProcess 
#>  cwlClass: CommandLineTool 
#>  cwlVersion: v1.0 
#>  baseCommand: bwa mem 
#> requirements:
#> - class: DockerRequirement
#>   dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
#> inputs:
#>   threads (int): -t 
#>   RG (string?): -R 
#>   Ref (File):  
#>   FQ1 (File):  
#>   FQ2 (File?):  
#> outputs:
#> sam:
#>   type: File
#>   outputBinding:
#>     glob: '*.sam'
#> stdout: bwaOutput.sam

Now the R tool of bwa is ready to use.

4 Customize a tool or pipeline

To fit users’ specific needs,the existing tool or pipline can be easily customized. Here we use the rnaseq_Sf pipeline to demonstrate how to access and change the arguments of a specific tool inside a pipeline. This pipeline covers RNA-seq reads quality summary by fastQC, alignment by STAR, quantification by featureCounts and quality control by RSeQC.

rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
#> fastqc loaded
#> STAR loaded
#> sortBam loaded
#> samtools_index loaded
#> samtools_flagstat loaded
#> featureCounts loaded
#> gtfToGenePred loaded
#> genePredToBed loaded
#> read_distribution loaded
#> geneBody_coverage loaded
#> STAR loaded
#> gCoverage loaded
plotCWL(rnaseq_Sf)

There are many default arguments defined for the tool of STAR inside the pipeline. Users might want to change some of them. For example, we can change the value for --outFilterMismatchNmax argument from 2 to 5 for longer reads.

arguments(rnaseq_Sf, "STAR")[5:6]
#> [[1]]
#> [1] "--readFilesCommand"
#> 
#> [[2]]
#> [1] "zcat"
arguments(rnaseq_Sf, "STAR")[[6]] <- 5
arguments(rnaseq_Sf, "STAR")[5:6]
#> [[1]]
#> [1] "--readFilesCommand"
#> 
#> [[2]]
#> [1] "5"

We can also change the docker image for a specific tool (e.g., to a specific version). First, we search for all available docker images for STAR in biocontainers repository. The Source server could be quay or dockerhub.

searchContainer("STAR", repo = "biocontainers", source = "quay")
#> DataFrame with 34 rows and 6 columns
#>                           tool                    V2               name
#>                    <character>           <character>        <character>
#> 2.7.9a--h9ee0642_0        STAR quay.io/biocontainers 2.7.9a--h9ee0642_0
#> 2.6.1d--h9ee0642_1        STAR quay.io/biocontainers 2.6.1d--h9ee0642_1
#> 2.7.8a--h9ee0642_1        STAR quay.io/biocontainers 2.7.8a--h9ee0642_1
#> 2.4.0j--h9ee0642_2        STAR quay.io/biocontainers 2.4.0j--h9ee0642_2
#> 2.6.0c--h9ee0642_3        STAR quay.io/biocontainers 2.6.0c--h9ee0642_3
#> ...                        ...                   ...                ...
#> 2.4.0j--0                 STAR quay.io/biocontainers          2.4.0j--0
#> 2.5.4a--0                 STAR quay.io/biocontainers          2.5.4a--0
#> 2.5.3a--0                 STAR quay.io/biocontainers          2.5.3a--0
#> 2.5.2b--0                 STAR quay.io/biocontainers          2.5.2b--0
#> 2.5.1b--0                 STAR quay.io/biocontainers          2.5.1b--0
#>                             last_modified        size              container
#>                               <character> <character>            <character>
#> 2.7.9a--h9ee0642_0 Tue, 11 May 2021 19:..    10134089 quay.io/biocontainer..
#> 2.6.1d--h9ee0642_1 Fri, 26 Mar 2021 15:..    11646389 quay.io/biocontainer..
#> 2.7.8a--h9ee0642_1 Fri, 26 Mar 2021 15:..     9956698 quay.io/biocontainer..
#> 2.4.0j--h9ee0642_2 Fri, 26 Mar 2021 15:..     7066519 quay.io/biocontainer..
#> 2.6.0c--h9ee0642_3 Fri, 26 Mar 2021 15:..    11634304 quay.io/biocontainer..
#> ...                                   ...         ...                    ...
#> 2.4.0j--0          Tue, 06 Mar 2018 12:..     4734325 quay.io/biocontainer..
#> 2.5.4a--0          Fri, 26 Jan 2018 21:..     9225952 quay.io/biocontainer..
#> 2.5.3a--0          Sat, 18 Mar 2017 11:..     9119736 quay.io/biocontainer..
#> 2.5.2b--0          Tue, 06 Sep 2016 07:..     9086803 quay.io/biocontainer..
#> 2.5.1b--0          Wed, 11 May 2016 08:..    11291827 quay.io/biocontainer..

Then, we can change the STAR version into 2.7.8a (tag name: 2.7.8a–0).

requirements(rnaseq_Sf, "STAR")[[1]]
#> $class
#> [1] "DockerRequirement"
#> 
#> $dockerPull
#> [1] "quay.io/biocontainers/star:2.7.9a--h9ee0642_0"
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
    docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
#> $class
#> [1] "DockerRequirement"
#> 
#> $dockerPull
#> [1] "quay.io/biocontainers/star:2.7.8a--0"

5 Run a tool or pipeline

Once the tool or pipeline is ready, we only need to assign values for each of the input parameters, and then submit using one of the functions: runCWL, runCWLBatch and cwlShiny. More detailed Usage and examples can be refer to the Rcwl vignette.

To successfully run the tool or pipeline, users either need to have all required command line tools pre-installed locally, or using the docker/singularity runtime by specifying docker = TRUE or docker = "singularity" argument inside runCWL or runCWLBatch function. Since the Bioconductor building machine doesn’t have all the tools installed, nor does it support the docker runtime, here we use some pseudo-code to demonstrate the tool/pipeline execution.

inputs(rnaseq_Sf)
rnaseq_Sf$in_seqfiles <- list("sample_R1.fq.gz",
                              "sample_R2.fq.gz")
rnaseq_Sf$in_prefix <- "sample"
rnaseq_Sf$in_genomeDir <- "genome_STAR_index_Dir"
rnaseq_Sf$in_GTFfile <- "GENCODE_version.gtf"

runCWL(rnaseq_Sf, outdir = "output/sample", docker = TRUE)

Users can also submit parallel jobs to HPC for multiple samples using runCWLBatch function. Different cluster job managers, such as “multicore”, “sge” and “slurm”, are supported using the BiocParallel::BatchtoolsParam.

library(BioParallel)
bpparam <- BatchtoolsParam(workers = 2, cluster = "sge",
                           template = batchtoolsTemplate("sge"))

inputList <- list(in_seqfiles = list(sample1 = list("sample1_R1.fq.gz",
                                                    "sample1_R2.fq.gz"),
                                     sample2 = list("sample2_R1.fq.gz",
                                                    "sample2_R2.fq.gz")),
                  in_prefix = list(sample1 = "sample1",
                                   sample2 = "sample2"))

paramList <- list(in_genomeDir = "genome_STAR_index_Dir",
                  in_GTFfile = "GENCODE_version.gtf",
                  in_runThreadN = 16)

runCWLBatch(rnaseq_Sf, outdir = "output",
            inputList, paramList,
            BPPARAM = bpparam)

6 SessionInfo

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] RcwlPipelines_1.10.0 BiocFileCache_2.2.0  dbplyr_2.1.1        
#> [4] Rcwl_1.10.0          S4Vectors_0.32.0     BiocGenerics_0.40.0 
#> [7] yaml_2.2.1           BiocStyle_2.22.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2           tidyr_1.1.4          sass_0.4.0          
#>  [4] bit64_4.0.5          jsonlite_1.7.2       R.utils_2.11.0      
#>  [7] bslib_0.3.1          shiny_1.7.1          assertthat_0.2.1    
#> [10] BiocManager_1.30.16  blob_1.2.2           base64url_1.4       
#> [13] progress_1.2.2       pillar_1.6.4         RSQLite_2.2.8       
#> [16] backports_1.2.1      lattice_0.20-45      glue_1.4.2          
#> [19] reticulate_1.22      digest_0.6.28        RColorBrewer_1.1-2  
#> [22] promises_1.2.0.1     checkmate_2.0.0      htmltools_0.5.2     
#> [25] httpuv_1.6.3         Matrix_1.3-4         R.oo_1.24.0         
#> [28] pkgconfig_2.0.3      dir.expiry_1.2.0     bookdown_0.24       
#> [31] DiagrammeR_1.0.6.1   purrr_0.3.4          xtable_1.8-4        
#> [34] brew_1.0-6           later_1.3.0          BiocParallel_1.28.0 
#> [37] git2r_0.28.0         tibble_3.1.5         generics_0.1.1      
#> [40] ellipsis_0.3.2       cachem_1.0.6         withr_2.4.2         
#> [43] magrittr_2.0.1       crayon_1.4.1         mime_0.12           
#> [46] memoise_2.0.0        evaluate_0.14        R.methodsS3_1.8.1   
#> [49] fansi_0.5.0          tools_4.1.1          data.table_1.14.2   
#> [52] prettyunits_1.1.1    hms_1.1.1            lifecycle_1.0.1     
#> [55] basilisk.utils_1.6.0 stringr_1.4.0        compiler_4.1.1      
#> [58] jquerylib_0.1.4      rlang_0.4.12         debugme_1.1.0       
#> [61] grid_4.1.1           rstudioapi_0.13      rappdirs_0.3.3      
#> [64] htmlwidgets_1.5.4    visNetwork_2.1.0     igraph_1.2.7        
#> [67] rmarkdown_2.11       basilisk_1.6.0       codetools_0.2-18    
#> [70] curl_4.3.2           DBI_1.1.1            R6_2.5.1            
#> [73] knitr_1.36           dplyr_1.0.7          bit_4.0.4           
#> [76] fastmap_1.1.0        utf8_1.2.2           filelock_1.0.2      
#> [79] stringi_1.7.5        parallel_4.1.1       Rcpp_1.0.7          
#> [82] vctrs_0.3.8          png_0.1-7            batchtools_0.9.15   
#> [85] tidyselect_1.1.1     xfun_0.27