systemPipeR 2.13.0
This is the BLAST workflow template of the systemPipeRdata package, a companion package to systemPipeR (H Backman and Girke 2016). Like other workflow templates, it can be loaded with a single command. Users have the flexibility to utilize the template as is or modify it as needed. More in-depth information can be found in the main vignette of systemPipeRdata. The BLAST workflow template serves as a starting point for conducting sequence similarity search routines. It employs NCBI’s BLAST software as an illustrative example, enabling users to search a sequence database for entries that share sequence similarity to one or multiple query sequences. The search results can be presented in a concise tabular summary format, or the corresponding pairwise alignments can be included. To utilize this workflow, users must download and install the BLAST software from NCBI’s website here and ensure it is added to their system’s PATH environment variable.
The Rmd
file (SPblast.Rmd
) associated with this vignette serves a dual purpose.
It acts both as a template for executing the workflow and as a template for
generating a reproducible scientific analysis report. Thus, users want to
customize the text (and/or code) of this or other systemPipeR
workflow vignettes to describe their
experimental design and analysis results. This typically involves deleting the
instructions how to work with this workflow, and customizing the text
describing experimental designs, other metadata and analysis results.
The following data analysis steps are included in this workflow template:
The topology graph of the BLAST workflow is shown in Figure 1.
The environment of the chosen workflow is generated with the genWorenvir
function. After this, the user’s R session needs to be directed into the
resulting directory (here SPblast
).
systemPipeRdata::genWorkenvir(workflow = "SPblast", mydirname = "SPblast")
setwd("SPblast")
The SPRproject
function initializes a new workflow project instance. This
function call creates an empty SAL
workflow container and at the same time a
linked project log directory (default name .SPRproject
) that acts as a
flat-file database of a workflow. For additional details, please visit this
section
in systemPipeR's
main vignette.
library(systemPipeR)
sal <- SPRproject()
sal
The importWF
function allows to import all the workflow steps outlined in
the source Rmd file of this vignette into a SAL
(SYSargsList
) workflow
container. Once imported, the entire workflow can be executed from start to
finish using the runWF
function. More details regarding this process are
provided in the following section here.
sal <- importWF(sal, "SPblast.Rmd")
sal <- runWF(sal)
Next, the systemPipeR
package needs to be loaded in a workflow.
appendStep(sal) <- LineWise(code = {
library(systemPipeR)
}, step_name = "load_packages")
The following step is optional. It tests the availability of the BLAST software on the user’s system.
appendStep(sal) <- LineWise(code = {
# If you have a modular system, then enable the
# following line moduleload('ncbi-blast')
blast_check <- tryCMD("blastn", silent = TRUE)
if (blast_check == "error")
stop("Check your BLAST installation path.")
}, step_name = "test_blast", dependency = "load_packages")
This step creates an indexed sequence database that can be searched with BLAST.
The sample sequences used for creating the databases are stored in a file named
tair10.fasta
under the data
directory of the workflow environment. The exact command-line (CL)
call used for creating the indexed database can be returned with cmdlist(sal, step=3)
.
appendStep(sal) <- SYSargsList(step_name = "build_genome_db",
dir = FALSE, targets = NULL, wf_file = "blast/makeblastdb.cwl",
input_file = "blast/makeblastdb.yml", dir_path = "param/cwl",
dependency = "test_blast")
Next, the BLASTable database is searched with a set of query sequences to find sequences in the
database that share sequence similarity with the queries. As above, the exact CL
call used in this step can be returned with cmdlist(sal, step=4)
.
appendStep(sal) <- SYSargsList(step_name = "blast_genome", dir = FALSE,
targets = "targets_blast.txt", wf_file = "blast/blastn.cwl",
input_file = "blast/blastn.yml", inputvars = c(FileName = "_query_file_",
SampleName = "_SampleName_"), dir_path = "param/cwl",
dependency = "build_genome_db")
This step displays the top hits identified by the BLAST search in the previous step.
The e_value
and bit_score
columns allow to rank the BLAST results by sequence similarity.
appendStep(sal) <- LineWise(code = {
# get the output file path from a Sysargs step using
# `getColumn`
tbl_tair10 <- read.delim(getColumn(sal, step = "blast_genome")[1],
header = FALSE, stringsAsFactors = FALSE)
names(tbl_tair10) <- c("query", "subject", "identity", "alignment_length",
"mismatches", "gap_openings", "q_start", "q_end", "s_start",
"s_end", "e_value", "bit_score")
print(head(tbl_tair10, n = 20))
}, step_name = "display_hits", dependency = "blast_genome")
appendStep(sal) <- LineWise(code = {
sessionInfo()
}, step_name = "wf_session", dependency = "display_hits")
Once the above workflow steps have been loaded into sal
from the source Rmd
file of this vignette, the workflow can be ex