Seqpac provides functions and workflows for analysis of short sequenced reads. It was originally developed for small RNA (sRNA) analysis, but can be implemented on any sequencing raw data (provided as a fastq-file), where the unit of measurement is counts of unique sequences. The core of the Seqpac strategy is the generation and subsequence annotation/analysis/ visualization of a standardized object called PAC. Using an innovative target system, Seqpac process, analyze and visualize group differences using the PAC object. Seqpac is particularly useful in generating sequence coverage profiles across reference sequences (such as tRNA or gene coverage) without loosing the integrity of the original fastq sequences. Thus, candidate sequences are easily validated by for example a BLAST at a genome browser. This guide will introduce you to the basics of our recommended workflow for sRNA analysis, both fastq-processing (e.g. adapter trimming and generating sequence counts) and post-fastq analysis (e.g filtering, genome and sRNA alignment, differential expression and visualization). To assure package quality, Seqpac has been optimized for Bioconducter. Seqpac package version: 1.6.0
Stable versions of Seqpac is available on Bioconductor, and development branches can be obtained at github:
## Installation Bioconductor:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("seqpac")
## Installation github:
devtools::install_github("Danis102/seqpac",
dependencies=TRUE, ref="development")
Seqpac contains an extensive toolbox for the analysis of short sequenced reads. The package was originally developed for small RNA (sRNA) analysis, but can be applied on any type of data generated by large scale nucleotide sequencing, where the user wish to maintain sequence integrity throughout the whole analysis. This involves, for example, regular RNA-seq with long RNA. But note, however, that it currently does not automatically support paired-end sequencing.
To preserve sequence integrity, and thereby guarantee a clean data linage,
Seqpac applies sequence-based counting. This can be contrasted against
feature-based counting, where reads are counted over known genomic features,
like protein coding genes or sRNA. Such strategy will result in a count table
where each count is made per feature (e.g. number of read counts mapping to a
gene). In Seqpac, such annotations are initially disregarded, as a count table
is generated solely by counting unique sequences in the raw sequence files
(fastq). Annotations against known reference(s) (such as databases containing
sequences of full genomes, miRNA, tRNA, protein coding genes, or genomic
coordinates of such features) are done after the count table has been produced.
The advantage of using sequence-based counting, compared to feature-based
counting, is that for example mismatches in the classification of reads, can be
applied any time during the analysis. In feature-based counting, allowing a
mismatch in the mapping will result in a loss of information (loss of sequence
integrity).
For example, say that you have 506 read counts for miR-185 (a miRNA) when
allowing 3 mismatches against the reference. In feature-based analysis this will
result in 1 entry in the count table. Unless you have saved the exact alignments
elsewhere, you have lost the information about what sequences that are included
in that count. The only thing you know is that the sequence may differ from the
reference sequence by any 3 mismatches. In sequence-based counting, reads are
counted as unique sequences and each unique sequence is then mapped against the
target reference databases. Information about whether a sequence was aligning
against a reference is maintained in a linked annotation table that can contain
both perfect alignments and alignments allowing mismatches. Thus, in any stage
of the analysis mismatches and classifications into features can be dynamically
controlled, meaning that you can easily observe the effect of for example
allowing more mismatches or changes in hierarchical classifications (often
applied in sRNA analysis when a sequence align with multiple classes of sRNA).
But the best with sequence-based counting where you preserve sequence
integrity, is probably that you will be able blast your candidate sequences at
your favorite database, to verify and find out more about them.
The input file format for Seqpac is fastq, which can be generated from most sequencing platforms. Two types of objects sustain the core of the Seqpac workflow:
The PAC object stores the Phenotypic information (P), Annotations (A) and Counts
(C) needed for all downstream analysis, while the targeting object is used for
accessing specific subsets of data within the PAC object.
Seqpac provides two versions of the PAC object. In its simplest form it is a S3
list containing 3 data.frames (Pheno, Anno and Counts). As default, however, an
S4 PAC object is generated. By using the coercion method as(pac-object, "list")
, you can turn an S4 PAC into an S3 list. Similarly, by using
as.PAC(pac-object)
you can turn an S3 PAC into an S4. Plaese see ?as.PAC
an
?PAC
for more details and examples.
The targeting object is always a list of 2 character objects.
The workflow is roughly divided into four steps:
Here is a quick simple workflow using Seqpac. Remember, however, that in Seqpac you can create much more advanced workflows. Especially when applying the reanno workflow, as will be described below.
# Setup
library(seqpac)
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
full.names = TRUE)
ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
# Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
count_list <- make_counts(fastq, plot = FALSE,
parse="default_neb", trimming="seqpac")
pac <- make_PAC(count_list$counts)
pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
# Preprocess PAC-object and creat means
pac <- PAC_norm(pac) # Default normalization is cpm
pac <- PAC_filter(pac, size=c(15,60)) # Here, filter on sequence length
pac <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("groups", unique(pheno(pac)$groups)))
# Quickly align (annotate) and plot PAC-object
map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
plts[[13]]
Contained within this package are 6 sRNA fastq file. These files were originally generated by extracting sRNA from a single fruit fly embryo. The original fastq file was acquired by Illumina NextSeq 500 sequencer using a High Output 75 cycles flow cell (product no: 20024906) and NEBNext® Small RNA Library Prep Set for Illumina (E7330). Due to package size restriction, this fastq was randomly down-sampled to a fraction of its original size into 6 much much smaller files.
There is also a prepared example PAC-object that were similarly generated
from randomly down-sampled fastq files (but here we used 9 unique embryos). In
addition, there are a few fasta reference file. These files are used in
examples of this vignette and in the function manuals.
Seqpac works best with merged fastq, meaning that if sequencing was done in separate lanes on the flow cell, where the same sample was present in all lanes, lane files must be merged for each unique sample.
Conveniently, Seqpac contains a function that can be used to merge lane files:
merge_lanes
. Please see, ?merge_lanes
for examples on how to merge
fastq-files. Specifically, this function searches the input folder for similar
file names and append files with similar names. As long as the user uses unique
sample names (e.g. sample1_lane1.fastq.gz, sample1_lane2.fastq.gz,
sample2_lane1.fastq.gz, sample2_lane2.fastq.gz etc) then merge_lanes
will
merge lane files into the same sample (e.g. sample1.fastq.gz, sample2.fastq.gz).
Count tables are the core components in most sequence analysis. Seqpac uses sequence-based counts (see 1.1 Overview). Thus, we need to generate a count table with the number of unique sequences in each sample.
There are three ways of generating a count table in Seqpac.
The two first involves adapter trimming, while the last involves already
trimmed reads. All input options are contained within the make_counts
.
function.
Generating a count table from untrimmed fastq files using nothing but R packages
(internal) depends on the make_trim
function. This function goes through a
series of trimming cycles using the trimLRPatterns
function in the
Biostrings package to generate highly comparable adapter trimming
as generated by popular fastq trimming software, such as cutadapt.
Seqpac trimming is praticularly efficient when trimming many fastq files at the
same time. It parallelizes trimming on multiple processor cores/threads by
applying functions in the foreach package. Thus, if threads=12
and you input 12 fastq files, all files will be trimmed simultaneously in
parallel. Note, however, that each thread will make a substantial impact on your
computers memory performance. Thus, if you know that you are low in ram memory
use fewer number of threads.
When make_trim
is used on its own it results in trimmed fastq files stored at
a user defined location. When trimming="Seqpac"
is set in the make_counts
function, this function parses settings to make_trim
to generate a count table
from temporary stored trimmed fastq files.
## Using default settings for NEB type adapter
# NEB= NEBNext® Small RNA Library Prep Set for Illumina (E7300/E7330)
# For illumina type adapters, 'parse="default_illumina"' may be used
# To speed things up we use 4 threads.
count_list <- make_counts(input=fastq, trimming="seqpac", threads=4,
parse="default_neb")
As an alternative to internal trimming, two external software: cutadapt and fastq_quality_filter can be used to trim the adapter sequence and remove low quality reads prior of making a count table. By setting `trimming=“cutadapt” make_counts will check if these software is installed on your system.
Setting trimming=NULL
in the make_counts
function will pass
the adapter trimming and a count table will be generated directly from the fastq
files. Thus, this option can be used if you already have trimmed your fastq
files.
In the make_counts
function users may choose to process data on-disk and in
chunks instead of processing data in-memory. Thus, users on low-end computers
with little RAM may choose to process smaller chunks and save data to disk
from time to time on the expense of performance. This is controlled by the
optimize
option. Here, two panic filters are also available. These filters will
only be applied if the unfiltered data fails to complete due to memory
shortage. If such filter is applied to a fastq file this is clearly stated in
the progress report, which give you the choice of what to do with such a sample.
See the manuals for make_counts
for more details.
Besides the counts table, make_counts
automatically generates a
progress report for the trimming and evidence filter, as well as bar graphs
showing the impact of the evidence filter (if plots=TRUE).
Users may already when making the counts table with make_counts
markedly reduce the noise in the data by specifying the number of independent
evidence that a specific sequence must reach to be included. This is controlled
by the evidence
argument. As default, evidence=c(experiment=2, sample=1)
will include all sequences that have >=1 counts in at least 2
independent fastq files.
Importantly, we routinely observe that 95-99% of all reads are maintained when
applying the default evidence filter, while >50% of the unique sequences will be
dropped because they can not be replicated across two independent samples. For
an example of how a saturated experiment may look like, see the original paper,
Skog et al. from 2021.
As you may have guessed, the ‘experiment’ argument controls the number of
independent fastq evidence needed across the whole experiment. Note, however,
that ‘sample’ does not control the number of counts needed in each sample. The
evidence filter will always use >=1 counts in X number of fastq files. Instead
‘sample’ controls when a sequence should be included despite not reaching the
‘experiment’ threshold. Thus, if evidence=c(experiment=2, sample=10)
,
sequences that reach 10 counts in a single sample will also be included.
Here is an example on how to use the ‘sample’ argument:
###---------------------------------------------------------------------
## Evidence over two independent samples, saving single sample sequences
## reaching 3 counts
test <- make_counts(input=fastq, trimming="seqpac",
parse="default_neb", threads=3,
evidence=c(experiment=2, sample=3))
extras <- apply(test$counts, 1, function(x){sum(!x==0)})
test$counts[extras==1,] # A few still have 3 alone
Lastly, if evidence=NULL
all unique sequences will be saved. Remember,
however, that the count table will become larger in well saturated experiments
that may lead to performance issues on some systems, since each unique sequence
occupies a row in counts(PAC)
and anno(PAC)
. Apply other filters prior to
annotating your sequences may solve such problems.
Next, we generate a Pheno table (P) and merge it with the progress report that
was stored from the make_counts
output. The Pheno table contains
sample specific information such as sample name and possible interventions, with
each sample as a row.
The make_pheno
function can import a Pheno table both externally by providing
a path to a .cvs file, as well as internally by just passing a data.frame. In
common for both options, is that a unique column named “Sample_ID” needs to be
present, and must contain the sample names (column names) present in the counts
table generated by make_counts
. Now, make_pheno
will attempt to match
similar names, but to avoid problems use identical names.
In addition, make_pheno
may add the progress report generated by
make_counts
. Lets have a look at how this works:
###---------------------------------------------------------------------
## Generate a Pheno table using the file names of test fastq
Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
Treatment=c(rep("heat", times=1),
rep("control", times=2)),
Batch=rep(c("1", "2", "3"), times=1))
pheno <- make_pheno(pheno=pheno_fl,
progress_report=count_list$progress_report,
counts=count_list$counts)
Finally, we put everything together as a master PAC object using the make_PAC
function. Conveniently, skipping the anno=
argument will automatically add a
very simple annotation table to the PAC-object that can be filled later.
###---------------------------------------------------------------------
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master
## If TRUE the PAC object is ok
PAC_check(pac_master)
Seqpac uses a simple targeting system to divide and focus the analysis of a PAC object to certain sample groups (Pheno) or specific sequences (Anno). If not specified otherwise in the function manual, a targeting object is always a two item list, where the first object is naming a column in a specific table in the PAC and the second object tells the function which categories in that column that should be targeted.
pheno_target
: extracts samples from a column in pheno(PAC)
anno_target
: extracts sequences from a column in anno(PAC)
While pheno_target
and anno_target
are the most common targeting objects,
there are other targeting object. Common for all is that they target a specific
sub-table in the PAC.
Now, lets take a look at the provided example PAC-object in Seqpac. As noted by
the name, this PAC has already been filtered and annotated, but it will still
show you the principle of preprocessing using targeting objects.
###---------------------------------------------------------------------
## Load the PAC-object and inspect the columns in Pheno and Anno
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
pheno(pac)[,1:5]
anno(pac)[1:5,1:5]
counts(pac)[1:5,1:5]
## This PAC is an S4 object, but you can easily turn it to an S3 list with as().
## using as.PAC() will turn it back into S4:
isS4(pac)
pac_s3 <- as(pac, "list")
lapply(pac_s3[1:3], head)
isS4(pac_s3)
pac <- as.PAC(pac_s3)
methods(class="PAC") #all S4 methods
A pheno_target object that will exclusively target stage 1 and 5 samples would
look like this: pheno_target=list("stage", c("Stage1", "Stage5"))
While an anno_target that will target sequences of 22 nt in length would look
like this: anno_target=list("Size", "22")
Notice that both targeting objects are lists holding two secondary objects: one
character string targeting a specific column, and one character vector targeting
specific categories within the target column. The only difference being that
each targeting object target different tables in the PAC object.
Important, if the second object is set to NULL or is left out this will
automatically target all samples in the specified column. If the anno_target is
a character vector (not a list) some function will attempt to extract matches in
the row names of either Pheno or Anno (see example with PAC_filtsep
below).For
more, please refer to the original Seqpac publication: Skog et al. 2021.
A master PAC object will contain sequences in your original fastq files that passed the trimming and the initial evidence filter. Now, in many experiments this still involves millions of unique sequences, making the master PAC difficult to work with. Thus, it is often wise to further reduce noise prior to normalization and annotation.
Seqpac contains two functions for this purpose:
PAC_filter
PAC_filtsep
While PAC_filter
gives a variety of choices to subset the PAC object according
to for example sequence size or minimum counts in a certain proportion of the
samples,PAC_filtsep
extracts the sequence names that pass a filter within
sample groups.
Here are some examples:
###---------------------------------------------------------------------
## Extracts all sequences between 20-30 nt in length with at least 5 counts
## in 20% of all samples.
pac_filt <- PAC_filter(pac, size=c(20,30), threshold=5,
coverage=20, norm = "counts",
pheno_target=NULL, anno_target=NULL)
##
## -- Size filter will retain: 3416 of 9131 seqs.
## -- Count filter was specified.
## -- The chosen filters will retain: 826 of 9131 seqs.
###---------------------------------------------------------------------
## Optionally, a simple coverage/threshold graph at different filter depths
## can be plotted with stat=TRUE (but it will promt you for a question).
pac_filt <- PAC_filter(pac, threshold=5, coverage=20,
norm = "counts", stat=TRUE,
pheno_target=NULL, anno_target=NULL)
###---------------------------------------------------------------------
## Extracts all sequences with 22 nt size and the samples in Batch1 and Batch2.
pac_filt <- PAC_filter(pac, subset_only = TRUE,
pheno_target=list("batch", c("Batch1", "Batch2")),
anno_target=list("Size", "22"))
##
## -- Pheno target was specified, will retain: 6 of 9 samples.
## -- Anno target was specified, will retain: 378 of 9131 seqs.
###---------------------------------------------------------------------
## Extracts sequences with >=5 counts in 100% of samples within each stage
filtsep <- PAC_filtsep(pac, norm="counts", threshold=5,
coverage=100, pheno_target= list("stage"))
pac_filt <- PAC_filter(pac, subset_only = TRUE,
anno_target= unique(do.call("c", as.list(filtsep))))
##
## -- Anno target was specified, will retain: 1093 of 9131 seqs.
Notice that PAC_filtsep
only extracts the sequence names and stores them in a
data.frame. Converted into a character vector with unique names (using unique
+ do.call
) they can be applied as an anno_target that targets sequence names
in Anno.
Hint, the data.frame output of PAC_filtsep
has been designed for the purpose
of generating Venn and Upset diagrams that visualize overlaps across groups:
###---------------------------------------------------------------------
## Venn diagram using the venneuler package
olap <- reshape2::melt(filtsep,
measure.vars = c("Stage1", "Stage3", "Stage5"),
na.rm=TRUE)
plot(venneuler::venneuler(olap[,c(2,1)]))
###---------------------------------------------------------------------
## Setting output="binary", prepares data for upset plots (UpSetR package):
filtsep_bin <- PAC_filtsep(pac, norm="counts", threshold=5,
coverage=100, pheno_target= list("stage"),
output="binary")
UpSetR::upset(filtsep_bin, sets = colnames(filtsep_bin),
mb.ratio = c(0.55, 0.45), order.by = "freq",
keep.order=TRUE)
While the simplest PAC only contains three data.frame objects (P, A and C), additional ‘folders’ will be stored in the PAC object as the analysis progresses. In fact, if using a S3 PAC (list), you may choose to store any number of objects in your PAC as long as they do not conflict with the names of the standard folders, which are:
Pheno (P): data.frame
Anno (A): data.frame
Counts (C): data.frame
norm: list of data.frames
summary: list of data.frames
The norm folder will be used to store counts that have been normalized in
different ways, while the summary folder will contain the data that has been
summarized across groups of samples. In both cases, applying a filter using
PAC_filter
will automatically subset not only P, A, and C, but also all tables
stored in norm(PAC)
and summary(PAC)
.
The built in normalization methods in Seqpac are handled by the PAC_norm
function. In the current version three methods are available:
More specifically, cpm normalize the dataset in relation to the total number of
reads, while vst and rlog uses both mean dispersion and size factor to produce
homoskedastic values with functions available in the DESeq2
package.
It is easy to generate your own normalized count table using your
method of choice, and import it as a data.frame to the norm folder in the PAC
object. Just remember that row and column names must be identical to the
original Counts table. Use PAC_check
to verify.
Lets generate two normalized count tables with cpm and vst:
###---------------------------------------------------------------------
## Example normalization in Seqpac
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
# PAC_norm works on both
pac <- PAC_norm(pac, norm="cpm")
pac <- PAC_norm(pac, norm="vst")
pac
From this, it may sometimes be wise to run further filtering steps to focus on reads with high cpm values.
###---------------------------------------------------------------------
## Filtering using cpm instead of raw counts
## filter >=100 cpm in 100% of samples in >= 1 full group
filtsep <- PAC_filtsep(pac, norm="cpm", threshold=100, coverage=100,
pheno_target= list("stage"))
pac_cpm_filt <- PAC_filter(pac, subset_only = TRUE,
anno_target= unique(do.call("c", as.list(filtsep))))
From the start a PAC object contains a very limited Anno table such as this:
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE]
head(anno(pac))
To facilitate easy adoption of Seqpac to any species, we provide a fast and flexible workflow for mapping PAC sequences against user defined reference databases. We call this reannotation, since we map the sequences in Anno object sequentially over one or multiple references.
Important, Seqpac relies on the popular bowtie aligner Rbowtie
for mapping: http://bowtie-bio.sourceforge.net/manual.shtml. This means that
references must be indexed using the bowtie_build
function before they are
compatible with the reannotation workflow (but see the PAC_mapper
function for
an index free alternative).
Small RNA annotation/mapping depends heavily on choosing your references wisely and not to be too greedy if possible. If your hypothesis target a specific type of class (like tRNA), it might for instance be wiser to target such references only.
Many databases provide their references in the fasta file format. Seqpac can use any correctly formated fasta file as input in the reannotation workflow. However, we distinguish two kinds of fasta references:
You can download the files manually from the source database. Below are some helpful example links where to retrieve useful fasta references. Note that all compressed files must be uncompressed to work with the bowtie aligner.
Reference genomes:
Ensembl Animals
Ensembl Plants
UCSC
Hint: Unmasked top level fasta files are best for most purposes, e.g. Ensembl
file name *.dna.toplevel.fa.gz.
Other specialized references:
miRNA
miRNA
piRNA
many ncRNA Animals
many ncRNA Plants
tRNA
repeats
There are also several packages in Bioconductor/R such as r Biocpkg("BSgenome")
and biomaRt that allow you to download
reference genomes and other specialized references directly into r. Actually,
using the download.file
function makes most urls (https/http/ftp etc)
available for download. Custom fasta files can also be used, and you can easily
merge, add and remove features to the references by functions in the r Biocpkg("Biostrings")
package, such as readDNAStringSet
(read fasta) and
writeXStringSet
(save fasta).
As default, bowtie only reports the names up to the first white space. Sometimes
the sequence names need to be modified or rearranged for the bowtie aligner to
best report the names in the mapping output. This can be done using the
Biostrings package to read, rearrange and save the fasta. A
useful expression to verify that you have caught the relevant info before the
the first white space is: do.call("rbind", strsplit(names(<your_DNAStringSet>), " "))[,1]
You may also want to add tRNAs from the mitochrondrial genome to your tRNA
reference, which is not always provided in the GtRNAdb fasta. This can be done
by moving them from Ensembl ncRNA fasta or scan the mitochondrial genome
manually at tRNAscan-SE. The
mitochonridal genome is found in the reference genome fasta.
## Indexing the fasta references
Before we can use the fasta references we need to index them for the bowtie
aligner. You can do this by using the bowtie_build
function in the Rbowtie
package or by running bowtie externally, outside R. For more information see
?Rbowtie::bowtie, Rbowtie::bowtie_build_usage() and
http://bowtie-bio.sourceforge.net/manual.shtml.
Important, the bowtie prefix must have the same basename (prefix=
) and the
indexes must be saved in the same directory (outdir=
) as the original fasta
file. Since the bowtie_build
naming is often confusing for newbies, here
are some examples:
###---------------------------------------------------------------------
## Examples of generating bowtie indexes
path <- "/some/path/to/your/folder"
#Genome
mycoplasma_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(mycoplasma_path,
outdir=gsub("mycoplasma.fa", "", mycoplasma_path),
prefix="mycoplasma", force=TRUE)
#tRNA
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(trna_path,
outdir=gsub("tRNA.fa", "", trna_path),
prefix="tRNA", force=TRUE)
#rRNA
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(rrna_path,
outdir=gsub("rRNA.fa", "", rrna_path),
prefix="rRNA", force=TRUE)
Building indexes may take some time for large fasta, such as a reference genome.
You only need to generate the index once for every fasta reference, however.
After the sequences have been mapped against a reference genome they can also acquire annotations from genomic feature files, such as the gtf and gff formats. These files contains the coordinates for different genomic features associated with a specific reference genome, such as repetitive regions, exons and CpG islands etc.
GTF files can often be acquired at the same places as the fasta reference files
(see above). The biomartr package also have functions for
downloading some gtf files (e.g. getGTF
).In addition, you may create your own
by downloading/importing different tables, for example using r Biocpkg("rtracklayer")
(see readGFF
, getTable
, ucscTableQuery
) and then
create a GRanges object (GenomicRanges package) where you add
your preferred table info. Finally you may save it using rtracklayer again
(export(<GRanges object>, <destination_path>, format="gtf"
)
The procedure to annotate sequences in Seqpac is called reannotation.
The PAC reannotation family of functions carries out the following tasks:
map_reanno
To accommodate most users on most platforms, Seqpac provides two alternatives
for calling bowtie using the map_reanno
function: internally from within R
(type="internal"
) or externally from outside R (type="external"
). In the
internal mode, Seqpac uses the bowtie
function in the Rbowtie
package, while in the external mode bowtie is called by a system command to a
locally installed version of bowtie. Both options gives identical results, but
since the internal Rbowtie package is rarely updated, we provide the external
option.
Since the input commands for external bowtie and internal Rbowtie differs
slightly, map_reanno
has two parse options: parse_internal
and
parse_external
. These can be used to control the two versions of bowtie
as if they were run from the console or command line, respectively.
While bowtie has its own way to handle mismatches in the alignments, Seqpac
disregards this and instead runs bowtie sequentially, increasing the number of
mismatches for each cycle. After each cycle all sequences that received a match
in any of the provided references will be subtracted. Therefore, in the next
cycle only sequences without a match will be ‘reannotated’ but allowing one
additional mismatch. The map_reanno
function controls this by sequentially
increasing the number of mismatches after subtracting the matches.
import_reanno
This function is called by map_reanno
to import the bowtie results after
each cycle. If you would run import_reanno
separately it simply
provides you with options on how to import bowtie results into R. For example,
when aligning against a reference genome the mapping coordinates is important,
while mapping against more specialized references used for creating sRNA
classifications (rRNA, tRNA, miRNA etc), ‘hit’ or ‘no hit’ might be enough, and
will be a much faster option. The map_reanno
function, therefore, has two
default import modes, import=“genome” and import=“biotype”, which automatically
configure import_reanno
for genome or class annotation.
make_reanno
After each annotation cycle, map_reanno
will store an .Rdata file
(Full_reanno_mis0/1/2/3.Rdata) containing a list with all alignment hits for
each PAC-sequence-to-fasta-reference mismatch cycle. Calling make_reanno
will
search for these .Rdata files in a given folder and create a reanno object in R.
This involves extracting, ordering and preparing annotations in a format that
can be merged with the existing Anno table in a PAC object. It will also
generate an overview table, that summarizes the annotations if multiple fasta
references were used by map_reanno
.
add_reanno
Before merging the new annotations with a PAC object, unless we have provided a
fasta reference with only one type (e.g. miRNA for mirBase), classification not
only between fasta references but also within references might be necessary
(e.g. snoRNA, tRNA, miRNA, rRNA in the Ensembl_ncRNA fasta). Classification of
the fasta reference names, which was first saved by bowtie and then caught in
our reanno object as an annotation, is done with the add_reanno
function.
Again, there are two primary modes, either “genome” or “biotype”. When
type=“genome” a standardized workflow anticipating genomic coordinates are
applied. When type=“biotype” a list of search term vectors directed against each
fasta reference needs to be provided. Matches between search term (written as
regular expressions) and text strings in the fasta reference names will result
in a classification of the PAC sequence. Finally, add_reanno
generates an
annotation table that can either be saved separately or merged with the original
PAC object.
simplify_reanno
While add_reanno
starts the process of simplifying the reference name
annotations into useful classifications, it will generate multiple
classifications for each counted sequence if it matches multiple search terms.
The simplify_reanno
function boils down multi-classifications into one
classification for each sequence. By doing so, an hierarchy must be set, where
priority of some classifications over others is done (e.g. miRNA over piRNA).
While this is a controversial topic, using such hierarchies is still the only
way to provide overview statistics in experiments targeting highly multimapping
sequences, such as small RNA. Nonetheless, the Seqpac workflow with its sequence
based counts, where hierarchical classification is done in the very end of the
annotation process, makes it easy to generate classifications using alternative
hierarchies and observe the consequences of your choices in for example a pie
chart.
In summary:
map_reanno
to import bowtie outputmap_reanno
output into a reanno objectAlready at this point it is good to know that Seqpac provides a ‘backdoor’
function, PAC_mapper
, that carries out many of the steps in the reanno
workflow using smaller references and pac objects. Please see ‘6.1 Advanced
alignment analysis’ or ?PAC_mapper for more details.
Lets observe the functions in action using the drosophila test dataset.
First we map against a reference genome, using map_reanno
. Note that
import=“genome” must be set to catch the genomic coordinates. Since
map_reanno
can handle multiple fasta references. Remember, each
fasta must have bowtie indexes (see 4.2).
# Lets reload our pac
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE] # Remove previous Anno
###---------------------------------------------------------------------
## Genome mapping
# Path to your output folder
outpath_genome <- "/some/path/to/reanno_folder"
# or a temp folder
outpath_genome <- paste0(tempdir(), "/seqpac_reanno/reanno_genome")
# Look up the your own Bowtie indexed reference genomes (fasta)
ref_paths <- list(myco="<some/path/to/your/genome.fa>")
# For testing, you can use a small mycoplasma genome provided with Seqpac
# But remember to provide single genomes as a named list anyway.
# The name will be used to keep track of your genome.
myco_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa",
package = "seqpac", mustWork = TRUE)
ref_paths <- list(myco=myco_path)
# Run map_reanno
map_reanno(PAC=pac, ref_paths=ref_paths, output_path=outpath_genome,
type="internal", mismatches=1, import="genome",
threads=8, keep_temp=TRUE)
Similarly, we can map against other specialized references, that can be used for
classifying each sequence into biotypes. Note, unlike genome mapping, at this
stage we are not interested in where a sequence matches a reference for the
specialized references, only if it matches or not. Thus, we use
import="biotype"
.
###---------------------------------------------------------------------
## sRNA mapping using internal bowtie
# Path to output folder
outpath_biotype <- "/some/path/to/reanno_biotype"
# or a temp folder
outpath_biotype <- paste0(tempdir(), "/seqpac_reanno/reanno_biotype")
# Seqpac contains two fasta for tRNA and rRNA we can use for testing
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
# Provide paths to bowtie indexed fasta references as a list
ref_paths <- list(tRNA=trna_path,
rRNA=rrna_path)
map_reanno(pac, ref_paths=ref_paths, output_path=outpath_biotype,
type="internal", mismatches=1, import="biotype",
threads=6, keep_temp=TRUE)
Now, the output .Rdata files for all mismatch cycles should have been stored in
the output folders. Lets import everything into R, generate a reanno objects and
plot some pie charts from the Overview table.
###---------------------------------------------------------------------
## Generate a reanno object with make_reanno
reanno_genome <- make_reanno(outpath_genome, PAC=pac, mis_fasta_check = TRUE)
reanno_biotype <- make_reanno(outpath_biotype, PAC=pac, mis_fasta_check = TRUE)
# Note, setting mis_fasta_check=TRUE will double check that the number of
# sequences that failed to receive an alignment in the last mismatch cycle
# agrees with the number sequences in the reanno object without an annotation.
# (these sequences are stored in mis_fasta_x.txt where x is max mismatches+1)
# List structure
str(reanno_genome, max.level = 3, give.attr = FALSE)
str(reanno_biotype, max.level = 3, give.attr = FALSE)
# Simple pie charts using the Overview table
pie(table(overview(reanno_genome)$myco)) # Very few mycoplasma with 0 mismatch
pie(table(overview(reanno_biotype)$Any)) # Many hits for either rRNA or tRNA
Next step is to reorganize and classify using add_reanno
. For mapping against
a reference genome this is easy. Just set type=“genome” and set the maximum
number of alignments to be reported by genome_max
(Warning: genome_max="all"
will give you all alignments).
###---------------------------------------------------------------------
### Genomic coordinates using add_reanno
# Output as separate table
anno_genome <- add_reanno(reanno_genome, type="genome",
genome_max=10, mismatches=1)
# Output merged with provided PAC object
pac <- add_reanno(reanno_genome, type="genome",
genome_max=10, mismatches=1, merge_pac=pac)
# Example of original reference name annotations
head(full(reanno_genome)$mis1$myco)
# Finished genome annotation
head(anno_genome)
head(anno(pac))
For specialized references, type="biotype"
should be used. This allows for
classification based on match or no match between the search terms provided in
the bio_search
input and the reference names annotated with each counted
sequence. Correct classification is all about finding the best search terms that
discriminates between the different names in the original fasta reference files
(up to the first white space; see 4.1).
With the bio_perfect
option you may control how conservative the matching
between search term and annotation should be. With bio_perfect=FALSE
,
every reference hit that fail to match your search terms, will be classified as
‘other’. Using bio_perfect=TRUE
will instead guarantee that your search
terms will cover all reference hits.
Hint: See ?add_reanno
for a trick on how to succeed with bio_perfect=TRUE
.
###---------------------------------------------------------------------
## Classify sequences using add_reanno
# Lets start by exploring the names in the original fasta reference:
ref_path <- "/some/path/to/fasta_reference"
# For testing use the the fasta reference for tRNA provided with Seqpac:
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
# Read fasta names
fasta <- names(Biostrings::readDNAStringSet(trna_path))
# Different naming standards
table(substr(fasta, 1, 10))
# Starting with FBgn discriminate mitochondrial tRNA
fasta[grepl("FBgn", fasta)]
# The other are nuclear
head(fasta[!grepl("FBgn", fasta)])
# We can use as 'mt:tRNA' as search term to catch mitochondrial tRNA
# and '_tRNA' to catch nuclear.
# A useful expression for extracting strings up to 1st white space is
# fasta <- do.call("rbind", strsplit(fasta, " "))[,1]
# Similarly load the rrna reference provided with seqpaq
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
# Read fasta names
fasta <- names(Biostrings::readDNAStringSet(rrna_path))
# Many rRNA subtypes
table(substr(fasta, 1, 10))
# Lets try two search term lists directed against each reference and written as
# regular expressions:
# Names in the search list needs to be the same as in the reanno object
head(overview(reanno_biotype)) # 'rRNA' and 'tRNA' with some capital letters
# Generate a search list with search terms
bio_search <- list(
rRNA=c("5S", "5.8S", "12S", "16S", "18S", "28S", "pre_S45"),
tRNA =c("_tRNA", "mt:tRNA")
)
test <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE,
mismatches = 1, merge_pac=pac)
# Throws an error because perfect matching was required:
anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=TRUE, mismatches = 1)
# References with no search term hits are classified as "Other":
<div id="refs"></div>
# (APPENDIX) Appendix {-}
anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE, mismatches = 1)
# Increasing number of hits allowing mismatches
table(anno_temp$mis0)
table(anno_temp$mis1)
# You may also add your new annotaion with a PAC object using the 'merge_pac'
# option. For even more options see ?add_reanno.
pac <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE, mismatches = 1,
merge_pac=pac)
head(anno(pac))
To make overview statistics and graphs we need to boil down the classifications
to a factor column with only a few unique biotypes per factor. This is what
simplify_reanno
does. Just like add_reanno
in the type="biotype"
mode,
simplify_reanno
needs a list of search terms, an hierarchy
. This time,
however, the list is order sensitive and targets the output table from
add_reanno
that contains the columns with new classifications (e.g.
“mis0_bio”, “mis1_bio”, “mis2_bio” etc). If a match occurs between the first
search term and a classification, the counted sequence will be annotated to this
classification, no matter if other search terms matches further down the list.
This is done sequentially, allowing one additional mismatch until a maximum
(specified in mismatches
) has been reached. Thus, if a match occurs in tRNA
with 0 mismatch, and rRNA with 1 mismatch, it will be reported as tRNA even
though rRNA where higher in the hierarchy
. A better match will always be
prioritized over the hierarchy.
###---------------------------------------------------------------------
## Hierarchical classification with simplify_reanno
# Currently classification does not discriminate
table(anno(pac)$mis3_bio)
# Set the hierarchy and remember that it is order sensitive.
# Here: S5_rRNA >> Other_rRNA >> mt:tRNA >> tRNA
# Remember to use 'regular expressions' if you wish to catch all:
hierarchy_1 <- list(rRNA_5S="rRNA_5S",
Other_rRNA="5.8S|12S|16S|18S|28S|pre_S45|rRNA_other",
Mt_tRNA="tRNA_mt:tRNA",
tRNA="tRNA__tRNA"
)
# What happens if you don't catch all:
hierarchy_2 <- list(rRNA_5S="rRNA_5S",
Mt_tRNA="tRNA_mt:tRNA",
tRNA="tRNA__tRNA")
# No mismatch allowed using hierarchy_1
test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=0,
bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test) # All remaining rRNA are classified as 'Other_rRNA'
# Instead using hierarchy_2
test <- simplify_reanno(input=pac, hierarchy=hierarchy_2, mismatches=0,
bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test)# Non-hits are classified as 'other'
# Note, setting 'merge_pac=FALSE' returns a one-column data.frame only
# containing the new hierarchical classifications.
# Lets increase number of mismatches an merge it with the PAC
# (How many mismatches depends on what you allowed previously in the workflow)
colnames(anno(pac)) # mis0-1_bio = Upto 1 mismatches are available
pac_test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=1,
bio_name="Biotypes_mis1", merge_pac=TRUE)
# Now we also have some mitochondrial tRNA
table(anno(pac_test)$Biotypes_mis1)
There are several functions in Seqpac that handle statistical analysis and visualization. We will only present a few of them here, so please refer to Seqpac’s reference manual for a complete list. Just like many of the preprocessing functions, these functions are named PAC_xxx to clearly show that they are applied on a PAC object.
Hint: If you are unsure if your manually constructed/manipulated PAC object are
compatible, you can always run PAC_check
.
To make simple summaries over column categories/groups in the pheno(PAC)
table
such as means, standard deviation (SD), standard error (SE), and log2 fold
changes (log2FC) etc, Seqpac has a convenient function called PAC_summary
.
This function uses a target_pheno object to generate a summary over raw
counts or normalized counts across sample groups. Processed data will be stored
in the summary(PAC)
folder.
Important, summarized data in the summary(PAC)
folder will always have the
same rownames (unique sequences) as counts(PAC)
and anno(PAC)
, while the
column names will be derived from the pheno(PAC)
table. Summarize using an
anno_target object is not allowed, because that would disrupt sequence names
(loss of sequence integrity). Summaries over sequences (such as generating means
over sRNA classes) is instead handled by each plot/statistical function, and
are often stored in the output from these functions. Nonetheless, user may
always generate their own derived tables, using functions like aggregate
and
reshape::melt
.
Lets have some examples on how PAC_summary works:
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
###---------------------------------------------------------------------
## PAC_summary in Seqpac
# Make means of counts over stages and return a data.frame
tab <- PAC_summary(pac, norm = "counts", type = "means",
pheno_target=list("stage"), merge_pac=FALSE)
##
## -- Pheno target was specified, will retain: 9 of 9 samples.
# When merge_pac=TRUE the table is added to the summary(PAC) 'folder'
pac_test <- PAC_summary(pac, norm = "counts", type = "means",
pheno_target=list("stage"), merge_pac=TRUE)
##
## -- Pheno target was specified, will retain: 9 of 9 samples.
pac # Structure of PAC before PAC_summary (S4)
## PAC object with:
## 9 samples
## 9131 sequences
## mean total counts: 26554 (min:20353/max:38275)
## best sequence: 1998 mean counts
## worst sequence: 0 mean counts
## normalized tables: 1
## cpm
pac_test # Structure of PAC after PAC_summary (S4)
## PAC object with:
## 9 samples
## 9131 sequences
## mean total counts: 26554 (min:20353/max:38275)
## best sequence: 1998 mean counts
## worst sequence: 0 mean counts
## normalized tables: 1
## cpm
## summarized tables: 1
## countsMeans_stage
names(summary(pac_test))
## [1] "countsMeans_stage"
head(summary(pac_test)$countsMeans_stage)
## Stage1 Stage3 Stage5
## TATTGCACTTGAGACGGCCTGAAAA 15.666667 2.666667 9.0000000
## CATGAGGACTGTGCT 8.333333 16.333333 6.0000000
## TGCTTGGACTACATATGGTTGAGGGTTGTA 2203.666667 378.333333 74.0000000
## CTGCTTGGACTACATATGGTTGAGGGTTGTA 1790.666667 456.000000 57.6666667
## CAATGAGGGACCAGTACATGAGGACTCTGCT 6.000000 16.000000 0.6666667
## CATGAGGACTGTGCC 7.333333 15.333333 2.0000000
# You may want to use normalized counts
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means",
pheno_target=list("stage"), merge_pac=TRUE)
# Maybe only include a subset of the samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means",
pheno_target=list("batch", c("Batch1", "Batch2")),
merge_pac=TRUE)
# Generate standard errors
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "se",
pheno_target=list("stage"), merge_pac=TRUE)
# log2FC
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FC",
pheno_target=list("stage"), merge_pac=TRUE)
# log2FC generated from a grand mean over all samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FCgrand",
pheno_target=list("stage"), merge_pac=TRUE)
# All summarized tables have identical row names that can be merged
names(summary(pac_test))
lapply(summary(pac_test), function(x){
identical(rownames(x), rownames(summary(pac_test)[[1]]))
})
head(do.call("cbind", summary(pac_test)))
Seqpac has a useful function, PAC_deseq
, that lets you make omic scale
expressional analysis on PAC objects using the popular DESeq2
package. This involves fitting generalized linear models with negative binomial
distributions and subsequent correction for multiple testing using the PAC
tables. For more information (such as how to correctly build models), please see
the DESeq2 vingette DESeq2. In addition to preparing a PAC object
for DESeq2, PAC_deseq
will organize and visualize the output. Lets have an
example using the test dataset:
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
###---------------------------------------------------------------------
## Differential expression in Seqpac
# Simple model testing stages against using Wald test with local fit (default)
table(pheno(pac)$stage)
##
## Stage1 Stage3 Stage5
## 3 3 3
output_deseq <- PAC_deseq(pac, model= ~stage, threads=6)
##
##
## ** stage Stage3 vs Stage5 **
##
## out of 9131 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 820, 9%
## LFC < 0 (down) : 1975, 22%
## outliers [1] : 3, 0.033%
## low counts [2] : 2125, 23%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
##
## NULL
# More complicated, but still graphs will be generated from 'stage' since it is
# first in model
output_deseq <- PAC_deseq(pac, model= ~stage + batch, threads=6)
## Warning: 'IS_BIOC_BUILD_MACHINE' environment variable detected, setting
## BiocParallel workers to 4 (was 6)
## Warning: Removed 6019 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Using pheno_target we can change the focus to batch
# (no batch effect)
output_deseq <- PAC_deseq(pac, model= ~stage + batch,
pheno_target=list("batch"))
# With pheno_target we can also change the direction of the comparison
output_deseq <- PAC_deseq(pac, model= ~stage,
pheno_target=list("stage", c("Stage5", "Stage3")))
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2128 rows containing missing values or values outside the scale range
## (`geom_point()`).
## In the output you find PAC merged results, target plots and output_deseq
names(output_deseq)
head(output_deseq$result)
The PAC_pca
function uses the FactoMineR package to make a
principle component analysis, from which scatter plots are plotted using the
ggplot2 package. Here are some examples on how to use PAC_pca:
###---------------------------------------------------------------------
## PCA analysis in Seqpac
# As simple as possible
output_pca <- PAC_pca(pac)
# Two 'folders' in the output
names(output_pca)
# Using pheno_target
output_pca <- PAC_pca(pac, pheno_target =list("stage"))
# Using pheno_target with sample labels
output_pca <- PAC_pca(pac, pheno_target =list("stage"),
label=pheno(pac)$sample)