Contents




Stable versions of Seqpac is available on Bioconductor, and development branches can be obtained at github:

## Installation Bioconductor:
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("seqpac")

## Installation github:
devtools::install_github("Danis102/seqpac", 
                         dependencies=TRUE, ref="development")



1 Introduction


1.1 Overview

Seqpac contains an extensive toolbox for the analysis of short sequenced reads. The package was originally developed for small RNA (sRNA) analysis, but can be applied on any type of data generated by large scale nucleotide sequencing, where the user wish to maintain sequence integrity throughout the whole analysis. This involves, for example, regular RNA-seq with long RNA. But note, however, that it currently does not automatically support paired-end sequencing.


To preserve sequence integrity, and thereby guarantee a clean data linage, Seqpac applies sequence-based counting. This can be contrasted against feature-based counting, where reads are counted over known genomic features, like protein coding genes or sRNA. Such strategy will result in a count table where each count is made per feature (e.g. number of read counts mapping to a gene). In Seqpac, such annotations are initially disregarded, as a count table is generated solely by counting unique sequences in the raw sequence files (fastq). Annotations against known reference(s) (such as databases containing sequences of full genomes, miRNA, tRNA, protein coding genes, or genomic coordinates of such features) are done after the count table has been produced.


The advantage of using sequence-based counting, compared to feature-based counting, is that for example mismatches in the classification of reads, can be applied any time during the analysis. In feature-based counting, allowing a mismatch in the mapping will result in a loss of information (loss of sequence integrity).


For example, say that you have 506 read counts for miR-185 (a miRNA) when allowing 3 mismatches against the reference. In feature-based analysis this will result in 1 entry in the count table. Unless you have saved the exact alignments elsewhere, you have lost the information about what sequences that are included in that count. The only thing you know is that the sequence may differ from the reference sequence by any 3 mismatches. In sequence-based counting, reads are counted as unique sequences and each unique sequence is then mapped against the target reference databases. Information about whether a sequence was aligning against a reference is maintained in a linked annotation table that can contain both perfect alignments and alignments allowing mismatches. Thus, in any stage of the analysis mismatches and classifications into features can be dynamically controlled, meaning that you can easily observe the effect of for example allowing more mismatches or changes in hierarchical classifications (often applied in sRNA analysis when a sequence align with multiple classes of sRNA).


But the best with sequence-based counting where you preserve sequence integrity, is probably that you will be able blast your candidate sequences at your favorite database, to verify and find out more about them.


1.2 The Seqpac workflow

The input file format for Seqpac is fastq, which can be generated from most sequencing platforms. Two types of objects sustain the core of the Seqpac workflow:

  • The PAC object
  • The Targeting objects


The PAC object stores the Phenotypic information (P), Annotations (A) and Counts (C) needed for all downstream analysis, while the targeting object is used for accessing specific subsets of data within the PAC object.


Seqpac provides two versions of the PAC object. In its simplest form it is a S3 list containing 3 data.frames (Pheno, Anno and Counts). As default, however, an S4 PAC object is generated. By using the coercion method as(pac-object, "list"), you can turn an S4 PAC into an S3 list. Similarly, by using as.PAC(pac-object) you can turn an S3 PAC into an S4. Plaese see ?as.PAC an ?PAC for more details and examples.


The targeting object is always a list of 2 character objects.


The workflow is roughly divided into four steps:

  • Generate a PAC object from fastq-files
  • Preprocess the PAC object
  • Annotate the PAC object
  • Analyze the PAC object.


1.3 Quick start

Here is a quick simple workflow using Seqpac. Remember, however, that in Seqpac you can create much more advanced workflows. Especially when applying the reanno workflow, as will be described below.

# Setup
 library(seqpac)
 
 sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
 fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
                 full.names = TRUE)
 ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
                         package = "seqpac", mustWork = TRUE)
 
 # Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
 count_list  <- make_counts(fastq, plot = FALSE, 
                             parse="default_neb", trimming="seqpac")
 pac <- make_PAC(count_list$counts)
 pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
 
 # Preprocess PAC-object and creat means
 pac <- PAC_norm(pac)                        # Default normalization is cpm
 pac <- PAC_filter(pac, size=c(15,60))       # Here, filter on sequence length
 pac <- PAC_summary(pac, norm = "cpm", type = "means", 
                    pheno_target=list("groups", unique(pheno(pac)$groups)))
 
 # Quickly align (annotate) and plot PAC-object
 map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
 plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
 plts[[13]]




1.4 Example data

Contained within this package are 6 sRNA fastq file. These files were originally generated by extracting sRNA from a single fruit fly embryo. The original fastq file was acquired by Illumina NextSeq 500 sequencer using a High Output 75 cycles flow cell (product no: 20024906) and NEBNext® Small RNA Library Prep Set for Illumina (E7330). Due to package size restriction, this fastq was randomly down-sampled to a fraction of its original size into 6 much much smaller files.


There is also a prepared example PAC-object that were similarly generated from randomly down-sampled fastq files (but here we used 9 unique embryos). In addition, there are a few fasta reference file. These files are used in examples of this vignette and in the function manuals.



2 Making a PAC object


2.1 Before making a PAC object (merge_lanes)

Seqpac works best with merged fastq, meaning that if sequencing was done in separate lanes on the flow cell, where the same sample was present in all lanes, lane files must be merged for each unique sample.


Conveniently, Seqpac contains a function that can be used to merge lane files: merge_lanes. Please see, ?merge_lanes for examples on how to merge fastq-files. Specifically, this function searches the input folder for similar file names and append files with similar names. As long as the user uses unique sample names (e.g. sample1_lane1.fastq.gz, sample1_lane2.fastq.gz, sample2_lane1.fastq.gz, sample2_lane2.fastq.gz etc) then merge_lanes will merge lane files into the same sample (e.g. sample1.fastq.gz, sample2.fastq.gz).


2.2 Generating a count table (make_counts)

Count tables are the core components in most sequence analysis. Seqpac uses sequence-based counts (see 1.1 Overview). Thus, we need to generate a count table with the number of unique sequences in each sample.


There are three ways of generating a count table in Seqpac.

  • Trim raw fastq files using internal functions in Seqpac/R.
  • Trim raw fastq files using external functions outside R.
  • Import reads from already trimmed fastq files.


The two first involves adapter trimming, while the last involves already trimmed reads. All input options are contained within the make_counts. function.


2.2.1 Internal trimming

Generating a count table from untrimmed fastq files using nothing but R packages (internal) depends on the make_trim function. This function goes through a series of trimming cycles using the trimLRPatterns function in the Biostrings package to generate highly comparable adapter trimming as generated by popular fastq trimming software, such as cutadapt.


Seqpac trimming is praticularly efficient when trimming many fastq files at the same time. It parallelizes trimming on multiple processor cores/threads by applying functions in the foreach package. Thus, if threads=12 and you input 12 fastq files, all files will be trimmed simultaneously in parallel. Note, however, that each thread will make a substantial impact on your computers memory performance. Thus, if you know that you are low in ram memory use fewer number of threads.


When make_trim is used on its own it results in trimmed fastq files stored at a user defined location. When trimming="Seqpac" is set in the make_counts function, this function parses settings to make_trim to generate a count table from temporary stored trimmed fastq files.

## Using default settings for NEB type adapter 
# NEB= NEBNext® Small RNA Library Prep Set for Illumina (E7300/E7330)
# For illumina type adapters, 'parse="default_illumina"' may be used
# To speed things up we use 4 threads.

count_list <- make_counts(input=fastq, trimming="seqpac", threads=4,
                          parse="default_neb")


2.2.2 External trimming

As an alternative to internal trimming, two external software: cutadapt and fastq_quality_filter can be used to trim the adapter sequence and remove low quality reads prior of making a count table. By setting `trimming=“cutadapt” make_counts will check if these software is installed on your system.


2.2.3 Already trimmed fastq

Setting trimming=NULL in the make_counts function will pass the adapter trimming and a count table will be generated directly from the fastq files. Thus, this option can be used if you already have trimmed your fastq files.


2.2.4 Optimizing for low-end systems and challenging data

In the make_countsfunction users may choose to process data on-disk and in chunks instead of processing data in-memory. Thus, users on low-end computers with little RAM may choose to process smaller chunks and save data to disk from time to time on the expense of performance. This is controlled by the optimizeoption. Here, two panic filters are also available. These filters will only be applied if the unfiltered data fails to complete due to memory shortage. If such filter is applied to a fastq file this is clearly stated in the progress report, which give you the choice of what to do with such a sample.


See the manuals for make_counts for more details.


2.2.5 Outputs from make_counts

Besides the counts table, make_counts automatically generates a progress report for the trimming and evidence filter, as well as bar graphs showing the impact of the evidence filter (if plots=TRUE).


2.2.6 Evidence filtering

Users may already when making the counts table with make_counts markedly reduce the noise in the data by specifying the number of independent evidence that a specific sequence must reach to be included. This is controlled by the evidence argument. As default, evidence=c(experiment=2, sample=1) will include all sequences that have >=1 counts in at least 2 independent fastq files.


Importantly, we routinely observe that 95-99% of all reads are maintained when applying the default evidence filter, while >50% of the unique sequences will be dropped because they can not be replicated across two independent samples. For an example of how a saturated experiment may look like, see the original paper, Skog et al. from 2021.


As you may have guessed, the ‘experiment’ argument controls the number of independent fastq evidence needed across the whole experiment. Note, however, that ‘sample’ does not control the number of counts needed in each sample. The evidence filter will always use >=1 counts in X number of fastq files. Instead ‘sample’ controls when a sequence should be included despite not reaching the ‘experiment’ threshold. Thus, if evidence=c(experiment=2, sample=10), sequences that reach 10 counts in a single sample will also be included. Here is an example on how to use the ‘sample’ argument:

###--------------------------------------------------------------------- 
## Evidence over two independent samples, saving single sample sequences 
## reaching 3 counts
test <- make_counts(input=fastq, trimming="seqpac", 
                    parse="default_neb", threads=3,   
                    evidence=c(experiment=2, sample=3))
extras <- apply(test$counts, 1, function(x){sum(!x==0)})
test$counts[extras==1,] # A few still have 3 alone


Lastly, if evidence=NULL all unique sequences will be saved. Remember, however, that the count table will become larger in well saturated experiments that may lead to performance issues on some systems, since each unique sequence occupies a row in counts(PAC) and anno(PAC). Apply other filters prior to annotating your sequences may solve such problems.


2.3 Generate the phenotype table (make_pheno)

Next, we generate a Pheno table (P) and merge it with the progress report that was stored from the make_counts output. The Pheno table contains sample specific information such as sample name and possible interventions, with each sample as a row.


The make_pheno function can import a Pheno table both externally by providing a path to a .cvs file, as well as internally by just passing a data.frame. In common for both options, is that a unique column named “Sample_ID” needs to be present, and must contain the sample names (column names) present in the counts table generated by make_counts. Now, make_pheno will attempt to match similar names, but to avoid problems use identical names.


In addition, make_pheno may add the progress report generated by make_counts. Lets have a look at how this works:

###--------------------------------------------------------------------- 
## Generate a Pheno table using the file names of test fastq 
Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
                    Treatment=c(rep("heat", times=1), 
                                rep("control", times=2)),
                    Batch=rep(c("1", "2", "3"), times=1))

pheno <- make_pheno(pheno=pheno_fl, 
                    progress_report=count_list$progress_report, 
                    counts=count_list$counts)

2.4 make_PAC


Finally, we put everything together as a master PAC object using the make_PAC function. Conveniently, skipping the anno= argument will automatically add a very simple annotation table to the PAC-object that can be filled later.

###--------------------------------------------------------------------- 
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master

## If TRUE the PAC object is ok 
PAC_check(pac_master)



3 Preprocessing the PAC object

3.1 Targeting objects

Seqpac uses a simple targeting system to divide and focus the analysis of a PAC object to certain sample groups (Pheno) or specific sequences (Anno). If not specified otherwise in the function manual, a targeting object is always a two item list, where the first object is naming a column in a specific table in the PAC and the second object tells the function which categories in that column that should be targeted.

  • pheno_target: extracts samples from a column in pheno(PAC)
  • anno_target: extracts sequences from a column in anno(PAC)

While pheno_target and anno_target are the most common targeting objects, there are other targeting object. Common for all is that they target a specific sub-table in the PAC.


Now, lets take a look at the provided example PAC-object in Seqpac. As noted by the name, this PAC has already been filtered and annotated, but it will still show you the principle of preprocessing using targeting objects.

###--------------------------------------------------------------------- 
## Load the PAC-object  and inspect the columns in Pheno and Anno
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))

pheno(pac)[,1:5]
anno(pac)[1:5,1:5]
counts(pac)[1:5,1:5]

## This PAC is an S4 object, but you can easily turn it to an S3 list with as().
## using as.PAC() will turn it back into S4:
isS4(pac)
pac_s3 <- as(pac, "list")
lapply(pac_s3[1:3], head) 
isS4(pac_s3)
pac <- as.PAC(pac_s3)

methods(class="PAC") #all S4 methods


A pheno_target object that will exclusively target stage 1 and 5 samples would look like this: pheno_target=list("stage", c("Stage1", "Stage5"))


While an anno_target that will target sequences of 22 nt in length would look like this: anno_target=list("Size", "22")


Notice that both targeting objects are lists holding two secondary objects: one character string targeting a specific column, and one character vector targeting specific categories within the target column. The only difference being that each targeting object target different tables in the PAC object.


Important, if the second object is set to NULL or is left out this will automatically target all samples in the specified column. If the anno_target is a character vector (not a list) some function will attempt to extract matches in the row names of either Pheno or Anno (see example with PAC_filtsep below).For more, please refer to the original Seqpac publication: Skog et al. 2021.


3.2 Filtering

A master PAC object will contain sequences in your original fastq files that passed the trimming and the initial evidence filter. Now, in many experiments this still involves millions of unique sequences, making the master PAC difficult to work with. Thus, it is often wise to further reduce noise prior to normalization and annotation.


Seqpac contains two functions for this purpose:

  • PAC_filter
  • PAC_filtsep


While PAC_filter gives a variety of choices to subset the PAC object according to for example sequence size or minimum counts in a certain proportion of the samples,PAC_filtsep extracts the sequence names that pass a filter within sample groups.


Here are some examples:

 ###--------------------------------------------------------------------- 
## Extracts all sequences between 20-30 nt in length with at least 5 counts 
## in 20% of all samples.
pac_filt <- PAC_filter(pac, size=c(20,30), threshold=5, 
                          coverage=20, norm = "counts",  
                          pheno_target=NULL, anno_target=NULL)
## 
## -- Size filter will retain: 3416 of 9131 seqs.
## -- Count filter was specified.
## -- The chosen filters will retain: 826 of 9131 seqs.
###--------------------------------------------------------------------- 
## Optionally, a simple coverage/threshold graph at different filter depths 
## can be plotted with stat=TRUE (but it will promt you for a question).
pac_filt <- PAC_filter(pac, threshold=5, coverage=20, 
                       norm = "counts", stat=TRUE,  
                       pheno_target=NULL, anno_target=NULL)
###--------------------------------------------------------------------- 
## Extracts all sequences with 22 nt size and the samples in Batch1 and Batch2.
pac_filt <- PAC_filter(pac, subset_only = TRUE,
                       pheno_target=list("batch", c("Batch1", "Batch2")), 
                       anno_target=list("Size", "22"))
## 
## -- Pheno target was specified, will retain: 6 of 9 samples.
## -- Anno target was specified, will retain: 378 of 9131 seqs.
###--------------------------------------------------------------------- 
## Extracts  sequences with >=5 counts in 100% of samples within each stage
filtsep <- PAC_filtsep(pac, norm="counts", threshold=5, 
                       coverage=100, pheno_target= list("stage"))

pac_filt <- PAC_filter(pac, subset_only = TRUE,
                       anno_target= unique(do.call("c", as.list(filtsep))))
## 
## -- Anno target was specified, will retain: 1093 of 9131 seqs.

Notice that PAC_filtsep only extracts the sequence names and stores them in a data.frame. Converted into a character vector with unique names (using unique + do.call) they can be applied as an anno_target that targets sequence names in Anno.

Hint, the data.frame output of PAC_filtsep has been designed for the purpose of generating Venn and Upset diagrams that visualize overlaps across groups:

###--------------------------------------------------------------------- 
## Venn diagram using the venneuler package
olap <- reshape2::melt(filtsep, 
                       measure.vars = c("Stage1", "Stage3", "Stage5"), 
                       na.rm=TRUE)
plot(venneuler::venneuler(olap[,c(2,1)]))

###--------------------------------------------------------------------- 
## Setting output="binary", prepares data for upset plots (UpSetR package): 
filtsep_bin <- PAC_filtsep(pac, norm="counts", threshold=5, 
                           coverage=100, pheno_target= list("stage"), 
                           output="binary")

UpSetR::upset(filtsep_bin, sets = colnames(filtsep_bin), 
                           mb.ratio = c(0.55, 0.45), order.by = "freq", 
                           keep.order=TRUE)


3.3 Storage of processed data in PAC

While the simplest PAC only contains three data.frame objects (P, A and C), additional ‘folders’ will be stored in the PAC object as the analysis progresses. In fact, if using a S3 PAC (list), you may choose to store any number of objects in your PAC as long as they do not conflict with the names of the standard folders, which are:

Pheno  (P):    data.frame
Anno   (A):    data.frame
Counts (C):    data.frame
norm:          list of data.frames
summary:       list of data.frames


The norm folder will be used to store counts that have been normalized in different ways, while the summary folder will contain the data that has been summarized across groups of samples. In both cases, applying a filter using PAC_filter will automatically subset not only P, A, and C, but also all tables stored in norm(PAC) and summary(PAC).


3.4 Normalization

The built in normalization methods in Seqpac are handled by the PAC_norm function. In the current version three methods are available:

  • Counts per million reads (cpm)
  • Variance-Stabalizing Transformation (vst)
  • Regularized log transformation (rlog)


More specifically, cpm normalize the dataset in relation to the total number of reads, while vst and rlog uses both mean dispersion and size factor to produce homoskedastic values with functions available in the DESeq2 package.


It is easy to generate your own normalized count table using your method of choice, and import it as a data.frame to the norm folder in the PAC object. Just remember that row and column names must be identical to the original Counts table. Use PAC_check to verify.

Lets generate two normalized count tables with cpm and vst:

###--------------------------------------------------------------------- 
## Example normalization in Seqpac
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))

# PAC_norm works on both
pac <- PAC_norm(pac, norm="cpm")
pac <- PAC_norm(pac, norm="vst")

pac

From this, it may sometimes be wise to run further filtering steps to focus on reads with high cpm values.

###---------------------------------------------------------------------
## Filtering using cpm instead of raw counts
## filter >=100 cpm in 100% of samples in >= 1 full group
filtsep <- PAC_filtsep(pac, norm="cpm", threshold=100, coverage=100, 
                       pheno_target= list("stage")) 
pac_cpm_filt <- PAC_filter(pac, subset_only = TRUE,
                       anno_target= unique(do.call("c", as.list(filtsep))))



4 Generate annotations

From the start a PAC object contains a very limited Anno table such as this:

load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE]
head(anno(pac))

To facilitate easy adoption of Seqpac to any species, we provide a fast and flexible workflow for mapping PAC sequences against user defined reference databases. We call this reannotation, since we map the sequences in Anno object sequentially over one or multiple references.

Important, Seqpac relies on the popular bowtie aligner Rbowtie for mapping: http://bowtie-bio.sourceforge.net/manual.shtml. This means that references must be indexed using the bowtie_build function before they are compatible with the reannotation workflow (but see the PAC_mapper function for an index free alternative).


4.1 Fasta references

Small RNA annotation/mapping depends heavily on choosing your references wisely and not to be too greedy if possible. If your hypothesis target a specific type of class (like tRNA), it might for instance be wiser to target such references only.

Many databases provide their references in the fasta file format. Seqpac can use any correctly formated fasta file as input in the reannotation workflow. However, we distinguish two kinds of fasta references:

  • Reference genomes.
  • Other specialized references.

You can download the files manually from the source database. Below are some helpful example links where to retrieve useful fasta references. Note that all compressed files must be uncompressed to work with the bowtie aligner.


Reference genomes:
Ensembl Animals
Ensembl Plants
UCSC


Hint: Unmasked top level fasta files are best for most purposes, e.g. Ensembl file name *.dna.toplevel.fa.gz.

Other specialized references:
miRNA
miRNA
piRNA
many ncRNA Animals
many ncRNA Plants
tRNA
repeats


There are also several packages in Bioconductor/R such as r Biocpkg("BSgenome") and biomaRt that allow you to download reference genomes and other specialized references directly into r. Actually, using the download.file function makes most urls (https/http/ftp etc) available for download. Custom fasta files can also be used, and you can easily merge, add and remove features to the references by functions in the r Biocpkg("Biostrings") package, such as readDNAStringSet (read fasta) and writeXStringSet (save fasta).


As default, bowtie only reports the names up to the first white space. Sometimes the sequence names need to be modified or rearranged for the bowtie aligner to best report the names in the mapping output. This can be done using the Biostrings package to read, rearrange and save the fasta. A useful expression to verify that you have caught the relevant info before the the first white space is: do.call("rbind", strsplit(names(<your_DNAStringSet>), " "))[,1]


You may also want to add tRNAs from the mitochrondrial genome to your tRNA reference, which is not always provided in the GtRNAdb fasta. This can be done by moving them from Ensembl ncRNA fasta or scan the mitochondrial genome manually at tRNAscan-SE. The mitochonridal genome is found in the reference genome fasta.


## Indexing the fasta references Before we can use the fasta references we need to index them for the bowtie aligner. You can do this by using the bowtie_build function in the Rbowtie package or by running bowtie externally, outside R. For more information see ?Rbowtie::bowtie, Rbowtie::bowtie_build_usage() and http://bowtie-bio.sourceforge.net/manual.shtml.


Important, the bowtie prefix must have the same basename (prefix=) and the indexes must be saved in the same directory (outdir=) as the original fasta file. Since the bowtie_build naming is often confusing for newbies, here are some examples:

###---------------------------------------------------------------------
## Examples of generating bowtie indexes

path <- "/some/path/to/your/folder"

#Genome
mycoplasma_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa", 
                               package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(mycoplasma_path, 
                      outdir=gsub("mycoplasma.fa", "", mycoplasma_path), 
                      prefix="mycoplasma", force=TRUE)
#tRNA
trna_path <- system.file("extdata/trna", "tRNA.fa", 
                           package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(trna_path, 
                        outdir=gsub("tRNA.fa", "", trna_path), 
                        prefix="tRNA", force=TRUE)
  
#rRNA
rrna_path <- system.file("extdata/rrna", "rRNA.fa", 
                           package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(rrna_path, 
                        outdir=gsub("rRNA.fa", "", rrna_path), 
                        prefix="rRNA", force=TRUE) 


Building indexes may take some time for large fasta, such as a reference genome. You only need to generate the index once for every fasta reference, however.


4.2 GTF feature coordinates

After the sequences have been mapped against a reference genome they can also acquire annotations from genomic feature files, such as the gtf and gff formats. These files contains the coordinates for different genomic features associated with a specific reference genome, such as repetitive regions, exons and CpG islands etc.


GTF files can often be acquired at the same places as the fasta reference files (see above). The biomartr package also have functions for downloading some gtf files (e.g. getGTF).In addition, you may create your own by downloading/importing different tables, for example using r Biocpkg("rtracklayer") (see readGFF, getTable, ucscTableQuery) and then create a GRanges object (GenomicRanges package) where you add your preferred table info. Finally you may save it using rtracklayer again (export(<GRanges object>, <destination_path>, format="gtf")


4.3 The reannotation workflow

The procedure to annotate sequences in Seqpac is called reannotation.


The PAC reannotation family of functions carries out the following tasks:

  • Maps sequences against fasta references
  • Stores the output on hard drive
  • Reads selected parts into R
  • Creates a reanno object
  • Generates an overview from the reanno object
  • Classifies and simplifies annotations
  • Adds classifications to a PAC object


4.3.1 The functions in the reanno workflow


map_reanno
To accommodate most users on most platforms, Seqpac provides two alternatives for calling bowtie using the map_reanno function: internally from within R (type="internal") or externally from outside R (type="external"). In the internal mode, Seqpac uses the bowtie function in the Rbowtie package, while in the external mode bowtie is called by a system command to a locally installed version of bowtie. Both options gives identical results, but since the internal Rbowtie package is rarely updated, we provide the external option.


Since the input commands for external bowtie and internal Rbowtie differs slightly, map_reanno has two parse options: parse_internal and parse_external. These can be used to control the two versions of bowtie as if they were run from the console or command line, respectively.


While bowtie has its own way to handle mismatches in the alignments, Seqpac disregards this and instead runs bowtie sequentially, increasing the number of mismatches for each cycle. After each cycle all sequences that received a match in any of the provided references will be subtracted. Therefore, in the next cycle only sequences without a match will be ‘reannotated’ but allowing one additional mismatch. The map_reanno function controls this by sequentially increasing the number of mismatches after subtracting the matches.


import_reanno
This function is called by map_reanno to import the bowtie results after each cycle. If you would run import_reanno separately it simply provides you with options on how to import bowtie results into R. For example, when aligning against a reference genome the mapping coordinates is important, while mapping against more specialized references used for creating sRNA classifications (rRNA, tRNA, miRNA etc), ‘hit’ or ‘no hit’ might be enough, and will be a much faster option. The map_reanno function, therefore, has two default import modes, import=“genome” and import=“biotype”, which automatically configure import_reanno for genome or class annotation.


make_reanno
After each annotation cycle, map_reanno will store an .Rdata file (Full_reanno_mis0/1/2/3.Rdata) containing a list with all alignment hits for each PAC-sequence-to-fasta-reference mismatch cycle. Calling make_reanno will search for these .Rdata files in a given folder and create a reanno object in R. This involves extracting, ordering and preparing annotations in a format that can be merged with the existing Anno table in a PAC object. It will also generate an overview table, that summarizes the annotations if multiple fasta references were used by map_reanno.


add_reanno
Before merging the new annotations with a PAC object, unless we have provided a fasta reference with only one type (e.g. miRNA for mirBase), classification not only between fasta references but also within references might be necessary (e.g. snoRNA, tRNA, miRNA, rRNA in the Ensembl_ncRNA fasta). Classification of the fasta reference names, which was first saved by bowtie and then caught in our reanno object as an annotation, is done with the add_reanno function. Again, there are two primary modes, either “genome” or “biotype”. When type=“genome” a standardized workflow anticipating genomic coordinates are applied. When type=“biotype” a list of search term vectors directed against each fasta reference needs to be provided. Matches between search term (written as regular expressions) and text strings in the fasta reference names will result in a classification of the PAC sequence. Finally, add_reanno generates an annotation table that can either be saved separately or merged with the original PAC object.


simplify_reanno
While add_reanno starts the process of simplifying the reference name annotations into useful classifications, it will generate multiple classifications for each counted sequence if it matches multiple search terms. The simplify_reanno function boils down multi-classifications into one classification for each sequence. By doing so, an hierarchy must be set, where priority of some classifications over others is done (e.g. miRNA over piRNA). While this is a controversial topic, using such hierarchies is still the only way to provide overview statistics in experiments targeting highly multimapping sequences, such as small RNA. Nonetheless, the Seqpac workflow with its sequence based counts, where hierarchical classification is done in the very end of the annotation process, makes it easy to generate classifications using alternative hierarchies and observe the consequences of your choices in for example a pie chart.


In summary:

  • map_reanno Bowtie reference mapping over mismatch cycles
  • import_reanno Used by map_reanno to import bowtie output
  • make_reanno Organizes map_reanno output into a reanno object
  • add_reanno Classifies annotations according to reference name
  • simplify_reanno Hierarchical classification


Already at this point it is good to know that Seqpac provides a ‘backdoor’ function, PAC_mapper, that carries out many of the steps in the reanno workflow using smaller references and pac objects. Please see ‘6.1 Advanced alignment analysis’ or ?PAC_mapper for more details.


4.3.2 Reannotation in action

Lets observe the functions in action using the drosophila test dataset.


First we map against a reference genome, using map_reanno. Note that import=“genome” must be set to catch the genomic coordinates. Since map_reanno can handle multiple fasta references. Remember, each fasta must have bowtie indexes (see 4.2).


# Lets reload our pac
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE] # Remove previous Anno

###---------------------------------------------------------------------
## Genome mapping

# Path to your output folder 
outpath_genome <- "/some/path/to/reanno_folder" 
# or a temp folder
outpath_genome <- paste0(tempdir(), "/seqpac_reanno/reanno_genome")

# Look up the your own Bowtie indexed reference genomes (fasta)
ref_paths <- list(myco="<some/path/to/your/genome.fa>")

# For testing, you can use a small mycoplasma genome provided with Seqpac
# But remember to provide single genomes as a named list anyway.
# The name will be used to keep track of your genome.
myco_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa", 
                         package = "seqpac", mustWork = TRUE)
ref_paths <- list(myco=myco_path)

# Run map_reanno
map_reanno(PAC=pac, ref_paths=ref_paths, output_path=outpath_genome, 
           type="internal", mismatches=1, import="genome", 
           threads=8, keep_temp=TRUE)


Similarly, we can map against other specialized references, that can be used for classifying each sequence into biotypes. Note, unlike genome mapping, at this stage we are not interested in where a sequence matches a reference for the specialized references, only if it matches or not. Thus, we use import="biotype".

###---------------------------------------------------------------------
## sRNA mapping using internal bowtie

# Path to output folder 
outpath_biotype <- "/some/path/to/reanno_biotype" 
# or a temp folder
outpath_biotype <- paste0(tempdir(), "/seqpac_reanno/reanno_biotype")


# Seqpac contains two fasta for tRNA and rRNA we can use for testing
trna_path <- system.file("extdata/trna", "tRNA.fa", 
                         package = "seqpac", mustWork = TRUE)  
rrna_path <- system.file("extdata/rrna", "rRNA.fa", 
                           package = "seqpac", mustWork = TRUE)

# Provide paths to bowtie indexed fasta references as a list
ref_paths <- list(tRNA=trna_path,
                  rRNA=rrna_path)

map_reanno(pac, ref_paths=ref_paths, output_path=outpath_biotype, 
           type="internal", mismatches=1,  import="biotype", 
           threads=6, keep_temp=TRUE)


Now, the output .Rdata files for all mismatch cycles should have been stored in the output folders. Lets import everything into R, generate a reanno objects and plot some pie charts from the Overview table.

###---------------------------------------------------------------------
## Generate a reanno object with make_reanno 
reanno_genome <- make_reanno(outpath_genome, PAC=pac, mis_fasta_check = TRUE)
reanno_biotype <- make_reanno(outpath_biotype, PAC=pac, mis_fasta_check = TRUE)

# Note, setting mis_fasta_check=TRUE will double check that the number of
# sequences that failed to receive an alignment in the last mismatch cycle
# agrees with the number sequences in the reanno object without an annotation.
# (these sequences are stored in mis_fasta_x.txt where x is max mismatches+1)

# List structure
str(reanno_genome, max.level = 3, give.attr = FALSE)
str(reanno_biotype, max.level = 3, give.attr = FALSE)

# Simple pie charts using the Overview table
pie(table(overview(reanno_genome)$myco)) # Very few mycoplasma with 0 mismatch 
pie(table(overview(reanno_biotype)$Any))  # Many hits for either rRNA or tRNA


Next step is to reorganize and classify using add_reanno. For mapping against a reference genome this is easy. Just set type=“genome” and set the maximum number of alignments to be reported by genome_max (Warning: genome_max="all" will give you all alignments).

###---------------------------------------------------------------------
### Genomic coordinates using add_reanno 

# Output as separate table
anno_genome <- add_reanno(reanno_genome, type="genome", 
                          genome_max=10, mismatches=1)

# Output merged with provided PAC object
pac <- add_reanno(reanno_genome, type="genome", 
                  genome_max=10, mismatches=1, merge_pac=pac)

# Example of original reference name annotations 
head(full(reanno_genome)$mis1$myco)

# Finished genome annotation 
head(anno_genome)
head(anno(pac))


For specialized references, type="biotype" should be used. This allows for classification based on match or no match between the search terms provided in the bio_search input and the reference names annotated with each counted sequence. Correct classification is all about finding the best search terms that discriminates between the different names in the original fasta reference files (up to the first white space; see 4.1).


With the bio_perfect option you may control how conservative the matching between search term and annotation should be. With bio_perfect=FALSE, every reference hit that fail to match your search terms, will be classified as ‘other’. Using bio_perfect=TRUE will instead guarantee that your search terms will cover all reference hits.


Hint: See ?add_reanno for a trick on how to succeed with bio_perfect=TRUE.

###---------------------------------------------------------------------
## Classify sequences using add_reanno 

# Lets start by exploring the names in the original fasta reference: 
ref_path <- "/some/path/to/fasta_reference" 
# For testing use the the fasta reference for tRNA provided with Seqpac:
trna_path <- system.file("extdata/trna", "tRNA.fa", 
                         package = "seqpac", mustWork = TRUE)  

#  Read fasta names
fasta <- names(Biostrings::readDNAStringSet(trna_path))
# Different naming standards  
table(substr(fasta, 1, 10))
# Starting with FBgn discriminate mitochondrial tRNA
fasta[grepl("FBgn", fasta)]
# The other are nuclear
head(fasta[!grepl("FBgn", fasta)])

# We can use as 'mt:tRNA' as search term to catch mitochondrial tRNA
# and '_tRNA' to catch nuclear. 

# A useful expression for extracting strings up to 1st white space is
# fasta <- do.call("rbind", strsplit(fasta, " "))[,1]
    
# Similarly load the rrna reference provided with seqpaq
rrna_path <- system.file("extdata/rrna", "rRNA.fa", 
                         package = "seqpac", mustWork = TRUE)
#  Read fasta names
fasta <- names(Biostrings::readDNAStringSet(rrna_path))
# Many rRNA subtypes  
table(substr(fasta, 1, 10))

# Lets try two search term lists directed against each reference and written as
# regular expressions:

# Names in the search list needs to be the same as in the reanno object
head(overview(reanno_biotype)) # 'rRNA' and 'tRNA' with some capital letters

# Generate a search list with  search terms 
bio_search <- list(
                rRNA=c("5S", "5.8S", "12S", "16S", "18S", "28S", "pre_S45"),
                tRNA =c("_tRNA", "mt:tRNA")
                    )
 
test <- add_reanno(reanno_biotype, bio_search=bio_search, 
                      type="biotype", bio_perfect=FALSE, 
                      mismatches = 1, merge_pac=pac)                
# Throws an error because perfect matching was required:
anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search, 
                        type="biotype", bio_perfect=TRUE, mismatches = 1)
# References with no search term hits are classified as "Other":
<div id="refs"></div>

# (APPENDIX) Appendix {-}

anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search, 
                        type="biotype", bio_perfect=FALSE, mismatches = 1)

# Increasing number of hits allowing mismatches  
table(anno_temp$mis0) 
table(anno_temp$mis1)

# You may also add your new annotaion with a PAC object using the 'merge_pac' 
# option. For even more options see ?add_reanno.
pac <- add_reanno(reanno_biotype, bio_search=bio_search, 
                        type="biotype", bio_perfect=FALSE, mismatches = 1,
                        merge_pac=pac)
head(anno(pac))


To make overview statistics and graphs we need to boil down the classifications to a factor column with only a few unique biotypes per factor. This is what simplify_reanno does. Just like add_reanno in the type="biotype" mode, simplify_reanno needs a list of search terms, an hierarchy. This time, however, the list is order sensitive and targets the output table from add_reanno that contains the columns with new classifications (e.g. “mis0_bio”, “mis1_bio”, “mis2_bio” etc). If a match occurs between the first search term and a classification, the counted sequence will be annotated to this classification, no matter if other search terms matches further down the list. This is done sequentially, allowing one additional mismatch until a maximum (specified in mismatches) has been reached. Thus, if a match occurs in tRNA with 0 mismatch, and rRNA with 1 mismatch, it will be reported as tRNA even though rRNA where higher in the hierarchy. A better match will always be prioritized over the hierarchy.


###---------------------------------------------------------------------
## Hierarchical classification with simplify_reanno

#  Currently classification does not discriminate
table(anno(pac)$mis3_bio)

# Set the hierarchy and remember that it is order sensitive.
# Here: S5_rRNA >> Other_rRNA >> mt:tRNA >> tRNA
# Remember to use 'regular expressions' if you wish to catch all:
hierarchy_1 <- list(rRNA_5S="rRNA_5S",
                    Other_rRNA="5.8S|12S|16S|18S|28S|pre_S45|rRNA_other",
                    Mt_tRNA="tRNA_mt:tRNA",
                    tRNA="tRNA__tRNA"
                    )

# What happens if you don't catch all:
hierarchy_2 <- list(rRNA_5S="rRNA_5S",
                    Mt_tRNA="tRNA_mt:tRNA",
                    tRNA="tRNA__tRNA")

# No mismatch allowed using hierarchy_1
test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=0, 
                       bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test) # All remaining rRNA are classified as 'Other_rRNA'

# Instead using hierarchy_2
test <- simplify_reanno(input=pac, hierarchy=hierarchy_2, mismatches=0, 
                       bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test)# Non-hits are classified as 'other'

# Note, setting 'merge_pac=FALSE' returns a one-column data.frame only
# containing the new hierarchical classifications.

# Lets increase number of mismatches an merge it with the PAC
# (How many mismatches depends on what you allowed previously in the workflow) 
colnames(anno(pac))  # mis0-1_bio = Upto 1 mismatches are available

pac_test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=1, 
                        bio_name="Biotypes_mis1", merge_pac=TRUE)

# Now we also have some mitochondrial tRNA 
table(anno(pac_test)$Biotypes_mis1)



5 Analysis and visualization

There are several functions in Seqpac that handle statistical analysis and visualization. We will only present a few of them here, so please refer to Seqpac’s reference manual for a complete list. Just like many of the preprocessing functions, these functions are named PAC_xxx to clearly show that they are applied on a PAC object.

Hint: If you are unsure if your manually constructed/manipulated PAC object are compatible, you can always run PAC_check.

5.1 Summarize data over groups

To make simple summaries over column categories/groups in the pheno(PAC) table such as means, standard deviation (SD), standard error (SE), and log2 fold changes (log2FC) etc, Seqpac has a convenient function called PAC_summary. This function uses a target_pheno object to generate a summary over raw counts or normalized counts across sample groups. Processed data will be stored in the summary(PAC) folder.

Important, summarized data in the summary(PAC) folder will always have the same rownames (unique sequences) as counts(PAC) and anno(PAC), while the column names will be derived from the pheno(PAC) table. Summarize using an anno_target object is not allowed, because that would disrupt sequence names (loss of sequence integrity). Summaries over sequences (such as generating means over sRNA classes) is instead handled by each plot/statistical function, and are often stored in the output from these functions. Nonetheless, user may always generate their own derived tables, using functions like aggregate and reshape::melt.

Lets have some examples on how PAC_summary works:

load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))

###---------------------------------------------------------------------
## PAC_summary in Seqpac 

# Make means of counts over stages and return a data.frame
tab <- PAC_summary(pac, norm = "counts", type = "means", 
                   pheno_target=list("stage"), merge_pac=FALSE)
## 
## -- Pheno target was specified, will retain: 9 of 9 samples.
# When merge_pac=TRUE the table is added to the summary(PAC) 'folder'  
pac_test <- PAC_summary(pac, norm = "counts", type = "means", 
                        pheno_target=list("stage"), merge_pac=TRUE)
## 
## -- Pheno target was specified, will retain: 9 of 9 samples.
pac       # Structure of PAC before PAC_summary (S4)
## PAC object with: 
##    9 samples
##    9131 sequences
##    mean total counts: 26554 (min:20353/max:38275)
##    best sequence: 1998 mean counts
##    worst sequence: 0 mean counts
## normalized tables: 1 
## cpm
pac_test  # Structure of PAC after PAC_summary  (S4)
## PAC object with: 
##    9 samples
##    9131 sequences
##    mean total counts: 26554 (min:20353/max:38275)
##    best sequence: 1998 mean counts
##    worst sequence: 0 mean counts
## normalized tables: 1 
## cpm 
## summarized tables: 1 
## countsMeans_stage
names(summary(pac_test))
## [1] "countsMeans_stage"
head(summary(pac_test)$countsMeans_stage)
##                                      Stage1     Stage3     Stage5
## TATTGCACTTGAGACGGCCTGAAAA         15.666667   2.666667  9.0000000
## CATGAGGACTGTGCT                    8.333333  16.333333  6.0000000
## TGCTTGGACTACATATGGTTGAGGGTTGTA  2203.666667 378.333333 74.0000000
## CTGCTTGGACTACATATGGTTGAGGGTTGTA 1790.666667 456.000000 57.6666667
## CAATGAGGGACCAGTACATGAGGACTCTGCT    6.000000  16.000000  0.6666667
## CATGAGGACTGTGCC                    7.333333  15.333333  2.0000000
# You may want to use normalized counts   
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means", 
                        pheno_target=list("stage"), merge_pac=TRUE)

# Maybe only include a subset of the samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means", 
                        pheno_target=list("batch", c("Batch1", "Batch2")), 
                        merge_pac=TRUE)

# Generate standard errors
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "se", 
                        pheno_target=list("stage"), merge_pac=TRUE)

# log2FC
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FC",  
                        pheno_target=list("stage"), merge_pac=TRUE)

# log2FC generated from a grand mean over all samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FCgrand",  
                        pheno_target=list("stage"), merge_pac=TRUE)

# All summarized tables have identical row names that can be merged
names(summary(pac_test))
lapply(summary(pac_test), function(x){
  identical(rownames(x), rownames(summary(pac_test)[[1]]))
  })
head(do.call("cbind", summary(pac_test)))

5.2 Differential Expression using PAC object

Seqpac has a useful function, PAC_deseq, that lets you make omic scale expressional analysis on PAC objects using the popular DESeq2 package. This involves fitting generalized linear models with negative binomial distributions and subsequent correction for multiple testing using the PAC tables. For more information (such as how to correctly build models), please see the DESeq2 vingette DESeq2. In addition to preparing a PAC object for DESeq2, PAC_deseq will organize and visualize the output. Lets have an example using the test dataset:

load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))

###---------------------------------------------------------------------
## Differential expression in Seqpac 

# Simple model testing stages against using Wald test with local fit (default)
table(pheno(pac)$stage)
## 
## Stage1 Stage3 Stage5 
##      3      3      3
output_deseq <- PAC_deseq(pac, model= ~stage, threads=6)
## 
## 
## ** stage Stage3 vs Stage5 **
## 
## out of 9131 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 820, 9%
## LFC < 0 (down)     : 1975, 22%
## outliers [1]       : 3, 0.033%
## low counts [2]     : 2125, 23%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## NULL

# More complicated, but still graphs will be generated from 'stage' since it is
# first in model
output_deseq <- PAC_deseq(pac, model= ~stage + batch, threads=6)
## Warning:   'IS_BIOC_BUILD_MACHINE' environment variable detected, setting
##   BiocParallel workers to 4 (was 6)
## Warning: Removed 6019 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Using pheno_target we can change the focus to batch
# (no batch effect) 
output_deseq <- PAC_deseq(pac, model= ~stage + batch, 
                          pheno_target=list("batch")) 

# With pheno_target we can also change the direction of the comparison 
output_deseq <- PAC_deseq(pac, model= ~stage, 
                          pheno_target=list("stage", c("Stage5", "Stage3")))
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2128 rows containing missing values or values outside the scale range
## (`geom_point()`).
## In the output you find PAC merged results, target plots and output_deseq   
names(output_deseq)
head(output_deseq$result)

5.3 Principal component analysis (PAC_pca)

The PAC_pca function uses the FactoMineR package to make a principle component analysis, from which scatter plots are plotted using the ggplot2 package. Here are some examples on how to use PAC_pca:

###---------------------------------------------------------------------
## PCA analysis in Seqpac 

# As simple as possible 
output_pca <- PAC_pca(pac)
# Two 'folders' in the output
names(output_pca)  

# Using pheno_target
output_pca <- PAC_pca(pac, pheno_target =list("stage"))
# Using pheno_target with sample labels
output_pca <- PAC_pca(pac, pheno_target =list("stage"), 
                      label=pheno(pac)$sample)