1 Simple, flexible and reusable annotation for metaseqR2 pipeline

When using the first version of metaseqR, one had to either embed the annotation to the gene/exon/3’ UTR counts, or to download and construct on-the-fly the required annotation when starting from BAM files. Although the counting and gene model construction results could (anf still can) be saved and re-used with other analysis parameters changed (e.g. statistical algorithms), one could not easily add for example new data to an existing dataset without re-running the whole pipeline and re-downloading annotation. On top of that, many times, the main Ensembl servers (when using Ensembl annotations) do not respond well to biomaRt calls, so the whole pipeline may stall until the servers are back.

Another main issue with the annotation used by metaseqR was that there was no straightforward way, provided by metaseqR, to archive and version the annotation used by a specific analysis and was up to the user to take care of reproducibility at this level. Furthermore, there was no straightforward way for a user to plugin own annotation elements (e.g. in the form of a GTF file) and use it in the same manner as standard annotations supported by metaseqR, e.g. when analyzing data from not-so-often studied organisms such as insects. Plugging-in own annotation was possible but usually a painful procedure, which has become now very easy.

The annotation database builder for metaseqR2 remedies the above situations. The buildAnnotationDatabase function should be run once with the organisms one requires to have locally to work with and then that’s it! Of course you can manage your database by adding and removing specific annotations (and you even can play with an SQLite browser, although not advised, as the database structure is rather simple). Furthermore, you can use the metaseqR2 annotation database and management mechanism for any other type of analysis where you require to have a simple tab-delimited annotation file, acquired with very little effort.

2 Supported organisms

The following organisms (essentially genome versions) are supported for automatic database builds:

  • Human (Homo sapiens) genome version hg38
  • Human (Homo sapiens) genome version hg19
  • Human (Homo sapiens) genome version hg18
  • Mouse (Mus musculus) genome version mm10
  • Mouse (Mus musculus) genome version mm9
  • Rat (Rattus norvegicus) genome version rn6
  • Rat (Rattus norvegicus) genome version rn5
  • Fruitfly (Drosophila melanogaster) genome version dm6
  • Fruitfly (Drosophila melanogaster) genome version dm3
  • Zebrafish (Danio rerio) genome version danRer7
  • Zebrafish (Danio rerio) genome version danRer10
  • Zebrafish (Danio rerio) genome version danRer11
  • Chimpanzee (Pan troglodytes) genome version panTro4
  • Chimpanzee (Pan troglodytes) genome version panTro5
  • Pig (Sus scrofa) genome version susScr3
  • Pig (Sus scrofa) genome version susScr11
  • Horse (Equus cabalus) genome version equCab2
  • Arabidopsis (Arabidobsis thaliana) genome version TAIR10

3 Using the local database

3.1 Installation of metaseqR2

To install the metaseqR2 package, start R and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("metaseqR2")

3.2 Setup the database

By default, the database file will be written in the system.file(package="metaseqR2") directory. You can specify another prefered destination for it using the db argument in the function call, but if you do that, you will have to supply the localDb argument pointing to the SQLite database file you created to every metaseqr2 call you perform, otherwise, the pipeline will download and use annotations on-the-fly.

In this vignette, we will build a minimal database comprising only the mouse mm9 genome version from Ensembl. The database will be build in a temporary directory inside session tempdir().

Important note: As the annotation build function makes use of Kent utilities for creating 3’UTR annotations from RefSeq and UCSC, the latter cannot be built in Windows. Therefore it is advised to either build the annotation database in a Linux system or use our pre-built databases.

library(metaseqR2)

buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)

# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")

# Since we are using Ensembl, we can also ask for a version
organisms <- list(mm9=67)
sources <- ifelse(.Platform$OS.type=="unix",c("ensembl","refseq"),"ensembl")

# If the example is not running in a multicore system, rc is ignored
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,db=myDb,rc=0.5)
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
## Retrieving genome information for mm9 from ensembl
## Retrieving gene annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Retrieving transcript annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging transcripts for mm9 from ensembl version 67
## Retrieving 3' UTR annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging gene 3' UTRs for mm9 from ensembl version 67
## Merging transcript 3' UTRs for mm9 from ensembl version 67
## Retrieving exon annotation for mm9 from ensembl version 67
## Using Ensembl host http://may2012.archive.ensembl.org
## Merging exons for mm9 from ensembl version 67

3.3 Use the database

Now, that a small database is in place, let’s retrieve some data. Remember that since the built database is not in the default location, we need to pass the database file in each data retrieval function. The annotation is retrieved as a GRanges object by default.

# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="gene",
    db=myDb)
genes
## GRanges object with 37583 ranges and 4 metadata columns:
##                      seqnames          ranges strand |            gene_id
##                         <Rle>       <IRanges>  <Rle> |        <character>
##   ENSMUSG00000090025     chr1 3044314-3044814      + | ENSMUSG00000090025
##   ENSMUSG00000064842     chr1 3092097-3092206      + | ENSMUSG00000064842
##   ENSMUSG00000051951     chr1 3195982-3661579      - | ENSMUSG00000051951
##   ENSMUSG00000089699     chr1 3456668-3503634      + | ENSMUSG00000089699
##   ENSMUSG00000088390     chr1 3668961-3669024      - | ENSMUSG00000088390
##                  ...      ...             ...    ... .                ...
##   ENSMUSG00000052831     chrY 2086590-2097768      + | ENSMUSG00000052831
##   ENSMUSG00000069031     chrY 2118049-2129045      + | ENSMUSG00000069031
##   ENSMUSG00000071960     chrY 2156899-2168120      + | ENSMUSG00000071960
##   ENSMUSG00000091987     chrY 2390390-2398856      + | ENSMUSG00000091987
##   ENSMUSG00000090600     chrY 2550262-2552957      + | ENSMUSG00000090600
##                      gc_content   gene_name        biotype
##                       <numeric> <character>    <character>
##   ENSMUSG00000090025      48.30     Gm16088     pseudogene
##   ENSMUSG00000064842      36.36          U6          snRNA
##   ENSMUSG00000051951      38.51        Xkr4 protein_coding
##   ENSMUSG00000089699      39.27      Gm1992      antisense
##   ENSMUSG00000088390      34.38          U7          snRNA
##                  ...        ...         ...            ...
##   ENSMUSG00000052831      36.60     Rbmy1a1 protein_coding
##   ENSMUSG00000069031      37.06     Gm10256 protein_coding
##   ENSMUSG00000071960      37.12     Gm10352 protein_coding
##   ENSMUSG00000091987      34.84      Gm3376 protein_coding
##   ENSMUSG00000090600      44.40      Gm3395 protein_coding
##   -------
##   seqinfo: 21 sequences from mm9 genome
# Load standard annotation based on 3' UTR coordinates
utrs <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="utr",
    db=myDb)
utrs
## GRanges object with 161406 ranges and 4 metadata columns:
##                      seqnames          ranges strand |      transcript_id
##                         <Rle>       <IRanges>  <Rle> |        <character>
##   ENSMUST00000160944     chr1 3044815-3045092      + | ENSMUST00000160944
##   ENSMUST00000082908     chr1 3092207-3092484      + | ENSMUST00000082908
##   ENSMUST00000162897     chr1 3195704-3195981      - | ENSMUST00000162897
##   ENSMUST00000159265     chr1 3196326-3196603      - | ENSMUST00000159265
##   ENSMUST00000070533     chr1 3204285-3204562      - | ENSMUST00000070533
##                  ...      ...             ...    ... .                ...
##   ENSMUST00000078383     chrY 2167760-2168120      + | ENSMUST00000078383
##   ENSMUST00000078383     chrY 2168121-2168398      + | ENSMUST00000078383
##   ENSMUST00000166474     chrY 2398496-2398856      + | ENSMUST00000166474
##   ENSMUST00000166474     chrY 2398857-2399134      + | ENSMUST00000166474
##   ENSMUST00000172100     chrY 2552084-2552957      + | ENSMUST00000172100
##                                 gene_id   gene_name        biotype
##                             <character> <character>    <character>
##   ENSMUST00000160944 ENSMUSG00000090025     Gm16088     pseudogene
##   ENSMUST00000082908 ENSMUSG00000064842          U6          snRNA
##   ENSMUST00000162897 ENSMUSG00000051951        Xkr4 protein_coding
##   ENSMUST00000159265 ENSMUSG00000051951        Xkr4 protein_coding
##   ENSMUST00000070533 ENSMUSG00000051951        Xkr4 protein_coding
##                  ...                ...         ...            ...
##   ENSMUST00000078383 ENSMUSG00000071960     Gm10352 protein_coding
##   ENSMUST00000078383 ENSMUSG00000071960     Gm10352 protein_coding
##   ENSMUST00000166474 ENSMUSG00000091987      Gm3376 protein_coding
##   ENSMUST00000166474 ENSMUSG00000091987      Gm3376 protein_coding
##   ENSMUST00000172100 ENSMUSG00000090600      Gm3395 protein_coding
##   -------
##   seqinfo: 21 sequences from mm9 genome
# Load summarized exon annotation based used with RNA-Seq analysis
sumEx <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="exon",
    summarized=TRUE,db=myDb)
sumEx
## GRanges object with 247808 ranges and 4 metadata columns:
##                            seqnames          ranges strand |
##                               <Rle>       <IRanges>  <Rle> |
##   ENSMUSG00000090025_MEX_1     chr1 3044314-3044814      + |
##   ENSMUSG00000064842_MEX_1     chr1 3092097-3092206      + |
##   ENSMUSG00000051951_MEX_1     chr1 3195982-3197398      - |
##   ENSMUSG00000051951_MEX_2     chr1 3203520-3207049      - |
##   ENSMUSG00000051951_MEX_3     chr1 3411783-3411982      - |
##                        ...      ...             ...    ... .
##   ENSMUSG00000091987_MEX_5     chrY 2394921-2395029      + |
##   ENSMUSG00000091987_MEX_6     chrY 2396398-2396478      + |
##   ENSMUSG00000091987_MEX_7     chrY 2397909-2397997      + |
##   ENSMUSG00000091987_MEX_8     chrY 2398179-2398856      + |
##   ENSMUSG00000090600_MEX_1     chrY 2550262-2552957      + |
##                                             exon_id            gene_id
##                                         <character>        <character>
##   ENSMUSG00000090025_MEX_1 ENSMUSG00000090025_MEX_1 ENSMUSG00000090025
##   ENSMUSG00000064842_MEX_1 ENSMUSG00000064842_MEX_1 ENSMUSG00000064842
##   ENSMUSG00000051951_MEX_1 ENSMUSG00000051951_MEX_1 ENSMUSG00000051951
##   ENSMUSG00000051951_MEX_2 ENSMUSG00000051951_MEX_2 ENSMUSG00000051951
##   ENSMUSG00000051951_MEX_3 ENSMUSG00000051951_MEX_3 ENSMUSG00000051951
##                        ...                      ...                ...
##   ENSMUSG00000091987_MEX_5 ENSMUSG00000091987_MEX_5 ENSMUSG00000091987
##   ENSMUSG00000091987_MEX_6 ENSMUSG00000091987_MEX_6 ENSMUSG00000091987
##   ENSMUSG00000091987_MEX_7 ENSMUSG00000091987_MEX_7 ENSMUSG00000091987
##   ENSMUSG00000091987_MEX_8 ENSMUSG00000091987_MEX_8 ENSMUSG00000091987
##   ENSMUSG00000090600_MEX_1 ENSMUSG00000090600_MEX_1 ENSMUSG00000090600
##                              gene_name        biotype
##                            <character>    <character>
##   ENSMUSG00000090025_MEX_1     Gm16088     pseudogene
##   ENSMUSG00000064842_MEX_1          U6          snRNA
##   ENSMUSG00000051951_MEX_1        Xkr4 protein_coding
##   ENSMUSG00000051951_MEX_2        Xkr4 protein_coding
##   ENSMUSG00000051951_MEX_3        Xkr4 protein_coding
##                        ...         ...            ...
##   ENSMUSG00000091987_MEX_5      Gm3376 protein_coding
##   ENSMUSG00000091987_MEX_6      Gm3376 protein_coding
##   ENSMUSG00000091987_MEX_7      Gm3376 protein_coding
##   ENSMUSG00000091987_MEX_8      Gm3376 protein_coding
##   ENSMUSG00000090600_MEX_1      Gm3395 protein_coding
##   -------
##   seqinfo: 21 sequences from mm9 genome
# Load standard annotation based on gene body coordinates from RefSeq
if (.Platform$OS.type=="unix") {
    refGenes <- loadAnnotation(genome="mm9",refdb="refseq",level="gene",
        type="gene",db=myDb)
    refGenes
}
## Getting latest annotation on the fly for mm9 from refseq
## Retrieving genome information for mm9 from refseq
## Retrieving latest gene annotation for mm9 from refseq
## Loading required namespace: RMySQL
## Converting annotation to GenomicRanges object...
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Getting DNA sequences...
## Getting GC content...
## GRanges object with 20956 ranges and 4 metadata columns:
##                seqnames          ranges strand |      gene_id gc_content
##                   <Rle>       <IRanges>  <Rle> |  <character>  <numeric>
##   NM_001011874     chr1 3204562-3661579      - | NM_001011874      38.56
##      NM_011283     chr1 4333587-4350395      - |    NM_011283      38.73
##      NM_011441     chr1 4481008-4487435      - |    NM_011441      49.75
##   NM_001177658     chr1 4763278-4775807      - | NM_001177658      42.59
##      NM_008866     chr1 4797903-4836816      + |    NM_008866      40.99
##            ...      ...             ...    ... .          ...        ...
##      NM_148943     chrY   635403-796225      - |    NM_148943      36.39
##      NM_009571     chrY 1362122-1426357      - |    NM_009571      36.58
##      NM_011564     chrY 1918380-1919568      - |    NM_011564      55.17
##   NM_001177569     chrY 1976248-1976584      - | NM_001177569      49.55
##   NM_001166384     chrY 2156898-2168120      + | NM_001166384      37.13
##                  gene_name        biotype
##                <character>    <character>
##   NM_001011874        Xkr4 protein_coding
##      NM_011283         Rp1 protein_coding
##      NM_011441       Sox17 protein_coding
##   NM_001177658      Mrpl15 protein_coding
##      NM_008866      Lypla1 protein_coding
##            ...         ...            ...
##      NM_148943       Usp9y protein_coding
##      NM_009571        Zfy2 protein_coding
##      NM_011564         Sry protein_coding
##   NM_001177569      H2al2c protein_coding
##   NM_001166384        Rbmy protein_coding
##   -------
##   seqinfo: 21 sequences (21 circular) from mm9 genome

Or as a data frame if you prefer using asdf=TRUE. The data frame however does not contain metadata like Seqinfo to be used for any susequent validations:

# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm9",refdb="ensembl",level="gene",type="gene",
    db=myDb,asdf=TRUE)
head(genes)
##   chromosome   start     end            gene_id gc_content strand gene_name
## 1       chr1 3044314 3044814 ENSMUSG00000090025      48.30      +   Gm16088
## 2       chr1 3092097 3092206 ENSMUSG00000064842      36.36      +        U6
## 3       chr1 3195982 3661579 ENSMUSG00000051951      38.51      -      Xkr4
## 4       chr1 3456668 3503634 ENSMUSG00000089699      39.27      +    Gm1992
## 5       chr1 3668961 3669024 ENSMUSG00000088390      34.38      -        U7
## 6       chr1 3773818 3773879 ENSMUSG00000089420      38.71      -        U7
##          biotype
## 1     pseudogene
## 2          snRNA
## 3 protein_coding
## 4      antisense
## 5          snRNA
## 6          snRNA

3.4 Add a custom annotation

Apart from the supported organisms and databases, you can add a custom annotation. Such an annotation can be:

  • A non-supported organism (e.g. an insect or another mammal e.g. dog)
  • A modification or further curation you have done to existing/supported annotations
  • A supported organism but from a different source
  • Any other case where the provided annotations are not adequate

This can be achieved through the usage of GTF files, along with some simple metadata that you have to provide for proper import to the annotation database. This can be achieved through the usage of the buildCustomAnnotation function. Details on required metadata can be found in the function’s help page.

Important note: Please note that importing a custom genome annotation directly from UCSC (UCSC SQL database dumps) is not supported in Windows as the process involves using the genePredToGtf which is not available for Windows.

Let’s try a couple of exammples. The first one is a custom annotation for the Ebola virus from UCSC:

# Setup a temporary directory to download files etc.
customDir <- file.path(tempdir(),"test_custom")
dir.create(customDir)

# Convert from GenePred to GTF - Unix/Linux only!
if (.Platform$OS.type == "unix" && !grepl("^darwin",R.version$os)) {
    # Download data from UCSC
    goldenPath="http://hgdownload.cse.ucsc.edu/goldenPath/"
    # Gene annotation dump
    download.file(paste0(goldenPath,"eboVir3/database/ncbiGene.txt.gz"),
        file.path(customDir,"eboVir3_ncbiGene.txt.gz"))
    # Chromosome information
    download.file(paste0(goldenPath,"eboVir3/database/chromInfo.txt.gz"),
        file.path(customDir,"eboVir3_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"eboVir3_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtf <- file.path(customDir,"genePredToGtf")
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"eboVir3.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"eboVir3_ncbiGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=eboVir3"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="eboVir3_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    eboGenes <- loadAnnotation(genome="eboVir3_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    eboGenes
}
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
##   Importing GTF /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf as GTF to make id map
##   Making id map
##   Importing GTF /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf as TxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for eboVir3_test from ucsc_test version 20200430 from /tmp/Rtmpu5V6f6/test_custom/eboVir3.gtf
## Retrieving transcript annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized transcript annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving 3' UTR annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized 3' UTR annotation per gene for eboVir3_test from ucsc_test version 20200430
##   summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for eboVir3_test from ucsc_test version 20200430
##   summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for eboVir3_test from ucsc_test version 20200430
## Retrieving summarized exon annotation for eboVir3_test from ucsc_test version 20200430
##   summarizing exons per gene for imported GTF
## GRanges object with 9 ranges and 4 metadata columns:
##          seqnames      ranges strand |     gene_id gc_content   gene_name
##             <Rle>   <IRanges>  <Rle> | <character>  <numeric> <character>
##     NP KM034562v1     56-3026      + |          NP         50          NP
##   VP35 KM034562v1   3032-4407      + |        VP35         50        VP35
##   VP40 KM034562v1   4390-5894      + |        VP40         50        VP40
##     GP KM034562v1   5900-8305      + |          GP         50          GP
##    sGP KM034562v1   5900-8305      + |         sGP         50         sGP
##   ssGP KM034562v1   5900-8305      + |        ssGP         50        ssGP
##   VP30 KM034562v1   8288-9740      + |        VP30         50        VP30
##   VP24 KM034562v1  9885-11518      + |        VP24         50        VP24
##      L KM034562v1 11501-18282      + |           L         50           L
##            biotype
##        <character>
##     NP        gene
##   VP35        gene
##   VP40        gene
##     GP        gene
##    sGP        gene
##   ssGP        gene
##   VP30        gene
##   VP24        gene
##      L        gene
##   -------
##   seqinfo: 1 sequence from eboVir3_test genome

Another example, the Atlantic cod from UCSC. The same things apply for the operating system.

if (.Platform$OS.type == "unix") {
    # Gene annotation dump
    download.file(paste0(goldenPath,"gadMor1/database/augustusGene.txt.gz"),
        file.path(customDir,"gadMori1_augustusGene.txt.gz"))
    # Chromosome information
    download.file(paste(goldenPath,"gadMor1/database/chromInfo.txt.gz",sep=""),
        file.path(customDir,"gadMori1_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"gadMori1_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtf <- file.path(customDir,"genePredToGtf")
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"gadMori1.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"gadMori1_augustusGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=gadMori1"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="gadMor1_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    gadGenes <- loadAnnotation(genome="gadMor1_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    gadGenes
}

Another example, Armadillo from Ensembl. This should work irrespectively of operating system. We are downloading chromosomal information from UCSC.

# Gene annotation dump from Ensembl
download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/",
    "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"))

# Chromosome information will be provided from the following BAM file
# available from Ensembl. We have noticed that when using Windows as the OS,
# a remote BAM files cannot be opened by scanBamParam, so for this example,
# chromosome length information will not be available when running in Windows.
bamForInfo <- NULL
if (.Platform$OS.type == "unix")
    bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/",
        "dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam")

# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
    gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    metadata=list(
        organism="dasNov3_test",
        source="ensembl_test",
        chromInfo=bamForInfo
    ),
    db=myDb
)
## Opening metaseqR2 SQLite database /tmp/Rtmpu5V6f6/test_anndb/testann.sqlite
##   Importing GTF /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz as GTF to make id map
##   Making id map
##   Importing GTF /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz as TxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for dasNov3_test from ensembl_test version 20200430 from /tmp/Rtmpu5V6f6/test_custom/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz
## Retrieving transcript annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized transcript annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving 3' UTR annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized 3' UTR annotation per gene for dasNov3_test from ensembl_test version 20200430
##   summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for dasNov3_test from ensembl_test version 20200430
##   summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for dasNov3_test from ensembl_test version 20200430
## Retrieving summarized exon annotation for dasNov3_test from ensembl_test version 20200430
##   summarizing exons per gene for imported GTF
# Try to retrieve some data
dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test",
    level="gene",type="gene",db=myDb)
dasGenes
## GRanges object with 33374 ranges and 4 metadata columns:
##                            seqnames      ranges strand |            gene_id
##                               <Rle>   <IRanges>  <Rle> |        <character>
##   ENSDNOG00000040597 AAGV03000462.1    666-1525      + | ENSDNOG00000040597
##   ENSDNOG00000040886 AAGV03001531.1      3-1269      - | ENSDNOG00000040886
##   ENSDNOG00000049893 AAGV03002204.1     21-1190      - | ENSDNOG00000049893
##   ENSDNOG00000042161 AAGV03003272.1      56-996      - | ENSDNOG00000042161
##   ENSDNOG00000037889 AAGV03003599.1      30-296      + | ENSDNOG00000037889
##                  ...            ...         ...    ... .                ...
##   ENSDNOG00000040274    NC_001821.1 13570-14097      - | ENSDNOG00000040274
##   ENSDNOG00000033793    NC_001821.1 14100-14168      - | ENSDNOG00000033793
##   ENSDNOG00000048587    NC_001821.1 14171-15310      + | ENSDNOG00000048587
##   ENSDNOG00000037257    NC_001821.1 15312-15379      + | ENSDNOG00000037257
##   ENSDNOG00000041926    NC_001821.1 15382-15452      - | ENSDNOG00000041926
##                      gc_content          gene_name        biotype
##                       <numeric>        <character>    <character>
##   ENSDNOG00000040597         50 ENSDNOG00000040597 protein_coding
##   ENSDNOG00000040886         50 ENSDNOG00000040886 protein_coding
##   ENSDNOG00000049893         50 ENSDNOG00000049893        lincRNA
##   ENSDNOG00000042161         50 ENSDNOG00000042161 protein_coding
##   ENSDNOG00000037889         50 ENSDNOG00000037889     pseudogene
##                  ...        ...                ...            ...
##   ENSDNOG00000040274         50                ND6 protein_coding
##   ENSDNOG00000033793         50 ENSDNOG00000033793        Mt_tRNA
##   ENSDNOG00000048587         50               CYTB protein_coding
##   ENSDNOG00000037257         50 ENSDNOG00000037257        Mt_tRNA
##   ENSDNOG00000041926         50 ENSDNOG00000041926        Mt_tRNA
##   -------
##   seqinfo: 4823 sequences from dasNov3_test genome

3.5 A complete build

A quite complete build (with latest versions of Ensembl annotations) would look like (supposing the default annotation database location):

organisms <- list(
    hg18=67,
    hg19=75,
    hg38=97:98,
    mm9=67,
    mm10=97:98,
    rn5=79,
    rn6=97:98,
    dm3=78,
    dm6=97:98,
    danrer7=79,
    danrer10=91,
    danrer11=97:98,
    pantro4=90,
    pantro5=97:98,
    susscr3=89,
    susscr11=97:98,
    equcab2=97:98
)

sources <- c("ensembl","ucsc","refseq")

buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,rc=0.5)

The aforementioned complete built can be found here Complete builts will become available from time to time (e.g. with every new Ensembl relrase) for users who do not wish to create annotation databases on their own. Root access may be required (depending on the metaseqR2 library location) to place it in the default location where it can be found automatically.

4 Annotations on-the-fly

If for some reason you do not want to build and use an annotation database for metaseqR2 analyses (not recommended) or you wish to perform an analysis with an organism that does not yet exist in the database, the loadAnnotation function will perform all required actions (download and create a GRanges object) on-the-fly as long as there is an internet connection.

However, the above function does not handle custom annotations in GTF files. In a scenario where you want to use a custom annotation only once, you should supply the annotation argument to the metaseqr2 function, which is almost the same as the metadata argument used in buildCustomAnnotation, actually augmented by a list member for the GTF file, that is:

annotation <- list(
    gtf="PATH_TO_GTF",
    organism="ORGANISM_NAME",
    source="SOURCE_NAME",
    chromInfo="CHROM_INFO"
)

The above argument can be passed to the metaseqr2 call in the respective position.

For further details about custom annotations on the fly, please check buildCustomAnnotation and importCustomAnnotation functions.

5 Session Info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.56.0                  
##  [3] rtracklayer_1.48.0                Biostrings_2.56.0                
##  [5] XVector_0.28.0                    metaseqR2_1.0.11                 
##  [7] locfit_1.5-9.4                    limma_3.44.1                     
##  [9] DESeq2_1.28.0                     SummarizedExperiment_1.18.0      
## [11] DelayedArray_0.14.0               matrixStats_0.56.0               
## [13] Biobase_2.48.0                    GenomicRanges_1.40.0             
## [15] GenomeInfoDb_1.24.0               IRanges_2.22.1                   
## [17] S4Vectors_0.26.0                  BiocGenerics_0.34.0              
## [19] BiocStyle_2.16.0                 
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.9.2             tidyselect_1.0.0         
##   [3] heatmaply_1.1.0           RSQLite_2.2.0            
##   [5] AnnotationDbi_1.50.0      htmlwidgets_1.5.1        
##   [7] grid_4.0.0                TSP_1.1-10               
##   [9] BiocParallel_1.22.0       baySeq_2.22.0            
##  [11] DESeq_1.40.0              munsell_0.5.0            
##  [13] codetools_0.2-16          preprocessCore_1.50.0    
##  [15] DT_0.13                   colorspace_1.4-1         
##  [17] knitr_1.28                GenomeInfoDbData_1.2.3   
##  [19] hwriter_1.3.2             bit64_0.9-7              
##  [21] rhdf5_2.32.0              vctrs_0.2.4              
##  [23] lambda.r_1.2.4            xfun_0.13                
##  [25] BiocFileCache_1.12.0      EDASeq_2.22.0            
##  [27] gclus_1.3.2               R6_2.4.1                 
##  [29] rmdformats_0.3.7          seriation_1.2-8          
##  [31] NBPSeq_0.3.0              bitops_1.0-6             
##  [33] assertthat_0.2.1          scales_1.1.0             
##  [35] bsseq_1.24.0              gtable_0.3.0             
##  [37] affy_1.66.0               log4r_0.3.2              
##  [39] rlang_0.4.5               genefilter_1.70.0        
##  [41] lazyeval_0.2.2            DSS_2.36.0               
##  [43] BiocManager_1.30.10       yaml_2.2.1               
##  [45] reshape2_1.4.4            abind_1.4-5              
##  [47] GenomicFeatures_1.40.0    qvalue_2.20.0            
##  [49] RMySQL_0.10.20            tools_4.0.0              
##  [51] lava_1.6.7                bookdown_0.18            
##  [53] ggplot2_3.3.0             affyio_1.58.0            
##  [55] ellipsis_0.3.0            gplots_3.0.3             
##  [57] RColorBrewer_1.1-2        Rcpp_1.0.4.6             
##  [59] plyr_1.8.6                progress_1.2.2           
##  [61] zlibbioc_1.34.0           purrr_0.3.4              
##  [63] RCurl_1.98-1.2            prettyunits_1.1.1        
##  [65] openssl_1.4.1             viridis_0.5.1            
##  [67] zoo_1.8-7                 cluster_2.1.0            
##  [69] magrittr_1.5              data.table_1.12.8        
##  [71] futile.options_1.0.1      survcomp_1.38.0          
##  [73] aroma.light_3.18.0        hms_0.5.3                
##  [75] evaluate_0.14             xtable_1.8-4             
##  [77] XML_3.99-0.3              VennDiagram_1.6.20       
##  [79] jpeg_0.1-8.1              gridExtra_2.3            
##  [81] compiler_4.0.0            biomaRt_2.44.0           
##  [83] tibble_3.0.1              KernSmooth_2.23-17       
##  [85] crayon_1.3.4              R.oo_1.23.0              
##  [87] htmltools_0.4.0           tidyr_1.0.2              
##  [89] geneplotter_1.66.0        DBI_1.1.0                
##  [91] SuppDists_1.1-9.5         formatR_1.7              
##  [93] corrplot_0.84             dbplyr_1.4.3             
##  [95] MASS_7.3-51.6             rappdirs_0.3.1           
##  [97] ShortRead_1.46.0          Matrix_1.2-18            
##  [99] rmeta_3.0                 permute_0.9-5            
## [101] vsn_3.56.0                R.methodsS3_1.8.0        
## [103] gdata_2.18.0              pkgconfig_2.0.3          
## [105] GenomicAlignments_1.24.0  registry_0.5-1           
## [107] plotly_4.9.2.1            foreach_1.5.0            
## [109] annotate_1.66.0           webshot_0.5.2            
## [111] prodlim_2019.11.13        stringr_1.4.0            
## [113] digest_0.6.25             rmarkdown_2.1            
## [115] dendextend_1.13.4         edgeR_3.30.0             
## [117] DelayedMatrixStats_1.10.0 curl_4.3                 
## [119] Rsamtools_2.4.0           gtools_3.8.2             
## [121] NOISeq_2.32.0             lifecycle_0.2.0          
## [123] jsonlite_1.6.1            ABSSeq_1.42.0            
## [125] Rhdf5lib_1.10.0           survivalROC_1.0.3        
## [127] futile.logger_1.4.3       viridisLite_0.3.0        
## [129] askpass_1.1               pillar_1.4.3             
## [131] lattice_0.20-41           httr_1.4.1               
## [133] survival_3.1-12           glue_1.4.0               
## [135] png_0.1-7                 iterators_1.0.12         
## [137] pander_0.6.3              bit_1.1-15.2             
## [139] stringi_1.4.6             HDF5Array_1.16.0         
## [141] bootstrap_2019.6          blob_1.2.1               
## [143] latticeExtra_0.6-29       caTools_1.18.0           
## [145] memoise_1.1.0             dplyr_0.8.5