ensembldb
annotation packages to retrieve specific annotationsEnsDb
packages with UCSC based annotationsshiny
web appensembldb
and Gviz
EnsDb
objects in the AnnotationDbi
frameworkPackage: ensembldb
Authors: Johannes Rainer johannes.rainer@eurac.edu, Tim Triche tim.triche@usc.edu
Modified: 12 September, 2016
Compiled: Wed Nov 16 19:52:05 2016
The ensembldb
package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb
packages from the GenomicFeatures
package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb
package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. In the databases, along with the gene and transcript models and their chromosomal coordinates, additional annotations including the gene name (symbol) and NCBI Entrezgene identifiers as well as the gene and transcript biotypes are stored too (see Section 11 for the database layout and an overview of available attributes/columns).
Another main goal of this package is to generate versioned annotation packages, i.e. annotation packages that are build for a specific Ensembl release, and are also named according to that (e.g. EnsDb.Hsapiens.v75
for human gene definitions of the Ensembl code database version 75). This ensures reproducibility, as it allows to load annotations from a specific Ensembl release also if newer versions of annotation packages/releases are available. It also allows to load multiple annotation packages at the same time in order to e.g. compare gene models between Ensembl releases.
In the example below we load an Ensembl based annotation package for Homo sapiens, Ensembl version 75. The connection to the database is bound to the variable EnsDb.Hsapiens.v75
.
library(EnsDb.Hsapiens.v75)
## Making a "short cut"
edb <- EnsDb.Hsapiens.v75
## print some informations for this package
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.1.3
## |Creation time: Thu Sep 15 13:16:58 2016
## |ensembl_version: 75
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |genome_build: GRCh37
## |DBSCHEMAVERSION: 1.0
## | No. of genes: 64102.
## | No. of transcripts: 215647.
## for what organism was the database generated?
organism(edb)
## [1] "Homo sapiens"
ensembldb
annotation packages to retrieve specific annotationsThe ensembldb
package provides a set of filter objects allowing to specify which entries should be fetched from the database. The complete list of filters, which can be used individually or can be combined, is shown below (in alphabetical order):
ExonidFilter
: allows to filter the result based on the (Ensembl) exon identifiers.ExonrankFilter
: filter results on the rank (index) of an exon within the transcript model. Exons are always numbered from 5’ to 3’ end of the transcript, thus, also on the reverse strand, the exon 1 is the most 5’ exon of the transcript.EntrezidFilter
: allows to filter results based on NCBI Entrezgene identifiers of the genes.GenebiotypeFilter
: allows to filter for the gene biotypes defined in the Ensembl database; use the listGenebiotypes
method to list all available biotypes.GeneidFilter
: allows to filter based on the Ensembl gene IDs.GenenameFilter
: allows to filter based on the names (symbols) of the genes.SymbolFilter
: allows to filter on gene symbols; note that no database columns symbol is available in an EnsDb
database and hence the gene name is used for filtering.GRangesFilter
: allows to retrieve all features (genes, transcripts or exons) that are either within (setting condition
to “within”) or partially overlapping (setting condition
to “overlapping”) the defined genomic region/range. Note that, depending on the called method (genes
, transcripts
or exons
) the start and end coordinates of either the genes, transcripts or exons are used for the filter. For methods exonsBy
, cdsBy
and txBy
the coordinates of by
are used.SeqendFilter
: filter based on the chromosomal end coordinate of the exons, transcripts or genes (correspondingly set =feature = “exon”=, =feature = “tx”= or =feature = “gene”=).SeqnameFilter
: filter by the name of the chromosomes the genes are encoded on.SeqstartFilter
: filter based on the chromosomal start coordinates of the exons, transcripts or genes (correspondingly set =feature = “exon”=, =feature = “tx”= or =feature = “gene”=).SeqstrandFilter
: filter for the chromosome strand on which the genes are encoded.TxbiotypeFilter
: filter on the transcript biotype defined in Ensembl; use the listTxbiotypes
method to list all available biotypes.TxidFilter
: filter on the Ensembl transcript identifiers.Each of the filter classes can take a single value or a vector of values (with the exception of the SeqendFilter
and SeqstartFilter
) for comparison. In addition, it is possible to specify the condition for the filter, e.g. setting condition
to = to retrieve all entries matching the filter value, to != to negate the filter or setting condition = "like"= to allow partial matching. The =condition
parameter for SeqendFilter
and SeqendFilter
can take the values = , >, >=, < and <= (since these filters base on numeric values).
A simple example would be to get all transcripts for the gene BCL2L11. To this end we specify a GenenameFilter
with the value BCL2L11. As a result we get a GRanges
object with start
, end
, strand
and seqname
of the GRanges
object being the start coordinate, end coordinate, chromosome name and strand for the respective transcripts. All additional annotations are available as metadata columns. Alternatively, by setting return.type
to “DataFrame”, or “data.frame” the method would return a DataFrame
or data.frame
object.
Tx <- transcripts(edb, filter = list(GenenameFilter("BCL2L11")))
Tx
## GRanges object with 17 ranges and 7 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <character>
## ENST00000432179 2 [111876955, 111881689] + | ENST00000432179
## ENST00000308659 2 [111878491, 111922625] + | ENST00000308659
## ENST00000357757 2 [111878491, 111919016] + | ENST00000357757
## ENST00000393253 2 [111878491, 111909428] + | ENST00000393253
## ENST00000337565 2 [111878491, 111886423] + | ENST00000337565
## ... ... ... ... . ...
## ENST00000452231 2 [111881323, 111921808] + | ENST00000452231
## ENST00000361493 2 [111881323, 111921808] + | ENST00000361493
## ENST00000431217 2 [111881323, 111921929] + | ENST00000431217
## ENST00000439718 2 [111881323, 111922220] + | ENST00000439718
## ENST00000438054 2 [111881329, 111903861] + | ENST00000438054
## tx_biotype tx_cds_seq_start tx_cds_seq_end
## <character> <integer> <integer>
## ENST00000432179 protein_coding 111881323 111881689
## ENST00000308659 protein_coding 111881323 111921808
## ENST00000357757 protein_coding 111881323 111919016
## ENST00000393253 protein_coding 111881323 111909428
## ENST00000337565 protein_coding 111881323 111886328
## ... ... ... ...
## ENST00000452231 nonsense_mediated_decay 111881323 111919016
## ENST00000361493 nonsense_mediated_decay 111881323 111887812
## ENST00000431217 nonsense_mediated_decay 111881323 111902078
## ENST00000439718 nonsense_mediated_decay 111881323 111909428
## ENST00000438054 protein_coding 111881329 111902068
## gene_id tx_name gene_name
## <character> <character> <character>
## ENST00000432179 ENSG00000153094 ENST00000432179 BCL2L11
## ENST00000308659 ENSG00000153094 ENST00000308659 BCL2L11
## ENST00000357757 ENSG00000153094 ENST00000357757 BCL2L11
## ENST00000393253 ENSG00000153094 ENST00000393253 BCL2L11
## ENST00000337565 ENSG00000153094 ENST00000337565 BCL2L11
## ... ... ... ...
## ENST00000452231 ENSG00000153094 ENST00000452231 BCL2L11
## ENST00000361493 ENSG00000153094 ENST00000361493 BCL2L11
## ENST00000431217 ENSG00000153094 ENST00000431217 BCL2L11
## ENST00000439718 ENSG00000153094 ENST00000439718 BCL2L11
## ENST00000438054 ENSG00000153094 ENST00000438054 BCL2L11
## -------
## seqinfo: 1 sequence from GRCh37 genome
## as this is a GRanges object we can access e.g. the start coordinates with
head(start(Tx))
## [1] 111876955 111878491 111878491 111878491 111878491 111878506
## or extract the biotype with
head(Tx$tx_biotype)
## [1] "protein_coding" "protein_coding" "protein_coding" "protein_coding"
## [5] "protein_coding" "protein_coding"
The parameter columns
of the exons
, genes
and transcripts
method allows to specify which database attributes (columns) should be retrieved. The exons
method returns by default all exon-related columns, the transcripts
all columns from the transcript database table and the genes
all from the gene table. Note however that in the example above we got also a column gene_name
although this column is not present in the transcript database table. By default the methods return also all columns that are used by any of the filters submitted with the filter
argument (thus, because a GenenameFilter
was used, the column gene_name
is also returned). Setting returnFilterColumns(edb) <- FALSE
disables this option and only the columns specified by the columns
parameter are retrieved.
To get an overview of database tables and available columns the function listTables
can be used. The method listColumns
on the other hand lists columns for the specified database table.
## list all database tables along with their columns
listTables(edb)
## $gene
## [1] "gene_id" "gene_name" "entrezid"
## [4] "gene_biotype" "gene_seq_start" "gene_seq_end"
## [7] "seq_name" "seq_strand" "seq_coord_system"
## [10] "symbol"
##
## $tx
## [1] "tx_id" "tx_biotype" "tx_seq_start"
## [4] "tx_seq_end" "tx_cds_seq_start" "tx_cds_seq_end"
## [7] "gene_id" "tx_name"
##
## $tx2exon
## [1] "tx_id" "exon_id" "exon_idx"
##
## $exon
## [1] "exon_id" "exon_seq_start" "exon_seq_end"
##
## $chromosome
## [1] "seq_name" "seq_length" "is_circular"
##
## $metadata
## [1] "name" "value"
## list columns from a specific table
listColumns(edb, "tx")
## [1] "tx_id" "tx_biotype" "tx_seq_start"
## [4] "tx_seq_end" "tx_cds_seq_start" "tx_cds_seq_end"
## [7] "gene_id" "tx_name"
Thus, we could retrieve all transcripts of the biotype nonsense_mediated_decay (which, according to the definitions by Ensembl are transcribed, but most likely not translated in a protein, but rather degraded after transcription) along with the name of the gene for each transcript. Note that we are changing here the return.type
to DataFrame
, so the method will return a DataFrame
with the results instead of the default GRanges
.
Tx <- transcripts(edb,
columns = c(listColumns(edb , "tx"), "gene_name"),
filter = TxbiotypeFilter("nonsense_mediated_decay"),
return.type = "DataFrame")
nrow(Tx)
## [1] 13812
Tx
## DataFrame with 13812 rows and 9 columns
## tx_id tx_biotype tx_seq_start tx_seq_end
## <character> <character> <integer> <integer>
## 1 ENST00000495251 nonsense_mediated_decay 64085 69409
## 2 ENST00000462860 nonsense_mediated_decay 64085 69452
## 3 ENST00000483390 nonsense_mediated_decay 65739 68764
## 4 ENST00000538848 nonsense_mediated_decay 66411 68843
## 5 ENST00000567466 nonsense_mediated_decay 97578 99521
## ... ... ... ... ...
## 13808 ENST00000496411 nonsense_mediated_decay 249149927 249153217
## 13809 ENST00000483223 nonsense_mediated_decay 249150714 249152728
## 13810 ENST00000533647 nonsense_mediated_decay 249151472 249152523
## 13811 ENST00000528141 nonsense_mediated_decay 249151590 249153284
## 13812 ENST00000530986 nonsense_mediated_decay 249151668 249153284
## tx_cds_seq_start tx_cds_seq_end gene_id tx_name
## <integer> <integer> <character> <character>
## 1 68052 68789 ENSG00000234769 ENST00000495251
## 2 68052 68789 ENSG00000234769 ENST00000462860
## 3 66428 68764 ENSG00000234769 ENST00000483390
## 4 67418 68789 ENSG00000234769 ENST00000538848
## 5 98546 98893 ENSG00000261456 ENST00000567466
## ... ... ... ... ...
## 13808 249152153 249152508 ENSG00000171163 ENST00000496411
## 13809 249152153 249152508 ENSG00000171163 ENST00000483223
## 13810 249152153 249152508 ENSG00000171163 ENST00000533647
## 13811 249152203 249152508 ENSG00000171163 ENST00000528141
## 13812 249152203 249152508 ENSG00000171163 ENST00000530986
## gene_name
## <character>
## 1 WASH4P
## 2 WASH4P
## 3 WASH4P
## 4 WASH4P
## 5 TUBB8
## ... ...
## 13808 ZNF692
## 13809 ZNF692
## 13810 ZNF692
## 13811 ZNF692
## 13812 ZNF692
For protein coding transcripts, we can also specifically extract their coding region. In the example below we extract the CDS for all transcripts encoded on chromosome Y.
yCds <- cdsBy(edb, filter = SeqnameFilter("Y"))
yCds
## GRangesList object of length 160:
## $ENST00000155093
## GRanges object with 7 ranges and 3 metadata columns:
## seqnames ranges strand | seq_name exon_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] Y [2821978, 2822038] + | Y ENSE00002223884
## [2] Y [2829115, 2829687] + | Y ENSE00003645989
## [3] Y [2843136, 2843285] + | Y ENSE00003548678
## [4] Y [2843552, 2843695] + | Y ENSE00003611496
## [5] Y [2844711, 2844863] + | Y ENSE00001649504
## [6] Y [2845981, 2846121] + | Y ENSE00001777381
## [7] Y [2846851, 2848034] + | Y ENSE00001368923
## exon_rank
## <integer>
## [1] 2
## [2] 3
## [3] 4
## [4] 5
## [5] 6
## [6] 7
## [7] 8
##
## $ENST00000215473
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | seq_name exon_id
## [1] Y [4924865, 4925500] + | Y ENSE00001436852
## [2] Y [4966256, 4968748] + | Y ENSE00001640924
## [3] Y [5369098, 5369296] + | Y ENSE00001803775
## [4] Y [5483308, 5483316] + | Y ENSE00001731866
## [5] Y [5491131, 5491145] + | Y ENSE00001711324
## [6] Y [5605313, 5605983] + | Y ENSE00001779807
## exon_rank
## [1] 1
## [2] 2
## [3] 3
## [4] 4
## [5] 5
## [6] 6
##
## $ENST00000215479
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | seq_name exon_id
## [1] Y [6740596, 6740649] - | Y ENSE00001671586
## [2] Y [6738047, 6738094] - | Y ENSE00001645681
## [3] Y [6736773, 6736817] - | Y ENSE00000652250
## [4] Y [6736078, 6736503] - | Y ENSE00001667251
## [5] Y [6734114, 6734119] - | Y ENSE00001494454
## exon_rank
## [1] 2
## [2] 3
## [3] 4
## [4] 5
## [5] 6
##
## ...
## <157 more elements>
## -------
## seqinfo: 1 sequence from GRCh37 genome
Using a GRangesFilter
we can retrieve all features from the database that are either within or overlapping the specified genomic region. In the example below we query all genes that are partially overlapping with a small region on chromosome 11. The filter restricts to all genes for which either an exon or an intron is partially overlapping with the region.
## Define the filter
grf <- GRangesFilter(GRanges("11", ranges = IRanges(114000000, 114000050),
strand = "+"), condition = "overlapping")
## Query genes:
gn <- genes(edb, filter = grf)
gn
## GRanges object with 1 range and 6 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000109906 11 [113930315, 114121398] + | ENSG00000109906
## gene_name entrezid gene_biotype seq_coord_system
## <character> <character> <character> <character>
## ENSG00000109906 ZBTB16 7704 protein_coding chromosome
## symbol
## <character>
## ENSG00000109906 ZBTB16
## -------
## seqinfo: 1 sequence from GRCh37 genome
## Next we retrieve all transcripts for that gene so that we can plot them.
txs <- transcripts(edb, filter = GenenameFilter(gn$gene_name))
plot(3, 3, pch = NA, xlim = c(start(gn), end(gn)), ylim = c(0, length(txs)),
yaxt = "n", ylab = "")
## Highlight the GRangesFilter region
rect(xleft = start(grf), xright = end(grf), ybottom = 0, ytop = length(txs),
col = "red", border = "red")
for(i in 1:length(txs)) {
current <- txs[i]
rect(xleft = start(current), xright = end(current), ybottom = i-0.975,
ytop = i-0.125, border = "grey")
text(start(current), y = i-0.5, pos = 4, cex = 0.75, labels = current$tx_id)
}
As we can see, 4 transcripts of the gene ZBTB16 are also overlapping the region. Below we fetch these 4 transcripts. Note, that a call to exons
will not return any features from the database, as no exon is overlapping with the region.
transcripts(edb, filter = grf)
## GRanges object with 4 ranges and 6 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <character>
## ENST00000335953 11 [113930315, 114121398] + | ENST00000335953
## ENST00000541602 11 [113930447, 114060486] + | ENST00000541602
## ENST00000392996 11 [113931229, 114121374] + | ENST00000392996
## ENST00000539918 11 [113935134, 114118066] + | ENST00000539918
## tx_biotype tx_cds_seq_start tx_cds_seq_end
## <character> <integer> <integer>
## ENST00000335953 protein_coding 113934023 114121277
## ENST00000541602 retained_intron <NA> <NA>
## ENST00000392996 protein_coding 113934023 114121277
## ENST00000539918 nonsense_mediated_decay 113935134 113992549
## gene_id tx_name
## <character> <character>
## ENST00000335953 ENSG00000109906 ENST00000335953
## ENST00000541602 ENSG00000109906 ENST00000541602
## ENST00000392996 ENSG00000109906 ENST00000392996
## ENST00000539918 ENSG00000109906 ENST00000539918
## -------
## seqinfo: 1 sequence from GRCh37 genome
The GRangesFilter
supports also GRanges
defining multiple regions and a query will return all features overlapping any of these regions. Besides using the GRangesFilter
it is also possible to search for transcripts or exons overlapping genomic regions using the exonsByOverlaps
or transcriptsByOverlaps
known from the GenomicFeatures
package. Note that the implementation of these methods for EnsDb
objects supports also to use filters to further fine-tune the query.
To get an overview of allowed/available gene and transcript biotype the functions listGenebiotypes
and listTxbiotypes
can be used.
## Get all gene biotypes from the database. The GenebiotypeFilter
## allows to filter on these values.
listGenebiotypes(edb)
## [1] "protein_coding" "pseudogene"
## [3] "processed_transcript" "antisense"
## [5] "lincRNA" "polymorphic_pseudogene"
## [7] "IG_V_pseudogene" "IG_V_gene"
## [9] "sense_overlapping" "sense_intronic"
## [11] "TR_V_gene" "misc_RNA"
## [13] "snRNA" "miRNA"
## [15] "snoRNA" "rRNA"
## [17] "Mt_tRNA" "Mt_rRNA"
## [19] "IG_C_gene" "IG_J_gene"
## [21] "TR_J_gene" "TR_C_gene"
## [23] "TR_V_pseudogene" "TR_J_pseudogene"
## [25] "IG_D_gene" "IG_C_pseudogene"
## [27] "TR_D_gene" "IG_J_pseudogene"
## [29] "3prime_overlapping_ncrna" "processed_pseudogene"
## [31] "LRG_gene"
## Get all transcript biotypes from the database.
listTxbiotypes(edb)
## [1] "protein_coding"
## [2] "processed_transcript"
## [3] "retained_intron"
## [4] "nonsense_mediated_decay"
## [5] "unitary_pseudogene"
## [6] "non_stop_decay"
## [7] "unprocessed_pseudogene"
## [8] "processed_pseudogene"
## [9] "transcribed_unprocessed_pseudogene"
## [10] "antisense"
## [11] "lincRNA"
## [12] "polymorphic_pseudogene"
## [13] "transcribed_processed_pseudogene"
## [14] "miRNA"
## [15] "pseudogene"
## [16] "IG_V_pseudogene"
## [17] "snoRNA"
## [18] "IG_V_gene"
## [19] "sense_overlapping"
## [20] "sense_intronic"
## [21] "TR_V_gene"
## [22] "snRNA"
## [23] "misc_RNA"
## [24] "rRNA"
## [25] "Mt_tRNA"
## [26] "Mt_rRNA"
## [27] "IG_C_gene"
## [28] "IG_J_gene"
## [29] "TR_J_gene"
## [30] "TR_C_gene"
## [31] "TR_V_pseudogene"
## [32] "TR_J_pseudogene"
## [33] "IG_D_gene"
## [34] "IG_C_pseudogene"
## [35] "TR_D_gene"
## [36] "IG_J_pseudogene"
## [37] "3prime_overlapping_ncrna"
## [38] "translated_processed_pseudogene"
## [39] "LRG_gene"
Data can be fetched in an analogous way using the exons
and genes
methods. In the example below we retrieve gene_name
, entrezid
and the gene_biotype
of all genes in the database which names start with “BCL2”.
## We're going to fetch all genes which names start with BCL. To this end
## we define a GenenameFilter with partial matching, i.e. condition "like"
## and a % for any character/string.
BCLs <- genes(edb,
columns = c("gene_name", "entrezid", "gene_biotype"),
filter = list(GenenameFilter("BCL%", condition = "like")),
return.type = "DataFrame")
nrow(BCLs)
## [1] 25
BCLs
## DataFrame with 25 rows and 4 columns
## gene_name entrezid gene_biotype gene_id
## <character> <character> <character> <character>
## 1 BCL10 8915 protein_coding ENSG00000142867
## 2 BCL11A 53335 protein_coding ENSG00000119866
## 3 BCL11B 64919 protein_coding ENSG00000127152
## 4 BCL2 596 protein_coding ENSG00000171791
## 5 BCL2A1 597 protein_coding ENSG00000140379
## ... ... ... ... ...
## 21 BCL7C 9274 protein_coding ENSG00000099385
## 22 BCL9 607 protein_coding ENSG00000116128
## 23 BCL9 607 protein_coding ENSG00000266095
## 24 BCL9L 283149 protein_coding ENSG00000186174
## 25 BCLAF1 9774 protein_coding ENSG00000029363
Sometimes it might be useful to know the length of genes or transcripts (i.e. the total sum of nucleotides covered by their exons). Below we calculate the mean length of transcripts from protein coding genes on chromosomes X and Y as well as the average length of snoRNA, snRNA and rRNA transcripts encoded on these chromosomes.
## determine the average length of snRNA, snoRNA and rRNA genes encoded on
## chromosomes X and Y.
mean(lengthOf(edb, of = "tx",
filter = list(GenebiotypeFilter(c("snRNA", "snoRNA", "rRNA")),
SeqnameFilter(c("X", "Y")))))
## [1] 116.3046
## determine the average length of protein coding genes encoded on the same
## chromosomes.
mean(lengthOf(edb, of = "tx",
filter = list(GenebiotypeFilter("protein_coding"),
SeqnameFilter(c("X", "Y")))))
## [1] 1920
Not unexpectedly, transcripts of protein coding genes are longer than those of snRNA, snoRNA or rRNA genes.
At last we extract the first two exons of each transcript model from the database.
## Extract all exons 1 and (if present) 2 for all genes encoded on the
## Y chromosome
exons(edb, columns = c("tx_id", "exon_idx"),
filter = list(SeqnameFilter("Y"),
ExonrankFilter(3, condition = "<")))
## GRanges object with 1287 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <character>
## ENSE00002088309 Y [2652790, 2652894] + | ENST00000516032
## ENSE00001494622 Y [2654896, 2655740] - | ENST00000383070
## ENSE00002323146 Y [2655049, 2655069] - | ENST00000525526
## ENSE00002201849 Y [2655075, 2655644] - | ENST00000525526
## ENSE00002214525 Y [2655145, 2655168] - | ENST00000534739
## ... ... ... ... . ...
## ENSE00001632993 Y [28737695, 28737748] - | ENST00000456738
## ENSE00001616687 Y [28772667, 28773306] - | ENST00000435741
## ENSE00001638296 Y [28779492, 28779578] - | ENST00000435945
## ENSE00001797328 Y [28780670, 28780799] - | ENST00000435945
## ENSE00001794473 Y [59001391, 59001635] + | ENST00000431853
## exon_idx exon_id
## <integer> <character>
## ENSE00002088309 1 ENSE00002088309
## ENSE00001494622 1 ENSE00001494622
## ENSE00002323146 2 ENSE00002323146
## ENSE00002201849 1 ENSE00002201849
## ENSE00002214525 2 ENSE00002214525
## ... ... ...
## ENSE00001632993 1 ENSE00001632993
## ENSE00001616687 1 ENSE00001616687
## ENSE00001638296 2 ENSE00001638296
## ENSE00001797328 1 ENSE00001797328
## ENSE00001794473 1 ENSE00001794473
## -------
## seqinfo: 1 sequence from GRCh37 genome
For the feature counting step of an RNAseq experiment, the gene or transcript models (defined by the chromosomal start and end positions of their exons) have to be known. To extract these from an Ensembl based annotation package, the exonsBy
, genesBy
and transcriptsBy
methods can be used in an analogous way as in TxDb
packages generated by the GenomicFeatures
package. However, the transcriptsBy
method does not, in contrast to the method in the GenomicFeatures
package, allow to return transcripts by “cds”. While the annotation packages built by the ensembldb
contain the chromosomal start and end coordinates of the coding region (for protein coding genes) they do not assign an ID to each CDS.
A simple use case is to retrieve all genes encoded on chromosomes X and Y from the database.
TxByGns <- transcriptsBy(edb, by = "gene",
filter = list(SeqnameFilter(c("X", "Y")))
)
TxByGns
## GRangesList object of length 2908:
## $ENSG00000000003
## GRanges object with 3 ranges and 6 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <character>
## [1] X [99888439, 99894988] - | ENST00000494424
## [2] X [99883667, 99891803] - | ENST00000373020
## [3] X [99887538, 99891686] - | ENST00000496771
## tx_biotype tx_cds_seq_start tx_cds_seq_end gene_id
## <character> <integer> <integer> <character>
## [1] processed_transcript <NA> <NA> ENSG00000000003
## [2] protein_coding 99885795 99891691 ENSG00000000003
## [3] processed_transcript <NA> <NA> ENSG00000000003
## tx_name
## <character>
## [1] ENST00000494424
## [2] ENST00000373020
## [3] ENST00000496771
##
## $ENSG00000000005
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | tx_id
## [1] X [99839799, 99854882] + | ENST00000373031
## [2] X [99848621, 99852528] + | ENST00000485971
## tx_biotype tx_cds_seq_start tx_cds_seq_end gene_id
## [1] protein_coding 99840016 99854714 ENSG00000000005
## [2] processed_transcript <NA> <NA> ENSG00000000005
## tx_name
## [1] ENST00000373031
## [2] ENST00000485971
##
## $ENSG00000001497
## GRanges object with 6 ranges and 6 metadata columns:
## seqnames ranges strand | tx_id
## [1] X [64732463, 64754655] - | ENST00000484069
## [2] X [64732462, 64754636] - | ENST00000374811
## [3] X [64732463, 64754636] - | ENST00000374804
## [4] X [64732463, 64754636] - | ENST00000312391
## [5] X [64732462, 64754634] - | ENST00000374807
## [6] X [64740309, 64743497] - | ENST00000469091
## tx_biotype tx_cds_seq_start tx_cds_seq_end
## [1] nonsense_mediated_decay 64744901 64754595
## [2] protein_coding 64732655 64754595
## [3] protein_coding 64732655 64754595
## [4] protein_coding 64744901 64754595
## [5] protein_coding 64732655 64754595
## [6] protein_coding 64740535 64743497
## gene_id tx_name
## [1] ENSG00000001497 ENST00000484069
## [2] ENSG00000001497 ENST00000374811
## [3] ENSG00000001497 ENST00000374804
## [4] ENSG00000001497 ENST00000312391
## [5] ENSG00000001497 ENST00000374807
## [6] ENSG00000001497 ENST00000469091
##
## ...
## <2905 more elements>
## -------
## seqinfo: 2 sequences from GRCh37 genome
Since Ensembl contains also definitions of genes that are on chromosome variants (supercontigs), it is advisable to specify the chromosome names for which the gene models should be returned.
In a real use case, we might thus want to retrieve all genes encoded on the standard chromosomes. In addition it is advisable to use a GeneidFilter
to restrict to Ensembl genes only, as also LRG (Locus Reference Genomic) genes2 are defined in the database, which are partially redundant with Ensembl genes.
## will just get exons for all genes on chromosomes 1 to 22, X and Y.
## Note: want to get rid of the "LRG" genes!!!
EnsGenes <- exonsBy(edb, by = "gene",
filter = list(SeqnameFilter(c(1:22, "X", "Y")),
GeneidFilter("ENSG%", "like")))
The code above returns a GRangesList
that can be used directly as an input for the summarizeOverlaps
function from the GenomicAlignments
package 3.
Alternatively, the above GRangesList
can be transformed to a data.frame
in SAF format that can be used as an input to the featureCounts
function of the Rsubread
package 4.
## Transforming the GRangesList into a data.frame in SAF format
EnsGenes.SAF <- toSAF(EnsGenes)
Note that the ID by which the GRangesList
is split is used in the SAF formatted data.frame
as the GeneID
. In the example below this would be the Ensembl gene IDs, while the start, end coordinates (along with the strand and chromosomes) are those of the the exons.
In addition, the disjointExons
function (similar to the one defined in GenomicFeatures
) can be used to generate a GRanges
of non-overlapping exon parts which can be used in the DEXSeq
package.
## Create a GRanges of non-overlapping exon parts.
DJE <- disjointExons(edb,
filter = list(SeqnameFilter(c(1:22, "X", "Y")),
GeneidFilter("ENSG%", "like")))
The methods to retrieve exons, transcripts and genes (i.e. exons
, transcripts
and genes
) return by default GRanges
objects that can be used to retrieve sequences using the getSeq
method e.g. from BSgenome packages. The basic workflow is thus identical to the one for TxDb
packages, however, it is not straight forward to identify the BSgenome package with the matching genomic sequence. Most BSgenome packages are named according to the genome build identifier used in UCSC which does not (always) match the genome build name used by Ensembl. Using the Ensembl version provided by the EnsDb
, the correct genomic sequence can however be retrieved easily from the AnnotationHub
using the getGenomeFaFile
. If no Fasta file matching the Ensembl version is available, the function tries to identify a Fasta file with the correct genome build from the closest Ensembl release and returns that instead.
In the code block below we retrieve first the FaFile
with the genomic DNA sequence, extract the genomic start and end coordinates for all genes defined in the package, subset to genes encoded on sequences available in the FaFile
and extract all of their sequences. Note: these sequences represent the sequence between the chromosomal start and end coordinates of the gene.
library(EnsDb.Hsapiens.v75)
library(Rsamtools)
edb <- EnsDb.Hsapiens.v75
## Get the FaFile with the genomic sequence matching the Ensembl version
## using the AnnotationHub package.
Dna <- getGenomeFaFile(edb)
## Get start/end coordinates of all genes.
genes <- genes(edb)
## Subset to all genes that are encoded on chromosomes for which
## we do have DNA sequence available.
genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]
## Get the gene sequences, i.e. the sequence including the sequence of
## all of the gene's exons and introns.
geneSeqs <- getSeq(Dna, genes)
To retrieve the (exonic) sequence of transcripts (i.e. without introns) we can use directly the extractTranscriptSeqs
method defined in the GenomicFeatures
on the EnsDb
object, eventually using a filter to restrict the query.
## get all exons of all transcripts encoded on chromosome Y
yTx <- exonsBy(edb, filter = SeqnameFilter("Y"))
## Retrieve the sequences for these transcripts from the FaFile.
library(GenomicFeatures)
yTxSeqs <- extractTranscriptSeqs(Dna, yTx)
yTxSeqs
## Extract the sequences of all transcripts encoded on chromosome Y.
yTx <- extractTranscriptSeqs(Dna, edb, filter = SeqnameFilter("Y"))
## Along these lines, we could use the method also to retrieve the coding sequence
## of all transcripts on the Y chromosome.
cdsY <- cdsBy(edb, filter = SeqnameFilter("Y"))
extractTranscriptSeqs(Dna, cdsY)
Note: in the next section we describe how transcript sequences can be retrieved from a BSgenome
package that is based on UCSC, not Ensembl.
EnsDb
packages with UCSC based annotationsSometimes it might be useful to combine (Ensembl based) annotations from EnsDb
packages/objects with annotations from other Bioconductor packages, that might base on UCSC annotations. To support such an integration of annotations, the ensembldb
packages implements the seqlevelsStyle
and seqlevelsStyle<-
from the GenomeInfoDb
package that allow to change the style of chromosome naming. Thus, sequence/chromosome names other than those used by Ensembl can be used in, and are returned by, the queries to EnsDb
objects as long as a mapping for them is provided by the GenomeInfoDb
package (which provides a mapping mostly between UCSC, NCBI and Ensembl chromosome names for the main chromosomes).
In the example below we change the seqnames style to UCSC.
## Change the seqlevels style form Ensembl (default) to UCSC:
seqlevelsStyle(edb) <- "UCSC"
## Now we can use UCSC style seqnames in SeqnameFilters or GRangesFilter:
genesY <- genes(edb, filter = SeqnameFilter("chrY"))
## The seqlevels of the returned GRanges are also in UCSC style
seqlevels(genesY)
## [1] "chrY"
Note that in most instances no mapping is available for sequences not corresponding to the main chromosomes (i.e. contigs, patched chromosomes etc). What is returned in cases in which no mapping is available can be specified with the global ensembldb.seqnameNotFound
option. By default (with ensembldb.seqnameNotFound
set to “ORIGINAL”), the original seqnames (i.e. the ones from Ensembl) are returned. With ensembldb.seqnameNotFound
“MISSING” each time a seqname can not be found an error is thrown. For all other cases (e.g. ensembldb.seqnameNotFound = NA
) the value of the option is returned.
seqlevelsStyle(edb) <- "UCSC"
## Getting the default option:
getOption("ensembldb.seqnameNotFound")
## [1] "ORIGINAL"
## Listing all seqlevels in the database.
seqlevels(edb)[1:30]
## Warning in .formatSeqnameByStyleFromQuery(x, sn, ifNotFound): More than 5
## seqnames with seqlevels style of the database (Ensembl) could not be mapped
## to the seqlevels style: UCSC!) Returning the orginal seqnames for these.
## [1] "chr1" "chr10" "chr11" "chr12" "chr13"
## [6] "chr14" "chr15" "chr16" "chr17" "chr18"
## [11] "chr19" "chr2" "chr20" "chr21" "chr22"
## [16] "chr3" "chr4" "chr5" "chr6" "chr7"
## [21] "chr8" "chr9" "GL000191.1" "GL000192.1" "GL000193.1"
## [26] "GL000194.1" "GL000195.1" "GL000196.1" "GL000199.1" "GL000201.1"
## Setting the option to NA, thus, for each seqname for which no mapping is available,
## NA is returned.
options(ensembldb.seqnameNotFound=NA)
seqlevels(edb)[1:30]
## Warning in .formatSeqnameByStyleFromQuery(x, sn, ifNotFound): More than 5
## seqnames with seqlevels style of the database (Ensembl) could not be mapped
## to the seqlevels style: UCSC!) Returning NA for these.
## [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
## [9] "chr17" "chr18" "chr19" "chr2" "chr20" "chr21" "chr22" "chr3"
## [17] "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" NA NA
## [25] NA NA NA NA NA NA
## Resetting the option.
options(ensembldb.seqnameNotFound = "ORIGINAL")
Next we retrieve transcript sequences from genes encoded on chromosome Y using the BSGenome
package for the human genome from UCSC. The specified version hg19
matches the genome build of Ensembl version 75, i.e. GRCh37
. Note that while we changed the style of the seqnames to UCSC we did not change the naming of the genome release.
library(BSgenome.Hsapiens.UCSC.hg19)
bsg <- BSgenome.Hsapiens.UCSC.hg19
## Get the genome version
unique(genome(bsg))
## [1] "hg19"
unique(genome(edb))
## [1] "GRCh37"
## Although differently named, both represent genome build GRCh37.
## Extract the full transcript sequences.
yTxSeqs <- extractTranscriptSeqs(bsg, exonsBy(edb, "tx", filter = SeqnameFilter("chrY")))
yTxSeqs
## A DNAStringSet instance of length 731
## width seq names
## [1] 5239 GCCTAGTGCGCGCGCAGTAA...AAATGTTTACTTGTATATG ENST00000155093
## [2] 4023 ATGTTTAGGGTTGGCTTCTT...GGAAACACATCCCTTGTAA ENST00000215473
## [3] 802 AGAGGACCAAGCCTCCCTGT...TAAAATGTTTTAAAAATCA ENST00000215479
## [4] 910 TGTCTGTCAGAGCTGTCAGC...ACACTGGTATATTTCTGTT ENST00000250776
## [5] 1305 TTCCAGGATATGAACTCTAC...ATCCTGTGGCTGTAGGAAA ENST00000250784
## ... ... ...
## [727] 333 ATGGATGAAGAAGAGAAAAC...TGAACTTTCTAGATTGCAT ENST00000604924
## [728] 1247 CATGGCGGGGTTCCTGCCTT...TTTGGAGTAATGTCTTAGT ENST00000605584
## [729] 199 CAGTTCTCGCTCCTGTGCAG...GGTCTGGGTGGCTTCTGGA ENST00000605663
## [730] 276 GCCCCAGGAGGAAAGGGGGA...AATAAAGAACAGCGCATTC ENST00000606439
## [731] 444 ATGGGAGCCACTGGGCTTGG...CGTTCATGAAGAAGACTAA ENST00000607210
## Extract just the CDS
Test <- cdsBy(edb, "tx", filter = SeqnameFilter("chrY"))
yTxCds <- extractTranscriptSeqs(bsg, cdsBy(edb, "tx", filter = SeqnameFilter("chrY")))
yTxCds
## A DNAStringSet instance of length 160
## width seq names
## [1] 2406 ATGGATGAAGATGAATTTGA...AGAAGTTGGTCTGCCCTAA ENST00000155093
## [2] 4023 ATGTTTAGGGTTGGCTTCTT...GGAAACACATCCCTTGTAA ENST00000215473
## [3] 579 ATGGGGACCTGGATTTTGTT...GCAGGAGGAAGTGGATTAA ENST00000215479
## [4] 792 ATGGCCCGGGGCCCCAAGAA...CAAACAGAGCAGTGGCTAA ENST00000250784
## [5] 378 ATGAGTCCAAAGCCGAGAGC...TACTCCCCTATCTCCCTGA ENST00000250823
## ... ... ...
## [156] 63 CGCAAGGATTTAAAAGAGAT...ACCCTGTTGGCCAGGCTAG ENST00000601700
## [157] 42 CTTGATACAAAGAATCAATTTAATTTTAAGATTGTCTATCTT ENST00000601705
## [158] 33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT ENST00000602680
## [159] 33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT ENST00000602732
## [160] 33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT ENST00000602770
At last changing the seqname style to the default value =“Ensembl”=.
seqlevelsStyle(edb) <- "Ensembl"
shiny
web appIn addition to the genes
, transcripts
and exons
methods it is possibly to search interactively for gene/transcript/exon annotations using the internal, shiny
based, web application. The application can be started with the runEnsDbApp()
function. The search results from this app can also be returned to the R workspace either as a data.frame
or GRanges
object.
ensembldb
and Gviz
The Gviz
package provides functions to plot genes and transcripts along with other data on a genomic scale. Gene models can be provided either as a data.frame
, GRanges
, TxDB
database, can be fetched from biomart and can also be retrieved from ensembldb
.
Below we generate a GeneRegionTrack
fetching all transcripts from a certain region on chromosome Y.
Note that if we want in addition to work also with BAM files that were aligned against DNA sequences retrieved from Ensembl or FASTA files representing genomic DNA sequences from Ensembl we should change the ucscChromosomeNames
option from Gviz
to FALSE
(i.e. by calling options(ucscChromosomeNames = FALSE)
). This is not necessary if we just want to retrieve gene models from an EnsDb
object, as the ensembldb
package internally checks the ucscChromosomeNames
option and, depending on that, maps Ensembl chromosome names to UCSC chromosome names.
## Loading the Gviz library
library(Gviz)
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
## Retrieving a Gviz compatible GRanges object with all genes
## encoded on chromosome Y.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "Y",
start = 20400000, end = 21400000)
## Define a genome axis track
gat <- GenomeAxisTrack()
## We have to change the ucscChromosomeNames option to FALSE to enable Gviz usage
## with non-UCSC chromosome names.
options(ucscChromosomeNames = FALSE)
plotTracks(list(gat, GeneRegionTrack(gr)))
options(ucscChromosomeNames = TRUE)
Above we had to change the option ucscChromosomeNames
to FALSE
in order to use it with non-UCSC chromosome names. Alternatively, we could however also change the seqnamesStyle
of the EnsDb
object to UCSC
. Note that we have to use now also chromosome names in the UCSC style in the SeqnameFilter
(i.e. “chrY” instead of Y
).
seqlevelsStyle(edb) <- "UCSC"
## Retrieving the GRanges objects with seqnames corresponding to UCSC chromosome names.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
start = 20400000, end = 21400000)
seqnames(gr)
## factor-Rle of length 218 with 1 run
## Lengths: 218
## Values : chrY
## Levels(1): chrY
## Define a genome axis track
gat <- GenomeAxisTrack()
plotTracks(list(gat, GeneRegionTrack(gr)))
We can also use the filters from the ensembldb
package to further refine what transcripts are fetched, like in the example below, in which we create two different gene region tracks, one for protein coding genes and one for lincRNAs.
protCod <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
start = 20400000, end = 21400000,
filter = GenebiotypeFilter("protein_coding"))
lincs <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
start = 20400000, end = 21400000,
filter = GenebiotypeFilter("lincRNA"))
plotTracks(list(gat, GeneRegionTrack(protCod, name = "protein coding"),
GeneRegionTrack(lincs, name = "lincRNAs")), transcriptAnnotation = "symbol")
## At last we change the seqlevels style again to Ensembl
seqlevelsStyle <- "Ensembl"
EnsDb
objects in the AnnotationDbi
frameworkMost of the methods defined for objects extending the basic annotation package class AnnotationDbi
are also defined for EnsDb
objects (i.e. methods columns
, keytypes
, keys
, mapIds
and select
). While these methods can be used analogously to basic annotation packages, the implementation for EnsDb
objects also support the filtering framework of the ensembldb
package.
In the example below we first evaluate all the available columns and keytypes in the database and extract then the gene names for all genes encoded on chromosome X.
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
## List all available columns in the database.
columns(edb)
## [1] "ENTREZID" "EXONID" "EXONIDX" "EXONSEQEND"
## [5] "EXONSEQSTART" "GENEBIOTYPE" "GENEID" "GENENAME"
## [9] "GENESEQEND" "GENESEQSTART" "ISCIRCULAR" "SEQCOORDSYSTEM"
## [13] "SEQLENGTH" "SEQNAME" "SEQSTRAND" "SYMBOL"
## [17] "TXBIOTYPE" "TXCDSSEQEND" "TXCDSSEQSTART" "TXID"
## [21] "TXNAME" "TXSEQEND" "TXSEQSTART"
## Note that these do *not* correspond to the actual column names
## of the database that can be passed to methods like exons, genes,
## transcripts etc. These column names can be listed with the listColumns
## method.
listColumns(edb)
## [1] "seq_name" "seq_length" "is_circular"
## [4] "exon_id" "exon_seq_start" "exon_seq_end"
## [7] "gene_id" "gene_name" "entrezid"
## [10] "gene_biotype" "gene_seq_start" "gene_seq_end"
## [13] "seq_name" "seq_strand" "seq_coord_system"
## [16] "symbol" "name" "value"
## [19] "tx_id" "tx_biotype" "tx_seq_start"
## [22] "tx_seq_end" "tx_cds_seq_start" "tx_cds_seq_end"
## [25] "gene_id" "tx_name" "tx_id"
## [28] "exon_id" "exon_idx"
## List all of the supported key types.
keytypes(edb)
## [1] "ENTREZID" "EXONID" "GENEBIOTYPE" "GENEID" "GENENAME"
## [6] "SEQNAME" "SEQSTRAND" "SYMBOL" "TXBIOTYPE" "TXID"
## [11] "TXNAME"
## Get all gene ids from the database.
gids <- keys(edb, keytype = "GENEID")
length(gids)
## [1] 64102
## Get all gene names for genes encoded on chromosome Y.
gnames <- keys(edb, keytype = "GENENAME", filter = SeqnameFilter("Y"))
head(gnames)
## [1] "KDM5D" "DDX3Y" "ZFY" "TBL1Y" "PCDH11Y" "AMELY"
In the next example we retrieve specific information from the database using the select
method. First we fetch all transcripts for the genes BCL2 and BCL2L11. In the first call we provide the gene names, while in the second call we employ the filtering system to perform a more fine-grained query to fetch only the protein coding transcripts for these genes.
## Use the /standard/ way to fetch data.
select(edb, keys = c("BCL2", "BCL2L11"), keytype = "GENENAME",
columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE"))
## GENEID GENENAME TXID TXBIOTYPE
## 1 ENSG00000171791 BCL2 ENST00000398117 protein_coding
## 2 ENSG00000171791 BCL2 ENST00000333681 protein_coding
## 3 ENSG00000171791 BCL2 ENST00000590515 processed_transcript
## 4 ENSG00000171791 BCL2 ENST00000589955 protein_coding
## 5 ENSG00000171791 BCL2 ENST00000444484 protein_coding
## 6 ENSG00000153094 BCL2L11 ENST00000432179 protein_coding
## 7 ENSG00000153094 BCL2L11 ENST00000308659 protein_coding
## 8 ENSG00000153094 BCL2L11 ENST00000393256 protein_coding
## 9 ENSG00000153094 BCL2L11 ENST00000393252 protein_coding
## 10 ENSG00000153094 BCL2L11 ENST00000433098 nonsense_mediated_decay
## 11 ENSG00000153094 BCL2L11 ENST00000405953 protein_coding
## 12 ENSG00000153094 BCL2L11 ENST00000415458 nonsense_mediated_decay
## 13 ENSG00000153094 BCL2L11 ENST00000436733 nonsense_mediated_decay
## 14 ENSG00000153094 BCL2L11 ENST00000437029 nonsense_mediated_decay
## 15 ENSG00000153094 BCL2L11 ENST00000452231 nonsense_mediated_decay
## 16 ENSG00000153094 BCL2L11 ENST00000361493 nonsense_mediated_decay
## 17 ENSG00000153094 BCL2L11 ENST00000431217 nonsense_mediated_decay
## 18 ENSG00000153094 BCL2L11 ENST00000439718 nonsense_mediated_decay
## 19 ENSG00000153094 BCL2L11 ENST00000438054 protein_coding
## 20 ENSG00000153094 BCL2L11 ENST00000357757 protein_coding
## 21 ENSG00000153094 BCL2L11 ENST00000393253 protein_coding
## 22 ENSG00000153094 BCL2L11 ENST00000337565 protein_coding
## Use the filtering system of ensembldb
select(edb, keys = list(GenenameFilter(c("BCL2", "BCL2L11")),
TxbiotypeFilter("protein_coding")),
columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE"))
## Note: ordering of the results might not match ordering of keys!
## GENEID GENENAME TXID TXBIOTYPE
## 1 ENSG00000171791 BCL2 ENST00000398117 protein_coding
## 2 ENSG00000171791 BCL2 ENST00000333681 protein_coding
## 3 ENSG00000171791 BCL2 ENST00000589955 protein_coding
## 4 ENSG00000171791 BCL2 ENST00000444484 protein_coding
## 5 ENSG00000153094 BCL2L11 ENST00000432179 protein_coding
## 6 ENSG00000153094 BCL2L11 ENST00000308659 protein_coding
## 7 ENSG00000153094 BCL2L11 ENST00000393256 protein_coding
## 8 ENSG00000153094 BCL2L11 ENST00000393252 protein_coding
## 9 ENSG00000153094 BCL2L11 ENST00000405953 protein_coding
## 10 ENSG00000153094 BCL2L11 ENST00000438054 protein_coding
## 11 ENSG00000153094 BCL2L11 ENST00000357757 protein_coding
## 12 ENSG00000153094 BCL2L11 ENST00000393253 protein_coding
## 13 ENSG00000153094 BCL2L11 ENST00000337565 protein_coding
Finally, we use the mapIds
method to establish a mapping between ids and values. In the example below we fetch transcript ids for the two genes from the example above.
## Use the default method, which just returns the first value for multi mappings.
mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME")
## BCL2 BCL2L11
## "ENST00000398117" "ENST00000432179"
## Alternatively, specify multiVals="list" to return all mappings.
mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME",
multiVals = "list")
## $BCL2
## [1] "ENST00000398117" "ENST00000333681" "ENST00000590515" "ENST00000589955"
## [5] "ENST00000444484"
##
## $BCL2L11
## [1] "ENST00000432179" "ENST00000308659" "ENST00000393256"
## [4] "ENST00000393252" "ENST00000433098" "ENST00000405953"
## [7] "ENST00000415458" "ENST00000436733" "ENST00000437029"
## [10] "ENST00000452231" "ENST00000361493" "ENST00000431217"
## [13] "ENST00000439718" "ENST00000438054" "ENST00000357757"
## [16] "ENST00000393253" "ENST00000337565"
## And, just like before, we can use filters to map only to protein coding transcripts.
mapIds(edb, keys = list(GenenameFilter(c("BCL2", "BCL2L11")),
TxbiotypeFilter("protein_coding")), column = "TXID",
multiVals = "list")
## Warning in .mapIds(x = x, keys = keys, column = column, keytype =
## keytype, : Got 2 filter objects. Will use the keys of the first for the
## mapping!
## Note: ordering of the results might not match ordering of keys!
## $BCL2
## [1] "ENST00000398117" "ENST00000333681" "ENST00000589955" "ENST00000444484"
##
## $BCL2L11
## [1] "ENST00000432179" "ENST00000308659" "ENST00000393256" "ENST00000393252"
## [5] "ENST00000405953" "ENST00000438054" "ENST00000357757" "ENST00000393253"
## [9] "ENST00000337565"
Note that, if the filters are used, the ordering of the result does no longer match the ordering of the genes.
These notes might explain eventually unexpected results (and, more importantly, help avoiding them):
The ordering of the results returned by the genes
, exons
, transcripts
methods can be specified with the order.by
parameter. The ordering of the results does however not correspond to the ordering of values in submitted filter objects. The exception is the select
method. If a character vector of values or a single filter is passed with argument keys
the ordering of results of this method matches the ordering of the key values or the values of the filter.
Results of exonsBy
, transcriptsBy
are always ordered by the by
argument.
The CDS provided by EnsDb
objects always includes both, the start and the stop codon.
Transcripts with multiple CDS are at present not supported by EnsDb
.
At present, EnsDb
support only genes/transcripts for which all of their exons are encoded on the same chromosome and the same strand.
The code in this section is not supposed to be automatically executed when the vignette is built, as this would require a working installation of the Ensembl Perl API, which is not expected to be available on each system. Also, building EnsDb
from alternative sources, like GFF or GTF files takes some time and thus also these examples are not directly executed when the vignette is build.
The fetchTablesFromEnsembl
function of the package uses the Ensembl Perl API to retrieve the required annotations from an Ensembl database (e.g. from the main site ensembldb.ensembl.org). Thus, to use the functionality to built databases, the Ensembl Perl API needs to be installed (see 5 for details).
Alternatively, the ensDbFromAH
, ensDbFromGff
, ensDbFromGRanges
and ensDbFromGtf
functions allow to build EnsDb SQLite files from a GRanges
object or GFF/GTF files from Ensembl (either provided as files or via AnnotationHub
). These functions do not depend on the Ensembl Perl API, but require a working internet connection to fetch the chromosome lengths from Ensembl as these are not provided within GTF or GFF files.
The functions below use the Ensembl Perl API to fetch the required data directly from the Ensembl core databases. Thus, the path to the Perl API specific for the desired Ensembl version needs to be added to the PERL5LIB
environment variable.
An annotation package containing all human genes for Ensembl version 75 can be created using the code in the block below.
library(ensembldb)
## get all human gene/transcript/exon annotations from Ensembl (75)
## the resulting tables will be stored by default to the current working
## directory
fetchTablesFromEnsembl(75, species = "human")
## These tables can then be processed to generate a SQLite database
## containing the annotations (again, the function assumes the required
## txt files to be present in the current working directory)
DBFile <- makeEnsemblSQLiteFromTables()
## and finally we can generate the package
makeEnsembldbPackage(ensdb = DBFile, version = "0.99.12",
maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>",
author = "J Rainer")
The generated package can then be build using R CMD build EnsDb.Hsapiens.v75
and installed with R CMD INSTALL EnsDb.Hsapiens.v75*
. Note that we could directly generate an EnsDb
instance by loading the database file, i.e. by calling edb <- EnsDb(DBFile)
and work with that annotation object.
To fetch and build annotation packages for plant genomes (e.g. arabidopsis thaliana), the Ensembl genomes should be specified as a host, i.e. setting host
to “mysql-eg-publicsql.ebi.ac.uk”, port
to 4157
and species
to e.g. “arabidopsis thaliana”.
In the next example we create an EnsDb
database using the AnnotationHub
package and load also the corresponding genomic DNA sequence matching the Ensembl version. We thus first query the AnnotationHub
package for all resources available for Mus musculus
and the Ensembl release 77. Next we create the EnsDb
object from the appropriate AnnotationHub
resource. We then use the getGenomeFaFile
method on the EnsDb
to directly look up and retrieve the correct or best matching FaFile
with the genomic DNA sequence. At last we retrieve the sequences of all exons using the getSeq
method.
## Load the AnnotationHub data.
library(AnnotationHub)
ah <- AnnotationHub()
## Query all available files for Ensembl release 77 for
## Mus musculus.
query(ah, c("Mus musculus", "release-77"))
## Get the resource for the gtf file with the gene/transcript definitions.
Gtf <- ah["AH28822"]
## Create a EnsDb database file from this.
DbFile <- ensDbFromAH(Gtf)
## We can either generate a database package, or directly load the data
edb <- EnsDb(DbFile)
## Identify and get the FaFile object with the genomic DNA sequence matching
## the EnsDb annotation.
Dna <- getGenomeFaFile(edb)
library(Rsamtools)
## We next retrieve the sequence of all exons on chromosome Y.
exons <- exons(edb, filter = SeqnameFilter("Y"))
exonSeq <- getSeq(Dna, exons)
## Alternatively, look up and retrieve the toplevel DNA sequence manually.
Dna <- ah[["AH22042"]]
In the example below we load a GRanges
containing gene definitions for genes encoded on chromosome Y and generate a EnsDb SQLite database from that information.
## Generate a sqlite database from a GRanges object specifying
## genes encoded on chromosome Y
load(system.file("YGRanges.RData", package = "ensembldb"))
Y
## GRanges object with 7155 ranges and 16 metadata columns:
## seqnames ranges strand | source
## <Rle> <IRanges> <Rle> | <factor>
## [1] Y [2652790, 2652894] + | snRNA
## [2] Y [2652790, 2652894] + | snRNA
## [3] Y [2652790, 2652894] + | snRNA
## [4] Y [2654896, 2655740] - | protein_coding
## [5] Y [2654896, 2655740] - | protein_coding
## ... ... ... ... . ...
## [7151] Y [28772667, 28773306] - | processed_pseudogene
## [7152] Y [28772667, 28773306] - | processed_pseudogene
## [7153] Y [59001391, 59001635] + | pseudogene
## [7154] Y [59001391, 59001635] + | processed_pseudogene
## [7155] Y [59001391, 59001635] + | processed_pseudogene
## type score phase gene_id gene_name
## <factor> <numeric> <integer> <character> <character>
## [1] gene <NA> <NA> ENSG00000251841 RNU6-1334P
## [2] transcript <NA> <NA> ENSG00000251841 RNU6-1334P
## [3] exon <NA> <NA> ENSG00000251841 RNU6-1334P
## [4] gene <NA> <NA> ENSG00000184895 SRY
## [5] transcript <NA> <NA> ENSG00000184895 SRY
## ... ... ... ... ... ...
## [7151] transcript <NA> <NA> ENSG00000231514 FAM58CP
## [7152] exon <NA> <NA> ENSG00000231514 FAM58CP
## [7153] gene <NA> <NA> ENSG00000235857 CTBP2P1
## [7154] transcript <NA> <NA> ENSG00000235857 CTBP2P1
## [7155] exon <NA> <NA> ENSG00000235857 CTBP2P1
## gene_source gene_biotype transcript_id transcript_name
## <character> <character> <character> <character>
## [1] ensembl snRNA <NA> <NA>
## [2] ensembl snRNA ENST00000516032 RNU6-1334P-201
## [3] ensembl snRNA ENST00000516032 RNU6-1334P-201
## [4] ensembl_havana protein_coding <NA> <NA>
## [5] ensembl_havana protein_coding ENST00000383070 SRY-001
## ... ... ... ... ...
## [7151] havana pseudogene ENST00000435741 FAM58CP-001
## [7152] havana pseudogene ENST00000435741 FAM58CP-001
## [7153] havana pseudogene <NA> <NA>
## [7154] havana pseudogene ENST00000431853 CTBP2P1-001
## [7155] havana pseudogene ENST00000431853 CTBP2P1-001
## transcript_source exon_number exon_id tag
## <character> <numeric> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] ensembl <NA> <NA> <NA>
## [3] ensembl 1 ENSE00002088309 <NA>
## [4] <NA> <NA> <NA> <NA>
## [5] ensembl_havana <NA> <NA> CCDS
## ... ... ... ... ...
## [7151] havana <NA> <NA> <NA>
## [7152] havana 1 ENSE00001616687 <NA>
## [7153] <NA> <NA> <NA> <NA>
## [7154] havana <NA> <NA> <NA>
## [7155] havana 1 ENSE00001794473 <NA>
## ccds_id protein_id
## <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] <NA> <NA>
## [4] <NA> <NA>
## [5] CCDS14772 <NA>
## ... ... ...
## [7151] <NA> <NA>
## [7152] <NA> <NA>
## [7153] <NA> <NA>
## [7154] <NA> <NA>
## [7155] <NA> <NA>
## -------
## seqinfo: 1 sequence from GRCh37 genome
DB <- ensDbFromGRanges(Y, path = tempdir(), version = 75,
organism = "Homo_sapiens")
## Warning in ensDbFromGRanges(Y, path = tempdir(), version = 75, organism
## = "Homo_sapiens"): I'm missing column(s): 'entrezid'. The corresponding
## database column(s) will be empty!
edb <- EnsDb(DB)
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.0.1
## |Creation time: Wed Nov 16 19:52:30 2016
## |ensembl_version: 75
## |ensembl_host: unknown
## |Organism: Homo_sapiens
## |genome_build: GRCh37
## |DBSCHEMAVERSION: 1.0
## |source_file: GRanges object
## | No. of genes: 495.
## | No. of transcripts: 731.
## As shown in the example below, we could make an EnsDb package on
## this DB object using the makeEnsembldbPackage function.
Alternatively we can build the annotation database using the ensDbFromGtf
ensDbFromGff
functions, that extracts most of the required data from a GTF respectively GFF (version 3) file which can be downloaded from Ensembl (e.g. from ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens for human gene definitions from Ensembl version 75; for plant genomes etc files can be retrieved from ftp://ftp.ensemblgenomes.org). All information except the chromosome lengths and the NCBI Entrezgene IDs can be extracted from these GTF files. The function also tries to retrieve chromosome length information automatically from Ensembl.
Below we create the annotation from a gtf file that we fetch directly from Ensembl.
library(ensembldb)
## the GTF file can be downloaded from
## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/
gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz"
## generate the SQLite database file
DB <- ensDbFromGtf(gtf = gtffile)
## load the DB file directly
EDB <- EnsDb(DB)
## alternatively, build the annotation package
## and finally we can generate the package
makeEnsembldbPackage(ensdb = DB, version = "0.99.12",
maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>",
author = "J Rainer")
The database consists of the following tables and attributes (the layout is also shown in Figure 115):
gene_id
: the Ensembl ID of the gene.gene_name
: the name (symbol) of the gene.entrezid
: the NCBI Entrezgene ID(s) of the gene. Note that this can be a ;
separated list of IDs for genes that are mapped to more than one Entrezgene.gene_biotype
: the biotype of the gene.gene_seq_start
: the start coordinate of the gene on the sequence (usually a chromosome).gene_seq_end
: the end coordinate of the gene on the sequence.seq_name
: the name of the sequence (usually the chromosome name).seq_strand
: the strand on which the gene is encoded.seq_coord_system
: the coordinate system of the sequence.tx_name
column is available in this database column, all methods to retrieve data from the database support also this column. The returned values are however the ID of the transcripts.
tx_id
: the Ensembl transcript ID.tx_biotype
: the biotype of the transcript.tx_seq_start
: the start coordinate of the transcript.tx_seq_end
: the end coordinate of the transcript.tx_cds_seq_start
: the start coordinate of the coding region of the transcript (NULL for non-coding transcripts).tx_cds_seq_end
: the end coordinate of the coding region of the transcript.gene_id
: the gene to which the transcript belongs.exon_id
: the Ensembl exon ID.exon_seq_start
: the start coordinate of the exon.exon_seq_end
: the end coordinate of the exon.tx_id
: the Ensembl transcript ID.exon_id
: the Ensembl exon ID.exon_idx
: the index of the exon in the corresponding transcript, always from 5’ to 3’ of the transcript.seq_name
: the name of the sequence/chromosome.seq_length
: the length of the sequence.is_circular
: whether the sequence in circular.key
value
symbol
: the database does not have such a database column, but it is still possible to use it in the columns
parameter. This column is symlinked to the gene_name
column.tx_name
: similar to the symbol
column, this column is symlinked to the tx_id
column.