Contents


rfaRm

Lara Selles Vidal, Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro

April 7, 2020


1 Abstract

The Rfam database is a collection of families of non-coding RNA and other structured RNA elements. Each family is defined by a multiple sequence alignment of family members, a consensus secondary structure and a covariance model, which integrates both multiple sequence alignment information and secondary structure information, and are related to the hidden Markov models employed by Pfam. “rfaRm” is a R package providing a client-side interface to the Rfam database. The package allows the search of the Rfam database by keywords or sequences, as well as the retrieval of all available information on specific Rfam families, such as member sequences, multiple sequence alignments, secondary structures and covariance models.

2 Introduction

The Rfam database (Kalvari et al. 2017) is a collection of RNA sequences classified into different families of non-coding RNA, cis-regulatory elements and other structured RNA. It is designed to be analogous to the Pfam database of protein families (El-Gebali et al. 2018). The initial step for identifying an Rfam family is to generate a manually curated seed alignment of a set of known RNA sequences, ideally including secondary structure information. The Infernal software package (Nawrocki & Eddy 2013) is then used to build a covariance model for the family (a probabilistic profile of the sequence and secondary structure of the family), and to search for new homologs to add from a sequence database. When new homologs are added, the covariance model is updated, and the process is repeated until no new homologs are found.

Much of the information stored in Rfam could be useful if integrated within existing genomics and transcriptomics workflows and with functionalities of packages already available in BioConductor. rfaRm aims to facilitate the integration of the information provided by Rfam in existing workflows by providing an R client-side interface to the Rfam database.

3 Software features

rfaRm allows to search the Rfam database by keywords and sequence. In the search by keywords, the supplied keywodrs are matched against the fields of each entry, and accessions for the matching families are returned. In the search by sequence, the supplied sequence is used by Infernal to be searched against the Rfam library of covariance models, returning high-scoring hits which allow the identification of regions in the query sequence that belong to an Rfam family. Searches can be used to find RNA families of interest.

rfaRm also allows the retrieval of the information available in Rfam for each RNA family by providing a valid Rfam family identifier. The types of retrievable information for each Rfam family include:

The results of each query are returned as R objects that can be used to perform further analysis on the retrieved data with other R packages. Additionally, when appropriate they can also be saved to standard file formats, which can be read by commonly used Bioinformatics software.

4 Example

4.1 Identify Rfam families of interest

Rfam families can be searched by keywords with the “rfamTextSearchFamilyAccession” function. The function returns a character vector where each element is a string representing the Rfam accession for an Rfam family that matched the searched keyword. Rfam accessions are always of the “RFXXXXX” structure, where X is any digit from 0 to 9. For example, in order to search for Rfam families corresponding to riboswitches, the keyword “riboswitch” can be passed as the query to the search function. The function will output a vector with the Rfam accessions of the matching families:

rfamTextSearchFamilyAccession("riboswitch")
##  [1] "RF00174" "RF00059" "RF00162" "RF01051" "RF00504" "RF00050" "RF00379"
##  [8] "RF00167" "RF00168" "RF03167" "RF01734" "RF01750" "RF00634" "RF01055"
## [15] "RF00380"

It is also possible to perform a sequence-based search with the “rfamSequenceSearch” function. The supplied query sequence is matched against the covariance models of the Rfam families by the Infernal software in order to identify regions in the query sequence that belong to one of the Rfam families. It should be noted that sequence-based searches can occasionally take long times to finish, sometimes up to several hours. The supplied string should be an RNA sequence, containing only the standard RNA nucleotides (A, U, G and C). The function will return a nested list where each top-level element represents a high-scoring hit, and is in itself a list including information such as the Rfam family with which the hit was found, and an alignment between the query sequence and the consensus sequence of the Rfam family. In the following example, only one high-scoring hit is identified in the provided sequence: an FMN riboswitch. The alignment with the consensus sequence of the Rfam family is saved to a text file, together with its associated secondary structure annotation.

sequenceSearch <- rfamSequenceSearch("GGAUCUUCGGGGCAGGGUGAAAUUCCCGACCGGUGGUAUAGUCCACGAAAGUAUUUGCUUUGAUUUGGUGAAAUUCCAAAACCGACAGUAGAGUCUGGAUGAGAGAAGAUUC")

length(sequenceSearch)
## [1] 1
names(sequenceSearch)
## [1] "FMN"
## Save the aligned consensus sequence and query sequence to a text file, together
## with secondary structure annotation

writeLines(c(sequenceSearch$FMN$alignmentQuerySequence, sequenceSearch$FMN$alignmentMatch, 
    sequenceSearch$FMN$alignmentHitSequence, sequenceSearch$FMN$alignmentSecondaryStructure), 
    con = "searchAlignment.txt")

4.2 Convert Rfam accession to ID and vice versa

Each Rfam family has two main identifiers: its accession (a string with the RFXXXX format), and its ID (a keyword related to the Rfam family functionality or RNA type). Both can be used interchangeably with all of the query functions available in rfaRm. It is possible to obtain the ID corresponding to an accession with the “rfamFamilyAccessionToID” function, and the accession corresponding to an ID with the “rfamFamilyIDToAccession” function.

rfamFamilyAccessionToID("RF00174")
## [1] "Cobalamin"
rfamFamilyIDToAccession("FMN")
## [1] "RF00050"

4.3 Query the Rfam database for data for a specific Rfam family

When the accession or ID of the Rfam family of interest is known, it is possible to query the database to retrieve different types of data about the family.

4.3.1 Obtain a descriptive summary

A summary with a short functional description of an Rfam family can be obtained with the “rfamFamilySummary” function.

rfamFamilySummary("RF00174")
## $rfamReleaseNumber
## [1] "14.3"
## 
## $numberSequencesSeedAlignment
## [1] "430"
## 
## $sourceSeedAlignment
## [1] "Barrick JE, Breaker RR"
## 
## $numberSpecies
## [1] "5174"
## 
## $RNAType
## [1] "Cis-reg; riboswitch;"
## 
## $numberSequencesAll
## [1] "13781"
## 
## $structureSource
## [1] "Predicted; PFOLD"
## 
## $description
## [1] "Cobalamin riboswitch"
## 
## $rfamAccession
## [1] "RF00174"
## 
## $rfamID
## [1] "Cobalamin"
## 
## $comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]."
# The summary includes, amongst other data, a brief paragraph describing the
# family

rfamFamilySummary("RF00174")$comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]."

4.3.2 Obtain the consensus secondary structure and consensus sequence

The consensus secondary structure for an Rfam family can be retrieved, together with its consensus sequence, with the “rfamConsensusSecondaryStructure” function. RNA Secondary structure notation can be output in the WUSS or extended Dot-Bracket notations. In the example below, the consensus secondary structure of the Rfam family with accession RF00174 is retrieved in both notations, and saved to a file with the extended Dot-Bracket notation.

rfamConsensusSecondaryStructure("RF00174", format = "WUSS")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] ":::::::::::::::[[[[[[,..<<<____......................_____....___>>>,,......,,,(..((,,,.<.<<.<.<.___...............................____>>>>.>,,<<<__.__.>>>,<<<---..<<.<<..........------<<<<<<-----............................<<<-<<<<<<_____.........................................................._>>.>>>>--->>>...>>>..>.>>..----------..................................................................>>>>.---->>>,,,,)))]]]]]]:::::::::::::::"
rfamConsensusSecondaryStructure("RF00174", format = "DB")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] "...............[[[[[[...<<<......................................>>>...........(..((....<.<<.<.<.......................................>>>>.>..<<<......>>>.<<<.....<<.<<................<<<<<<.................................<<<.<<<<<<................................................................>>.>>>>...>>>...>>>..>.>>..............................................................................>>>>.....>>>....)))]]]]]]..............."
rfamConsensusSecondaryStructure("RF00174", filename = "RF00174secondaryStructure.txt", 
    format = "DB")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] "...............[[[[[[...<<<......................................>>>...........(..((....<.<<.<.<.......................................>>>>.>..<<<......>>>.<<<.....<<.<<................<<<<<<.................................<<<.<<<<<<................................................................>>.>>>>...>>>...>>>..>.>>..............................................................................>>>>.....>>>....)))]]]]]]..............."

The generated files with secondary structure in the extended Dot-Bracket notation can be read by the R4RNA package (Lai et al. 2012), and used for further analysis and plotting of the corresponding RNA.

secondaryStructureTable <- readVienna("RF00174secondaryStructure.txt")

plotHelix(secondaryStructureTable)

It is also possible to retrieve an SVG file containing a representation of the consensus secondary structure of an Rfam family. Different types of secondary structure representations can be generated, including simple base pairing plots, plots with sequence conservation and plots with basepair conservation, amongst other types. The SVG file can be stored as a character string directly readable by the rsvg and magick packages, or saved to a file which can be read by external software.

## Retrieve an SVG with a normal representation of the secondary structure of Rfam
## family RF00174 and store it into a character string, which can be read by rsvg
## after conversion from SVG string to raw data

normalSecondaryStructureSVG = rfamSecondaryStructureXMLSVG("RF00174", plotType = "norm")
rsvg(charToRaw(normalSecondaryStructureSVG))

## Retrieve an SVG with a sequence conservation representation added to the
## secondary structure of Rfam family RF00174, and save it to an SVG file

rfamSecondaryStructureXMLSVG("RF00174", filename = "RF00174SSsequenceConservation.svg", 
    plotType = "cons")

It is also possible to directly plot the retrieved graphics to R’s graphics display, and save it to an image file such as PNG, JPEG or GIF.

## Plot a normal representation of the secondary structure of Rfam family RF00174

rfamSecondaryStructurePlot("RF00174", plotType = "norm")

Smiley face

## Save to a PNG file a plot of a representation of the secondary structure of
## Rfam family with ID FMN with sequence conservation annotation

rfamSecondaryStructurePlot("RF00050", filename = "RF00050SSsequenceConservation.png", 
    plotType = "cons")

Smiley face

4.3.3 Obtain the seed alignment used to define the Rfam family

The seed multiple sequence alignment that was manually curated and used to define the Rfam family can be retrieved with the “rfamSeedAlignment” function. The function returns the seed alignment as a Biostrings MultipleAlignment object with the aligned RNA sequences. The alignment can also be written to a file in Stockholm or FASTA format. In the example below, the seed alignment for Rfam family RF00174 is written to two files, in Stockholm and FASTA format, respectively.

rfamSeedAlignment("RF00174", filename = "RF00174seedAlignment.stk", format = "stockholm")
## RNAMultipleAlignment with 430 rows and 441 columns
##        aln                                                  names               
##   [1] GGCACCUUCGCGGCAGAUGGUU--C...CCUGCCAUCAGCGUCAUCAACCGC AF010496.1/39869-...
##   [2] CCACUCAGGGCGGGCGCUGGUU--U...CCUGCCAGCGCAUGGAUUUCGGGC AF010496.1/105318...
##   [3] GCUACUCCAACAGGCGAUGGU---U...CCUGCCAUCGCUCUGGCGUCGCAA AF010496.1/116971...
##   [4] GCAAUGAGGAAGGAUUAAGGUU--C...CCUGCCUUAAACCAAGUCAUCCAC AF193754.1/4343-4142
##   [5] GGAAAUUUUUUUGCAUAGGGU---U...CCGACCCUAUGUAAUCGUUCCACG AF193754.1/24966-...
##   [6] AGGCUGACCGGUGCAGCUGGUU--C...CCUGCCCGCUGCCCGCACGCGACC AF263012.1/9229-9015
##   [7] GAUAAUCCAAGUCGUCGAGGUU--C...CCGGCCCCGACAAUAUAUUGGUCC AF306632.1/382-181
##   [8] UCAAACAGCAACAGUAAAGGU---G...CCUGUCUUUAUUGUGAAGUUUCUA AJ000758.1/1191-1370
##   [9] CAUAUCGUGCAAAAAAAAGGUG--C...CCAGCCUUUUUCUAAUGCUAAGCU AJ000758.1/1489-1668
##   ... ...
## [422] ACUAUAUGUGGUGUUCAAGGUU--C...CCUGCCUUGAGCGCAAAUGUCCAC AE007869.2/70421-...
## [423] GGCCAUAUGCCGCCGUCAGGUG--C...CCGGCCUGGCGAGAUAGACCGGCC AL591688.1/199976...
## [424] GUAGCCUUGCGCUUUCGAGGUU--C...CCUGCCUCGUCACAGAUUCUCACU CP000094.2/355264...
## [425] UUACUGAAAUGGUUAUGUCGU---G...CCAGCCAAAUCGGUAAAAUAUAUU AALF02000008.1/20...
## [426] GCCAUUCAGCGAAGAGAUGGUU--C...CCUGCCAUCGCGUCGUUCGUCCGG CP002156.1/230652...
## [427] UUUUUUAUAUAUCGAAAUGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALC02000021.1/24...
## [428] AUAUAGUAUGCGAGUAUUGGU---G...CCUGCCAAUACUUGAAGUUUCUAU CP003056.1/112605...
## [429] CUUCAAGAAUAAAAGGCAUGUU--A...CCAGCCCGCUCAGUAAAAGAUUUU AALE02000017.1/13...
## [430] AUUUGUAGUUACUGAUAGGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALD02000002.1/89...
rfamSeedAlignment("RF00174", filename = "RF00174seedAlignment.stk", format = "fasta")
## RNAMultipleAlignment with 430 rows and 441 columns
##        aln                                                  names               
##   [1] GGCACCUUCGCGGCAGAUGGUU--C...CCUGCCAUCAGCGUCAUCAACCGC AF010496.1/39869-...
##   [2] CCACUCAGGGCGGGCGCUGGUU--U...CCUGCCAGCGCAUGGAUUUCGGGC AF010496.1/105318...
##   [3] GCUACUCCAACAGGCGAUGGU---U...CCUGCCAUCGCUCUGGCGUCGCAA AF010496.1/116971...
##   [4] GCAAUGAGGAAGGAUUAAGGUU--C...CCUGCCUUAAACCAAGUCAUCCAC AF193754.1/4343-4142
##   [5] GGAAAUUUUUUUGCAUAGGGU---U...CCGACCCUAUGUAAUCGUUCCACG AF193754.1/24966-...
##   [6] AGGCUGACCGGUGCAGCUGGUU--C...CCUGCCCGCUGCCCGCACGCGACC AF263012.1/9229-9015
##   [7] GAUAAUCCAAGUCGUCGAGGUU--C...CCGGCCCCGACAAUAUAUUGGUCC AF306632.1/382-181
##   [8] UCAAACAGCAACAGUAAAGGU---G...CCUGUCUUUAUUGUGAAGUUUCUA AJ000758.1/1191-1370
##   [9] CAUAUCGUGCAAAAAAAAGGUG--C...CCAGCCUUUUUCUAAUGCUAAGCU AJ000758.1/1489-1668
##   ... ...
## [422] ACUAUAUGUGGUGUUCAAGGUU--C...CCUGCCUUGAGCGCAAAUGUCCAC AE007869.2/70421-...
## [423] GGCCAUAUGCCGCCGUCAGGUG--C...CCGGCCUGGCGAGAUAGACCGGCC AL591688.1/199976...
## [424] GUAGCCUUGCGCUUUCGAGGUU--C...CCUGCCUCGUCACAGAUUCUCACU CP000094.2/355264...
## [425] UUACUGAAAUGGUUAUGUCGU---G...CCAGCCAAAUCGGUAAAAUAUAUU AALF02000008.1/20...
## [426] GCCAUUCAGCGAAGAGAUGGUU--C...CCUGCCAUCGCGUCGUUCGUCCGG CP002156.1/230652...
## [427] UUUUUUAUAUAUCGAAAUGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALC02000021.1/24...
## [428] AUAUAGUAUGCGAGUAUUGGU---G...CCUGCCAAUACUUGAAGUUUCUAU CP003056.1/112605...
## [429] CUUCAAGAAUAAAAGGCAUGUU--A...CCAGCCCGCUCAGUAAAAGAUUUU AALE02000017.1/13...
## [430] AUUUGUAGUUACUGAUAGGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALD02000002.1/89...

4.3.4 Obtain the phylogenetic tree associated with a seed alignment

The phylogenetic tree associated to the seed alignment can be retrieved with the “rfamSeedTree” function. The tree is ouput in the New Hampshire extended format (NHX). The resulting tree is returned as a string and saved to a file, which can be directly read by the treeio package (Wang et al. 2019) or other external software.

seedTreeNHXString <- rfamSeedTree("RF00050", filename = "RF00050tree.nhx")

treeioTree <- read.nhx("RF00050tree.nhx")

A plot of the phylogenetic tree can also be directly retrieved with the “rfamSeedTreeImage” function, labeled with either species names or sequence accessions.

rfamSeedTreeImage("RF00164", filename = "RF00164seedTreePlot.png", label = "species")