Original version: 31 October, 2023
This vignette illustrates how to display AlphaMissense predictions on AlphaFold predicted protein structure.
Visualization makes use of CRAN packages bio3d and r3dmol. Install these (if necessary) with
pkgs <- c("bio3d", "r3dmol")
pkgs_to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(pkgs_to_install))
BiocManager::install(pkgs_to_install)
Start by loading the AlphaMissenseR library.
Visit the summary of available AlphaMissense datasets
am_available()
#> # A tibble: 7 × 6
#> record key size cached filename link
#> <chr> <chr> <dbl> <lgl> <chr> <chr>
#> 1 10813168 gene_hg38 253636 FALSE AlphaMissense_gene… http…
#> 2 10813168 isoforms_hg38 1177361934 FALSE AlphaMissense_isof… http…
#> 3 10813168 isoforms_aa_substitutions 2461351945 FALSE AlphaMissense_isof… http…
#> 4 10813168 hg38 642961469 FALSE AlphaMissense_hg38… http…
#> 5 10813168 hg19 622293310 FALSE AlphaMissense_hg19… http…
#> 6 10813168 gene_hg19 243943 FALSE AlphaMissense_gene… http…
#> 7 10813168 aa_substitutions 1207278510 FALSE AlphaMissense_aa_s… http…
This vignette uses the aa_substitutions
and
hg38
data resources; make sure that these have been cached
locally.
am_data("aa_substitutions")
#> * [16:08:59][info] retrieving file name 'AlphaMissense_aa_substitutions.tsv.gz' (1.1 Gb)
#> * [16:08:59][info] data licensed under 'CC-BY-4.0'
#> * [16:09:00][info] downloading or finding local file
#> adding rname 'AlphaMissense_aa_substitutions.tsv.gz'
#> * [16:21:40][info] creating database table 'aa_substitutions'
#> * [16:21:40][info] disconnecting all registered connections
#> # Source: table<aa_substitutions> [?? x 4]
#> # Database: DuckDB v1.1.1 [biocbuild@Linux 6.8.0-47-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/2f7e82265d846_2f7e82265d846]
#> uniprot_id protein_variant am_pathogenicity am_class
#> <chr> <chr> <dbl> <chr>
#> 1 A0A024R1R8 M1A 0.467 ambiguous
#> 2 A0A024R1R8 M1C 0.383 ambiguous
#> 3 A0A024R1R8 M1D 0.827 pathogenic
#> 4 A0A024R1R8 M1E 0.524 ambiguous
#> 5 A0A024R1R8 M1F 0.275 benign
#> 6 A0A024R1R8 M1G 0.548 ambiguous
#> 7 A0A024R1R8 M1H 0.552 ambiguous
#> 8 A0A024R1R8 M1I 0.321 benign
#> 9 A0A024R1R8 M1K 0.288 benign
#> 10 A0A024R1R8 M1L 0.175 benign
#> # ℹ more rows
am_data("hg38")
#> * [16:22:10][info] retrieving file name 'AlphaMissense_hg38.tsv.gz' (613.2 Mb)
#> * [16:22:10][info] data licensed under 'CC-BY-4.0'
#> * [16:22:10][info] downloading or finding local file
#> adding rname 'AlphaMissense_hg38.tsv.gz'
#> * [16:28:54][info] creating database table 'hg38'
#> * [16:28:54][info] disconnecting all registered connections
#> * [16:29:32][info] renaming '#CHROM' to 'CHROM' in table 'hg38'
#> # Source: table<hg38> [?? x 10]
#> # Database: DuckDB v1.1.1 [biocbuild@Linux 6.8.0-47-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/2f7e82265d846_2f7e82265d846]
#> CHROM POS REF ALT genome uniprot_id transcript_id protein_variant
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 69094 G T hg38 Q8NH21 ENST00000335137.4 V2L
#> 2 chr1 69094 G C hg38 Q8NH21 ENST00000335137.4 V2L
#> 3 chr1 69094 G A hg38 Q8NH21 ENST00000335137.4 V2M
#> 4 chr1 69095 T C hg38 Q8NH21 ENST00000335137.4 V2A
#> 5 chr1 69095 T A hg38 Q8NH21 ENST00000335137.4 V2E
#> 6 chr1 69095 T G hg38 Q8NH21 ENST00000335137.4 V2G
#> 7 chr1 69097 A G hg38 Q8NH21 ENST00000335137.4 T3A
#> 8 chr1 69097 A C hg38 Q8NH21 ENST00000335137.4 T3P
#> 9 chr1 69097 A T hg38 Q8NH21 ENST00000335137.4 T3S
#> 10 chr1 69098 C A hg38 Q8NH21 ENST00000335137.4 T3N
#> # ℹ more rows
#> # ℹ 2 more variables: am_pathogenicity <dbl>, am_class <chr>
AlphaMissense predictions on pathogenicity of amino acid changes can be combined with AlphaFold (or other) predictions of protein structure.
Figure 3F of the AlphaMissense publication visualizes mean pathogenicity for UniProt id P35557. Filter amino acid data for that identifier
and visualization median pathogenicity with
The image is interactive, including rotation and zoom. The following sections explore this visualization in more detail.
Both AlphaMissense and AlphaFold use UniProt identifiers. Find all
AlphaMissense amino acid substitutions with UniProt identifiers starting
with P3555
; the choice of this identifier is so that
results can be compared with Figure 3F of the AlphaMissense
publication.
uniprot_ids <-
am_data("aa_substitutions") |>
dplyr::filter(uniprot_id %like% "P3555%") |>
dplyr::distinct(uniprot_id) |>
pull(uniprot_id)
uniprot_ids
#> [1] "P35558" "P35556" "P35557" "P35555"
The AlphaMissenseR
package includes several functions that facilitate interaction with AlphaFold; these functions start
with af_*()
. Use af_predictions()
to discover
AlphaFold predictions (via the AlphaFold API) associated with UniProt
identifiers.
prediction <- af_predictions(uniprot_ids)
#> * [16:29:37][info] 2 of 4 uniprot accessions not found
#> 'P35556', 'P35555'
glimpse(prediction)
#> Rows: 2
#> Columns: 25
#> $ entryId <chr> "AF-P35558-F1", "AF-P35557-F1"
#> $ gene <chr> "PCK1", "GCK"
#> $ sequenceChecksum <chr> "78D309E0845CC181", "094D4A2F78096724"
#> $ sequenceVersionDate <chr> "2006-03-07", "1994-06-01"
#> $ uniprotAccession <chr> "P35558", "P35557"
#> $ uniprotId <chr> "PCKGC_HUMAN", "HXK4_HUMAN"
#> $ uniprotDescription <chr> "Phosphoenolpyruvate carboxykinase, cytosolic […
#> $ taxId <int> 9606, 9606
#> $ organismScientificName <chr> "Homo sapiens", "Homo sapiens"
#> $ uniprotStart <int> 1, 1
#> $ uniprotEnd <int> 622, 465
#> $ uniprotSequence <chr> "MPPQLQNGLNLSAKVVQGSLDSLPQAVREFLENNAELCQPDHIHIC…
#> $ modelCreatedDate <chr> "2022-06-01", "2022-06-01"
#> $ latestVersion <int> 4, 4
#> $ allVersions <list> [1, 2, 3, 4], [1, 2, 3, 4]
#> $ isReviewed <lgl> TRUE, TRUE
#> $ isReferenceProteome <lgl> TRUE, TRUE
#> $ cifUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ bcifUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ pdbUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ paeImageUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ paeDocUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ amAnnotationsUrl <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ amAnnotationsHg19Url <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
#> $ amAnnotationsHg38Url <chr> "https://alphafold.ebi.ac.uk/files/AF-P35558-F1…
Note the message indicating that some UniProt identifiers
(accessions) are not found in the AlphaFold database. The query returns
a tibble containing columns with information on organism and UniProt
characteristics (including protein sequence) , as well as URLs for files
representing three-dimensional protein structure. We will use
pdbUrl
.
Focus on a particular UniProt identifier and the PDB url.
Cache the PDB file using BiocFileCache, and read the PDB file using bio3d.
pdb_file <- BiocFileCache::bfcrpath(rnames = basename(pdb_url), fpath = pdb_url)
pdb <- bio3d::read.pdb(pdb_file)
pdb
#>
#> Call: bio3d::read.pdb(file = pdb_file)
#>
#> Total Models#: 1
#> Total Atoms#: 3642, XYZs#: 10926 Chains#: 1 (values: A)
#>
#> Protein Atoms#: 3642 (residues/Calpha atoms#: 465)
#> Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
#>
#> Non-protein/nucleic Atoms#: 0 (residues: 0)
#> Non-protein/nucleic resid values: [ none ]
#>
#> Protein sequence:
#> MLDDRARMEAAKKEKVEQILAEFQLQEEDLKKVMRRMQKEMDRGLRLETHEEASVKMLPT
#> YVRSTPEGSEVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKTKHQMYSIPEDAMTGTAE
#> MLFDYISECISDFLDKHQMKHKKLPLGFTFSFPVRHEDIDKGILLNWTKGFKASGAEGNN
#> VVGLLRDAIKRRGDFEMDVVAMVNDTVATMISCYYEDHQCEVGMI...<cut>...MLGQ
#>
#> + attr: atom, xyz, seqres, calpha, call
Visualize the protein using r3dmol, using the ‘cartoon’ style.