Contents

1 Introduction

When working on your own genome project or when using publicly available genomes for comparative analyses, it is critical to assess the quality of your data. Over the past years, several tools have been developed and several metrics have been proposed to assess the quality of a genome assembly and annotation. cogeqc helps users interpret their genome assembly statistics by comparing them with statistics on publicly available genomes on the NCBI. Additionally, cogeqc also provides an interface to BUSCO (Simão et al. 2015), a popular tool to assess gene space completeness. Graphical functions are available to make publication-ready plots that summarize the results of quality control.

2 Installation

You can install cogeqc from Bioconductor with the following code:

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)

3 Assessing genome assembly quality: statistics in a context

When analyzing and interpreting genome assembly statistics, it is often useful to place your stats in a context by comparing them with stats from genomes of closely-related or even the same species. cogeqc provides users with an interface to the NCBI Datasets API, which can be used to retrieve summary stats for genomes on NCBI. In this section, we will guide you on how to retrieve such information and use it as a reference to interpret your data.

3.1 Obtaining assembly statistics for NCBI genomes

To obtain a data frame of summary statistics for NCBI genomes of a particular taxon, you will use the function get_genome_stats(). In the taxon parameter, you must specify the taxon from which data will be extracted. This can be done either by passing a character scalar with taxon name or by passing a numeric scalar with NCBI Taxonomy ID. For example, the code below demonstrates two ways of extracting stats on maize (Zea mays) genomes on NCBI:

# Example 1: get stats for all maize genomes using taxon name
maize_stats <- get_genome_stats(taxon = "Zea mays")
head(maize_stats)
#>         accession  source species_taxid         species_name
#> 1 GCA_000005005.6 GENBANK          4577             Zea mays
#> 2 GCA_000223545.1 GENBANK          4577             Zea mays
#> 3 GCA_000275765.1 GENBANK          4577             Zea mays
#> 4 GCA_001644905.2 GENBANK        381124 Zea mays subsp. mays
#> 5 GCA_001984235.2 GENBANK        381124 Zea mays subsp. mays
#> 6 GCA_001990705.1 GENBANK        381124 Zea mays subsp. mays
#>   species_common_name species_ecotype species_strain species_isolate
#> 1               maize            <NA>             NA            <NA>
#> 2               maize            <NA>             NA            <NA>
#> 3               maize            <NA>             NA            <NA>
#> 4               maize            <NA>             NA            <NA>
#> 5               maize            <NA>             NA            <NA>
#> 6               maize            <NA>             NA            <NA>
#>               species_cultivar assembly_level assembly_status
#> 1                          B73     Chromosome         current
#> 2 Palomero Toluqueno EDMX-2231       Scaffold         current
#> 3                          B73         Contig         current
#> 4                          W22     Chromosome         current
#> 5                          EP1     Chromosome         current
#> 6                           F7     Chromosome         current
#>                 assembly_name assembly_type submission_date
#> 1               B73 RefGen_v4       haploid      2017-02-07
#> 2     ZeaMays_PT_EDMX2233_1.0       haploid      2011-08-16
#> 3            ZmaysB73_wgs_1.0       haploid      2012-07-03
#> 4 Zm-W22-REFERENCE-NRGENE-2.0       haploid      2017-02-02
#> 5    Zm-EP1-REFERENCE-TUM-1.0       haploid      2017-02-09
#> 6     Zm-F7-REFERENCE-TUM-1.0       haploid      2017-02-09
#>                                                                              submitter
#> 1                                                                        maizesequence
#> 2 Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO) CINVESTAV Irapuato
#> 3                                                                        maizesequence
#> 4                                                            W22 Sequencing Consortium
#> 5                                       Technical University of Munich, Plant Breeding
#> 6                                       Technical University of Munich, Plant Breeding
#>            sequencing_technology atypical refseq_category chromosome_count
#> 1                         PacBio    FALSE            <NA>               10
#> 2 454 GS20; 454 Titanium; Sanger     TRUE            <NA>               NA
#> 3                     454 GS FLX     TRUE            <NA>               NA
#> 4   Illumina HiSeq; 10x Genomics    FALSE            <NA>               10
#> 5                       Illumina    FALSE            <NA>               10
#> 6                       Illumina    FALSE            <NA>               10
#>   sequence_length ungapped_length contig_count contig_N50 contig_L50
#> 1      2134373047      2103640169         2787    1279870        506
#> 2       177051422       177051324       196705       1129      44576
#> 3         1335069         1335069         1844        949        434
#> 4      2133880228      2093255169        68134      72426       8765
#> 5      2455259639      2429778437       130125      82504       8782
#> 6      2392801755      2367184004       117311      96966       7318
#>   scaffold_count scaffold_N50 scaffold_L50 GC_percent annotation_provider
#> 1            596     10679170           62       46.5       maizesequence
#> 2         196697         1129        44576       45.5                <NA>
#> 3             NA           NA           NA       45.0                <NA>
#> 4            191    222590201            5       46.5                <NA>
#> 5          60567    249966684            5       46.5                <NA>
#> 6          62610    237276974            5       46.5                <NA>
#>   annotation_release_date gene_count_total gene_count_coding
#> 1              2017-02-07            39320             39320
#> 2                    <NA>               NA                NA
#> 3                    <NA>               NA                NA
#> 4                    <NA>               NA                NA
#> 5                    <NA>               NA                NA
#> 6                    <NA>               NA                NA
#>   gene_count_noncoding gene_count_pseudogene gene_count_other CC_ratio
#> 1                   NA                    NA               NA    278.7
#> 2                   NA                    NA               NA       NA
#> 3                   NA                    NA               NA       NA
#> 4                   NA                    NA               NA   6813.4
#> 5                   NA                    NA               NA  13012.5
#> 6                   NA                    NA               NA  11731.1
str(maize_stats)
#> 'data.frame':    95 obs. of  36 variables:
#>  $ accession              : chr  "GCA_000005005.6" "GCA_000223545.1" "GCA_000275765.1" "GCA_001644905.2" ...
#>  $ source                 : chr  "GENBANK" "GENBANK" "GENBANK" "GENBANK" ...
#>  $ species_taxid          : int  4577 4577 4577 381124 381124 381124 381124 381124 4579 381124 ...
#>  $ species_name           : chr  "Zea mays" "Zea mays" "Zea mays" "Zea mays subsp. mays" ...
#>  $ species_common_name    : chr  "maize" "maize" "maize" "maize" ...
#>  $ species_ecotype        : chr  NA NA NA NA ...
#>  $ species_strain         : logi  NA NA NA NA NA NA ...
#>  $ species_isolate        : chr  NA NA NA NA ...
#>  $ species_cultivar       : chr  "B73" "Palomero Toluqueno EDMX-2231" "B73" "W22" ...
#>  $ assembly_level         : Factor w/ 4 levels "Complete","Chromosome",..: 2 3 4 2 2 2 2 2 3 3 ...
#>  $ assembly_status        : chr  "current" "current" "current" "current" ...
#>  $ assembly_name          : chr  "B73 RefGen_v4" "ZeaMays_PT_EDMX2233_1.0" "ZmaysB73_wgs_1.0" "Zm-W22-REFERENCE-NRGENE-2.0" ...
#>  $ assembly_type          : chr  "haploid" "haploid" "haploid" "haploid" ...
#>  $ submission_date        : chr  "2017-02-07" "2011-08-16" "2012-07-03" "2017-02-02" ...
#>  $ submitter              : chr  "maizesequence" "Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO) CINVESTAV Irapuato" "maizesequence" "W22 Sequencing Consortium" ...
#>  $ sequencing_technology  : chr  "PacBio" "454 GS20; 454 Titanium; Sanger" "454 GS FLX" "Illumina HiSeq; 10x Genomics" ...
#>  $ atypical               : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
#>  $ refseq_category        : chr  NA NA NA NA ...
#>  $ chromosome_count       : int  10 NA NA 10 10 10 10 10 NA NA ...
#>  $ sequence_length        : num  2.13e+09 1.77e+08 1.34e+06 2.13e+09 2.46e+09 ...
#>  $ ungapped_length        : num  2.10e+09 1.77e+08 1.34e+06 2.09e+09 2.43e+09 ...
#>  $ contig_count           : int  2787 196705 1844 68134 130125 117311 405730 182074 140894 102656 ...
#>  $ contig_N50             : int  1279870 1129 949 72426 82504 96966 10873 39757 31764 71122 ...
#>  $ contig_L50             : int  506 44576 434 8765 8782 7318 42688 16302 7912 8373 ...
#>  $ scaffold_count         : int  596 196697 NA 191 60567 62610 43299 3538 107418 48268 ...
#>  $ scaffold_N50           : int  10679170 1129 NA 222590201 249966684 237276974 215148664 2707071 107689 2995073 ...
#>  $ scaffold_L50           : int  62 44576 NA 5 5 5 5 233 499 168 ...
#>  $ GC_percent             : num  46.5 45.5 45 46.5 46.5 46.5 46 46.5 45.5 46 ...
#>  $ annotation_provider    : chr  "maizesequence" NA NA NA ...
#>  $ annotation_release_date: chr  "2017-02-07" NA NA NA ...
#>  $ gene_count_total       : int  39320 NA NA NA NA NA NA NA NA NA ...
#>  $ gene_count_coding      : int  39320 NA NA NA NA NA NA NA NA NA ...
#>  $ gene_count_noncoding   : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ gene_count_pseudogene  : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ gene_count_other       : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ CC_ratio               : num  279 NA NA 6813 13012 ...

# Example 2: get stats for all maize genomes using NCBI Taxonomy ID
maize_stats2 <- get_genome_stats(taxon = 4577)

# Checking if objects are the same
identical(maize_stats, maize_stats2)
#> [1] TRUE

As you can see, there are 95 maize genomes on the NCBI. You can also include filters in your searches by passing a list of key-value pairs with keys in list names and values in elements. For instance, to obtain only chromosome-scale and annotated maize genomes, you would run:

# Get chromosome-scale maize genomes with annotation
## Create list of filters
filt <- list(
    filters.has_annotation = "true",
    filters.assembly_level = "chromosome"
)
filt
#> $filters.has_annotation
#> [1] "true"
#> 
#> $filters.assembly_level
#> [1] "chromosome"

## Obtain data
filtered_maize_genomes <- get_genome_stats(taxon = "Zea mays", filters = filt)
dim(filtered_maize_genomes)
#> [1]  4 36

For a full list of filtering parameters and possible arguments, see the API documentation.

3.2 Comparing custom stats with NCBI stats

Now, suppose you sequenced a genome, obtained assembly and annotation stats, and want to compare them to NCBI genomes to identify potential issues. Examples of situations you may encounter include:

  • The genome you assembled is huge and you think there might be a problem with your assembly.

  • Your gene annotation pipeline predicted n genes, but you are not sure if this number is reasonable compared to other assemblies of the same species or closely-related species.

To compare user-defined summary stats with NCBI stats, you will use the function compare_genome_stats(). This function will include the values you observed for each statistic into a distribution (based on NCBI stats) and return the percentile and rank of your observed values in each distribution.

As an example, let’s go back to our maize stats we obtained in the previous section. Suppose you sequenced a new maize genome and observed the following values:

  1. Genome size = 2.4 Gb
  2. Number of genes = 50,000
  3. CC ratio = 21 Note: The CC ratio is the ratio of the number of contigs to the number of chromosome pairs, and it has been proposed in Wang and Wang (2022) as a measurement of contiguity that compensates for the flaws of N50 and allows cross-species comparisons.

To compare your observed values with those for publicly available maize genomes, you need to store them in a data frame. The column accession is mandatory, and any other column will be matched against columns in the data frame obtained with get_genome_stats(). Thus, make sure column names in your data frame match column names in the reference data frame. Then, you can compare both data frames as below:

# Check column names in the data frame of stats for maize genomes on the NCBI
names(maize_stats)
#>  [1] "accession"               "source"                 
#>  [3] "species_taxid"           "species_name"           
#>  [5] "species_common_name"     "species_ecotype"        
#>  [7] "species_strain"          "species_isolate"        
#>  [9] "species_cultivar"        "assembly_level"         
#> [11] "assembly_status"         "assembly_name"          
#> [13] "assembly_type"           "submission_date"        
#> [15] "submitter"               "sequencing_technology"  
#> [17] "atypical"                "refseq_category"        
#> [19] "chromosome_count"        "sequence_length"        
#> [21] "ungapped_length"         "contig_count"           
#> [23] "contig_N50"              "contig_L50"             
#> [25] "scaffold_count"          "scaffold_N50"           
#> [27] "scaffold_L50"            "GC_percent"             
#> [29] "annotation_provider"     "annotation_release_date"
#> [31] "gene_count_total"        "gene_count_coding"      
#> [33] "gene_count_noncoding"    "gene_count_pseudogene"  
#> [35] "gene_count_other"        "CC_ratio"

# Create a simulated data frame of stats for a maize genome
my_stats <- data.frame(
    accession = "my_lovely_maize",
    sequence_length = 2.4 * 1e9,
    gene_count_total = 50000,
    CC_ratio = 2
)

# Compare stats
compare_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)
#>         accession         variable percentile rank
#> 1 my_lovely_maize  sequence_length 0.97916667    3
#> 2 my_lovely_maize gene_count_total 1.00000000    1
#> 3 my_lovely_maize         CC_ratio 0.02898551    2

3.3 Visualizing summary assembly statistics

To have a visual representation of the summary stats obtained with get_genome_stats(), you will use the function plot_genome_stats().

# Summarize genome stats in a plot
plot_genome_stats(ncbi_stats = maize_stats)

Finally, you can pass your data frame of observed stats to highlight your values (as red points) in the distributions.

plot_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)