1 Introduction

The GBScleanR package has been developed to conduct error correction on genotype data obtained via NGS-based genotyping methods such as RAD-seq and GBS (Miller et al. 2007; Elshire et al. 2011). It is designed to estimate true genotypes along chromosomes from given allele read counts in the VCF file generated by SNP callers like GATK and TASSEL-GBS (McKenna et al. 2010; Glaubitz et al. 2014). The current implementation supports genotype data of a mapping population derived from two or more diploid parents followed by selfings, sibling crosses, or random crosses. e.g. F\(_2\) and 8-way RILs. Our method supposes markers to be biallelic and ordered along chromosomes by mapping reads on a reference genome sequence. To support smooth access to large size genotype data, every input VCF file is first converted to a Genomic Data Structure (GDS) file (Zheng et al. 2012). The current implementation does not allow non-biallelic markers, and those markers in an input VCF file will be automatically removed from a resultant GDS file. GBScleanR also provides functions for data visualization, filtering, and loading/writing a VCF file. Furthermore, the data structure of the GDS file created via this package is compatible with those used in the SNPRelate, GWASTools and GENESIS packages those are designed to handle large variant data and conduct downstream analyses including regression analysis (Zheng et al. 2012; Stephanie M. Gogarten et al. 2012; Stephanie M. Gogarten et al. 2019). In this document, we first walk through the utility functions implemented in GBScleanR to introduce a basic usage. Then, the core function of GBScleanR which estimates true genotypes for error correction will be introduced.

2 New Functions in GBScleanR Version 2

The latest release of GBScleanR is version 2. The major update in this version is that GBScleanR supports polyploid populations. GBScleanR tries to estimate the haplotype phases of founder and offspring samples based on given read counts. To process your data as of a polyploid population, set the ploidy argument in the loadGDS() function.

gds <- loadGDS(x = "/path/to/your/gds", ploidy = 4) # For tetraploid
gds <- loadGDS(x = "/path/to/your/gds", ploidy = 6) # For hexaploid

You can use any other functions in the GBScleanR package for your polyploid population in the same way for diploids.  

In addition, the setDominantMarkers() function enables you to force GBScleanR to treat some markers as dominant markers. In the iterative parameter optimization of the genotype estimation step, GBScleanR calculates the marker-wise reference-read-bias. We can assume that reads of either reference or alternative allele only can be observed at dominant markers and the marker-wise reference-read-bias must be 1 or 0 for dominant markers that give only reference or alternative allele reads, respectively.  

As another important change in estGeno() since version 2, the function now requires the node argument to explicitly specify whether raw or filtered read counts should be used for genotype estimation. If setCallFilter() has not been executed, the user-specified node argument will be ignored, and the “raw” node will be used by default.  

The data visualization by GBScleanR was also enhanced by new plot functions: plotReadRatio() and plotDosage(). See the “Plot Read Ratio and Dosage” section for the details.

3 Prerequisites

This package internally uses the following packages.
- ggplot2
- dplyr
- tidyr
- expm
- gdsfmt
- SeqArray
 

4 Installation

You can install GBScleanR from the Bioconductor repository with the following code.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GBScleanR")

 

The code below let you install the package from the github repository. The package released in the github usually get frequent update more than that in Bioconductor due to the release schedule.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("tomoyukif/GBScleanR", build_vignettes = TRUE)

5 Error while Handling a GDS File

You may face to the following error message or similar one if you killed the process that was accessing a GDS file.

Stream Read Error, need 12 byte(s) but receive 0

This error message indicates the corruption of the GDS file and you cannot access it anymore. In this case, please remake a GDS file using the gbsrVCF2GDS() function.

6 Loading Data

The main class of the GBScleanR package is GbsrGenotypData which inherits the GenotypeData class in the SeqArray package. The gbsrGenotypeData class object has three slots: sample, marker, and scheme. The data slot holds genotype data as a gds.class object which is defined in the gdsfmt package while snpAnnot and scanAnnot contain objects storing annotation information of SNPs and samples, which are the SnpAnnotationDataFrame and ScanAnnotationDataFrame objects defined in the GWASTools package. See the vignette of GWASTools for more detail. GBScleanR follows the way of GWASTools in which a unique genotyping instance (genotyped sample) is called “scan”.  

Load the package.

library("GBScleanR")

 

GBScleanR only supports a VCF file as input. As an example data, we use sample genotype data for a biparental F2 population derived from inbred parents.

vcf_fn <- system.file("extdata", "sample.vcf", package = "GBScleanR")
gds_fn <- tempfile("sample", fileext = ".gds")

 

As mentioned above, the GbsrGenotypeData class requires genotype data in the gds.class object which enable us quick access to the genotype data without loading the whole data on RAM. At the beginning of the processing, we need to convert data format of our genotype data from VCF to GDS. This conversion can be achieved using gbsrVCF2GDS() as shown below. A compressed VCF file (.vcf.gz) is also acceptable.

# `force = TRUE` allow the function to over write the GDS file,
# even if a GDS file exists at `out_fn`.
gbsrVCF2GDS(vcf_fn = vcf_fn, out_fn = gds_fn, force = TRUE, verbose = FALSE)
## [1] "/tmp/RtmpQ7MJxU/sample1b14b163350509.gds"

 

Once we converted the VCF to the GDS, we can create a GbsrGenotypeData instance for our data.

gds <- loadGDS(gds_fn, verbose = FALSE)

The first time to load a newly produced GDS file will take long time due to data reformatting for quick access.

7 Utility Methods

7.1 Getters

Getter functions allow you to retrieve basic information of genotype data, e.g. the number of SNPs and samples, chromosome names, physical position of SNPs and alleles.

# Number of samples
nsam(gds)
## [1] 102

 

# Number of SNPs
nmar(gds) 
## [1] 242

 

# Indices of chromosome ID of all markers
head(getChromosome(gds)) 
## [1] "1" "1" "1" "1" "1" "1"

 

# Chromosome names of all markers
head(getChromosome(gds)) 
## [1] "1" "1" "1" "1" "1" "1"

 

# Position (bp) of all markers
head(getPosition(gds)) 
## [1] 522289 571177 577905 720086 735019 841286

 

# Reference allele of all markers
head(getAllele(gds)) 
## [1] "A,G" "A,G" "C,T" "G,C" "C,T" "G,T"

 

# SNP IDs
head(getMarID(gds)) 
## [1] 1 2 3 4 5 6

 

# sample IDs
head(getSamID(gds)) 
## [1] "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"

 

The function getGenotype() returns genotype call data in which integer numbers 0, 1, and 2 indicate the number of reference allele.

geno <- getGenotype(gds)

 

The function getRead() returns read count data as a list with two elements ref and alt containing reference read counts and alternative read counts, respectively.

geno <- getRead(gds)

7.2 Data Summarization

7.2.1 Collect Basic Summary Statistics

countGenotype() and countRead() are class methods of GbsrGenotypeData and they summarize genotype counts and read counts per marker and per sample.

gds <- countGenotype(gds)
gds <- countRead(gds)

 

7.2.2 Visualize Basic Summary Statistics

These summary statistics can be visualized via plotting functions. With the values obtained via countGenotype(), we can plot histograms of missing rate , heterozygosity, reference allele frequency as shown below.

# Histgrams of missing rate
histGBSR(gds, stats = "missing") 

Missing rate per marker and per sample.  

# Histgrams of heterozygosity
histGBSR(gds, stats = "het") 

Heterozygosity per marker and per sample.  

# Histgrams of reference allele frequency
histGBSR(gds, stats = "raf")