.
.
.
The SeqArray package is designed for R programming environment, and enables high-performance computing in the multi-core symmetric multiprocessing and loosely coupled computer cluster framework. The features of SeqArray are extended with other existing R packages for WGS data analyses, and the R codes for demonstration are available in the package vignette R Integration.
Figure 1: SeqArray framework and flowchart. The SeqArray format is built on top of the Genomic Data Structure (GDS) format, and GDS is a generic data container with hierarchical structure for storing multiple array-oriented data sets. A high-level R interface to GDS files is provided in the gdsfmt package with a C++ library, and the SeqArray package offers functionalities specific to sequencing data. At a minimum a SeqArray file contains sample and variant identifiers, position, chromosome, reference and alternate alleles for each variant. Parallel computing environments, like multi-core computer clusters, are enabled with SeqArray. The functionality of SeqArray is extended by SeqVarTools, SNPRelate, GENESIS and other R/Bioconductor packages for WGS analyses.
.
library(SeqArray)
## Loading required package: gdsfmt
# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
file <- seqOpen(seqExampleFileName("KG_Phase1"))
seqSummary(file)
## File: /tmp/Rtmpaw2XyJ/Rinst2ec79d362f461d/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds
## Format Version: v1.0
## Reference: GRCh37
## Ploidy: 2
## Number of samples: 1,092
## Number of variants: 19,773
## Chromosomes:
## chr22: 19773
## Alleles:
## DEL, Deletion
## tabulation: 2, 19773(100.0%)
## Annotation, Quality:
## Min: 0, 1st Qu: 100, Median: 100, Mean: 110.146536457016, 3rd Qu: 100, Max: 3002, NA's: 10
## Annotation, FILTER:
## PASS, , 19773(100.0%)
## Annotation, INFO variable(s):
## <None>
## Annotation, FORMAT variable(s):
## GT, 1, String, Genotype
## Annotation, sample variable(s):
## Family.ID, String, <NA>
## Population, String, <NA>
## Gender, String, <NA>
.
.
.
.
.
.
.
Table 1: The key functions in the SeqArray package.
Function | Description |
---|---|
seqVCF2GDS | Reformat VCF files. » |
seqSetFilter | Define a data subset of samples or variants. » |
seqGetData | Get data from a SeqArray file with a defined filter. » |
seqApply | Apply a user-defined function over array margins. » |
seqParallel | Apply functions in parallel. » |
Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the R programming language. Table 1 lists five key functions provided in the SeqArray package and many data analyses can be done using just these functions.
seqVCF2GDS()
converts VCF files to SeqArray format.
Multiple cores in an SMP architecture within one or more compute nodes
in a compute cluster can be used to simultaneously reformat the data.
seqVCF2GDS()
utilizes R’s connection interface to read VCF
files incrementally, and it is able to import data from http/ftp texts
and the standard output of a command-line tool via a pipe.
seqSetFilter()
and seqGetData()
can be used
together to retrieve data for a selected set of samples from a defined
genomic region. GRanges
and GRangesList
objects defined in the Bioconductor core packages are supported via
seqSetFilter()
(Gentleman et al.
2004; Lawrence et al. 2013).
seqApply()
applies a user-defined function to array
margins of genotypes and annotations. The function that is applied can
be defined in R as is typical, or via C/C++ code using the Rcpp package
(Eddelbuettel et al. 2011).
seqParallel()
utilizes the facilities in the parallel
package (Rossini, Tierney, and Li 2007; R Core
Team 2016) to perform calculations on a SeqArray file in
parallel.
We illustrate the SeqArray functions by implementing an example to
calculate the frequency of reference allele across all chromosomes. If a
genomic region is specified via seqSetFilter()
, the
calculation is performed within the region instead of using all
variants. seqApply()
enables applying a user-defined
function to the margin of genotypes, and the R code is shown as
follows:
af <- seqApply(file, "genotype", as.is="double", margin="by.variant",
FUN=function(x) { mean(x==0L, na.rm=TRUE) })
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842
where file
is a SeqArray file, as.is
indicates returning a numeric vector, margin
is specified
for applying the function by variant. The variable x
in the
user-defined function is an allele-by-sample integer matrix at a site
and 0L
denotes the reference allele where the suffix
L
indicates the number is an integer.
The Rcpp package simplifies integration of compiled C++ code with R (Eddelbuettel et al. 2011), and the function can be dynamically defined with inlined C/C++ codes:
library(Rcpp)
cppFunction("
double CalcAlleleFreq(IntegerVector x)
{
int len=x.size(), n=0, n0=0;
for (int i=0; i < len; i++)
{
int g = x[i];
if (g != NA_INTEGER)
{
n++;
if (g == 0) n0++;
}
}
return double(n0) / n;
}")
where IntegerVector indicates the input variable
x
is an integer vector, NA_INTEGER
is missing
value and the function counts how many zeros and non-missing values for
calculating frequency. The name CalcAlleleFreq can be passed to
seqApply()
directly:
af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=CalcAlleleFreq)
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842
The C++ integration is several times faster than the R implementation, suggesting an efficient approach with C/C++ when real-time performance is required.
It is able to run the calculation in parallel. The genotypes of a SeqArray file are automatically split into non-overlapping parts according to different variants or samples, and the results from client processes collected internally:
af <- seqApply(file, "genotype", as.is="double",
margin="by.variant", FUN=function(x) { mean(x==0L, na.rm=TRUE) }, parallel=2)
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842
Here parallel
specifies the number of cores.
Principal Component Analysis (PCA) is a common tool used in
exploratory data analysis for high-dimensional data. PCA is often
involved with the calculation of covariance matrix, and the following R
code implements the calculation proposed in (Patterson, Price, and Reich 2006). The
user-defined function computes the covariance matrix for each variant
and adds up to a total matrix s
. The argument
.progress=TRUE
enables the display of progress information
during the calculation.
# covariance variable with an initial value
s <- 0
seqApply(file, "$dosage", function(x)
{
p <- 0.5 * mean(x, na.rm=TRUE) # allele frequency
g <- (x - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency
g[is.na(g)] <- 0 # correct missing values
s <<- s + (g %o% g) # update the cov matrix s in the parent environment
}, margin="by.variant", .progress=TRUE)
# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))
# eigen-decomposition
eig <- eigen(s)
[..................................................] 0%, ETC: ---
[==================>...............................] 36%, ETC: 4.1m
...
[==================================================] 100%, completed in 14.3m
# covariance variable with an initial value
s <- 0
seqBlockApply(file, "$dosage", function(x)
{
p <- 0.5 * colMeans(x, na.rm=TRUE) # allele frequencies (a vector)
g <- (t(x) - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency
g[is.na(g)] <- 0 # correct missing values
s <<- s + crossprod(g) # update the cov matrix s in the parent environment
}, margin="by.variant", .progress=TRUE)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 20s
# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))
# eigen-decomposition
eig <- eigen(s, symmetric=TRUE)
seqParallel()
utilizes the facilities offered by the R
parallel package to perform calculations within a cluster or SMP
environment, and the genotypes are automatically split into
non-overlapping parts. The parallel implementation with R is shown as
follows, and the C optimized function is also available in the SNPRelate
package.
# the datasets are automatically split into four non-overlapping parts
genmat <- seqParallel(2, file, FUN = function(f)
{
s <- 0 # covariance variable with an initial value
seqBlockApply(f, "$dosage", function(x)
{
p <- 0.5 * colMeans(x, na.rm=TRUE) # allele frequencies (a vector)
g <- (t(x) - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency
g[is.na(g)] <- 0 # correct missing values
s <<- s + crossprod(g) # update the cov matrix s in the parent environment
}, margin="by.variant")
s # output
}, .combine = "+", # sum "s" of different processes together
split = "by.variant")
# scaled by the number of samples over the trace
genmat <- genmat * (nrow(genmat) / sum(diag(genmat)))
# eigen-decomposition
eig <- eigen(genmat, symmetric=TRUE)
# figure
plot(eig$vectors[,1], eig$vectors[,2], xlab="PC 1", ylab="PC 2")
More examples can be found: SeqArray Data Format and Access
The default setting for the analysis functions in the SeqArray
package is serial implementation, but users can setup a cluster
computing environment manually via seqParallelSetup()
and
distribute the calculations to multiple cores or even more than 100
cluster nodes.
# use 2 cores for demonstration
seqParallelSetup(2)
## Enable the computing cluster with 2 forked R processes.
# numbers of distinct alleles per site
table(seqNumAllele(file))
##
## 2
## 19773
# reference allele frequencies
summary(seqAlleleFreq(file, ref.allele=0L))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.9725 0.9963 0.9286 0.9991 1.0000
# close the cluster environment
seqParallelSetup(FALSE)
## Stop the computing cluster.
.
.
.
.
.
.
.
In this section, we illustrate how to work with Bioconductor core
packages for performing common queries to retrieve data from a SeqArray
file. The GRanges
and GRangesList
classes
manipulate genomic range data and can be used in the function
seqSetFilter()
to define a data subset. For example, the
annotation information of each exon, the coding range and transcript ID
are stored in the TxDb.Hsapiens.UCSC.hg19.knownGene
object
for the UCSC known gene annotations on hg19.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
# get the exons grouped by gene
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txs <- exonsBy(txdb, "gene")
where exonsBy()
returns a GRangesList
object for all known genes in the database.
seqSetFilter(file, txs) # define an exon filter
## # of selected variants: 1,050
# VCF export with exon variants
seqGDS2VCF(file, "exons.vcf.gz")
## Tue Jan 3 15:43:55 2017
## VCF Export: exons.vcf.gz
## 1,092 samples, 1,050 variants
## INFO Field: <none>
## FORMAT Field: <none>
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed in 0s
## Tue Jan 3 15:43:55 2017 Done.
If random-access memory is sufficiently large, users could load all
exon variants via seqGetData(file, "genotype")
; otherwise,
data have to be loaded by chunk or a user-defined function is applied
over variants by seqApply()
.
SeqArray can also export data with selected variants and samples as a
VCF
object for use with the VariantAnnotation package (Obenchain et al. 2014):
library(VariantAnnotation)
# select a region [10Mb, 30Mb] on chromosome 22
seqSetFilterChrom(file, 22, from.bp=10000000, to.bp=30000000)
## # of selected variants: 7,066
vcf <- seqAsVCF(file, chr.prefix="chr")
vcf
## class: CollapsedVCF
## dim: 7066 1092
## rowRanges(vcf):
## GRanges with 9 metadata columns: ID, REF, ALT, QUAL, FILTER, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 0 columns:
## geno(vcf):
## List of length 1: GT
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
locateVariants(vcf, txdb, CodingVariants())
## GRanges object with 524 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList>
## 1 chr22 [17071862, 17071862] - | coding 1579 1579 128 74436 216505
## 2 chr22 [17073170, 17073170] - | coding 271 271 129 74436 216505
## 3 chr22 [17589225, 17589225] + | coding 1116 1116 377 73481 214034
## 4 chr22 [17601466, 17601466] - | coding 552 552 382 74444 216522
## 5 chr22 [17629357, 17629357] - | coding 424 424 394 74446 216528
## ... ... ... ... . ... ... ... ... ... ...
## 520 chr22 [29913278, 29913278] - | coding 1567 1567 7023 74771 217273
## 521 chr22 [29924156, 29924156] - | coding 977 977 7030 74768 217279
## 522 chr22 [29924156, 29924156] - | coding 977 977 7030 74769 217279
## 523 chr22 [29924156, 29924156] - | coding 977 977 7030 74770 217279
## 524 chr22 [29924156, 29924156] - | coding 977 977 7030 74771 217279
## GENEID PRECEDEID FOLLOWID
## <character> <CharacterList> <CharacterList>
## 1 150160
## 2 150160
## 3 23765
## 4 27439
## 5 27440
## ... ... ... ...
## 520 8563
## 521 8563
## 522 8563
## 523 8563
## 524 8563
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
.
.
.
.
.
.
.
The SeqVarTools package is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets. The vignette of SeqVarTools is http://www.bioconductor.org/packages/release/bioc/vignettes/SeqVarTools/inst/doc/SeqVarTools.pdf.
The SeqVarTools package extends SeqArray by providing methods for many tasks common to quality control and analysis of sequence data. Methods include: transition/transversion ratio, heterozygosity and homozygosity rates, singleton counts, Hardy-Weinberg equilibrium, Mendelian error checking, and linear and logistic regression. Additionally, SeqVarTools defines a new class to link the information present in the SeqArray file to additional sample and variant annotation provided by the user, such as sex and phenotype. One could select a subset of samples in a file and run a linear regression on all variants:
library(Biobase)
library(SeqVarTools)
data(KG_P1_SampData)
KG_P1_SampData
## An object of class 'AnnotatedDataFrame'
## rowNames: 1 2 ... 1092 (1092 total)
## varLabels: sample.id sex age phenotype
## varMetadata: labelDescription
head(pData(KG_P1_SampData)) # show KG_P1_SampData
## sample.id sex age phenotype
## 1 HG00096 male 55 -0.65582105
## 2 HG00097 female 34 1.28337670
## 3 HG00099 female 39 0.05563847
## 4 HG00100 female 62 0.11139003
## 5 HG00101 male 60 0.34933331
## 6 HG00102 female 28 0.36536723
# link sample data to SeqArray file
seqData <- SeqVarData(file, sample.data)
# set sample and variant filters
female <- sampleData(seqData)$sex == "female"
seqSetFilter(seqData, sample.sel=female)
# run linear regression
res <- regression(seqData, outcome="phenotype", covar="age")
head(res)
## variant.id n freq Est SE Wald.Stat Wald.Pval
## 1 1 567 0.6887125 -0.090555715 0.0699074 1.677974724 0.19519378
## 2 2 567 0.9400353 0.009685877 0.1321824 0.005369459 0.94158602
## 3 3 567 0.9991182 -0.378945215 1.0238102 0.136997920 0.71128392
## 4 4 567 1.0000000 NA NA NA NA
## 5 5 567 0.9356261 -0.009732930 0.1281883 0.005764880 0.93947733
## 6 6 567 0.9982363 -1.379486822 0.7233651 3.636804912 0.05651529
.
.
.
.
.
.
.
Parallel implementations of relatedness and principal component analysis with SeqArray format are enabled in the package SNPRelate, to detect and adjust for population structure and cryptic relatedness in association studies. The kernel of SNPRelate was optimized with SIMD instructions and multi-thread algorithms, and it was designed for bi-allelic SNP data originally (Zheng et al. 2012). In order to analyze sequence variant calls, and SNPRelate has been rewritten to take the dosages of reference alleles as an input genotype matrix with distinct integers 0, 1, 2 and NA for SeqArray files. Therefore no format conversion is required for WGS analyses.
Principal component analysis is implemented in the SNPRelate function
snpgdsPCA()
, and the exact and new randomized algorithms
are both provided (Patterson, Price, and Reich
2006; Galinsky et al. 2016). The randomized matrix algorithm is
designed to reduce the running time for large number of study
individuals (e.g., greater than 10,000 samples). Relatedness analyses
include PLINK method of moment (MoM), KING robust methods, GCTA genetic
relationship matrix (GRM) and individual-perspective beta estimator
(Purcell et al. 2007; Manichaikul et al. 2010;
“GCTA: A Tool for Genome-Wide Complex Trait
Analysis.” 2011; Bruce S. Weir and Zheng 2015), and these
algorithms are computationally efficient and optimized with SIMD
instructions. In addition, fixation index (\(F_\text{st}\)) has been widely used to
measure the genetic difference between populations, and the calculations
of moment estimators are available in the SNPRelate package with all
variants or a sliding window (B. S. Weir and
Cockerham 1984; B. S. Weir and Hill 2002; Bruce S. Weir et al.
2005).
It is suggested to perform marker pruning before running PCA and IBD analyses on WGS variant data, to reduce the influence of linkage disequilibrium and rare variants.
library(SNPRelate)
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
set.seed(1000)
# may try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(file, ld.threshold=0.2, maf=0.01)
## SNV pruning based on LD:
## Calculating allele counts/frequencies ...
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## # of selected variants: 7,028
## Excluding 12,745 SNVs (monomorphic: TRUE, MAF: 0.01, missing rate: 0.05)
## # of samples: 1,092
## # of SNVs: 7,028
## using 1 thread
## sliding window: 500,000 basepairs, Inf SNPs
## |LD| threshold: 0.2
## method: composite
## Chrom 22: |====================|====================|
## 10.19%, 2,014 / 19,773 (Sun Jan 12 19:07:45 2025)
## 2,014 markers are selected in total.
names(snpset)
## [1] "chr22"
head(snpset$chr22) # variant.id
## [1] 1 8 9 10 12 16
# get all selected variant id
snpset.id <- unlist(snpset)
# Run PCA
pca <- snpgdsPCA(file, snp.id=snpset.id, num.thread=2)
## Principal Component Analysis (PCA) on genotypes:
## Calculating allele counts/frequencies ...
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s (process 1/2)
## # of selected variants: 2,014
## # of samples: 1,092
## # of SNVs: 2,014
## using 2 threads
## # of principal components: 32
## CPU capabilities: Double-Precision SSE2
## Sun Jan 12 19:07:46 2025 (internal increment: 2844)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Sun Jan 12 19:07:46 2025 Begin (eigenvalues and eigenvectors)
## Sun Jan 12 19:07:47 2025 Done.
# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
## [1] 4.17 2.65 0.70 0.60 0.47 0.41
# plot the first 4 eigenvectors with character=20 and size=0.5
plot(pca, eig=1:4, pch=20, cex=0.5)
Population information are available:
pop.code <- factor(seqGetData(file, "sample.annotation/Population"))
head(pop.code)
## [1] GBR GBR GBR GBR GBR GBR
## Levels: ASW CEU CHB CHS CLM FIN GBR IBS JPT LWK MXL PUR TSI YRI
popgroup <- list(
EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"),
European = c("CEU", "TSI", "GBR", "FIN", "IBS"),
African = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"),
SouthAmerica = c("MXL", "PUR", "CLM", "PEL"),
India = c("GIH", "PJL", "BEB", "STU", "ITU"))
colors <- sapply(levels(pop.code), function(x) {
for (i in 1:length(popgroup)) {
if (x %in% popgroup[[i]])
return(names(popgroup)[i])
}
NA
})
colors <- as.factor(colors)
legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=","))
legend.text
## African EastAsia European SouthAmerica
## "ASW,LWK,YRI" "CHB,CHS,JPT" "CEU,FIN,GBR,IBS,TSI" "CLM,MXL,PUR"
# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
Population = pop.code, stringsAsFactors = FALSE)
head(tab)
## sample.id EV1 EV2 Population
## 1 HG00096 0.01300336 0.03432677 GBR
## 2 HG00097 0.01194089 0.02977460 GBR
## 3 HG00099 0.01109778 0.02788650 GBR
## 4 HG00100 0.01563469 0.03185198 GBR
## 5 HG00101 0.01031886 0.02887336 GBR
## 6 HG00102 0.01510305 0.03243198 GBR
# draw
plot(pca, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22", col=colors[tab$Population])
legend("topright", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75)
For \(n\) study individuals,
snpgdsIBS()
can be used to create a \(n \times n\) matrix of genome-wide average
IBS pairwise identities. To perform cluster analysis on the \(n \times n\) matrix of genome-wide IBS
pairwise distances, and determine the groups by a permutation score:
set.seed(1000)
ibs.hc <- snpgdsHCluster(snpgdsIBS(file, snp.id=snpset.id, num.thread=2))
## Identity-By-State (IBS) analysis on genotypes:
## Calculating allele counts/frequencies ...
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s (process 1/2)
## # of selected variants: 2,014
## # of samples: 1,092
## # of SNVs: 2,014
## using 2 threads
## Sun Jan 12 19:07:49 2025 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 2s
## Sun Jan 12 19:07:51 2025 Done.
Here is the population information we have known:
# Determine groups of individuals by population information
rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code]))
## Create 4 groups.
plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22",
edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black"))
legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4)
Fixation index (Fst) has been widely used to measure the genetic difference between populations, and the calculations of moment estimators are available in the SNPRelate package with all variants or a sliding window.
# sliding windows (window size: 500kb)
sw <- snpgdsSlidingWindow(file, winsize=500000, shift=100000,
FUN="snpgdsFst", as.is="numeric", population=pop.code)
## Sliding Window Analysis:
## Calculating allele counts/frequencies ...
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 1s
## # of selected variants: 19,651
## Excluding 122 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 1,092
## # of SNVs: 19,651
## using 1 thread
## window size: 500000, shift: 100000 (basepair)
## Method: Weir & Cockerham, 1984
## # of Populations: 14
## ASW (61), CEU (85), CHB (97), CHS (100), CLM (60), FIN (93), GBR (89), IBS (14), JPT (89), LWK (97), MXL (66), PUR (55), TSI (98), YRI (88)
## Chromosome Set: 22
## Sun Jan 12 19:07:54 2025, Chromosome 22 (19651 SNPs), 348 windows
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 11s
## Sun Jan 12 19:08:05 2025 Done.
plot(sw$chr22.pos/1000, sw$chr22.val, xlab="genome coordinate (kb)", ylab="population-average Fst",
main="1KG Phase 1, chromosome 22")
abline(h=mean(sw$chr22.val), lty=3, col="red", lwd=2)
.
.
.
.
.
.
.
The GENESIS package offers methodology for estimating and accounting for population and pedigree structure in genetic analyses. The current implementation provides functions to perform PC-AiR and PC-Relate (Conomos, Miller, and Thornton 2015; Conomos et al. 2016). PC-AiR performs PCA on whole-genome genotypes taking into account known or cryptic relatedness in the study sample. PC-Relate uses ancestry representative principal components to estimate measures of recent genetic relatedness. In addition, GENESIS includes support for SeqArray files in mixed model association testing and aggregate tests of rare variants like burden and SKAT tests.
.
.
.
.
.
.
.
seqClose(file)
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SNPRelate_1.41.0 VariantAnnotation_1.53.1 Rsamtools_2.23.1
## [4] Biostrings_2.75.3 XVector_0.47.2 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [10] IRanges_2.41.2 S4Vectors_0.45.2 MatrixGenerics_1.19.1
## [13] matrixStats_1.5.0 BiocGenerics_0.53.3 generics_0.1.3
## [16] Rcpp_1.0.14 SeqArray_1.47.2 gdsfmt_1.43.1
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 rjson_0.2.23 xfun_0.50 bslib_0.8.0
## [5] lattice_0.22-6 vctrs_0.6.5 tools_4.5.0 bitops_1.0-9
## [9] curl_6.1.0 parallel_4.5.0 AnnotationDbi_1.69.0 RSQLite_2.3.9
## [13] blob_1.2.4 Matrix_1.7-1 BSgenome_1.75.0 lifecycle_1.0.4
## [17] GenomeInfoDbData_1.2.13 compiler_4.5.0 codetools_0.2-20 htmltools_0.5.8.1
## [21] sass_0.4.9 RCurl_1.98-1.16 yaml_2.3.10 crayon_1.5.3
## [25] jquerylib_0.1.4 BiocParallel_1.41.0 DelayedArray_0.33.3 cachem_1.1.0
## [29] abind_1.4-8 digest_0.6.37 restfulr_0.0.15 fastmap_1.2.0
## [33] grid_4.5.0 cli_3.6.3 SparseArray_1.7.2 S4Arrays_1.7.1
## [37] GenomicFeatures_1.59.1 XML_3.99-0.18 UCSC.utils_1.3.0 bit64_4.5.2
## [41] rmarkdown_2.29 httr_1.4.7 bit_4.5.0.1 png_0.1-8
## [45] memoise_2.0.1 evaluate_1.0.3 knitr_1.49 BiocIO_1.17.1
## [49] rtracklayer_1.67.0 rlang_1.1.4 DBI_1.2.3 jsonlite_1.8.9
## [53] R6_2.5.1 GenomicAlignments_1.43.0 zlibbioc_1.53.0