.

.

.

1 Overview

The SeqArray file format is built on top of the Genomic Data Structure (GDS) format (Zheng et al. 2012). GDS is a flexible and scalable data container with a hierarchical structure that can store multiple array-oriented data sets. It is intended for use with large-scale genomic datasets, especially for those which are much larger than the available main memory. GDS provides memory and performance efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. Data compression and decompression are available with relatively efficient random access. GDS is implemented using an optimized C++ library (CoreArray, http://corearray.sourceforge.net) and a high-level R interface is provided by the platform-independent R package gdsfmt (http://bioconductor.org/packages/gdsfmt). Figure 1 shows the relationship between a SeqArray file and the underlying infrastructure upon which it is built. The minimum data fields required in a SeqArray file are sample and variant identifiers, variant chromosome, position and allele values and the value of the variant itself. SeqArray is a Bioconductor package (Gentleman et al. 2004; R Core Team 2016) available at http://www.bioconductor.org/packages/SeqArray under the GNU General Public License v3.

.