encodeSequences {DropletUtils} | R Documentation |
Encode short nucleotide sequences into integers with a 2-bit encoding.
encodeSequences(sequences)
sequences |
A character vector of short nucleotide sequences, e.g., UMIs or cell barcodes. |
Each pair of bits encodes a nucleotide - 00 is A, 01 is C, 10 is G and 11 is T. The least significant byte contains the 3'-most nucleotides, and the remaining bits are set to zero. Thus, the sequence “CGGACT” is converted to the binary form:
01 10 10 00 01 11
... which corresponds to the integer 1671.
A consequence of R's use of 32-bit integers means that no element of sequences
can be more than 15 nt long.
Otherwise, integer overflow will occur.
An integer vector containing the encoded sequences.
Aaron Lun
10X Genomics (2017). Molecule info. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/molecule_info
encodeSequences("CGGACT")