Contents

1 Introduction

1.1 Why use ACE?

You want to know the percentage of tumor cells in your sample(s) and you have (preferably whole-genome) NGS data. You want pretty copy number profiles of your samples. You want to know how many copies are present of a certain chromosomal segment, or even gene, or mutation!

1.2 How ACE works

ACE is a an absolute copy number estimator that scales copy number data to fit with integer copy numbers. For this it uses segmented data from the QDNAseq package, which in turn uses a number of dependencies. Note: make sure QDNAseq fetches the bin annotations from the same genome build as the one used for aligning the sequencing data! On with ACE! In brief: ACE will run QDNAseq or use its output rds-file(s) of segmented data. It will subsequently run through all samples in the object(s), for which it will create individual subdirectories. For each sample, it will calculate how well the segments fit (the relative error) to integer copy numbers for each percentage of “tumor cells” (cells with divergent segments). Note that it does not estimate for a lower percentage than 5. ACE will output a graph with relative errors (all errors relative to the largest error). Said graph can be used to quickly identify the most likely fit. ACE selects all “minima” and saves the corresponding copy number plots. The “best fit” (lowest error) is not by definition the most likely fit! ACE will run models for a general tumor ploidy of 2N, but you can expand this to include any ploidy of your choosing. The program needs to make one assumption: the median bin segment value corresponds with the tumor’s general ploidy. If the median bin segment value of a sample (the “standard”) lies on a segment that happens to be 3N, but you only ran ACE on 2N, you can run that sample individually using the singlemodel function with argument ploidy = 3 (see section “examining single samples”). If the standard happens to be on a subclonal segment, you either have to manually change the standard, or use the squaremodel function (again, see below). The output of ACE is designed in such a way that it is “easy” to quickly analyze multiple samples. Bear in mind that it is absolutely necessary to manually select the most likely models! I have made a conscious decision not to let ACE “autopick” the best model. See below for a manual how to most efficiently pick the most likely models from your output. Let’s get started!

2 Running ACE

2.1 Installation

ACE is available through Bioconductor. In R, enter the commands:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
# BiocManager::install()
BiocManager::install("ACE")

The development version of ACE will be available on GitHub and can be installed as follows:

# install.packages("devtools")
devtools::install_github("tgac-vumc/ACE")

2.2 Getting started

The ACE package includes segmented data derived from low-coverage whole genome sequencing, which will be used throughout this vignette. The mapped sequencing data has been processed through the QDNAseq package. Users of ACE, however, are most likely to start from their own bam-files, not the pre-processed segmented data. runACE, the core functionality of ACE, will run a default set of QDNAseq functions if bam-files are provided as the data source (make sure the genome builds correspond, I cannot stress this enough). If you wish to run the code below, make sure the file paths are correct. To get started, I recommend using a directory that only contains a few bam-files. The function runACE is designed to automatically analyze all samples in a directory. Input should be either segmented QDNAseq-objects or aligned bam-files. For details on all arguments, consult the runACE documentation. Let’s get started!

# specify the directory containing your bam-files
userpath <- tempdir()
# if you do not want the output in the same directory, use argument outputdir
runACE(userpath, filetype='bam', binsizes = c(100, 1000), 
       ploidies = c(2,4), imagetype='png')

If you do not have aligned bam-files ready to go, you can use the data provided in the package:

data("copyNumbersSegmented")
# specify the directory in which to create the output
userpath <- tempdir()
saveRDS(file.path(userpath,"copyNumbersSegmented.rds"))
runACE(userpath, filetype='rds', ploidies = c(2,4), imagetype='png')

2.3 ACE output

2.3.1 rds-file

This is the segmented QDNAseq object; obviously not created when using rds-file as input. It can be used if you want to run ACE again with slightly different parameters. More importantly, you can use this file to examine individual samples in downstream analyses.

2.3.2 rds subdirectories

ACE creates a subdirectory for each rds-file. In case of bam-files as input, the subdirectories have the names of the binsizes.

2.3.3 ploidies subdirectories

For each analyzed tumor ploidy, ACE makes a subdirectory. In this case: 2N and 4N

2.3.4 summaries

summary_errors: error lists of all the models summary_likelyfits: copy number plots of the best fit and the last minimum of each sample, with the corresponding error list plots. I would recommend using the likelyfits for model selection. Summary files can become quite big / huge depending on sample size and bin size. See below how to deal with this.

2.3.5 likelyfits subdirectory

This subdirectory contains the individual copy number graphs of the likelyfits.

2.3.6 individual sample subdirectories

These subdirectories have a summary file with all the fits for the corresponding sample and the error list plot. Individual copy number graphs are available in the subdirectory “graphs”.

2.3.7 fitpicker tables

This tab-delimited file can be used during selection of most likely models. Especially handy when analyzing a large number of samples. More instructions below.

2.4 Model selection

Having this massive pile of output can be daunting, how do you make sense of it all? First of all, if you have multiple bin sizes, you generally only need a single bin size for your model selection. I recommend using a relatively large bin size. File sizes are smaller and segmentation is often more robust. You can probably find the corresponding model in the smaller binsizes as well, if you prefer to use those for copy number graphs. Cellularities between corresponding models made with different bin sizes are usually very similar, but some fits may be “missed”. The most likely fit of a tumor is generally 1) the fit with the smallest error, or 2) the fit at the highest cellularity, so those two are presented in the summary_likelyfits file. When you open this file, you see for each sample three plots in a row (in case of imagetype = ‘pdf’, it will make a new page from each plot). The first two of the three are copy number plots of the best fit (lowest relative error) and the last minimum, respectively. The third plot is the error list, which shows the relative error of the fit at each tested “cellularity” (tumor cell percentage). In most cases you will be able to pick a good fit from these graphs. The individual graph is then available in the likelyfits subdirectory. If none of the models fit well, or you think there might be a better fit possible, you can look at all fits in the summary file, available in the subdirectory of that sample. If there are still no good fits (e.g. the tumor has a ploidy of 3N and you did not include this ploidy), you have to do model fitting on this sample separately (see below).

You might want to go through your samples a bit more systematically, especially if you have many samples. You can open the fitpicker.tsv file, go through the fits of the summary file and note in the likely_fit column of the picker table which fit you chose. Just leave open any samples of which you are not sure. The handy thing about the fitpicker files is that they have the list of your sample names and they have the list of cellularities.

Another feature that might help going through a lot of samples, is using the “penalty” parameter in the runACE call. This parameter penalizes fits at lower cellularities. Doing so greatly improves the chance that the best fit is also the most likely fit, but comes at the cost of precision at high cellularities and comes at the cost of sensitivity at low cellularities (but only at the lowest end of the spectrum, let’s say below 10%). More about this later.

In the two examples given below, it is quite obvious what the most likely appropriate models are: for sample 1 it is the “lastminimum” with a general ploidy of 2N and cellularity of 0.79, and for sample 2 it is the “bestfit” with a general ploidy of 2N and cellularity of 0.38. Examining the error plots of these samples draws a great picture of how obvious the right model can be for the trained eye, while it can be deceivingly tricky to rely on a simple computer algorithm to make the pick.

To reproduce some of the output created by runACE (which is only written to file), I have included the following code. The functions used will be explained in more detail later:

data("copyNumbersSegmented")
object <- copyNumbersSegmented
model1 <- singlemodel(object, QDNAseqobjectsample = 1)
bestfit1 <- model1$minima[tail(which(model1$rerror==min(model1$rerror)), 1)]
besterror1 <- min(model1$rerror)
lastfit1 <- tail(model1$minima, 1)
lasterror1 <- tail(model1$rerror, 1)
singleplot(object, QDNAseqobjectsample = 1, cellularity = bestfit1, 
           error = besterror1, standard = model1$standard, 
           title = "sample1 - binsize 1000 kbp - 379776 reads - 2N fit 1")

singleplot(object, QDNAseqobjectsample = 1, cellularity = lastfit1, 
           error = lasterror1, standard = model1$standard, 
           title = "sample1 - binsize 1000 kbp - 379776 reads - 2N fit 7")