--- title: "RaMWAS Parameters" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true # table of content true vignette: > %\VignetteIndexEntry{6. RaMWAS parameters} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r loadKnitr, echo=FALSE} # library("knitr") # opts_chunk$set(eval=FALSE) library(pander) panderOptions("digits", 3) ``` # Initializing RaMWAS parameters There are several ways to initialize the parameters for calling RaMWAS pipeline functions. The parameters can be stored in an R list like this: ```{r eval=FALSE} param = ramwasParameters( dirproject = ".", dirbam = "bams", filebamlist = "bam_list.txt", filecpgset = "Simulated_chromosome.rds", cputhreads = 2, scoretag = "MAPQ", minscore = 4, minfragmentsize = 50, maxfragmentsize = 250, filecovariates = "covariates.txt", modelcovariates = NULL, modeloutcome = "age", modelPCs = 0, toppvthreshold = 1e-5, cvnfolds = 10, mmalpha = 0, mmncpgs = c(5, 10, 50, 100, 500, 1000, 5000, 10000) ) ``` Alternatively, the parameters can be set in a separate R code file, which is processed into a list as above by `parametersFromFile` function. The R code file can contain lines like this: ```{r eval=FALSE} ### R parameter file dirbam = "/ramwas_project/bams/" dirproject = "/ramwas_project/" filebamlist = "/ramwas_project/000_list_of_files.txt" scoretag = "AS" minscore = 100 ### platform dependent part if(.Platform$OS.type == "windows"){ filecpgset="C:/RaMWAS/CpG_set/cpgset_hg19_SNPS_at_MAF_0.05.rds" } else { filecpgset="/computing_cluster/ramwas/cpgset_hg19_SNPS_at_MAF_0.05.rds" } ``` # Explanation of all parameters ## Parameters pointing to directories The project directory parameter is `dirproject`. Files specified by `file*` parameters are looked for here, unless they have the full path specified. By default `dirproject` is set to the current directory. The `dirbam` directory is the location where RaMWAS expects to find BAM files. If it is not an absolute path, it is considered to be relative to the `dirproject`. The `dirfilter` directory is, by default, the same as `dirproject`. All files created by RaMWAS are created within this directory. If the user wants to test different read filtering rules, they can set `dirfilter` to `TRUE`. This will set it to something like "Filter_MAPQ_4", there "MAPQ" is the BAM field used for filtering and "4" is the threshold. The `dirrbam` parameter is the location where RaMWAS saves RaMWAS raw data files (read start locations) after scanning BAMs. It is "rds_rbam" by default and is located in `dirfilter`. The `dirrqc` parameter is the location where RaMWAS saves QC files in R format after scanning BAMs. It is "rds_qc" by default and is located in `dirfilter`. The `dirqc` parameter is the location where RaMWAS saves QC plots and text files (BAM QC info) after scanning BAMs. It is "qc" by default and is located in `dirfilter`. The `dircoveragenorm` parameter is the sub-directory where RaMWAS saves coverage matrix at Step 3 of the pipeline. It is "coverage_norm_123" by default (123 is the number of samples) and is located in `dirfilter`. The `dirtemp` parameter is the directory where RaMWAS stores temporary files during construction of coverage matrix at Step 3 of the pipeline. It is "temp" by default and is located in `dircoveragenorm`. For better performance it can be set to a location on a different hard drive than `dircoveragenorm`. The `dirpca` parameter is the sub-directory where RaMWAS saves results of PCA analysis at Step 4 of the pipeline. It is "PCA_12_cvrts_0b0a0c" by default (12 is the number of covariates regressed out and 0b0a0c is a unique code to differentiate different sets of 12 covariates) and is located in `dircoveragenorm`. The `dirmwas` parameter is the sub-directory where RaMWAS saves results of MWAS analysis at Step 5. It is "Testing_age_7_PCs" by default (age is the phenotype being tested and 7 is number of top PCs included in the model) and is located in `dirpca`. The `dircv` parameter is the sub-directory where RaMWAS saves results of Methylation Risk Score analysis at Step 7. It is "CV_10_folds" by default (10 is number of folds in N-fold cross validation) and is located in `dirmwas`. ## Parameters pointing to files ### BAM names Parameter `filebamlist`, if defined, must point to a text file with one BAM file name per line. BAM file names can include path, relative to `dirbam` or absolute. Such file may looks like this. ``` batch1/b1sample1.bam batch1/b1sample2.bam batch2/b2sample1.bam batch2/b2sample2.bam batch2/b2sample3.bam batch4/sample4.bam ``` This file is then loaded into `bamnames` parameter, with ".bam" extension stripped. **Note:** BAM file names must be different. For example, the list of BAMS below is **NOT** allowed, as it contains "sample1.bam" twice: ``` batch1/sample1.bam batch1/sample2.bam batch2/sample1.bam ``` ### BAM to sample matching {#bam2sample} The `filebam2sample` parameter lets RaMWAS know the BAM to sample correspondence. It provides information on how BAMs from the same sample are to be combined. Each line in `filebam2sample` must have information for one sample. If sample1 contains reads from bam1, bam2 and bam3, the line should be > sample1=bam1,bam2,bam3 If the sample name matches the bam name, the line can simply contain that name > sample2 The `filebam2sample` file is scanned into `bam2sample` list. The elements of the list are bam names, and their names are sample names. For example: ```{r bam2sample} bam2sample = list( sample1 = c("bam1","bam2","bam3"), sample2 = "sample2" ) ``` ### CpG locations RaMWAS calculates CpG scores and performs further analyses at a set of CpGs (or locations in general) defined by the user via `filecpgset` parameter. The `filecpgset` parameter must point to an .rds file (a file saved using `saveRDS` function), with the set of locations stored as a `list` with one sorted vector of CpG locations per chromosome. ```{r CpGsetExample} cpgset = list( chr1 = c(12L, 57L, 123L), chr2 = c(45L, 95L, 99L, 111L), chr3 = c(22L, 40L, 199L, 211L) ) ``` In practice, the set should depend on the reference genome and can include CpGs created by common SNPs. Optionally, parameter `filenoncpgset`, can point to a file storing vetted locations away from any CpGs. For more on CpG sets see [the CpG set vignette](RW2_CpG_sets.html) ### File with covariates The parameter `filecovariates`, if defined, must point to a file containing phenotype information and covariates for the available samples. If the file has extension ".csv", it is assumed to be comma separated, otherwise - tab separated. It must have a header and the first column must have sample names as defined by `bam2sample` parameter ([see above](#bam2sample)). The data in `filecovariates` is read into the `covariates` parameter. ## Multithreading Many parts of RaMWAS are parallelized. The `cputhreads` parameter determines the maximum number of CPU intensive tasks running in parallel. By default `cputhreads` is set to the number of CPU cores. Some tasks are disk intensive. The maximum number of such tasks running in parallel is set by the `diskthreads` parameter. By default `diskthreads` value is 2. Higher values can be beneficial on machine with lots of RAM. On some systems the performance is better if different jobs are prevented from simultaneous access to files. To enforce this for filematrices set `usefilelock=TRUE`. ## Read filtering The reads are filtered by `scoretag` parameter, which is usually the "MAPQ" field or "AS" tag in the BAM file ([BAM file format](https://samtools.github.io/hts-specs/SAMv1.pdf)). The `minscore` parameter defines the minimum admissible score, reads with scores below that are excluded. If there the are more than `maxrepeats` read with the same start position, this excess is assumed to be the result of template preparation or amplification artifacts and count is reset to `maxrepeaets` (it is set to 3 by default). ## Coverage matrix The CpGs in CpG set defined by `filecpgset` are filtered based on their coverage. * A CpG must have average equal or greater than `minavgcpgcoverage` (default is 0.3). * A CpG must have at least `minnonzerosamples` proportion of samples with nonzero coverage\ (default is 0.3, i.e. a CpG is preserved if at least 30\% of samples have non-zero coverage). The file operations in this step can be performed faster if done in large blocks. To set the block size use `buffersize` parameter. Be default it is set to 1 GB (`buffersize = 1e9`). Numerical values take 8 bytes is stored with full precision. The coverage matrix does not need such precision and can safely be stored with 4 bytes per value (single precision). The value size is set by `doublesize` parameter, which is 4 by default. ## PCA and MWAS Both PCA and MWAS correct for variation explained by selected covariates set by `modelcovariates`. The `modelcovariates` parameter must name variables in `filecovariates`/`covariates`. By default, the tested linear model includes a constant. To exclude it, set `modelhasconstant` parameter to `FALSE`. MWAS tests for association of normalized CpG coverage with `modeloutcome`, accounting for variation of top `modelPCs` principal components. MWAS produces a QQ-plot in `dirmwas`. The title of the QQ-plot can be changed by the `qqplottitle` parameter. To exclude the title set `qqplottitle=""`. Top MWAS results are saved in a text file `Top_tests.txt`. Parameter `toppvthreshold` defines p-value threshold for selection of top results. Alternatively, it can define the number of top results, if it is set to a value larger than 1. ## Annotation of top findings {#biomart} The annotation is done using [`biomaRt`](https://bioconductor.org/packages/biomaRt/). package. The parameters include: * `bihost` -- BioMart host site. Default is `grch37.ensembl.org`. * `bimart` -- BioMart database name, see listMarts(). Default is `ENSEMBL_MART_ENSEMBL`. * `bidataset` -- BioMart data set, see listDatasets(). * `biattributes` -- are attributes of interest, see listAttributes(). Default is `c("hgnc_symbol","entrezgene","strand")`. * `bifilters` -- lists filters (if any), see listFilters(). * `biflank` -- indicates the maximum allowed distance from the CpG to the annotation element. Here is an example on how to select a custom biomart annotation track: ```{r marts, eval=FALSE} library(biomaRt) library(ramwas) # First pick a host. bihost = "grch37.ensembl.org" # First we list databases listOfMarts = listMarts(host = bihost) pander(head(listOfMarts, 10)) # Pick a database bimart = "ENSEMBL_MART_ENSEMBL" # Connect to the database mart = useMart(biomart = bimart, host = bihost) # List the data sets in the database listOfDatasets = listDatasets(mart = mart) pander(head(listOfDatasets, 10)) # Pisk a data set bidataset = "hsapiens_gene_ensembl" # Connect to the data set mart = useMart(biomart = bimart, dataset = bidataset, host = bihost) # List the attributes listOfAttributes = listAttributes(mart) pander(head(listOfAttributes, 10)) # Pick attributes biattributes = c("hgnc_symbol", "entrezgene", "strand") listOfFilters = listFilters(mart) pander(head(listOfFilters, 20)) # Pick a filter bifilters = list(with_hgnc_trans_name=TRUE) # Test a location chr = "chr1" pos = 15975530 param = ramwasParameters( bihost = bihost, bimart = bimart, bidataset = bidataset, biattributes = biattributes, bifilters = bifilters, biflank = 0) anno = ramwasAnnotateLocations(param, chr, pos) pander(anno) ``` ## Methylation risk score RaMWAS predicts the outcome variable (`modeloutcomes` parameter) using top `mmncpgs` CpGs from the MWAS. This prediction is done for each fold in k-fold cross validation and the prediction performance is measured via correlations and (for binary outcomes) ROC curves. To run the procedure for multiple number of top CpGs, The parameter `mmncpgs` can be set to a vector of multiple values. The elastic net mixing parameter alpha can be set via `mmalpha` parameter. The number of folds `cvnfolds` in the K-fold cross validation is 10 by default. The split into folds is random. The random seed can be set with the `randseed` parameter, which is set to `18090212` by default for consistency across runs. ### Choosing the number of folds `cvnfolds` in the cross validation When selecting the number of folds, K, in K-fold cross validation a researcher faces a trade off. On one hand, larger K allows the training set [of size approximately $\frac{K-1}{K}N$ to better match the size of the complete data set. On the other hand, the computational complexity of cross validation grows linearly with K. As a balance, $K = 5$ or $10$ is often chosen. The most extreme case of $K = N$ is known as the leave-one-out cross validation procedure. ## Joint analysis with genotype data The joint analysis of methylation and genotype data is described in the corresponding [vignette](RW4_SNPs.html). The SNP data must be stored in a filematrix with dimensions matching the CpG score matrix. Its name must be defined by `fileSNPs` parameter, with absolute path or relative to `dircoveragenorm`. The results of the joint analysis are stored in `dirSNPs` directory. By default, the directory is created within `dircoveragenorm` directory.