get_genome_gtf {ORFik} | R Documentation |
This function automatically downloads (if files not already exists)
genomes and contaminants specified for genome alignment.
Will create a R transcript database (TxDb object) from the annotation.
It will also index the genome for you
If you misspelled something or crashed, delete wrong files and
run again.
Do remake = TRUE, to do it all over again.
get_genome_gtf( GTF, output.dir, organism, assembly_type, db, gunzip, genome, optimize = FALSE )
GTF |
logical, default: TRUE, download gtf of organism specified
in "organism" argument. If FALSE, check if the downloaded
file already exist. If you want to use a custom gtf from you hard drive,
set GTF = FALSE,
and assign: |
output.dir |
directory to save downloaded data |
organism |
scientific name of organism, Homo sapiens,
Danio rerio, Mus musculus, etc. See |
assembly_type |
a character string specifying from which assembly type
the genome shall be retrieved from (ensembl only, else this argument is ignored):
Default is
|
db |
database to use for genome and GTF, default adviced: "ensembl" (remember to set assembly_type to "primary_assembly", else it will contain haplotypes, very large file!). Alternatives: "refseq" (primary assembly) and "genbank" (mix) |
gunzip |
logical, default TRUE, uncompress downloaded files that are zipped when downloaded, should be TRUE! |
genome |
character path, default NULL. Path to fasta genome, corresponding to the gtf. must be indexed (.fai file must exist there). If you want to make sure chromosome naming of the GTF matches the genome and correct seqlengths. If value is NULL or FALSE, it will be ignored. |
optimize |
logical, default FALSE. Create a folder within the folder of the gtf, that includes optimized objects to speed up loading of annotation regions from up to 15 seconds on human genome down to 0.1 second. ORFik will then load these optimized objects instead. Currently optimizes filterTranscript() and loadRegion(). |
If you want custom genome or gtf from you hard drive, assign it
after you run this function, like this:
annotation <- getGenomeAndAnnotation(GTF = FALSE, genome = FALSE)
annotation["genome"] = "path/to/genome.fasta"
annotation["gtf"] = "path/to/gtf.gtf"
a named character vector of path to genomes and gtf downloaded, and additional contaminants if used. If merge_contaminants is TRUE, will not give individual fasta files to contaminants, but only the merged one.
Other STAR:
STAR.align.folder()
,
STAR.align.single()
,
STAR.allsteps.multiQC()
,
STAR.index()
,
STAR.install()
,
STAR.multiQC()
,
STAR.remove.crashed.genome()
,
install.fastp()
## Get Saccharomyces cerevisiae genome and gtf (create txdb for R) #getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel") ## Get Danio rerio genome and gtf (create txdb for R) #getGenomeAndAnnotation("Danio rerio", tempdir()) output.dir <- "/Bio_data/references/zebrafish" ## Get Danio rerio and Phix contamints to deplete during alignment #getGenomeAndAnnotation("Danio rerio", output.dir, phix = TRUE) ## Optimize for ORFik (speed up for large annotations like human or zebrafish) #getGenomeAndAnnotation("Danio rerio", tempdir(), optimize = TRUE) ## How to save malformed refseq gffs: ## First run function and let it crash: #annotation <- getGenomeAndAnnotation(organism = "Arabidopsis thaliana", output.dir = "~/Desktop/test_plant/", # assembly_type = "primary_assembly", db = "refseq") ## Then apply a fix (example for linux, too long rows): # \code{system("cat ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.gff | awk '{ if (length($0) < 32768) print }' > ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed2.gff")} ## Then updated arguments: annotation <- c("~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed.gff", "~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.fna") names(annotation) <- c("gtf", "genome") # Make the txdb (for faster R use) # makeTxdbFromGenome(annotation["gtf"], annotation["genome"], organism = "Arabidopsis thaliana")