In home maintenance, shims are little wedges of wood that you stick into wobbly entities to make them more stable.
We need things like this to deal with diverse data resources in genomics. Here’s an example of the problem:
## eQTL FP HS ## chrom chr chr chrom ## start snp_pos start chromStart ## end snp_pos end chromEnd
The rownames of this data frame are target annotation terms for features of GRanges: chrom, start, end. The columns are different assay types. Entry i,j is the notation for feature i on assay type j. Thus for eQTL data the start and end can be determined from the source attribute ‘snp_pos’, while for footprints (FP) the footprint end is denoted ‘end’ and for hotspots (HS) the footprint end is denoted ‘chromEnd’.
In order to leave data in its original state but simplify
downstream integration, we use shims like
map attribute names to a common vocabulary.
RaggedMongoExpt instances to work with contents of
a remote MongoDB that holds large volumes of genomic annotation.
## class: RaggedMongoExpt ## dim: 2640 2640 ## assays(0): ## rownames: NULL ## colnames(2640): Adipose_Subcutaneous_allpairs_v7_eQTL ## Adipose_Visceral_Omentum_allpairs_v7_eQTL ... ## iPS_19_11_DS15153_hg19_FP vHMEC_DS18406_hg19_FP ## colData names(6): base type ... type mid
rme0 holds a reference to a MongoDB database, coordinated
colData component. (The package includes
a unit test for correspondence between collection names in the
txregnet database and the colData element names.)
We’ll step back for a moment to give a sense of basic motivations. We want to use MongoDB to manage data about eQTL, DnaseI hypersensitive regions and so forth, without curating the related file contents. Here’s an illustration of the basic functionality for eQTL:
## gene_id variant_id tss_distance ma_samples ma_count maf ## 1 ENSG00000272636.1 17_36551_T_C_b37 5124 286 367 0.479112 ## 2 ENSG00000272636.1 17_36718_A_G_b37 5291 246 305 0.398172 ## pval_nominal slope slope_se qvalue chr snp_pos A1 A2 build ## 1 0.000892921 0.176774 0.0527027 0.06317724 17 36551 T C b37 ## 2 0.001601720 0.170962 0.0537081 0.09719138 17 36718 A G b37
We’ll need different
q components for assays of different
types, because the internal notation used for chromosomes
differs between the assay types. Other aspects of diversified
annotation can emerge, and the shim concept helps deliver
to the user a more unified interface in the face of
At present, the main workhorse for retrieving assay
results from RaggedMongoExpt instances is
is an approach to a
We’ll illustrate this with extractions from lung-related
eQTL, Dnase hotspot, and digital genomic footprinting
lname_eqtl = "Lung_allpairs_v7_eQTL" lname_dhs = "ENCFF001SSA_hg19_HS" # see dnmeta lname_fp = "fLung_DS14724_hg19_FP" si17 = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] si17n = si17 GenomeInfoDb::seqlevelsStyle(si17n) = "NCBI" s1 = sbov(rme0[,lname_eqtl], GRanges("17", IRanges(38.06e6, 38.15e6), seqinfo=si17n))
In principle we could avoid the
by checking assay type within
sbov, but at the moment the
user must shoulder this responsibility.
To see more about how to work with
sbov outputs, check the main