BioC 2017: Where Software and Biology Connect

July 27-28, 2017 (Developer Day: July 26)
Dana Farber Cancer Institute, Boston, MA

This conference highlights current developments within and beyond Bioconductor. Morning scientific talks and afternoon workshops provide conference participants with insights and tools required for the analysis and comprehension of high-throughput genomic data. ‘Developer Day’ precedes the main conference on July 26, providing developers and would-be developers an opportunity to gain insights into project direction and software development best practices.

To launch an Amazon Machine Image (AMI) for this conference:

Schedule

Developer Day

Wednesday, July 26

  • 9:00 - 9:40 Welcome

  • 9:40 - 10:00 Spotlight Talk

  • 10:00 - 10:30 Lightning talks I

  • 10:30 - 11:00 BREAK

  • 11:00 - 12:00 Birds-of-a-feather I / Workshops I

    • Birds-of-a-feather: Interfaces for programming with data: making Bioconductor software more analyst-friendly (Michael Lawrence) [Slides]
    • Introduction to new package development and submission (Lori Shepherd) [MakeAPackage, Vignette]
    • Cloud and other creative approaches to working with big data (Vincent Carey, Sean Davis) [BigData]
  • 12:00 - 1:00 Lunch

  • 1:00 - 1:30 Lightning talks II

    • Track I
      • Semantically rich containers for pcloud scale genomics (Shweta Gopaulakrishnan, Samuela Pollock)
      • Multiomic regulatory network construction with txRegQuery (BJ Stubbs, Vince Carey) [Slides]
      • Extending Bioconductor To Exposome Data Analysis (Carles Hernandez-Ferrer) [Slides]
      • MIRA: an R Package for Inferring Regulatory Activity from DNA Methylation Data (John Lawson) [Slides]
      • curatedMetagenomicData: Human Microbiome Data Lightening Fast (Lucas Schiffer) [curated metagenomic data]
    • Track II
      • MEAL “2”: story of a software package (Carlos Ruiz-Arenas) [Slides]
      • Prototype meta-analysis demonstration for ImmuneSpaceR, using designated S4 objects (Dror Berel) [Slides]
      • Overview of a few new packages we developed recently including ATACseqQC and NADfinder (Lihua Julie Zhu) [Slides]
      • Shrinkage estimator of log fold change in differential expression analysis of RNA-Seq using “apeglm” (Anqi Zhu) [Slides]
      • R on Supercomputers (Pramod Gupta) [Slides]
  • 1:30 - 2:30 Birds-of-a-feather II / Workshops II

    • Birds-of-a-feather: Infrastructure for efficient storage and processing of large-scale single-cell genomics data (Davide Risso) [Slides, Notes, Discussion]
    • Git with the program: new Bioconductor version control (Nitesh Turaga) [Slides]
    • Birds-of-a-feather: Authoring modern vignettes and living papers with Rmarkdown, BiocStyle, and BiocWorkflowTools (Wolfgang Huber) [Notes]
    • Developer best practices (Martin Morgan, Kasper Hansen) [Unit Tests, Efficiency, Web Query, Coding Style, BiocCommon]
  • 2:30 - 3:00 Break

  • 3:00 - 3:30 Lightning talks III

    • Track I
      • Bioconductor Dockers / AMIs (Lori Shepherd) [Slides]
      • GOexpress: supervised classification and visualisation of (gene) expression data (Kevin Rue-Albrecht) [Slides]
      • Varying-Censoring Aware Matrix Factorization for Single-Cell RNA-Seq (Will Townes) [Slides]
      • bcbioSinglecell: Import and analyze bcbio single-cell RNA-seq data (Michael Steinbaugh) [Slides]
      • magrittr: Plumbing and Programing at the Same Time (Lucas Schiffer) [Slides, magrittr]
    • Track II
      • Single sample gene set analysis- Application to TCGA panCancer study (Aedin Culhane) [Slides]
      • Fun for all the family with BiocStickers (Matt Ritchie) [Slides]
      • ELMER, FunciVAR and more: Tools and pipelines for functional annotation and analysis (Dennis Hazelett) [Slides]
      • BiocFileCache: Local file management (Lori Shepherd) [Slides]
  • 3:30 - 4:30 Workshop III

  • 4:30 - 5:00 Panel discussion: project directions and opportunities

Main Conference

Thursday, July 27

  • 8:00 - 8:30. REGISTRATION

  • Invited Speakers and Community Presentations

    • 8:30 - 9:15 Christina Kendziorski (University of Wisconsin). Statistical methods for single-cell RNA-seq data. [Slides]
    • 9:15 - 10:00 Rahul Satija (New York Genome Center, NYU). Challenges and opportunities in analysis of single cell transcriptomics.
    • 10:00 - 10:30 Break
    • 10:30 - 11:15 James Lindsay (Dana-Farber Cancer Institute). Knowledge Systems at Dana Farber. [Slides]
    • 11:15 - 11:30 Stephanie Hicks (Dana-Farber Cancer Institute). Estimating cell type composition in whole blood using differentially methylated regions. [pdf Slides, Slide Links]
    • 11:30 - 11:45 Stephen Piccolo (Brigham Young University). Comprehensive benchmark of supervised-learning algorithms for predicting cancer states. [Slides]
    • 11:45 - 12:00 Nils Gehlenborg (Harvard Medical School). Relaxation Techniques for the Upset Data Scientist. [Slides]
  • 12:00 - 1:00 Lunch / Birds-of-a-feather

    • Infrastructure for efficient storage and processing of large-scale single-cell genomics data. Davide Risso (Division of Biostatistics and Epidemiology, Weill Cornell Medical College) et al. [Slides, Notes, Discussion]
    • Teaching / training using Bioconductor packages. Jenny Drnevich (University of Illinois), Radhika Khetani (Harvard T.H. Chan School of Public Health) [Notes]
  • 1:00 - 2:45 Workshops Session I

  • 3:15 - 5:00 Workshop Session II

    • Differential Gene Expression analysis using R / Bioconductor. Radhika Khetani (Harvard T. H. Chan School of Public Health) et al. [Workflow]
    • Interactive visualization and Data Analysis with epiviz web components. Jayaram Kancherla (University of Maryland, College Park) et al. [Github]
    • Functional enrichment analysis of high-throughput omics data in Bioconductor. Ludwig Geistlinger, Levi Waldron (CUNY School of Public Health) [Slides, Github]
    • Learn to leverage 70,000 human RNA-seq samples for your projects. Leonardo Collado Torres (Lieber Institute for Brain Development) [Slides, Github, Workshop, recount package]
  • 5:30 - 7:30 Contributed posters (Residence Inn Fenway)

Friday, July 28

  • 8:00 - 8:30. REGISTRATION

  • Invited Speakers and Community Presentations

    • 8:30 - 9:15 Elham Azizi (Memorial Sloan Kettering). Bayesian inference for single-cell clustering and imputing. [Slides]
    • 9:15 - 10:00 Martin Aryee (Harvard). Analyzing 3-dimensional genome topology data. [Slides]
    • 10:00 - 10:30 Break
    • 10:30 - 11:00 Davis McCarthy (EBI, Cambridge, UK). Using R_and _Bioconductor to explore genetic effects on single-cell gene expression. [Slides]
    • 11:00 - 11:30 Raphael Gottardo (Fred Hutchinson Cancer Research Center). Statistical methods for the analysis of flow cytometry data.
    • 11:30 - 11:45 Keegan Korthauer (Dana-Farber Cancer Institute). An R package for detection and inference of differentially methylated regions from bisulfite sequencing. [Slides]
    • 11:45 - 12:00 Jeff Gentry (Broad Institute). Cromwell & WDL: Bioinformatics pipelines at any scale. [Slides]
  • 12:00 - 1:00 Lunch / Birds-of-a-feather

    • Analyzing publicly available cancer genomics data from GTEx and TCGA with Bioconductor. Sonali Arora (Fred Hutchinson Cancer Research Center, Seattle) [Notes, Discussion]
  • 1:00 - 2:45 Workshops Session I

    • Ensemble gene set enrichment analysis with EGSEA. Matt Ritchie (Walter and Eliza Hall Institute of Medical Research) [Github, Workflow]
    • Bioconductor workflow for single-cell RNA-seq data analysis: dimensionality reduction, clustering, and pseudotime ordering. Fanny Perraudeau (Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA) et al. [Github, Workshop, Slides]
    • Integrative analysis workshop with TCGAbiolinks and ELMER. Tiago Chedraoui Silva (University of São Paulo / Cedars-Sinai) et al. [Workshop Link, Github]
    • CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets. Malgorzata Nowicka (University of Zurich) et al. [PDF Slides, Github, Workflow]
  • 3:15 - 5:00 Workshops Session II

    • Understanding Bioconductor Annotation Packages. James MacDonald (University of Washington) [Github, Workshop]
    • Variant Annotation Workshop with FunciVAR, StateHub and MotifBreakR. Dennis J Hazelett (Cedars-Sinai Medical Center), Simon G Coetzee [Workshop Link]
    • Integrated analysis and visualization of ChIP-seq data using ChIPpeakAnno, GeneNetworkBuilder and TrackViewer. Jianhong Ou and Lihua Julie Zhu (University of Massachusetts Medical School) et al. [Slides, Github, Workflow, Step by Step Guide]
    • Multi-omics data representation and analysis with MultiAssayExperiment. Marcel Ramos (CUNY School of Public Health) et al. [Github, MAE Lab]
  • 5:00 - 7:00 Closing Reception

Birds-of-a-feather abstracts

  • Interfaces for programming with data: making Bioconductor software more analyst-friendly. Audience: package developers and users. Bioconductor packages have multiple types of users. Most use packages to analyze their data. Some of them access packages through GUIs like Shiny applications, but most access is programmatic. Developers are also users, in that they develop packages on top of existing packages. Thus, the API of a package has to serve two roles: data analysis and software integration. Mostly for the sake of integration, we formally define our data structures in specialized classes. The benefits of an integrated, semantically-rich platform justify the complexity of specialization, abstraction, polymorphism, etc. However, this forces the analyst to think more like an engineer and less like an analyst. Beyond APIs and GUIs, is there room for a third type of interface, a data programming interface (DPI)? How would that relate to the fluent APIs of the so-called tidyverse? S4Vectors, IRanges, GenomicRanges. Michael Lawrence (Genentech).

  • Teaching / training using Bioconductor packages. Audience: people who teach others to use BioC packages. Do you teach others to use Bioconductor packages, either in workshops or classes? If so, please come join us to network, share ideas and discuss issues related to teaching. Do you write all your own material, primarily use Bioconductor’s Course and Conference materials, or a mix of both? What topics/packages do you teach, to what depth, and who is the intended audience? How is your time for preparing and teaching compensated? We will have an online survey to collate everyone’s information that we will share amongst ourselves, and talk about publicly posting it on Bioconductor’s website. We can also have a short discussion about the feasibility of developing a Intro to Bioconductor module for Data Carpentry’s Genomics Workshop. http://bioconductor.org/help/ Jenny Drnevich (University of Illinois), Radhika Khetani (Harvard T.H. Chan School of Public Health)

  • Analyzing publicly available cancer genomics data from GTEx and TCGA with Bioconductor TCGA/GTEx. Audience: researchers. This birds-of-a-feather would address the following: a) where to get GTEx and TCGA data from (websites, Bioconductor packages, annotation, …); b) reading data into R c) sub-setting the data to what you want (e.g., specific cancer type from SummarizedExperiments) d) what questions can be asked now? Example: Common question is what genes are differentially expressed in Normal Prostate vs Prostate Cancer Samples. i) are there are any batch effects since we are getting data from 2 sources ? Correct for batch effects ii) Differential Expression Analysis of Normal Samples and Cancer Samples in prostate cancer. iii) Using the differentially expressed genes to do a go enrichment analysis, pathway enrichment analysis using Bioconductor tools. Packages: getting GTEx/TCGA data, yarn/ recount, TCGAbiolinks; annotations, AnnotationHub; DE analysis, DESeq2; enrichment analysis, clusterProfiler. Sonali Arora Fred Hutchinson Cancer Research Center, Seattle.

  • Infrastructure for efficient storage and processing of large-scale single-cell genomics data. Audience: Developers and users of single-cell packages. Emerging high-throughput technologies in single-cell transcriptomics have allowed expression profiles to be rapidly generated for each of thousands of cells in a sample. This provides unprecedented resolution to investigate cellular heterogeneity within complex populations, for studying biological processes such as cell fate choice, immune activation and tumour diversity. Rigorous analysis of these data requires the application of appropriate statistical methodologies, many of which are available in packages from the Bioconductor project. However, the computational analyses are often complicated by a number of factors. The first is the suboptimal interoperability between packages that are currently available for single-cell RNA-seq data analysis, as each package defines its own S4 classes to be used for further processing. Another problem is the size of the data sets involved - even a simple experiment contains expression values for thousands of genes in thousands of cells. Finally, there is little support for multi-omics analyses of single-cell data, relevant to situations where multiple types of data (e.g., transcriptomics, genomics and methylation) are available for each cell. This birds-of-a-feather session will address these issues by proposing (i) a common S4 class for storing single-cell transcriptomics data, which extends existing Bioconductor classes with slots specific to single-cell studies; (ii) developing a C++ API for efficient handling of large single-cell data sets, using sparse and disk-backed matrices; and (iii) investigating avenues through which multi-omics data can be handled for integrative analyses. scater scran MAST scone monocle. Davide Risso (Division of Biostatistics and Epidemiology, Weill Cornell Medical College) Aaron Lun (CRUK Cambridge Institute) Davis McCarthy (EMBL-EBI) Peter Hickey (Johns Hopkins University) Stephanie Hicks (Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health) Andrew McDavid (University of Rochester Medical Center)

Workshop abstracts

Each entry includes title, intendend audience, description, and contributors.

  • Introduction to R and Bioconductor. Beginner. The Introduction to R / Bioconductor workshop is designed as a brief overview of Bioconductor and some of the core packages. Depending on participates’ background and experience with R and Bioconductor, the workshop will touch upon basic R concepts, overview of what is Bioconductor, and briefly summarize some standard Bioconductor packages like Biostrings, GenomicRanges, GenomicAlignment, VariantAnnotation, and Annotation Resources like org, TxDb, and AnnotationHub. This workshop will give a taste of what Bioconductor has to offer. Lori Shepherd (Roswell Park Cancer Institute).

  • A Journey of Discovery through the GenomicRanges Infrastructure. Intermediate. We will introduce the fundamental concepts underlying the GenomicRanges package and related infrastructure. After a structured introduction, we will freely explore the infrastructure, from the central data structures of GRanges and SummarizedExperiment to the murky depths, driven by attendee questions and interests. Topics will include data import/export, computing and summarizing data on genomic features, overlap detection, integration with reference annotations, scaling strategies, and visualization. By the end of the journey, we hope to arrive at a more holistic understanding of the underpinnings of Bioconductor. Michael Lawrence (Genentech).

  • Microbiome Data Analysis. Intermediate. Bioconductor provides significant resources for microbiome data acquisition, analysis, and visualization. This workshop introduces ExperimentHub, a recent platform for cloud-based distribution of curated experimental data to the Bioconductor session, and curatedMetagenomicData, a resource providing uniformly processed taxonomic and metabolic function profiles for more than 5,000 whole metagenome shotgun sequencing samples from 26 publicly available studies, including the Human Microbiome Project, along with curated participant data. It demonstrates analysis of these data using the dada2, phyloseq, and metagenomeSeq packages for denoising, estimating differential abundance, alpha and beta diversities, ordinations, and other aspects of microbiome data analysis, and the metavizr package for browsing and visualizing microbiome profiles. Together, these packages provide easily linked components for data acquisition and flexible analysis of 16S rRNA and whole metagenome shotgun microbiome profiles. At the end of this workshop, users will be able to access publicly available metagenomic data and to perform common statistical analyses of these and other data in Bioconductor. Levi Waldron (CUNY School of Public Health), Susan Holmes (Stanford University), Paul J. McMurdie (University of Washington), Edoardo Pasolli (University of Trento), Joe Paulson (Dana-Farber Cancer Institute), Lucas Schiffer (CUNY School of Public Health), Justin Wagner (University of Maryland).

  • Interactive visualization and Data Analysis with epiviz web components. Intermediate. This workshop will go over how to include interactive visualizations of genomic data in R markdown pages using epiviz web components. Epiviz web components are a new addition to the epiviz framework to support visualization of genomic data across various platforms and applications. We will demonstrate how to use our components to visualize data using plots or track based chart components, visualizing data from multiple regions at the same time on the same page using navigation component and enable brushing across all the charts using the environment component. At the end of the workshop, users will be able to setup the necessary libraries required to use epiviz web components, load data from R, generate interactive html pages from R markdown with epiviz components. We will also introduce users to setting up and using epiviz desktop application with R / Bioconductor epivizr package. This workshop is for intermediate users who want to perform exploratory data analysis using existing bioconductor infrastructure and quickly visualize genomic data and share their visualizations with other users. Jayaram Kancherla (University of Maryland, College Park), Hector Corrada Bravo (UMD), Brian Gottfried (UMD).

  • Differential Gene Expression analysis using R / Bioconductor. Intermediate. We will cover basic steps in gene-level differential expression analysis starting with count data generated from an RNA-seq experiment. These analysis steps will be performed using Bioconductor packages, and will include exploratory analysis, visualization and differential expression testing. Radhika Khetani, Meeta Mistry, Mary Piper (Harvard T. H. Chan School of Public Health)

  • CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets. Intermediate. High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations. Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals). This workflow was recently published as a preprint in the F1000Research Bioconductor gateway. Expected outcomes: Following this workshop, participants will be able to perform complete R-based differential analyses of HDCyto (CyTOF and high-dimensional flow cytometry) data, making use of several Bioconductor packages. Key steps include pre-processing and transformation, clustering, dimensionality reduction, and differential testing. Throughout the pipeline, we emphasize exploratory data analysis and visualization techniques to facilitate quality control and interpretation. R code for the full workflow will be provided and demonstrated on an example data set during the workshop. This will ensure that, after the workshop, participants can adapt and extend the code and workflow to analyze their own experimental data sets. Interactive exercises: The workshop will contain optional interactive exercises to step through the workflow from the paper. Users will be asked to perform small tasks such as changes to some of the plots, transformation adjustment, dimensionality reduction e.g. principal component analysis (PCA), and clustering into varying number of groups. The exercises aim to show how robust the proposed methods are and how the different analysis strategies may affect the results. Prerequisites: Participants should have intermediate familiarity with R and Bioconductor. In particular, participants should know how to install and load Bioconductor packages; import and export data from saved files (e.g. in .csv, .txt, or .fcs format); and create plots (preferably using “ggplot2”). Participants should also have basic familiarity with HDCyto (CyTOF and/or high-dimensional flow cytometry) experiments and data analysis techniques (e.g. have previously analyzed a CyTOF or FACS data set using either gating or automated methods). Intermediate familiarity with statistical methods (especially mixed models) will also be useful, but is not required. Malgorzata Nowicka, Lukas M. Weber, Mark D. Robinson (University of Zurich).

  • Functional enrichment analysis of high-throughput omics data in Bioconductor. Intermediate. This workshop will give an overview of existing methods and implementations for enrichment analysis of functional gene groups such as gene sets, pathways, and networks. Participants will be introduced to statistical theory of frequently used gene set testing methods, emphasizing on underlying differences in hypotheses and individual limitations. Hands-on examples will be carried out using a selection of established Bioconductor packages shown to work well in practice. This workshop will equip participants with functionality for data preparation, preprocessing, differential expression analysis, set- and network-based enrichment analysis, along with visualization and exploration of results for gene expression data. Finally, the workshop will provide an outlook on current developments to extend gene set enrichment analysis for data derived from multiple high-throughput omics assays. Specific prior knowledge is not needed, but basic skills in data manipulation with R are beneficial. Ludwig Geistlinger, Levi Waldron (CUNY School of Public Health).

  • Learn to leverage 70,000 human RNA-seq samples for your projects. Beginner. The recount2 project re-processed RNA sequencing (RNA-seq) data on over 70,000 human RNA-seq samples spanning a variety of tissues, cell types and disease conditions. Researchers can easily access these data via the recount Bioconductor package, and can quickly import gene, exon, exon-exon junction and base-pair coverage data for uniformly processed data from the SRA, GTEx and TCGA projects in R for analysis. This workshop will cover different use cases of the recount package, including downloading and normalizing data, processing and cleaning relevant phenotype data, performing differential expression (DE) analyses, and creating reports for exploring the results using other Bioconductor packages. The workshop will also cover how to use the base-pair coverage data for annotation-agnostic DE analyses and for visualizing coverage data for features of interest. After taking this workshop, attendees will be ready to enhance their analyses by leveraging RNA-seq data from 70,000 human samples. recount2 website, recount package, paper. Topic: RNA-seq Expected outcomes: Learn how to search projects, download data, explore the metadata, add more phenotype information, and prepare the data for a DE analysis. Then perform the DE analysis with DESeq2 and explore the results using regionReport. Participant prerequisites: basic familiarity with packages such as GenomicRanges and DESeq2. Functions from those packages used in the workshop will be briefly described. Leonardo Collado Torres (Lieber Institute for Brain Development).

  • Understanding Bioconductor Annotation Packages. Intermediate. There are various annotation packages provided by the Bioconductor project that can be used to incorporate additional information to results from high-throughput experiments. This can be as simple as mapping Ensembl IDs to corresponding HUGO gene symbols, to much more complex queries involving multiple data sources. In this workshop we will cover the various classes of annotation packages, what they contain, and how to use them efficiently. Participants are expected to have some familiarity with R and Bioconductor. James MacDonald (University of Washington).

  • Bioconductor workflow for single-cell RNA-seq data analysis: dimensionality reduction, clustering, and pseudotime ordering Intermediate Single-cell RNA sequencing (scRNA-seq) is a powerful and promising class of high-throughput assays that enables researchers to measure genome-wide transcription levels for individual cells. To properly account for features specific to scRNA-seq, such as zero inflation and high levels of technical noise, several novel statistical methods have been developed. The aim of this workshop is to walk participants through a scRNA-seq data analysis workflow, from the raw gene-level count data to clustering and the inference of cell lineages. The workflow has three main steps, where, for each of the steps, we use a different R package (to be submitted or already on Bioconductor).

    First, we present Zero-Inflated Negative Binomial-based Wanted Variation Extraction (ZINB-WaVE), a general and flexible framework for normalization, dimensionality reduction, and differential expression analysis (R software package zinbwave). The method is based on a ZINB model that accounts specifically for zero inflation (dropouts), over-dispersion, and the count nature of the data. The inclusion of both known and unknown cell-level covariates in the model for the ZI probability and the NB mean allows for supervised and unsupervised normalization. Second, we present a resampling-based sequential ensemble clustering (RSEC) method (Bioconductor R package clusterExperiment) for identifying stable and tight cell clusters. The method aggregates multiple clustering results obtained from different base clustering algorithms and applications of a given algorithm to resampled versions of the learning set. Third, we demonstrate how to use the R software package slingshot to infer branching lineages and order cells by developmental progression. We connect the clusters identified by RSEC with a minimum spanning tree to learn the global lineage structure. Then we refine this structure and order cells using highly stable simultaneous principal curves to infer smooth, branching lineages.

    The workflow will be illustrated using data from a scRNA-seq study of stem cell differentiation in the mouse olfactory epithelium. Fanny Perraudeau (Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA), Kelly Street (Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA) Davide Risso (Division of Biostatistics and Epidemiology, Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, NY, USA) Sandrine Dudoit (Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA and Department of Statistics, University of California, Berkeley, CA, USA) Elizabeth Purdom (Department of Statistics, University of California, Berkeley, CA, USA).

  • Integrative analysis Workshop with TCGAbiolinks and ELMER. Intermediate. The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap) and other organized international consortia have led the explosion of sequencing based biological data and thereby have provided unprecedented access to the largest publicly available genomic, transcriptomic and epigenomic data to date. These projects have provided amazing opportunities for researchers to interrogate the epigenome of cultured cancer cell lines, normal and tumor fresh tissues with high genomic resolution. However, the use of such data in analyzes, comprises the arduous task of searching, downloading and processing them in a reproducible manner. Furthermore, most bioinformatics tools are designed for specific data types (e.g. expression, epigenetics, genomics) which provides only a partial view of the biological process that takes place. Performing an integrated analysis of molecular datasets along with clinical information, has been shown to improve the prognostic and predictive accuracy for cancer phenotypes if compared to clinical features alone. This workshop will focus on helping researchers to perform an integrative analysis of both molecular and clinical data available through the TCGA by harnessing open source packages within the bioconductor platform. Participants will learn to search and download DNA methylation (epigenetic) and gene expression (transcription) data from the newly created NCI Genomic Data Commons (GDC) portal and prepare them into a Summarized Experiment object. We will introduce the workflow using our recently developed TCGAbiolinks and if time permitted, we will highlight the Graphics User Interface version (TCGAbiolinksGUI). Another bioconductor package will also be introduced called ELMER which allows one to identify DNA methylation changes in distal regulatory regions and correlate these signatures with expression of nearby genes to identify transcriptional targets associated with cancer. For these distal probes correlated with a gene, a transcription factor motif analysis is performed followed by expression analysis of transcription factors to infer upstream regulators. We expect that participants of this workshop will understand the integrative analysis performed by using TCGAbiolinks + ELMER, as well as be able to execute it from the data acquisition process to the final interpretation of the results. The workshop assumes users with an intermediate level of familiarity with R, and basic understanding of tumor biology. Tiago Chedraoui Silva (University of São Paulo / Cedars-Sinai), Houtan Noushmehr (University of São Paulo/Henry Ford Health System) Benjamin Berman (Cedars-Sinai)

  • Design and evaluate guide RNAs for CRISPR-Cas9 genome-editing uing CRISPRseek and GUIDEseq. Beginner. The most recently developed genome editing system, CRISPR-Cas9 has greater inherent flexibility than prior programmable nuclease platforms. Because of its simplicity and efficacy, this technology is revolutionizing biological studies and holds tremendous promise for therapeutic applications. However, imperfect cleavage specificity of CRISPR-Cas9 nuclease within the genome is a cause for concern for its therapeutic application. To facilitate the adoption and improvement of this technology, we have developed CRISPRseek for designing target specific gRNAs, and GUIDEseq for identifying genome-wide offtarget sites from GUIDE-seq experiments to assess the precision of engineered CIRSPR-Cas9 nucleases. In this workshop, I will give an introduction to the CRISPR genome editing and GUIDE-seq technology, followed by a practical hands-on session using CRISPRseek and GUIDEseq. By the end of workshop, the participants should be able to design target-specific gRNAs for various cas9 nucleases and genomes using CRISPRseek, and analyze GUIDE-seq data using GUIDEseq. Lihua Julie Zhu (University of Massachusetts Medical School).

  • Ensemble gene set enrichment analysis with EGSEA. Beginner. Many tools exist for gene set testing, and choosing the best method for a given RNA-sequencing/microarray data set is a challenge. The EGSEA software overcomes this problem by performing gene set testing using an ensemble approach that combines results from many different algorithms. EGSEA uses the voom method to process RNA-sequencing data and, where applicable, a linear model analysis with pair-wise comparisons of interest is applied. It then combines the results obtained from the many methods available in the Bioconductor project, that include camera, roast, fry, safe, gage, padog, plage, zscore, gsva, ssgsea, globaltest and ora to get a consensus. This workshop will provide an overview of the functionality of the EGSEA package via a workflow that demonstrates how EGSEA can be applied to both RNA-sequencing and microarray data profiling mouse mammary gland epithelial cell populations. It will introduce the large collection of gene signatures EGSEA can easily test and explore the detailed reporting options for gene set analysis results EGSEA provides via an html report that can be shared with collaborators. The biological relevance of the results obtained using EGSEA’s ensemble approach will be also highlighted. Matt Ritchie (Walter and Eliza Hall Institute of Medical Research).

  • Variant Annotation Workshop with FunciVAR, StateHub and MotifBreakR. Intermediate. Variant annotation is a critical step in evaluating the potential function of variants identified from whole-genome sequencing studies, GWAS and other next-generation sequencing technologies. In the proposed workshop, we will survey and review the capabilities of a suite of tools from the Bioconductor universe. During the workshop, participants will learn how to identify public datasets from ROADMAP, IHEC, Blueprint or ENCODE, download them and produce integrated chromatin state annotations using StateHub and StatePaintR. Participants will also be exposed to creating their own annotation models using the StateHub resource. Then, we will use the FunciVAR package to integrate a set of biologically interesting variants and assess them for biological enrichment, including some discussion of how to pick an appropriate background distribution. Finally participants will use MotifBreakR to predict potential motif disruptions with a set of variants. Participants need basic knowledge of R and Bioconductor data structures, and some working knowledge of ChIP-seq or general NGS experiments and data as these topics will not be covered. Dennis J Hazelett (Cedars-Sinai Medical Center), Simon G Coetzee.

  • Integrated analysis and visualization of ChIP-seq data using ChIPpeakAnno, GeneNetworkBuilder and TrackViewer. Intermediate. Chromatin immuno-precipitation followed by DNA sequencing (ChIP-seq) has become the most prevalent high throughput technology for the genome-wide identification of the binding sites of transcription factors and histone modification. In this workshop, attendants will gain knowledge and hands-on experiences on analyzing ChIP-seq dataset using several Bioconductor packages such as ChIPpeakAnno, GeneNetorkBuilder and TrackViewer. Participants are welcome to analyze their own dataset or a published ChIP-seq data set. Participant must know how to use R / Bioconductor, including how to install R, install Bioconductor packages. Familiar with the format of deep-sequencing files, such as fastq, sam and bam. Jianhong Ou, Lihua Julie Zhu (University of Massachusetts Medical School).

  • Multi-omics data representation and analysis with MultiAssayExperiment. Intermediate. The MultiAssayExperiment software package provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. It integrates an open-ended set of single-assay data classes, while abstracting the complexity of back-end data objects through a sufficient set of data manipulation, extraction, and re-shaping methods to interface with most R / Bioconductor data analysis and visualization tools. This workshop will introduce the data class and essential operations on it, then will lead users through a complete workflow including construction and several statistical analyses of a multi-omics dataset. Users should have some familiarity with basic Bioconductor data structures such as SummarizedExperiment. Marcel Ramos (CUNY School of Public Health), Vince Carey (Dana Farber), and Levi Waldron (CUNY School of Public Health).

Contributed poster descriptions

  • Anqi Zhu, University of North Carolina Chapel Hill; Joseph, Ibrahim, University of North Carolina Chapel Hill; Michael, Love, University of North Carolina Chapel Hill – An Empirical Bayes Approach for Differential Expression Analysis of RNA-Seq Data In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across different experimental conditions, despite technical and biological variability in the observations. A fundamental challenge is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC) across conditions. When the counts of sequenced reads are small in either or both conditions, the estimated LFC has high variance, leading to some high estimated LFCs, which do not represent true differences in expression. Current methods introduce arbitrary filtering thresholds and pseudocounts to exclude or moderate the estimated LFC from genes that have small read counts. These method may result in loss of genes from the analysis with true differences across conditions. Here, we propose an empirical Bayes procedure with a wide-tailed prior on effect sizes, which avoids defining arbitrary filter thresholds or pseudocounts. We show that our new estimator for LFC is efficient to calculate and has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little statistical information.

  • Lauren Blake, University of Chicago; Samantha M. Thomas, University of Chicago; John D. Blischak, University of Chicago; Chiaowen Joyce Hsiao, University of Chicago; Claudia Chavarria, University of Chicago; Marsha Myrthil, University of Chicago; Yoav Gilad, University of Chicago; Bryan J. Pavlovic, University of Chicago – A comparative study of endoderm differentiation in humans and chimpanzees There is substantial interest in the evolutionary forces that shaped the regulatory framework that is established in early human development. Progress in this area has been slow because it is difficult to obtain relevant biological samples. Inducible pluripotent stem cells (iPSCs) provide the ability to establish in vitro models of early human and non-human primate developmental stages. Using matched iPSC panels from humans and chimpanzees, we comparatively characterized gene regulatory changes through a four-day timecourse differentiation of iPSCs (day 0) into primary streak (day 1), endoderm progenitors (day 2), and definitive endoderm (day 3). As might be expected, we found that differentiation stage is the major driver of variation in gene expression levels, followed by species. Using the Bioconductor packages edgeR and limma (Robinson et al. 2010, Ritchie et al. 2015), we identified thousands of differentially expressed genes between humans and chimpanzees in each differentiation stage. Yet, when we utilized the R/Bioconductor package Cormotif (Wei et al. 2015) to consider gene-specific dynamic regulatory trajectories throughout the timecourse, we found that 75% of genes, including nearly all known endoderm developmental markers, have similar trajectories in the two species. Interestingly, we observed a marked reduction of both intra- and inter-species variation in gene expression levels in primitive streak samples compared to the iPSCs, with a recovery of regulatory variation in endoderm progenitors. The reduction of variation in gene expression levels at a specific developmental stage, paired with overall high degree of conservation of temporal gene regulation, is consistent with the dynamics of developmental canalization. Overall, we conclude that endoderm development in iPSC-based models are highly conserved and canalized between humans and our closest evolutionary relative.

  • James Ashmore, MRC Centre for Regenerative Medicine; Dr Luca Tosti, MRC Centre for Regenerative Medicine; Dr Nicholas Tan, MRC Centre for Regenerative Medicine; Dr Simon Tomlinson, MRC Centre for Regenerative Medicine; Prof Keisuke Kaji, MRC Centre for Regenerative Medicine – Mapping transcription factor occupancy using minimal numbers of cells in vitro and in vivo The identification of transcription factor (TF) binding sites in the genome is critical to understanding gene regulatory networks (GRNs). While ChIP-seq is commonly used to identify TF targets, it requires specific ChIP-grade antibodies and high cell numbers, often limiting its applicability. DNA adenine methyltransferase identification (DamID), developed and widely used in Drosophila, is a distinct technology to investigate protein-DNA interactions. Unlike ChIP-seq, it does not require antibodies, precipitation steps or chemical protein-DNA cross-linking. Here we describe an optimised DamID-seq protocol with data analysis package, and demonstrate the identification of OCT4 binding sites in as few as 1,000 embryonic stem cells (ESCs). Furthermore, we have applied this technique in vivo for the first time in mammals, and have successfully identified multiple OCT4 binding sites in the gastrulating mouse embryo at 7.5 days post coitum (dpc).

  • Nan Xiao, Seven Bridges Genomics; Teng-Fei Yin, Seven Bridges Genomics; Miao-Zhu Li, Duke University – DockFlow: Bioconductor Workflow Containerization and Orchestration with liftr We have accumulated numerous excellent software packages for analyzing large-scale biomedical data on the way to delivering on the promise of human genomics. Bioconductor workflows illustrated the feasibility of organizing and demonstrating such software collections in a reproducible and human-readable way. Going forward, how to implement fully automatic workflow execution and persistently reproducible report compilation on an industrial-scale becomes challenging from the engineering perspective. For example, the software tools across workflows usually require drastically different system dependencies and execution environments and thus need to be isolated completely. As one of the first efforts exploring the possibility of bioinformatics workflow containerization and orchestration using Docker, the DockFlow project aims to containerize every single existing Bioconductor workflow in a clean, smooth, and scalable way. We will show that with the help of our R package liftr, it is possible to achieve the goal of persistent reproducible workflow containerization by simply creating and managing a YAML configuration file for each workflow. We will also share our experience and the pitfalls encountered during such containerization efforts, which may offer some best practices and valuable references for creating similar bioinformatics workflows in the future. The DockFlow project website: https://dockflow.org.

  • Chiaowen Joyce Hsiao, University of Chicago; Matthew Stephens, Department of Statistics and Department of Human Genetics, University of Chicago; Kushal K Dey, Department of Statistics, University of Chicago – Visualizing the structure of RNA-seq expression data using grade of membership models Grade of membership models, also known as “admixture models”, “topic models” or “Latent Dirichlet Allocation”, are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple “populations”, and in natural language processing to model documents having words from multiple “topics”. Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes—from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.

  • Lihua Julie Zhu, Umass Medical School; Anastassiia Vertii, Umass Medical School; Jianhong Ou, Umass Medical School; Timothy D. Matheson, Umass Medical School; Jun Yu, Umass Medical School; Paul Kaufman, Umass Medical School – Genomic and bioinformatic analyses of mouse Nucleolar Associated Domains In interphase eukaryotic cells, heterochromatin is largely partitioned between the nucleolar periphery and regions adjacent to the nuclear lamina, thus defining Nucleolus-Associated Domains (NADs) and Lamina–Associated Domains (LADs). Previous studies indicate that human cell LADs and NADs are largely interchangeable because heterochromatin is generally stochastically localized in each daughter cell after mitosis. Here, we identified NADs in mouse embryonic fibroblasts via deep sequencing of isolated nucleoli, finding similar results in samples from crosslinked and non-crosslinked cells, and we developed a Bioconductor package NADfinder for bioinformatic analysis of the large NAD peaks (~0.3 Mb on average). Our analyses suggest that murine nucleoli associate with two different classes of peaks, which resemble facultative or constitutive heterochromatin. That is, these classes differ in their replication time, enrichment of H3K9me3 and H3K27me3, and overlap with LADs. Examples of nucleolar associations with both classes of NADs were confirmed using single cell fluorescent in situ hybridization experiments. These data are surprising given that human cell NADs are more heavily weighted towards LAD-associated heterochromatin.

  • Xengie Doan, Stowers Institute for Medical Research; Jennifer Gerton, Institute for Medical Research; Karen Miga, University of California, Santa Cruz – Centromeric satellite DNA changes during cancer Centromeres play critical roles in cells but have largely been omitted from genomic cancer studies due to the difficulty mapping long arrays of highly similar tandem repeats, or the satellite DNA that comprises centromeres. Centromeres recruit kinetochore, the binding site for microtubules to facilitate chromosome segregation. Defects in centromere function can lead to genetic instability and aneuploidy. Missegregation of chromosomes is common in many cancers. However, whether changes in satellite DNA affect centromere function in cancer is largely unknown. In this study I focus on alpha satellites (AS) and higher order repeats (HORs) which make up the core of human centromeres. Centromeric AS are highly homogeneous and are primarily localized in the centromere as tandemly repeated multi-AS units, or HOR units. These HOR units are themselves repeated in multimegabase arrays, which are more unique across chromosomes. Since tandem repeats have a high mutation rate and can be targeted by transposable elements (TE) we ask two questions using computational approaches: 1) are HOR units expanding or contracting during stress such as cancer; and 2) are TE altering AS units with mobile element insertions. Using NCBI dbGaP whole genome sequence tumor and normal paired samples from esophageal cancer individuals, I count subsets of HOR units to characterize gain or loss of repeats in the centromere. Also, I use mobile element locator tools to detect structural variants in AS repeats. Initial results suggest HOR copy number variation in cancer and transposable element insertion presence in some centromeres. Methods and analytical challenges will also be presented.

  • Azfar Basunia, SUNY Upstate/DFCI; Aedin Culhane, DFCI, Meng Chen Technical University of Munich – Using a multi-omics latent variable gene set and bi-clustering approach to discover immune subtypes in the TCGA PanCancer data Objectives: To discover latent immune molecular subtypes across The Cancer Genome Atlas tumors using a multi-omics computational frameworks. Methods and Materials: Multi Assay Experiment (MAE) R object was used to assemble, store and manage publicly available The Cancer Genome Atlas (TCGA) genomics datasets. These data included RNASeq, RPPA, gistica and gistict of 6469 patients from 30 tumor types. Data were integrated using multiple factor analysis (moGSA R package) and 77 principal components (PCs) representing 70% variance coverage were extracted. Gene set tables representing immune (Bindea, c7-MSigDB), oncogenic (c6-MSigDB) and curated (c2-MSigDB) pathways were projected onto PCs to generate pathway scores in each tumor. The pathway by tumor p-value table was discretized and iterative binary bi-clustering (iBBiG) was applied (10 iterations) to discover robust bi-clusters that spanned cancer types. These bi-clusters were evaluated for mutational load(ML), leukocyte fraction(LF) and survival. Results: 16 ensemble bi-clusters with multiple tumor memberships were generated. Bi-clusters were divided into 5 groups – strong LF (6), strong ML-weak LF (2), weak ML (2), weak LF (2) and broad (4). Conclusions: MAE-moGSA-iBBiG framework can be used to uncover and classify latent immune subtypes across tumor types. Impact and Significance: Immune subtypes can illuminate differences in tumor-immune response across tumor types, which can be therapeutically targeted.

  • Will Townes, Harvard Biostatistics; Stephanie Hicks, Dana Farber; Martin Aryee, MGH; Rafael Irizarry, Dana Farber – Varying-Censoring Aware Matrix Factorization for Single-Cell RNA-Seq Single cell RNA-Seq (scRNA-Seq) measures gene expression across individual cells, potentially facilitating identification of novel cell types. Typically, this is accomplished by dimensionality reduction from thousands of genes to a small number of factors, followed by unsupervised clustering. The scRNA-Seq data are characterized by large numbers of zeros that occur preferentially for lowly expressed genes (censoring). The rate of censoring often varies across cells and experimental platforms and may be due to technical batch effects rather than biology. This variable censoring distorts factors inferred by currently used methods such as Principal Components Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE) and Zero Inflated Factor Analysis (ZIFA). Here, we present Varying-Censoring Aware Matrix Factorization (VAMF), a novel method which separates variability due to censoring from the inferred latent factors. VAMF removes batch effects in a real dataset without using labels and detects biological groups despite variable censoring in simulated data.

  • Renato Umeton, Dana-Farber Cancer Institute; Navin R. Mahadevan, Brigham and Women’s Hospital, Department of Pathology; Adem Albayrak, Clinical & Translational Informatics Division, Informatics Department, Dana-Farber Cancer Institute; Anika E. Adeni, Harvard Medical School; Peter Hammerman, Lowe Center for Thoracic Oncology, Dana-Farber Cancer Institute, Belfer Institute for Applied Cancer Science and Harvard Medical School; Mark Awad, Lowe Center for Thoracic Oncology, Dana-Farber Cancer Institute and Harvard Medical School; Leena Gandhi, Lowe Center for Thoracic Oncology, Dana-Farber Cancer Institute and Harvard Medical School; Lynette M. Sholl, Brigham and Women’s Hospital, Department of Pathology and Harvard Medical School – Applying the Data Science Process to Cancer Research: Reported Tumor Mutation Load in NSCLC is Associated with Durable Clinical Response to Immunotherapy Background: Recent evidence indicates that efficacy and durability of responses to immune checkpoint inhibitors in lung carcinomas correlate with increased nonsynonymous mutation (NSM) burden, putative neoantigen number, and in some tumor types, PD-L1 protein expression. In this study, we retrospectively analyzed the relationship of lung carcinoma mutation burden, PD-L1 expression and immune infiltrates with clinical response in patients receiving immune checkpoint blockade (S.L. Topalian, et al. N Engl J Med. 2013; Garofalo A. et al. Genome Med. 2016). Methods: The entire work-flow of analyses was embedded in our data science process; this methodology aims at ensuring full reproducibility of the results and complete scalability of the analyses when new data or new insights become available (V. Stodden et al. Science. 2016). Pretty much like how we version source code in general, we decide to version analyses and data: from raw file processing to figure generation, everything is embedded in our data science process. Tumor mutation load data derived from clinical targeted next generation sequencing (E.P. Garcia, et al. Archives of Pathology & Laboratory Medicine. 2017) of lung carcinomas from 94 patients treated with immune checkpoint inhibitors and was correlated with clinical outcomes, including durable clinical benefit (DCB; >6 months partial or stable response) and progression-free survival (PFS). PD-L1 immunohistochemistry (clone E1L3N, Cell Signaling Technology, Envision+ detection, Dako) was considered positive if ≥1% of tumor cells and/or tumor-infiltrating immune cells (IC) stained. PU.1, CD3, and FOXP3 immunohistochemistry was used to highlight tumor-associated macrophages and non-regulatory and regulatory T cell populations, which were manually quantified per mm2. Results: The mean patient age was 62 years (range: 32-91 years). Lung tumor types included 67 adenocarcinomas, 11 squamous cell carcinomas, and 5 other/combined histology. Therapies included PD-1 inhibitors (73), a PD-L1 inhibitor (5) and multiple agents (5). Across all tumor types, patients with DCB had a significantly higher mutation load than patients who showed no durable benefit (NDB) [p < 0.01]. Patients with greater than the median tumor mutation load had significantly longer PFS than others (p < 0.05). Increasing smoking history correlated with higher mutation load (p < 0.05) and smokers had a longer PFS than never smokers. Expression of PD-L1 in either tumor cells or immune cells was not associated with mutation load or PFS. PD-L1 expression in the tumor microenvironment was associated with increased numbers of non-regulatory and regulatory T cells (p < 0.05 for both), and tended to be associated with greater numbers of tumor-associated macrophages, though this trend did not reach statistical significance. Conclusion: The tumor mutation load in lung non-small cell carcinoma as assessed by targeted next generation sequencing is associated with increased PFS and durable clinical benefit to immune checkpoint inhibitors. In this limited cohort, PD-L1 expression using clone E1L3N does not predict response to these therapies. We add to growing evidence that increased somatic mutations in carcinomas influence response to immune checkpoint blockade.

  • Tamas Schauer, Bioinformatics Core Facility, Biomedical Center, Faculty of Medicine, LMU Munich, Germany; Tobias Straub, Bioinformatics Core Facility, Biomedical Center, Faculty of Medicine, LMU Munich, Germany; Peter B Becker, Bioinformatics Core Facility, Biomedical Center, Faculty of Medicine, LMU Munich, Germany – RNA-seq normalization strategies of Drosophila single embryo transcriptomes Normalization is a key step in transcriptomic data analysis. Here, we explore various normalization strategies using ribosome-depleted total RNA sequencing from single Drosophila embryos. We compare relative normalization on read counts aligned to genic or consensus transposon sequences as well as absolute normalization using ERCC spike-in RNAs. Size-factors calculated by the DESeq package not always give reliable normalization, for example, when only less than half of the genes are detected in early stages. Genic library size and ERCC-based normalization correctly scale the transcriptomes as using these approaches known house-keeping genes (e.g. ribosomal protein genes) are invariant between stages. Reads aligned to consensus transposon sequences also require external normalization, because most of the transposons are up-regulated during development. A disadvantage of ERCC normalization is the increased variance across samples and replicates. Therefore, ERCC normalization can only be used for relatively large effect sizes, as is indeed the case during Drosophila embryonic development.

  • Matt Ritchie, The Walter and Eliza Hall Institute of Medical Research; Su Shian, The Walter and Eliza Hall Institute of Medical Research; Charity Law, The Walter and Eliza Hall Institute of Medical Research – Glimma: getting greater graphics for your genes RNA-sequencing is a popular technology for studying changes in gene expression across tens of thousands of transcripts simultaneously. To make exploration of gene expression data easier, we developed Glimma, an R package which generates interactive plots for gene expression analyses. Glimma plots connect the many layers of information in a single html page using d3.js. For example, a Glimma-style mean-difference plot, allows one to select a point from a display of summary statistics to reveal the sample-wise expression levels alongside the original plot. This feature enables researchers to interrogate the data more easily by allowing searches for genes or samples of interest and zooming for better resolution. Unlike the traditional multi-dimensional scaling (MDS) plot, Glimma’s MDS plot shows several dimensions and group combinations on the same page. Results from Glimma can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility. Besides bulk RNA-sequencing data, Glimma can also handle data from microarray, single-cell RNA-sequencing and methylation experiments.

  • Myriam, Maumy-Bertrand, University of Strasbourg & CNRS; Frederic, Bertrand, University of Strasbourg & CNRS – plsRglm: powerup Generalized PLS Regressions Using GPU Processes based on the so-called Partial Least Squares (PLS) regression, which recently gained much attention in the analysis of high-dimensional genomic datasets, were recently developed to perform variables selection. Most of these processes rely on some tuning parameters that are usually determined by Cross-Validation (CV), which has a very high computational cost. We developed a GPU based R functions to speed up our existing package plsRglm. The aim of the plsRglm package is to deal with complete and incomplete datasets through several new techniques or, at least, some which were not yet implemented in R. Indeed, not only does it make available the extension of the PLS regression to the generalized linear regression models, but also bootstrap techniques, leave one-out and repeated k-fold cross-validation. In addition, graphical displays help the user to assess the significance of the predictors when using bootstrap techniques.

  • Frederic, Bertrand, University of Strasbourg & CNRS; Myriam, Maumy-Bertrand, University of Strasbourg & CNRS – randABC : a package for Approximate Bayesian Computation Elucidating gene regulatory network is an important step towards understanding the normal cell physiology and complex pathological phenotype. Reverse-engineering consists in using gene expression over time or over different experimental conditions to discover the structure of the gene network in a targeted cellular process. The fact that gene expression data are usually noisy, highly correlated, and have high dimensionality explains the need for specific statistical methods to reverse engineer the underlying network. Among known methods, Approximate Bayesian Computation (ABC) algorithms have not been very well studied. Due to the computational overhead their application is also limited to a small number of genes. In this work we have developed a new multi-level ABC approach that has less computational cost. At the first level, the method captures the global properties of the network, such as scale-freeness and cluster- ing coefficients, whereas the second level is targeted to capture local properties, including the probability of each couple of genes being linked.

  • Jimmy Breen, Robinson Research Institute, University of Adelaide; Benjamin T Mayne, Robinson Research Institute, University of Adelaide; Shalem Leemaqz, Robinson Research Institute, University of Adelaide; Sam Buckberry, University of Western Australia; Carlos Rodriguez Lopez, University of Adelaide; Claire T Roberts, Robinson Research Institute, University of Adelaide; Tina Bianco-Miotto, Robinson Research Institute, University of Adelaide – msgbsR: an R package for analysing methylation-sensitive genotyping-by-sequencing data Genotyping-by-sequencing (GBS) is a practical and cost effective method for analysing large genomes from high diversity species. This method of sequencing, coupled with methylation-sensitive enzymes, is an effective tool to study DNA methylation in parts of the genome that are inaccessible in other sequencing techniques or are not annotated in microarrays technologies. Current software tools do not fulfil all experimental GBS assays such as those with methylation-sensitive restriction enzymes. Here we present msgbsR, an R package that contains tools for the analysis of methylation-sensitive genotyping-by-sequencing (msGBS) experiments. msgbsR contains functions for identifying and quantifying read counts at methylated sites directly from BAM files. It also enables verification of cut sites with the correct recognition sequence of the restriction enzyme. In addition, it also contains functions to test for differential methylation and creating genomic plots of the cut site locations. Furthermore, msgbsR, is fully documented and available freely online as a Bioconductor package (https://bioconductor.org/packages/msgbsR).

  • Dror Berel Fred Hutchinson Cancer Research Center; Raphael Gottardo – Prototype meta-analysis demonstration for ImmuneSpaceR, using designated S4 objects ImmuneSpace is a powerful management and analysis engine web-portal for integrative modeling of human immunological data. Currently, it contain data from 66 human immunology studies, covering a total of 4,084 participants. Each study is comprised of multiple data types including microarray, flow cytometry, hemagglutination inhibition assay, among others. Some of these assays contain thousands of biological measures (e.g. mRNA gene transcripts, FACS analytes / markers). The data is standardized and annotated for both explanatory clinical outcomes and biological ontologies. In addition to the web-portal, the entire ImmuneSpace databases are also accessible for direct download via ImmuneSpaceR, an API R/Bioconductor package. Aggregating such comprehensive data across all studies, all subjects and all variables (both dependent and independent) is an exhaustive task. Here we demonstrate an R analysis pipeline to accomplish this task, and demonstrate a meta-analysis using a specific hypothesis of interest. The R/Bioconductor MultiAssayExperiment (MAE) package is a designated S4 class for integrative omic data from multiple assays. All data from a study is converted and stored in a single MAE object, which is a non-atomic R object. A tibble R class is used to systematically access multiple non-atomic objects in a fashion reminiscent of the canonical R data.frames 10 out of the 66 ImmuneSpace studies include complete data for both microarray (gene markers) and clinical outcome that is derived from the Hemagglutination assay (HAI). For each study, association between the clinical outcome and each of the genes (separately) is modeled via a logistic regression, and summarized as an odds-ratio (OR) estimation. For each gene, a ‘meta’ OR across all single-studies is calculated by taking into account the relative weight for each of the study’s effect sizes. A forest plot summarize all single-studies and ‘meta’ ORs. The combination of such well-annotated standardized data, and designated tools for accessing it, enables modification and extension of this prototype analysis into broader pipelines of meta-analysis hypotheses testing.