Contents

1 Overview

Geneplast is designed for large-scale evolutionary analysis of orthologous groups, assessing the distribution of orthologous genes in a given species tree. Figure 1 illustrates the distribution of two hypothetical orthologous groups in 13 species. Diversity (Hα) and abundance (Dα) are useful metrics to describe the distribution of orthologous genes in a species tree. Geneplast calculates Hα assessing the normalized Shannon’s diversity (Castro et al. 2008). High Hα represents an homogeneous distribution (Figure 1a), while low Hα indicates that few species concentrate most of the observed orthologous genes (Figure 1b). The abundance Dα is given by the number of orthologous genes divided by the number of species in the tree. Geneplast uses Hα and Dα to calculate an evolutionary plasticity index (EPI) as defined by Dalmolin et al. (2011).

title Figure 1. Toy examples illustrating the distribution of orthologous genes in a given species tree. (a) OG of low abundance (Dα) and high diversity (Hα). This hypothetical OG comprises orthologous genes observed in all species of the tree, without apparent deletions or duplications. (b) Example of an OG observed in many species, but not all. Numbers in parentheses represent the orthologous genes in each species.

In order to interrogate the evolutionary root of a given gene, Geneplast implements a new algorithm called Bridge, which assesses the probability that an ortholog is present in each Last Common Ancestor (LCA) of a species in a given species tree. The method is designed to deal with large-scale queries in order to interrogate, for example, all genes annotated in a network (please refer to Castro et al. (2008) for additional examples).

To illustrate the rooting inference consider the evolutionary scenarios presented in Figure 2 for the same hypothetical OGs from Figure 1. These OGs comprise a number of orthologous genes distributed among 13 species, and the pattern of presence or absence is indicated by green and grey colours, respectively. Observe that in Figure 2a at least one ortholog is present in all extant species. To explain this common genetic trait, a possible evolutionary scenario could assume that the ortholog was present in the LCA of all extant species and was genetically transmitted up to the descendants. For this scenario, the evolutionary root might be placed at the bottom of the species tree (i.e. node j). A similar interpretation could be done for OG in Figure 2b, but with the evolutionary root placed at node f. The Bridge algorithm infers the most consistent rooting scenario for the observed orthologs in a given species tree, computing a consistency score called Dscore and an associated empirical p-value. The Dscore is an estimate of the stability of the inferred roots and the empirical p-value is computed by permutation analysis.