Many human diseases have genetic origins, and studying model organisms (like mice, fruit flies, or yeast) can provide insights into human gene function.
Imagine we have different car models from some manufacturer. If a certain part (a gene) is broken in a complex, high-end model (human), you might study the equivalent part in a simpler, more accessible, easier-to-disassemble model (mouse) to understand how it works and how to fix it. Orthology tells us which parts are truly "equivalent" across different models.
These projects leverage similarity-based inference methods, a powerful approach that employs graph theory to model and analyze evolutionary relationships. In this framework, nodes represent biological entities such as genes or proteins, with distinct colors indicating their respective species of origin. The edges connecting these nodes signify the intricate relationships between them, allowing us to visualize the complex web of evolutionary connections. This mathematical model forms the foundation for exploring the world of orthologs—genes in different species that originated from a common ancestral gene through speciation.
The journey of similarity-based orthology inference typically follows a well-defined pipeline, starting from raw sequence data and culminating in the identification of orthology clusters. This process, illustrated in the image below, begins with all-versus-all similarity sequence comparisons to quantify the relatedness between all possible pairs of genes or proteins. These comparisons then inform the construction of a similarity graph, where connections are established based on predefined similarity thresholds. The final phase involves the cluster analysis of this graph.
Our first project: "Best Match Graphs", a key component of our work, specifically focuses on refining the second phase by investigating the concept of the most evolutionarily closest entities within the similarity graph. Furthermore, our "Colorful Orthology Clustering" project contributes to the third phase, developing innovative methods for grouping orthologs into meaningful clusters.
Project type: Theoretical Bioinformatics • Graph Theory • Simulation Design
Role: Algorithm and Simulation design, visualization editor
Associated paper: Best match graphs and reconciliation of gene trees with species trees
Published in: Journal of Mathematical Biology, 2020
Orthology—the identification of genes in different species that originated from a common ancestor—is key to understanding gene function and evolution. One of the most common heuristics for this is the concept of reciprocal best matches (RBMs): genes that are each other’s best hits across species. The results of this analysis build up the edges of the similarity graph. But how accurate is this method, and under what conditions can it give misleading inferences?
This project aimed to rigorously study the relationship between RBMs, orthology, and evolutionary events like duplications and losses. We also wanted to understand and correct errors arising from the use of RBMs, especially false positive orthology assignments.
As an undergraduate, I designed and implemented a simulation framework to test how well RBMs reflect true orthology relations in various evolutionary scenarios. My contributions included:
I developed a recursive algorithm to generate “ground truth” orthology graphs from simulated gene trees under duplication-loss models and realistic species trees. This provided the foundation for all experimental results in the paper.
A theoretical contribution of the paper is that “good quartets”—specific subgraphs of four genes—could identify false positive orthology edges in RBM graphs. By analyzing degree sequences of directed graphs, I provided a shortcut to detect and correct these patterns efficiently. using vertex coloring rules and graph motifs, connecting them to results from Cloteaux et al. (2014).
I produced heatmaps (see Figs. 10 & 13 in the publication) showing how good quartets reduced false positives across a wide range of duplication and loss scenarios. Additionally, they allowed us to see that this error reduction algorithm cannot be applied when horizontal gene transfer is present. These figures became central to the paper’s message.
I began contributing to peer-reviewed research while still a bachelor’s student. Since then, I’ve emphasized transparent, well-documented methodology—even before I had a GitHub, I kept my work accessible through academic theses and supplementary materials.
Theoretical Insight: We proved that the orthology graph is always a subgraph of the RBMG, and that if the RBMG is a cograph, the orthology inference is accurate.
Simulation Insight: Even in the absence of horizontal gene transfer, RBMs may contain many false positives—but good quartets can correct most of them.
Methodological Impact: Our results suggest a principled way to post-process RBM-based orthology predictions using local graph patterns.
This project was my first hands-on experience connecting theory, simulation, and graph-based modeling in bioinformatics. It taught me how biological insight can emerge from carefully designed abstractions—and how simulation can validate and extend theoretical work.
This early exposure shaped my approach to algorithm design: building tools that are not only correct on paper but also usable and testable in practice.
Project type: Research · Theoretical Bioinformatics · Graph Algorithms · Cluster Analysis
Role: First author · Algorithm design · Simulation analysis
Associated paper:
López Sánchez, A. and Lafond, M.
Colorful orthology clustering in bounded-degree similarity graphs, Journal of Bioinformatics and Computational Biology, 2021.
PMID: 34775924 · DOI: 10.1142/S0219720021400102
This project marks my first publication as first author and a pivotal moment in my journey as a bioinformatics researcher. We developed a clustering method tailored for orthology inference that’s not just accurate, but fixed-parameter tractable (FPT)—meaning it remains computationally feasible when we bound key biological parameters like the number of species (k) and potential orthologs per species (d). The result is a model that balances biological realism with algorithmic efficiency.
While many orthology clustering methods ignore species information and rely on reciprocal hits, our approach introduces Colorful Cluster Editing, a variant of the classical cluster editing problem where each node carries a species "color", and clusters must contain only one node per species. This adds biological validity and reduces paralog contamination.
Combining exact within-cluster orthology and HyPPO for between-cluster predictions.
A new theoretical framework to limit search space and enable efficient enumeration.
Benchmarked our method 🐢 against spectral clustering 🐰 using simulated datasets—demonstrating higher performance in realistic evolutionary scenarios.
Made the code available through GitHub, contributing to reproducibility and accessibility in bioinformatics research.