Inferring Orthologous relationships

Why orthology?

Many human diseases have genetic origins, and studying model organisms (like mice, fruit flies, or yeast) can provide insights into human gene function.

Imagine we have different car models from some manufacturer. If a certain part (a gene) is broken in a complex, high-end model (human), you might study the equivalent part in a simpler, more accessible, easier-to-disassemble model (mouse) to understand how it works and how to fix it. Orthology tells us which parts are truly "equivalent" across different models.

Understanding Orthology

These projects leverage similarity-based inference methods, a powerful approach that employs graph theory to model and analyze evolutionary relationships. In this framework, nodes represent biological entities such as genes or proteins, with distinct colors indicating their respective species of origin. The edges connecting these nodes signify the intricate relationships between them, allowing us to visualize the complex web of evolutionary connections. This mathematical model forms the foundation for exploring the world of orthologs—genes in different species that originated from a common ancestral gene through speciation.

From Raw Data to Orthology Clusters: Our Project Focus

The journey of similarity-based orthology inference typically follows a well-defined pipeline, starting from raw sequence data and culminating in the identification of orthology clusters. This process, illustrated in the image below, begins with all-versus-all similarity sequence comparisons to quantify the relatedness between all possible pairs of genes or proteins. These comparisons then inform the construction of a similarity graph, where connections are established based on predefined similarity thresholds. The final phase involves the cluster analysis of this graph.

Our first project: "Best Match Graphs", a key component of our work, specifically focuses on refining the second phase by investigating the concept of the most evolutionarily closest entities within the similarity graph. Furthermore, our "Colorful Orthology Clustering" project contributes to the third phase, developing innovative methods for grouping orthologs into meaningful clusters.

Projects

Best Match Graphs: Algorithms and Simulations for Orthology Inference

Project type: Theoretical Bioinformatics • Graph Theory • Simulation Design

Role: Algorithm and Simulation design, visualization editor

Associated paper: Best match graphs and reconciliation of gene trees with species trees

Published in: Journal of Mathematical Biology, 2020

The Challenge

Orthology—the identification of genes in different species that originated from a common ancestor—is key to understanding gene function and evolution. One of the most common heuristics for this is the concept of reciprocal best matches (RBMs): genes that are each other’s best hits across species. The results of this analysis build up the edges of the similarity graph. But how accurate is this method, and under what conditions can it give misleading inferences?

This project aimed to rigorously study the relationship between RBMs, orthology, and evolutionary events like duplications and losses. We also wanted to understand and correct errors arising from the use of RBMs, especially false positive orthology assignments.

My Contribution

As an undergraduate, I designed and implemented a simulation framework to test how well RBMs reflect true orthology relations in various evolutionary scenarios. My contributions included:

Simulation Engine

I developed a recursive algorithm to generate “ground truth” orthology graphs from simulated gene trees under duplication-loss models and realistic species trees. This provided the foundation for all experimental results in the paper.

Graph Correction via Good Quartets

A theoretical contribution of the paper is that “good quartets”—specific subgraphs of four genes—could identify false positive orthology edges in RBM graphs. By analyzing degree sequences of directed graphs, I provided a shortcut to detect and correct these patterns efficiently. using vertex coloring rules and graph motifs, connecting them to results from Cloteaux et al. (2014).

Result Visualization

I produced heatmaps (see Figs. 10 & 13 in the publication) showing how good quartets reduced false positives across a wide range of duplication and loss scenarios. Additionally, they allowed us to see that this error reduction algorithm cannot be applied when horizontal gene transfer is present. These figures became central to the paper’s message.

I began contributing to peer-reviewed research while still a bachelor’s student. Since then, I’ve emphasized transparent, well-documented methodology—even before I had a GitHub, I kept my work accessible through academic theses and supplementary materials.

Download my bachelor's thesis

Download supplementary material

Key findings

Theoretical Insight: We proved that the orthology graph is always a subgraph of the RBMG, and that if the RBMG is a cograph, the orthology inference is accurate.

Simulation Insight: Even in the absence of horizontal gene transfer, RBMs may contain many false positives—but good quartets can correct most of them.

Methodological Impact: Our results suggest a principled way to post-process RBM-based orthology predictions using local graph patterns.

Why It Matters

This project was my first hands-on experience connecting theory, simulation, and graph-based modeling in bioinformatics. It taught me how biological insight can emerge from carefully designed abstractions—and how simulation can validate and extend theoretical work.

This early exposure shaped my approach to algorithm design: building tools that are not only correct on paper but also usable and testable in practice.

Colorful Orthology Clustering

Project type: Research · Theoretical Bioinformatics · Graph Algorithms · Cluster Analysis

Role: First author · Algorithm design · Simulation analysis

Associated paper:

López Sánchez, A. and Lafond, M.

Colorful orthology clustering in bounded-degree similarity graphs, Journal of Bioinformatics and Computational Biology, 2021.

PMID: 34775924 · DOI: 10.1142/S0219720021400102

The Challenge

This project marks my first publication as first author and a pivotal moment in my journey as a bioinformatics researcher. We developed a clustering method tailored for orthology inference that’s not just accurate, but fixed-parameter tractable (FPT)—meaning it remains computationally feasible when we bound key biological parameters like the number of species (k) and potential orthologs per species (d). The result is a model that balances biological realism with algorithmic efficiency.

While many orthology clustering methods ignore species information and rely on reciprocal hits, our approach introduces Colorful Cluster Editing, a variant of the classical cluster editing problem where each node carries a species "color", and clusters must contain only one node per species. This adds biological validity and reduces paralog contamination.

My Contributions

Co-developed the two-step inference pipeline

Combining exact within-cluster orthology and HyPPO for between-cluster predictions.

Developed the concept of (k,d)-Best Neighbor Graphs

A new theoretical framework to limit search space and enable efficient enumeration.

The Tortoise and the Hare - Clustering Style

Benchmarked our method 🐢 against spectral clustering 🐰 using simulated datasets—demonstrating higher performance in realistic evolutionary scenarios.

Designed and implemented an FPT algorithm that finds the best colorful cluster in a graph.

Made the code available through GitHub, contributing to reproducibility and accessibility in bioinformatics research.

Selected Publications

Page updated

Google Sites

Report abuse