Home

Predicting the Effect of Mutations on PPIs Using AlphaFold

The human interactome, encompassing all protein-protein interactions, may involve up to 600,000 interactions.

Predicting how a disease-causing mutation affects the interactome might seem like a Herculean task—but it’s not as impossible as you might expect, especially when you give a University of Waterloo co-op student free access to a beefy GPU cluster, world-class mentorship, and free agency to pursue any approach.

I used the machine learning framework XGBoost, cutting-edge deep learning software AlphaFold-Multimer (AF-M), and over 47,000 SLURM jobs to build a multi-classifier model that predicts the effects of missense mutations on PPIs with a 91% AUC.

Multi-class ROC curve and AUC (By Author) Multi-class ROC curve and AUC (By Author)

In this article, I’ll walk through:

If you’re involved in bioinformatics or molecular biology and want to start incorporating machine learning into your research—this article is for you.

Let’s jump into it!

Background #

In this section, I’ll cover:

Not so hot take: protein-protein interactions are important.

I might be biased because I studied them for eight months at The Center for Applied Genomics at SickKids—but I’ve got good reason to think so:

So, if we can understand how and why a given mutation impacts PPIs, there’s a lot we can do:

With DeepMind’s release of AlphaFold-Multimer (AF-M), a version of AlphaFold trained with protein complexes in mind, came an opportunity and a question:

Can AF-M capture the effect of missense variants on PPIs?

You might be wondering—why missense variants specifically? As opposed to other kinds of mutations?

Well…

  1. AF-M can’t predict macro-level structural changes from point mutations.
  2. However, in silico changes at the interface of a PPI can have measurable changes (observed through a program like ChimeraX).

It’s a research question that lets us take advantage of current technology.

Now—how do we go about answering this question? I’ll answer that in the next section.

Methodology & Approach #

In this section, I’ll cover:

3D structures — like the kind AF-M generates — have exclusive observable properties you can’t get anywhere else.

Most of these are structural features like:

This is exciting data we can only observe from PDB structures. Given the power of a 3D structure, we arrived at the following approach:

  1. Take a wildtype protein complex (just a regular old homo- or hetero-dimer).
  2. Now, take the same complex, except this time with a missense variant in one of the proteins. Effectively, a ‘missense complex.’

If we can detect the difference between the structural features of these complexes and relate it to the specific missense variant, we can find a relationship:

Impact on PPI ~ (structural features).

In other words, impact on PPI as a function of structural features.

Illustration of wildtype and missense complex comparison Illustration of wildtype and missense complex comparison

But there’s another opportunity here: what if I enriched that structural data with variant annotations, too? Examples include:

We’d have many more features to parse a relationship from, developing a more comprehensive function:

Impact on PPI ~ f(x₁, x₂, …, xn)

During my data collection process, I was able to add 40+ more non-structural features on top of the structural ones. This made for excellent training data for an XGBoost classifier model that could predict the Impact on PPI of a given variant out of these four classes:

You’re probably wondering—how did I get all this data matching missense variants to PPI effects? And why did I choose XGBoost as my classifier model?

I’ll explain in the following sections.

Data Acquisition & Processing #

In this section, I’ll cover:

IntAct Mutations Database #

I gathered my training data from the IntAct Mutations database (available under license CC0). It’s a massive annotated record with thousands of missense variants involved in binary protein interactions. Every variant is annotated with its effect on PPI, encoded under the ‘Feature type’ column: increase, decrease, no effect, etc.:

Data encoding in IntAct Mutations database Data encoding in IntAct Mutations database

However, you’ll notice this dataset does NOT include structural feature information or other variant annotations. That’s where I (by which I mean AF-M) came in.

The database contains ~12,000 eligible data points. To save compute and ensure balanced data classes, I randomly subsampled variants until I had ~1,000 for each class. Curious about how random subsampling works?

Random subsampling is when I randomly pull datapoints from a given class until I have enough. This is one approach to ensuring your classes are weighted equally in your training data set so your classifier doesn’t bias towards any single one. Of course, you could also simply assign different weights to each class instead, but random subsampling means I have fewer protein complexes to fold—saving on compute.

Overview of data aggregation pipeline Overview of data aggregation pipeline

I then wrote a script that generated FASTA files for the wildtype and missense complexes before feeding them into AF-M to produce PDB structures. Due to time constraints, the final dataset had about 250 structures per class.

Creating a wildtype and missense complex Creating a wildtype and missense complex

Wondering how I went from UniProt IDs to PDB structures? This was a relatively simple Bash script that would pull UniProt IDs from the Intact Mutations file and download the appropriate FASTAs via CURL request. We create a wildtype FASTA and then ‘mutate’ a copy to make a missense version. These are inputs for AF-M.

This leaves us with two PDB versions of each protein complex: a wildtype and a missense variant (see image above). Multiply that across ~a thousand binary interactions and 47,000 SLURM jobs, and we have ~20TB worth of data to work with.

Feature Extraction & Engineering #

All I had left to do was extract structural and variant feature information:

Overview of data aggregation pipeline, final steps Overview of data aggregation pipeline, final steps

The AF-M pipeline I used, AlphaPulldown, simulates a pulldown assay by parallelizing the ‘folding’ of several protein complexes. It also includes a feature extraction pipeline that generates several key structural metrics:

And beyond. I also added a few features of my own, with annotations from Variant Effect Predictor by Ensembl, available under license Apache 2.0:

Pathogenicity predictions #

Two examples of pathogenicity predictions I used:

Why (and how) I included them:

gnomAD frequencies #

gnomAD (the Genome Aggregation Database from the Broad Institute) contains population allele frequencies for several different groups and variants.

Why (and how) I included frequencies:

Relative ratios #

I engineered simple ratio features, like the interface area / total surface area, the differences in feature x from the wildtype version, etc.

Why I included relative ratios:

Free energy #

Using EvoEF2 (available under the MIT license), I obtained thermodynamic data on protein complexes, comparing 𝚫𝚫G (the difference in Gibbs energy of the folded and unfolded protein states) of the wildtype and mutant variants.

Why free energy:

By the end of it all, I had a tabular dataset that looked a bit like this:

Screenshots by Author

One thing about bioinformatics tools, however—they don’t always work. Metrics for a given variant will often be empty or missing. For example, not every variant in IntAct has an associated gnomAD frequency.

Wondering why can missing values make ML challenging? It probably makes intuitive sense as to why missing values are problematic in any machine learning task. Given an incomplete picture of the dataset, it can be hard to trust the output of your model. Luckily, there are methods for dealing with our trust issues.

Missing values are pretty annoying whenever you do machine learning, so I had to pick a model with the right chops. Luckily, XGBoost had what it took—let me explain how.

The Machine Learning Model #

In this section, I’ll break down:

XGBoost #

XGBoost is a gradient-boosted decision tree model.

Decision trees are graphs that split data at threshold-based nodes. Here’s a simple model of deciding whether or not to play outside today:

Decision tree diagram

Decision tree diagram

Chaining several of these trees together in sequence allows each one to correct on the error of the last one, building a much more robust model. This is what makes it ‘gradient-boosted.’

XGBoost is a lightweight, fast-training model that has several advantages for this particular task:

How does XGBoost handle missing data? The model treats missing data as essentially their own kind of value. At a splitting node, the algorithm determine which direction (left/right) leads to the highest gain (best reduction of the loss function) and encodes that decision into its training. When it encounters a similar missing value in the test set, it’ll make the same decision.

Since the dataset had reasonably sized dimensions and XGBoost trains so quickly, I decided to include a hyperparameter tuning step using k-fold cross-validation:

Hyperparameter tuning & k-fold cross-validation #

Any machine learning model will have internal parameters—the weights and biases it ascribes to its internal classifications.

However, ML models also have hyperparameters, which are higher-level structural settings for the overall model. These parameters are set before training and can significantly impact accuracy outcomes.

Here are the ones I focused on:

We determine the optimal hyperparameters by testing several combinations and selecting the one with the best results—in other words, by tuning.

To add even more robustness to the model training, I implemented a k-fold cross-validation step. What is that exactly?

Let’s break it down:

By MBanuelos22 from Wikimedia Commons, CC BY-SA 4.0

By MBanuelos22 from Wikimedia Commons, CC BY-SA 4.0

We do this for every hyperparameter combination, ensuring we get the highest accuracy model possible.

Now that we’ve got a model trained and tested, how good is it? What can it teach us about biology? We’ll cover that next:

Results, Model Accuracy & Feature Importances #

In this section, I’ll cover:

Confusion matrix & ROC curve #

To assess the quality of a multi-classifier, you’ll typically use a confusion matrix:

Figure 1. Confusion matrix Figure 1. Confusion matrix

The confusion matrix visualizes the test set. Along the bottom, we have what the model predicted for each data point (a protein complex), and on the y-axis, we have the actual value for each data point. The lighter the colour of the diagonal, the more accurate the model.

The confusion matrix is just one way of assessing the model’s accuracy. We can also use an ROC curve:

Figure 2. Multi-class ROC curve and AUC Figure 2. Multi-class ROC curve and AUC

A receiver operating characteristic (ROC) curve plots FPR vs. TPR:

The points along the curve indicate different threshold settings. The threshold refers to the cutoff point where we distinguish between a positive and negative case.

In a multi-class ROC curve, the positive case is a given class, and the negative case is all the others (one-vs-all). The diagonal (dashed line) refers to random chance, where TPR = FPR at all thresholds. One way to think of accuracy is that the further we are from this line, the better the model.

The ideal classifier has a bowed-out shape, where we have a high TPR at low FPR rates, even at high thresholds. We see this for most of our curves.

Microaveraging the curves involved summing each class’s TPs, FNs, and TNs (seen above).

This gives us an overall picture of our model’s performance. To standardize our comparison, we calculate the area under the curve. Closer to 1 = more accurate. As seen in Figure 2, we achieve a micro average AUC of 0.91.

You can also see that the curves for classes 0 and 1 reflect what we saw in the confusion matrix. Overall, we have a reasonably accurate model. But now comes the fun part—what can it tell us about biology?

SHAP values #

Using SHAP values can help us determine how much the features in a model contributed to each prediction.

What are SHAP values? Shapley Additive exPlanations measure the marginal contributions of each feature to a class prediction. In short, they measure feature importance. SHAP values are useful because they work for any supervised machine learning task.

They’re determined by:

  1. Taking the average prediction for a given class.
  2. Iteratively adding each feature in (in all possible orders) and evaluating the output.
  3. Calculating the marginal effect of a feature on the model output’s distance from average.

SHAP values are biologically useful because they can tell us which structural features require more study, or which pathogenicity predictions are most accurate.

Here are the top ten feature importances for the entire model:

Figure 3. Feature importance Figure 3. Feature importance

From my analysis, I saw some interesting results.

Realistically, finding one or two features globally predictive of PPI effects is a fruitless exercise. We’re better off looking at feature importance class by class using SHAP values:

Class 0: Decrease in interaction strength #

Figure 4 (left), feature importances plot. Figure 5 (right), beeswarm plot.

On the left, I have a feature importance plot showing the average impact of a feature on the model’s output. This graph is relatively simple to interpret; the bigger the bar, the greater the magnitude of impact.

On the right, we have a beeswarm plot. These kinds of plots help us map two factors:

We can combine these two factors to further our understanding of the feature effects. I’ll use the Resulting_sequence_A feature as an example. (Background: this is a dummy value, so values are either 0 or 1).

In natural language, it seems alanine as a missense residue has a strong effect on decreasing interaction strength. This tracks intuitively—alanine is a small amino acid with only a methyl group for a side chain.

Alanine chemical structure

This limits the kind of interactions it can participate in. So, if it replaces a crucial disulfide bride or prevents a charged interaction, it makes sense that it would increase the likelihood of decreased interaction strength.

Combining wet and computational biology for a better model. While there is a biological case to be made that this finding reflects reality, it’s also true that scientists often conduct alanine-scanning studies where residues are sequentially replaced in a protein sequence to determine which ones are important in folding, interaction, etc.

It’s possible the results from these studies are overrepresented in the IntAct database, biasing the model to think this feature is more important than it is. Interestingly, it was crucial I have background in ‘wet’ biology to understand issues like this. Otherwise, I may not have even realized this was a potential problem.

We see some unexpected results with the shape complementarity (sc) feature. sc measures how well the two proteins fit into each other; we’d expect that when sc is low, it increases the likelihood of decreased interaction strength between two proteins. However, this is not the case: higher sc seems to decrease interaction strength, and vice versa.

In the following sections, I’ll only call out a few features I found interesting:

Class 1: Disruption of interaction #

Fig. 6 Fig. 7

Class 2: Increase in interaction strength #

Fig. 8 Fig. 9

Class 3: No effect on interaction #

Fig. 10 Fig. 11

All in all, some interesting results. Now, let’s apply the model to some novel protein complexes to see if we can learn anything new:

In this section, I’ll:

This is all fun and interesting, but what’s the point of machine learning in molecular biology if we haven’t learned anything new?

So, for the application part of this project, I used the model to analyze a few coding missense variants from the MSSNG database, a repository of whole genome sequence and annotation data designed to study autism spectrum disorder (ASD). (The Scherer Lab specializes in studying the genomics of ASD).

MSSNG is a vast body of data, so I refined my analysis to only a subset that met the following criteria:

Here are two of the standout variants from this subset:

The model predicts Class 1 (a disrupted interaction) for both variants. Let’s take a closer look:

Androgen receptor & tyrosine protein kinase (AR:YES1.p.Q234R) #

By Author

Here are the facts:

So here’s one potential story:

Unfortunately, there are a few issues here:

The bottom line is that there are no conclusive results from the model. For your interest, here are the SHAP values that lead to this prediction. Note the comparatively high difference in interface solvation energy from the wildtype.

By Author

Transcriptional coactivator & zinc finger protein (YAP1:GLIS3.p.G448D) #

By Author

Once again, here are the facts:

One possible story here: Disrupted interaction between these two proteins may expedite the progression of neuropathologies that result in autism.

However, gleaning a more specific mechanism from this research is a challenge.

Once again, here are its SHAP values:

By Author

So far we’ve trained a reasonably strong model, interpreted its results, and used it to learn more about biology. What are the takeaways? And what’s next?

Final Thoughts and Lessons #

Future Directions #

I spent about four months at TCAG working on this project, much of which was spent collecting training data. (Not to mention the four months I spent there previously getting accustomed to AF-M).

Given more time, here are steps I’d take to expand the project:

However, I’m quite proud of what I achieved and learned in this timeframe.

Conclusions & Key Takeaways #

I can’t imagine many students are given free rein over such an exciting research question, access to world-class mentors, and one of Canada’s largest compute clusters.

But I did.

So, I’m incredibly thankful to everyone involved (Dr. Brett Trost, Dr. Richard Wintle, and Dr. Steve Scherer, to name a few).

Working in the Scherer Lab showed me research doesn’t have to be stale, slow, and tedious. With the right team and vision, it can be fast, dynamic, and cutting-edge.

On a macro-scale, I’ve learned how crucial AI/ML will be to the future of computational biology.

What are your thoughts? #

I’m happy to get your feedback, questions, compliments, and curses. Thanks for reading!

project, bioinformatics, ml
View / Make Comments