DAMIA: Disease-Associated Missense Interactome Analysis

Table of Contents #

Overview
Pipeline
Lessons & Takeaways

Overview #

As part of my work at SickKids, I needed to compare wildtype protein complexes predicted using AlphaFold-Multimer (AFM) to ones with missense mutations introduced.

In conducting this analysis, I created several tools to obtain very specific kinds of information:

Tabular representations of interacting residues and chains
Average distances between amino acid carbons (alpha & beta)
Jaccard similarity between the interacting residues in wildtype and missense complexes

Many of these scripts came together to form a pipeline I called DAMIA (Disease-Associated Missense Interactome Analysis) that helps analyze the similarities and differences between wildtype and missense complexes as predicted by AFM.

Here’s an overview of each tool, with large reference to the documentation.

distance: Distance calculation between alpha and beta carbons
interface: Interface determination for chain-residue pairs
similarity: Jaccard index similarity calculation
avgfeatures: Averages of PI-score output
interactions: Interaction search and MFA creation
whitney: U-test between WT and missense interfaces.

`Distance` #

This script calculates the distances between the alpha- and beta-carbons of interface residue pairs in a PDB file, using PyMOL.

The rationale for the creation of this tool is that we might expect a significant difference in average carbon distance between interacting residues in missense complexes if AFM can truly capture the effects of the mutation.

The input is an .ic file containing the list of residue pairs, and the output is a TSV file containing the calculated distances. This provides some information about interface quality and proximity.

To run the script from the command line, use the following format:

python distance.py <ic_file> <pdb_file> [-o <output_tsv>]

Where:

<ic_file>: Path to the .ic file containing the residue pairs.
<pdb_file>: Path to the PDB file containing the protein structure.
<output_tsv>: (Optional) Path to the output TSV file where the distances will be saved. Defaults to output.tsv in the current directory.

Example:

python distance.py input.ic protein.pdb -o distances.tsv

Input

The input .ic file should contain a list of residue pairs, one pair per line. Each line should contain the following columns, separated by whitespace:

Residue type of the first residue (e.g., “GLY”).
Residue number of the first residue (e.g., “42”).
Chain identifier of the first residue (e.g., “A”).
Residue type of the second residue (e.g., “LEU”).
Residue number of the second residue (e.g., “87”).
Chain identifier of the second residue (e.g., “B”).

For example:

GLY 42 A LEU 87 B

Output

The output TSV file will contain the following columns:

Index: The index of the residue pairn from the .ic file.
Chain1: Chain identifier of the first residue.
Res1: Residue type of the first residue.
Res1_num: Residue number of the first residue.
Chain2: Chain identifier of the second residue.
Res2: Residue type of the second residue.
Res2_num: Residue number of the second residue.
Alpha_distance: Distance between the alpha-carbons of the two residues.
Beta_distance: Distance between the beta-carbons of the two residues, or “N/A” if one of the residues is glycine.

Example output:

index   chain1  res1    res1_num chain2  res2    res2_num alpha_distance    beta_distance
1       A       GLY     42       B       LEU     87       12.34             N/A

Code Overview

The script is divided into the following functions:

calculate_distance: Calculates the distance between the alpha- and beta-carbons of two residues in a PDB file.
main: Reads the input .ic file, calculates distances for all residue pairs, and writes the results to the output TSV file.

The script uses the argparse library to parse command-line arguments, and the pymol library to load PDB files and calculate distances.

`Interface` #

This Python script analyzes occurrences of a chain-residue pair across multiple TSV files and calculates a consensus score based on their presence. The script also calculates the average distance (either alpha or beta) between the chain-residue pairs in the provided TSV files. interface.py aims to show often a given residue (at a specified chain) appears in interfaces and, when it does, how far it is, on average, from the other residues it interacts with.

The rationale for this tool’s existence is to observe interface residues as they appear across AFM predictions (the algorithm can produce up to 5 per run, per sequence).

Usage

python interface.py <file_paths> <chain> <residue> [-d <distance_type>]

Arguments

file_paths: Paths to the TSV files (space-separated).
chain: Chain (a letter A-Z).
residue: Residue (an integer).
-d, --distance_type: (Optional) Distance type, either “alpha” or “beta”. Default is “beta”.

Example

Suppose we have two TSV files file1.tsv and file2.tsv. We want to analyze the occurrences of the chain-residue pair A-42 with alpha distance. The command would look like:

python interface.py file1.tsv file2.tsv A 42 -d alpha

Output

The script prints the consensus score and total occurrences of the chain-residue pair to the console. Additionally, it generates two output files:

occurrences.tsv: A TSV file containing the occurrences and average distances for each input TSV file.
avg_distance.txt: A plain text file containing the overall average distance.

Functions

read_tsv(file_path: str) -> pd.DataFrame

Reads a TSV file and returns a pandas DataFrame.

find_chain_residue(df: pd.DataFrame, chain: str, residue: int, distance_type: str) -> (bool, int, float)

Finds the occurrences of a chain-residue pair in the given DataFrame and calculates the average distance based on the specified distance type. Returns a tuple containing a boolean indicating presence, the number of occurrences, and the average distance.

Example Input TSV File

The input TSV files should have the following format (this comes from PRODIGY):

index   chain1  res1    res1_num chain2  res2    res2_num alpha_distance  beta_distance
     A       ALA     15       B       GLY     32       12.34           10.56
     A       VAL     42       B       MET     29       7.92            6.34
     B       ALA     15       A       GLY     32       11.87           9.42

Example Output TSV File

The output TSV file occurrences.tsv will have the following format:

TSV     Occurrences  Average Distance
file1.tsv       2           6.45
file2.tsv       1           7.92

Example Output TXT File

The output plain text file avg_distance.txt will have the following content:

Average Distance: 6.94

`Similarity` #

This script calculates the Jaccard Index similarity between multiple TSV files, each containing interaction data from a complex. Users can provide options to keep only the residue numbers, remove duplicate residues, or both. The script saves the pairwise similarity scores in a TSV file and writes a summary of the results to a text file.

similarity.py tells us how similar the interfaces of different complexes are to each other.

Command-Line Arguments

file_paths (str): Paths to the TSV files to be compared.
--residues_only (bool, optional): If provided, keep only the residue number columns in the input DataFrames.
--remove_duplicate_residues (bool, optional): If provided, remove duplicate residues from the input DataFrames.

Example Usage

Here is an example of how to run the script from the command line:

python similarity.py file1.tsv file2.tsv file3.tsv --residues_only --remove_duplicate_residues

In this example, the script will:

Read the contents of file1.tsv, file2.tsv, and file3.tsv.
Preprocess each DataFrame by keeping only the residue number columns (res1_num and res2_num) and removing duplicate residues if the --residues_only and --remove_duplicate_residues flags are provided.
Calculate the pairwise Jaccard Index similarity between each pair of tables.
Save the pairwise similarity scores in a TSV file named pairwise_similarity.tsv.
Calculate the average, median, minimum, and maximum similarity scores and save the summary in a text file named summary.txt.

The output files pairwise_similarity.tsv and summary.txt will be generated in the same directory as the script.

Functions

read_tsv(file_path: str) -> pd.DataFrame:
Reads a TSV file and returns a pandas DataFrame.
- file_path (str): The file path of the TSV file.
preprocess_table(df: pd.DataFrame, residues_only: bool, remove_duplicate_residues: bool) -> pd.DataFrame:
Preprocesses a DataFrame by keeping only the residue number columns or removing unnecessary columns and duplicate rows.
- df (pd.DataFrame): The input DataFrame.
- residues_only (bool): If True, keep only the residue number columns.
- remove_duplicate_residues (bool): If True, remove duplicate residues.
jaccard_index(df1: pd.DataFrame, df2: pd.DataFrame) -> float:
Calculates the Jaccard Index similarity between two DataFrames.
- df1 (pd.DataFrame): The first DataFrame.
- df2 (pd.DataFrame): The second DataFrame.
multi_table_jaccard(file_paths: List[str], residues_only: bool, remove_duplicate_residues: bool) -> Tuple[pd.DataFrame, pd.Series]:
Calculates pairwise Jaccard Index similarity between multiple TSV files and returns a DataFrame with the results and a summary Series.
- file_paths (List[str]): A list of file paths to the TSV files.
- residues_only (bool): If True, keep only the residue number columns in the input DataFrames.
- remove_duplicate_residues (bool): If True, remove duplicate residues from the input DataFrames.

Output Files

pairwise_similarity.tsv: A TSV file containing the pairwise similarity scores.
summary.txt: A text file containing the average, median, minimum, and maximum similarity scores.

Example Input:

Let’s consider three example TSV files, file1.tsv, file2.tsv, and file3.tsv.

file1.tsv:

index   chain1  res1    res1_num chain2  res2    res2_num alpha_distance  beta_distance
     A       ALA     15       B       GLY     32       12.34           10.56
     A       VAL     42       B       MET     29       7.92            6.34
     B       ALA     15       A       GLY     32       11.87           9.42

file2.tsv:

index   chain1  res1    res1_num chain2  res2    res2_num alpha_distance  beta_distance
     A       ALA     15       B       GLY     32       13.45           11.32
     B       VAL     42       A       MET     29       8.01            6.47
     A       ARG     8        B       LEU     21       10.29           8.61

file3.tsv:

index   chain1  res1    res1_num chain2  res2    res2_num alpha_distance  beta_distance
     A       ALA     15       B       GLY     32       12.78           10.89
     A       VAL     42       B       MET     29       7.95            6.38
     B       ALA     15       A       GLY     32       11.89           9.44
     A       ARG     8        B       LEU     21       10.31           8.63

Command:

python multi_table_jaccard.py file1.tsv file2.tsv file3.tsv --residues_only --remove_duplicate_residues

Example Output:

pairwise_similarity.tsv:

Pairwise Matchup      Similarity Score
Table 1 - Table 2     0.6
Table 1 - Table 3     0.75
Table 2 - Table 3     0.6

summary.txt:

Average Score: 0.65
Median Score: 0.6
Minimum Score: 0.6
Maximum Score: 0.75

In this example, the pairwise similarity scores between the tables are 0.6 and 0.75, indicating that the residue pairs in the tables are not identical after preprocessing.

`AvgFeatures` #

This Python script reads a CSV file containing PI-Score features and calculates the averages, standard deviations, and standard errors for each numeric feature. The results are saved in a TSV file.

Usage

python avgfeatures.py <csv_path> [-o output_file]

Arguments

csv_path: Path to the input CSV file.
output_file: (Optional) Output file name. Default is “averages.tsv”.

Example

Suppose we have a CSV file input.csv. We want to calculate the averages, standard deviations, and standard errors for the numeric features and save the results in a file called “results.tsv”. The command would look like:

python avgfeatures.py input.csv -o results.tsv

Output

The script generates an output TSV file containing the averages, standard deviations, and standard errors for each numeric feature in the input CSV file.

Functions

process_csv_file(filename: str, output_filename: str)

Reads a CSV file, calculates the averages, standard deviations, and standard errors for each numeric feature, and saves the results in a TSV file.

Example Input CSV File

The input CSV files should have the following format:

pdb,interface,Num_intf_residues,Polar,Hydrophobhic,Charged,conserved_interface, contact_pairs, sc, hb, sb, int_solv_en, int_area, pvalue
ranked_0,B_A,54,0.278,0.259,0.37,NA,55,0.629,21,6,-8.41,1770.08,0.63

Example Output TSV File

The output TSV file will have the following format:

Feature	        Average    Std. Deviation	Std. Error
Num_intf_residues 54.000000	0.000000	    0.000000
Polar	          0.278000	0.000000	    0.000000
Hydrophobhic      0.259000	0.000000	    0.000000
Charged           0.370000	0.000000	    0.000000

Note that the example output file contains only a single row of data, as there is only one input row in the example input file. In practice, the input CSV file will have multiple rows, and the output file will show the calculated values based on all input rows.

`Interactions` #

This script fetches protein interaction data from MINT or BioGRID and generates multi-FASTA files for each pair of interaction partners. Its purpose is to create MFA inputs for AF2-multimer to make complex predictions.

Usage

python interactions.py <uniprot_ids> --source <mint|biogrid> [--output_folder <output_folder>] [--access_key <access_key>]

Arguments

uniprot_ids: List of UniProt IDs to fetch data for.
--source: Choose the data source (mint or biogrid). Only MINT is currently working.
--output_folder (optional): Output folder for interaction data (default: interactions).
--access_key (optional): Access key for BioGRID API (required if source is biogrid).

Output

The script creates a folder with interaction data files for each UniProt ID in the specified output_folder (default: interactions).
A JSON file interaction_partners.json is created in the output_folder, containing a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values.
A folder named complexes is created with multi-FASTA files for each pair of interaction partners. The file names are in the format key:value[0].fa, key:value[1].fa, etc.

Limitations

The .fa files in the complexes folder are automatically generated and do not account for other proteins that may be required to properly model a complex. They also assume all complexes are heterodimeric, which may not be the case.
Some of the complexes may have exact experimental structures in the PDB, which would be preferred for most analyses.

Functions

get_mint_data(output_folder, uniprot_ids)

Fetches interaction data from MINT and saves it as TSV files in the specified output folder.

extract_interaction_partners_mint(uniprot_ids, output_folder)

Extracts interaction partner UniProt IDs from MINT data and returns a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values.

get_biogrid_data(output_folder, uniprot_ids, access_key)

Fetches interaction data from BioGRID and saves it as TSV files in the specified output folder. Also produces interaction_partners.json.

extract_interaction_partners_biogrid(uniprot_ids, output_folder)

Converts BioGRID gene names to UniProt IDs and returns a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values.

create_mfa(interaction_partners: dict)

Generates multi-FASTA files for each pair of interaction partners and saves them in a folder named complexes.

main()

The main function of the script. It parses command-line arguments, calls the appropriate data fetching and processing functions depending on the selected data source, and generates the JSON file with the interaction partners dictionary.

Example Usage

To fetch MINT interaction data for UniProt IDs Q15113, P20908, and Q9BQB4, and save the output in the “interactions” folder, run the following command:

python interactions.py Q15113 P20908 Q9BQB4 --source mint

This will create the following files and folders:

interactions/Q15113.tsv: MINT interaction data for Q15113
interactions/P20908.tsv: MINT interaction data for P20908
interactions/Q9BQB4.tsv: MINT interaction data for Q9BQB4
interactions/interaction_partners.json: A JSON file containing the dictionary of interaction partners
complexes/: A folder containing multi-FASTA files for each pair of interaction partners

Similarly for BioGRID, run the following command:

python interactions.py Q15113 P20908 Q9BQB4 --source biogrid --access_key "accesskey"

You can obtain an access key from the BioGRID website.

Notes

The script currently generates multi-FASTA files for each pair of interaction partners without accounting for other proteins that may be required to properly model a complex. The script also assumes all complexes are heterodimeric, which may not be the case.
Output folder issue fixed (Thank you Kritika Grover)
Some of the complexes may have exact experimental structures in the PDB, which would be preferred for most analyses.

TODO

The extract_interaction_partners_{mint, biogrid} functions use much of the same code, and can probably be combined.
Include a means of checking PDB for existing experimental model.

`Whitney` #

This Python script performs a Mann-Whitney U-test on specified columns from multiple TSV files to determine if the difference between interaction data of two sets of models is significant.

Usage

python whitney.py -a <sample_a_files> -b <sample_b_files> -c <column_name>

Arguments

-a, --sample_a: List of TSV files for sample A (space-separated).
-b, --sample_b: List of TSV files for sample B (space-separated).
-c, --column_name: Column name to extract data from.

Example

Suppose we have four TSV files, sample_a1.tsv, sample_a2.tsv, sample_b1.tsv, and sample_b2.tsv. We want to perform a Mann-Whitney U-test on the “interaction_strength” column. The command would look like:

python whitney.py -a sample_a1.tsv sample_a2.tsv -b sample_b1.tsv sample_b2.tsv -c interaction_strength

Output

The script prints the Mann-Whitney U-test results, including the U statistic and P-value, to the console. Additionally, it generates an output text file called “whitney.txt” containing the same information.

Functions

read_and_extract_column(filepaths: List[str], column_name: str) -> List[float]

Reads the specified column from each TSV file in the list of file paths, ignoring rows with “N/A” in the specified column, and returns the combined data as a list of floats.

main(sample_a_files: List[str], sample_b_files: List[str], column_name: str)

Performs a Mann-Whitney U-test on the specified columns from the given TSV files for samples A and B, prints the results, and writes the results to an output file.

Input TSV File Format

The input TSV files should have the following format:

pdb_id  interaction_strength
model1  0.65
model2  0.72
model3  0.61

Example Output TXT File

The output text file whitney.txt will have the following content:

Mann-Whitney U-test results:
U statistic: 15.0
P-value: 0.0389249472108

Pipeline #

I want to be clear: the ultimate vision for DAMIA is yet to be realized. It’s not an autonomous pipeline just yet. However, all of its scripts can be used independently in a pipeline of sorts. However, the ultimate vision is for DAMIA to function as a simple to use pipeline where all you need to provide is a UniProt ID (or several, in a high-throughput analysis):

Start with a UniProt ID input into interactions.py to get the MFAs we need (wildtype and missense).
Predict their structures using AFM.
Run the PDBs through PRODIGY to get an .ic file.
Run the .ic file through distance.py to obtain an enriched TSV file version, with additional carbon distances.
Run the TSV files through the whitney.py to see if there’s a statistically significant difference between the wildtype and missense complexes.
Run the TSV files through interface.py to see the average distance between the interface residues and how often certain residues occur.
If the missense residue being analyzed is in the interface:
- Use partners.py to find that residues interaction partners.
- Use avgdistances.py to find average distances between interface residues.
- Run the complex through PI-score and avgfeatures.py to get the average of all features from that pipelines.

Lessons & Takeaways #

Save your work as you go: This started as a mostly exploratory project. But since I saved all my scripts, I ultimately realized I could package up my workflow into something useful.
Plan your pipeline and analysis ahead of time: If these scripts seem a little tricky to follow, it’s because they are. Much of DAMIA was built by going and back and forth in the research process (something I like to call Agile Research). While this is good for quickly iterating on findings, it doesn’t necessary translate well over to novel pipelines like this one.

The work I did on this project, culminating in DAMIA, was what led to my contributions in the Nature Genetics paper I write about here, so it means a lot to me!

project, bioinformatics, tools

View / Make Comments

DAMIA: Disease-Associated Missense Interactome Analysis

Table of Contents #

Overview #

Distance #

Interface #

Similarity #

AvgFeatures #

Interactions #

Whitney #

Pipeline #

Lessons & Takeaways #

`Distance` #

`Interface` #

`Similarity` #

`AvgFeatures` #

`Interactions` #

`Whitney` #