DAMIA: Disease-Associated Missense Interactome Analysis
Table of Contents #
Overview #
As part of my work at SickKids, I needed to compare wildtype protein complexes predicted using AlphaFold-Multimer (AFM) to ones with missense mutations introduced.
In conducting this analysis, I created several tools to obtain very specific kinds of information:
- Tabular representations of interacting residues and chains
- Average distances between amino acid carbons (alpha & beta)
- Jaccard similarity between the interacting residues in wildtype and missense complexes
Many of these scripts came together to form a pipeline I called DAMIA (Disease-Associated Missense Interactome Analysis) that helps analyze the similarities and differences between wildtype and missense complexes as predicted by AFM.
Here’s an overview of each tool, with large reference to the documentation.
distance
: Distance calculation between alpha and beta carbonsinterface
: Interface determination for chain-residue pairssimilarity
: Jaccard index similarity calculationavgfeatures
: Averages of PI-score outputinteractions
: Interaction search and MFA creationwhitney
: U-test between WT and missense interfaces.
Distance
#
This script calculates the distances between the alpha- and beta-carbons of interface residue pairs in a PDB file, using PyMOL.
The rationale for the creation of this tool is that we might expect a significant difference in average carbon distance between interacting residues in missense complexes if AFM can truly capture the effects of the mutation.
The input is an .ic
file containing the list of residue pairs, and the output is a TSV file containing the calculated distances. This provides some information about interface quality and proximity.
To run the script from the command line, use the following format:
python distance.py <ic_file> <pdb_file> [-o <output_tsv>]
Where:
<ic_file>
: Path to the.ic
file containing the residue pairs.<pdb_file>
: Path to the PDB file containing the protein structure.<output_tsv>
: (Optional) Path to the output TSV file where the distances will be saved. Defaults tooutput.tsv
in the current directory.
Example:
python distance.py input.ic protein.pdb -o distances.tsv
Input
The input .ic
file should contain a list of residue pairs, one pair per line. Each line should contain the following columns, separated by whitespace:
- Residue type of the first residue (e.g., “GLY”).
- Residue number of the first residue (e.g., “42”).
- Chain identifier of the first residue (e.g., “A”).
- Residue type of the second residue (e.g., “LEU”).
- Residue number of the second residue (e.g., “87”).
- Chain identifier of the second residue (e.g., “B”).
For example:
GLY 42 A LEU 87 B
Output
The output TSV file will contain the following columns:
- Index: The index of the residue pairn from the
.ic
file. - Chain1: Chain identifier of the first residue.
- Res1: Residue type of the first residue.
- Res1_num: Residue number of the first residue.
- Chain2: Chain identifier of the second residue.
- Res2: Residue type of the second residue.
- Res2_num: Residue number of the second residue.
- Alpha_distance: Distance between the alpha-carbons of the two residues.
- Beta_distance: Distance between the beta-carbons of the two residues, or “N/A” if one of the residues is glycine.
Example output:
index chain1 res1 res1_num chain2 res2 res2_num alpha_distance beta_distance
1 A GLY 42 B LEU 87 12.34 N/A
Code Overview
The script is divided into the following functions:
calculate_distance
: Calculates the distance between the alpha- and beta-carbons of two residues in a PDB file.main
: Reads the input.ic
file, calculates distances for all residue pairs, and writes the results to the output TSV file.
The script uses the argparse
library to parse command-line arguments, and the pymol
library to load PDB files and calculate distances.
Interface
#
This Python script analyzes occurrences of a chain-residue pair across multiple TSV files and calculates a consensus score based on their presence. The script also calculates the average distance (either alpha or beta) between the chain-residue pairs in the provided TSV files. interface.py
aims to show often a given residue (at a specified chain) appears in interfaces and, when it does, how far it is, on average, from the other residues it interacts with.
The rationale for this tool’s existence is to observe interface residues as they appear across AFM predictions (the algorithm can produce up to 5 per run, per sequence).
Usage
python interface.py <file_paths> <chain> <residue> [-d <distance_type>]
Arguments
file_paths
: Paths to the TSV files (space-separated).chain
: Chain (a letter A-Z).residue
: Residue (an integer).-d
,--distance_type
: (Optional) Distance type, either “alpha” or “beta”. Default is “beta”.
Example
Suppose we have two TSV files file1.tsv
and file2.tsv
. We want to analyze the occurrences of the chain-residue pair A-42 with alpha distance. The command would look like:
python interface.py file1.tsv file2.tsv A 42 -d alpha
Output
The script prints the consensus score and total occurrences of the chain-residue pair to the console. Additionally, it generates two output files:
occurrences.tsv
: A TSV file containing the occurrences and average distances for each input TSV file.avg_distance.txt
: A plain text file containing the overall average distance.
Functions
read_tsv(file_path: str) -> pd.DataFrame
Reads a TSV file and returns a pandas DataFrame.
find_chain_residue(df: pd.DataFrame, chain: str, residue: int, distance_type: str) -> (bool, int, float)
Finds the occurrences of a chain-residue pair in the given DataFrame and calculates the average distance based on the specified distance type. Returns a tuple containing a boolean indicating presence, the number of occurrences, and the average distance.
Example Input TSV File
The input TSV files should have the following format (this comes from PRODIGY):
index chain1 res1 res1_num chain2 res2 res2_num alpha_distance beta_distance
1 A ALA 15 B GLY 32 12.34 10.56
2 A VAL 42 B MET 29 7.92 6.34
3 B ALA 15 A GLY 32 11.87 9.42
Example Output TSV File
The output TSV file occurrences.tsv
will have the following format:
TSV Occurrences Average Distance
file1.tsv 2 6.45
file2.tsv 1 7.92
Example Output TXT File
The output plain text file avg_distance.txt
will have the following content:
Average Distance: 6.94
Similarity
#
This script calculates the Jaccard Index similarity between multiple TSV files, each containing interaction data from a complex. Users can provide options to keep only the residue numbers, remove duplicate residues, or both. The script saves the pairwise similarity scores in a TSV file and writes a summary of the results to a text file.
similarity.py
tells us how similar the interfaces of different complexes are to each other.
Command-Line Arguments
file_paths
(str): Paths to the TSV files to be compared.--residues_only
(bool, optional): If provided, keep only the residue number columns in the input DataFrames.--remove_duplicate_residues
(bool, optional): If provided, remove duplicate residues from the input DataFrames.
Example Usage
Here is an example of how to run the script from the command line:
python similarity.py file1.tsv file2.tsv file3.tsv --residues_only --remove_duplicate_residues
In this example, the script will:
- Read the contents of
file1.tsv
,file2.tsv
, andfile3.tsv
. - Preprocess each DataFrame by keeping only the residue number columns (
res1_num
andres2_num
) and removing duplicate residues if the--residues_only
and--remove_duplicate_residues
flags are provided. - Calculate the pairwise Jaccard Index similarity between each pair of tables.
- Save the pairwise similarity scores in a TSV file named
pairwise_similarity.tsv
. - Calculate the average, median, minimum, and maximum similarity scores and save the summary in a text file named
summary.txt
.
The output files pairwise_similarity.tsv
and summary.txt
will be generated in the same directory as the script.
Functions
read_tsv(file_path: str) -> pd.DataFrame
:Reads a TSV file and returns a pandas DataFrame.
file_path
(str): The file path of the TSV file.
preprocess_table(df: pd.DataFrame, residues_only: bool, remove_duplicate_residues: bool) -> pd.DataFrame
:Preprocesses a DataFrame by keeping only the residue number columns or removing unnecessary columns and duplicate rows.
df
(pd.DataFrame): The input DataFrame.residues_only
(bool): If True, keep only the residue number columns.remove_duplicate_residues
(bool): If True, remove duplicate residues.
jaccard_index(df1: pd.DataFrame, df2: pd.DataFrame) -> float
:Calculates the Jaccard Index similarity between two DataFrames.
df1
(pd.DataFrame): The first DataFrame.df2
(pd.DataFrame): The second DataFrame.
multi_table_jaccard(file_paths: List[str], residues_only: bool, remove_duplicate_residues: bool) -> Tuple[pd.DataFrame, pd.Series]
:Calculates pairwise Jaccard Index similarity between multiple TSV files and returns a DataFrame with the results and a summary Series.
file_paths
(List[str]): A list of file paths to the TSV files.residues_only
(bool): If True, keep only the residue number columns in the input DataFrames.remove_duplicate_residues
(bool): If True, remove duplicate residues from the input DataFrames.
Output Files
pairwise_similarity.tsv
: A TSV file containing the pairwise similarity scores.summary.txt
: A text file containing the average, median, minimum, and maximum similarity scores.
Example Input:
Let’s consider three example TSV files, file1.tsv
, file2.tsv
, and file3.tsv
.
file1.tsv
:
index chain1 res1 res1_num chain2 res2 res2_num alpha_distance beta_distance
1 A ALA 15 B GLY 32 12.34 10.56
2 A VAL 42 B MET 29 7.92 6.34
3 B ALA 15 A GLY 32 11.87 9.42
file2.tsv
:
index chain1 res1 res1_num chain2 res2 res2_num alpha_distance beta_distance
1 A ALA 15 B GLY 32 13.45 11.32
2 B VAL 42 A MET 29 8.01 6.47
4 A ARG 8 B LEU 21 10.29 8.61
file3.tsv
:
index chain1 res1 res1_num chain2 res2 res2_num alpha_distance beta_distance
1 A ALA 15 B GLY 32 12.78 10.89
2 A VAL 42 B MET 29 7.95 6.38
3 B ALA 15 A GLY 32 11.89 9.44
5 A ARG 8 B LEU 21 10.31 8.63
Command:
python multi_table_jaccard.py file1.tsv file2.tsv file3.tsv --residues_only --remove_duplicate_residues
Example Output:
pairwise_similarity.tsv
:
Pairwise Matchup Similarity Score
Table 1 - Table 2 0.6
Table 1 - Table 3 0.75
Table 2 - Table 3 0.6
summary.txt
:
Average Score: 0.65
Median Score: 0.6
Minimum Score: 0.6
Maximum Score: 0.75
In this example, the pairwise similarity scores between the tables are 0.6 and 0.75, indicating that the residue pairs in the tables are not identical after preprocessing.
AvgFeatures
#
This Python script reads a CSV file containing PI-Score features and calculates the averages, standard deviations, and standard errors for each numeric feature. The results are saved in a TSV file.
Usage
python avgfeatures.py <csv_path> [-o output_file]
Arguments
csv_path
: Path to the input CSV file.output_file
: (Optional) Output file name. Default is “averages.tsv”.
Example
Suppose we have a CSV file input.csv
. We want to calculate the averages, standard deviations, and standard errors for the numeric features and save the results in a file called “results.tsv”. The command would look like:
python avgfeatures.py input.csv -o results.tsv
Output
The script generates an output TSV file containing the averages, standard deviations, and standard errors for each numeric feature in the input CSV file.
Functions
process_csv_file(filename: str, output_filename: str)
Reads a CSV file, calculates the averages, standard deviations, and standard errors for each numeric feature, and saves the results in a TSV file.
Example Input CSV File
The input CSV files should have the following format:
pdb,interface,Num_intf_residues,Polar,Hydrophobhic,Charged,conserved_interface, contact_pairs, sc, hb, sb, int_solv_en, int_area, pvalue
ranked_0,B_A,54,0.278,0.259,0.37,NA,55,0.629,21,6,-8.41,1770.08,0.63
Example Output TSV File
The output TSV file will have the following format:
Feature Average Std. Deviation Std. Error
Num_intf_residues 54.000000 0.000000 0.000000
Polar 0.278000 0.000000 0.000000
Hydrophobhic 0.259000 0.000000 0.000000
Charged 0.370000 0.000000 0.000000
Note that the example output file contains only a single row of data, as there is only one input row in the example input file. In practice, the input CSV file will have multiple rows, and the output file will show the calculated values based on all input rows.
Interactions
#
This script fetches protein interaction data from MINT or BioGRID and generates multi-FASTA files for each pair of interaction partners. Its purpose is to create MFA inputs for AF2-multimer to make complex predictions.
Usage
python interactions.py <uniprot_ids> --source <mint|biogrid> [--output_folder <output_folder>] [--access_key <access_key>]
Arguments
uniprot_ids
: List of UniProt IDs to fetch data for.--source
: Choose the data source (mint or biogrid). Only MINT is currently working.--output_folder
(optional): Output folder for interaction data (default: interactions).--access_key
(optional): Access key for BioGRID API (required if source is biogrid).
Output
- The script creates a folder with interaction data files for each UniProt ID in the specified
output_folder
(default: interactions). - A JSON file
interaction_partners.json
is created in theoutput_folder
, containing a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values. - A folder named
complexes
is created with multi-FASTA files for each pair of interaction partners. The file names are in the formatkey:value[0].fa
,key:value[1].fa
, etc.
Limitations
- The .fa files in the
complexes
folder are automatically generated and do not account for other proteins that may be required to properly model a complex. They also assume all complexes are heterodimeric, which may not be the case. - Some of the complexes may have exact experimental structures in the PDB, which would be preferred for most analyses.
Functions
get_mint_data(output_folder, uniprot_ids)
Fetches interaction data from MINT and saves it as TSV files in the specified output folder.
extract_interaction_partners_mint(uniprot_ids, output_folder)
Extracts interaction partner UniProt IDs from MINT data and returns a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values.
get_biogrid_data(output_folder, uniprot_ids, access_key)
Fetches interaction data from BioGRID and saves it as TSV files in the specified output folder. Also produces interaction_partners.json
.
extract_interaction_partners_biogrid(uniprot_ids, output_folder)
Converts BioGRID gene names to UniProt IDs and returns a dictionary with UniProt IDs as keys and a list of interaction partner UniProt IDs as values.
create_mfa(interaction_partners: dict)
Generates multi-FASTA files for each pair of interaction partners and saves them in a folder named complexes
.
main()
The main function of the script. It parses command-line arguments, calls the appropriate data fetching and processing functions depending on the selected data source, and generates the JSON file with the interaction partners dictionary.
Example Usage
To fetch MINT interaction data for UniProt IDs Q15113, P20908, and Q9BQB4, and save the output in the “interactions” folder, run the following command:
python interactions.py Q15113 P20908 Q9BQB4 --source mint
This will create the following files and folders:
interactions/Q15113.tsv
: MINT interaction data for Q15113interactions/P20908.tsv
: MINT interaction data for P20908interactions/Q9BQB4.tsv
: MINT interaction data for Q9BQB4interactions/interaction_partners.json
: A JSON file containing the dictionary of interaction partnerscomplexes/
: A folder containing multi-FASTA files for each pair of interaction partners
Similarly for BioGRID, run the following command:
python interactions.py Q15113 P20908 Q9BQB4 --source biogrid --access_key "accesskey"
You can obtain an access key from the BioGRID website.
Notes
- The script currently generates multi-FASTA files for each pair of interaction partners without accounting for other proteins that may be required to properly model a complex. The script also assumes all complexes are heterodimeric, which may not be the case.
- Output folder issue fixed (Thank you Kritika Grover)
- Some of the complexes may have exact experimental structures in the PDB, which would be preferred for most analyses.
TODO
- The
extract_interaction_partners_{mint, biogrid}
functions use much of the same code, and can probably be combined. - Include a means of checking PDB for existing experimental model.
Whitney
#
This Python script performs a Mann-Whitney U-test on specified columns from multiple TSV files to determine if the difference between interaction data of two sets of models is significant.
Usage
python whitney.py -a <sample_a_files> -b <sample_b_files> -c <column_name>
Arguments
-a
,--sample_a
: List of TSV files for sample A (space-separated).-b
,--sample_b
: List of TSV files for sample B (space-separated).-c
,--column_name
: Column name to extract data from.
Example
Suppose we have four TSV files, sample_a1.tsv
, sample_a2.tsv
, sample_b1.tsv
, and sample_b2.tsv
. We want to perform a Mann-Whitney U-test on the “interaction_strength” column. The command would look like:
python whitney.py -a sample_a1.tsv sample_a2.tsv -b sample_b1.tsv sample_b2.tsv -c interaction_strength
Output
The script prints the Mann-Whitney U-test results, including the U statistic and P-value, to the console. Additionally, it generates an output text file called “whitney.txt” containing the same information.
Functions
read_and_extract_column(filepaths: List[str], column_name: str) -> List[float]
Reads the specified column from each TSV file in the list of file paths, ignoring rows with “N/A” in the specified column, and returns the combined data as a list of floats.
main(sample_a_files: List[str], sample_b_files: List[str], column_name: str)
Performs a Mann-Whitney U-test on the specified columns from the given TSV files for samples A and B, prints the results, and writes the results to an output file.
Input TSV File Format
The input TSV files should have the following format:
pdb_id interaction_strength
model1 0.65
model2 0.72
model3 0.61
Example Output TXT File
The output text file whitney.txt
will have the following content:
Mann-Whitney U-test results:
U statistic: 15.0
P-value: 0.0389249472108
Pipeline #
I want to be clear: the ultimate vision for DAMIA is yet to be realized. It’s not an autonomous pipeline just yet. However, all of its scripts can be used independently in a pipeline of sorts. However, the ultimate vision is for DAMIA to function as a simple to use pipeline where all you need to provide is a UniProt ID (or several, in a high-throughput analysis):
- Start with a UniProt ID input into
interactions.py
to get the MFAs we need (wildtype and missense). - Predict their structures using AFM.
- Run the PDBs through PRODIGY to get an
.ic
file. - Run the
.ic
file throughdistance.py
to obtain an enriched TSV file version, with additional carbon distances. - Run the TSV files through the
whitney.py
to see if there’s a statistically significant difference between the wildtype and missense complexes. - Run the TSV files through
interface.py
to see the average distance between the interface residues and how often certain residues occur. - If the missense residue being analyzed is in the interface:
- Use
partners.py
to find that residues interaction partners. - Use
avgdistances.py
to find average distances between interface residues. - Run the complex through PI-score and
avgfeatures.py
to get the average of all features from that pipelines.
- Use
Lessons & Takeaways #
- Save your work as you go: This started as a mostly exploratory project. But since I saved all my scripts, I ultimately realized I could package up my workflow into something useful.
- Plan your pipeline and analysis ahead of time: If these scripts seem a little tricky to follow, it’s because they are. Much of DAMIA was built by going and back and forth in the research process (something I like to call Agile Research). While this is good for quickly iterating on findings, it doesn’t necessary translate well over to novel pipelines like this one.
The work I did on this project, culminating in DAMIA, was what led to my contributions in the Nature Genetics paper I write about here, so it means a lot to me!