PepBDB-ML
Peptide-protein datasets from PepBDB, designed to be ML & DL-friendly for compbio tasks. Read more on GitHub & Medium.
This project aims to generate and provide an enriched dataset from the PepBDB database for machine learning and computational biology research.
PepBDB-ML is a tabular dataset suitable for analysis with random forests, XGBoost, etc.
Each row is labeled as either a binding residue (1) or non-binding residue (0).
- The data is based on the PepBDB dataset, which contains 3D structures of peptide-protein complexes.
- GitHub repo allows you to regenerate this dataset on your own machine.
- Generation requires PSI-BLAST, Prodigy, AAindex, BioPython, and more.
Each column is as follows:
- AA: Amino acid, single letter code.
- Hydrophobicity: Hydrophobicity index of the resiude from AAindex1. (Unitless).
- Steric Parameter: Steric parameter of residue from AAindex1. (Unitless).
- Volume: Spatial volume of residue, from AAindex1. (Ų).
- Polarizability: How easily an amino acid's electron cloud will form dipoles in response to external forces from AAindex1. (Unitless).
- Helix Probability: Relative helix residue probability in 47 proteins, from AAindex1. (Unitless).
- Beta Probability: Relative beta sheet residue probability in 47 proteins, from AAindex1. (Unitless).
- Isoelectric Point: Isoelectric point of residue, from AAindex1. (pH).
- HSE Up: The number of Cɑs in the up-facing hemisphere. (Unitless, normalized integer counts).
- HSE Down: The number of Cɑs in the down-facing hemisphere. (Unitless, normalized integer counts).
- Pseudo Angles: The angle between three adjacent Cɑs. (Degrees).
- ASA: How accessible a residue in the sequence is by its surrounding solvent, normalized by the maximum possible ASA. (Unitless since normalized, but otherwise in Ų).
- Phi: Angle of rotation about the (N)-(Cɑ) bond. (In degrees)
- Psi: Angle of rotation about the (Cɑ)-(carbonyl carbon) atom. (In degrees).
- SS {H...-}: Secondary structure assignments from DSSP.
-
Code Structure H Alpha helix (4-12) B Isolated beta-bridge residue E Strand G 3-10 helix I Pi helix T Turn S Bend - None - A...V: Normalized substitution matrix score for the residue in the sequence. (Unitless).
- Binding Indices: Label for the residue, 1 for binding and 0 for non-binding.