Proteinloop - PepPI Datasets

PepBDB-ML

Peptide-protein datasets from PepBDB, designed to be ML & DL-friendly for compbio tasks. Read more on GitHub & Medium.

This project aims to generate and provide an enriched dataset from the PepBDB database for machine learning and computational biology research. PepBDB-ML is a tabular dataset suitable for analysis with random forests, XGBoost, etc. Each row is labeled as either a binding residue (1) or non-binding residue (0).

The data is based on the PepBDB dataset, which contains 3D structures of peptide-protein complexes.
GitHub repo allows you to regenerate this dataset on your own machine.
Generation requires PSI-BLAST, Prodigy, AAindex, BioPython, and more.

Each column is as follows:

AA: Amino acid, single letter code.
Hydrophobicity: Hydrophobicity index of the resiude from AAindex1. (Unitless).
Steric Parameter: Steric parameter of residue from AAindex1. (Unitless).
Volume: Spatial volume of residue, from AAindex1. (Å²).
Polarizability: How easily an amino acid's electron cloud will form dipoles in response to external forces from AAindex1. (Unitless).
Helix Probability: Relative helix residue probability in 47 proteins, from AAindex1. (Unitless).
Beta Probability: Relative beta sheet residue probability in 47 proteins, from AAindex1. (Unitless).
Isoelectric Point: Isoelectric point of residue, from AAindex1. (pH).
HSE Up: The number of Cɑs in the up-facing hemisphere. (Unitless, normalized integer counts).
HSE Down: The number of Cɑs in the down-facing hemisphere. (Unitless, normalized integer counts).
Pseudo Angles: The angle between three adjacent Cɑs. (Degrees).
ASA: How accessible a residue in the sequence is by its surrounding solvent, normalized by the maximum possible ASA. (Unitless since normalized, but otherwise in Å²).
Phi: Angle of rotation about the (N)-(Cɑ) bond. (In degrees)
Psi: Angle of rotation about the (Cɑ)-(carbonyl carbon) atom. (In degrees).
SS {H...-}: Secondary structure assignments from DSSP.

Code	Structure
H	Alpha helix (4-12)
B	Isolated beta-bridge residue
E	Strand
G	3-10 helix
I	Pi helix
T	Turn
S	Bend
-	None

A...V: Normalized substitution matrix score for the residue in the sequence. (Unitless).
Binding Indices: Label for the residue, 1 for binding and 0 for non-binding.