PepBDB-ML

Peptide-protein datasets from PepBDB, designed to be ML & DL-friendly for compbio tasks. Read more on GitHub & Medium.

This project aims to generate and provide an enriched dataset from the PepBDB database for machine learning and computational biology research. PepBDB-ML is a tabular dataset suitable for analysis with random forests, XGBoost, etc. Each row is labeled as either a binding residue (1) or non-binding residue (0).

  • The data is based on the PepBDB dataset, which contains 3D structures of peptide-protein complexes.
  • GitHub repo allows you to regenerate this dataset on your own machine.
  • Generation requires PSI-BLAST, Prodigy, AAindex, BioPython, and more.

Each column is as follows:

  • AA: Amino acid, single letter code.
  • Hydrophobicity: Hydrophobicity index of the resiude from AAindex1. (Unitless).
  • Steric Parameter: Steric parameter of residue from AAindex1. (Unitless).
  • Volume: Spatial volume of residue, from AAindex1. (Ų).
  • Polarizability: How easily an amino acid's electron cloud will form dipoles in response to external forces from AAindex1. (Unitless).
  • Helix Probability: Relative helix residue probability in 47 proteins, from AAindex1. (Unitless).
  • Beta Probability: Relative beta sheet residue probability in 47 proteins, from AAindex1. (Unitless).
  • Isoelectric Point: Isoelectric point of residue, from AAindex1. (pH).
  • HSE Up: The number of Cɑs in the up-facing hemisphere. (Unitless, normalized integer counts).
  • HSE Down: The number of Cɑs in the down-facing hemisphere. (Unitless, normalized integer counts).
  • Pseudo Angles: The angle between three adjacent Cɑs. (Degrees).
  • ASA: How accessible a residue in the sequence is by its surrounding solvent, normalized by the maximum possible ASA. (Unitless since normalized, but otherwise in Ų).
  • Phi: Angle of rotation about the (N)-(Cɑ) bond. (In degrees)
  • Psi: Angle of rotation about the (Cɑ)-(carbonyl carbon) atom. (In degrees).
  • SS {H...-}: Secondary structure assignments from DSSP.
    • Code Structure
      H Alpha helix (4-12)
      B Isolated beta-bridge residue
      E Strand
      G 3-10 helix
      I Pi helix
      T Turn
      S Bend
      - None
  • A...V: Normalized substitution matrix score for the residue in the sequence. (Unitless).
  • Binding Indices: Label for the residue, 1 for binding and 0 for non-binding.