Imbuing machine learning with a hint of biophysics.
Incorporating prior knowledge into machine learned models can increase their statistical power. We combined structural and biophysical insights with machine learning to infer a model of protein-protein interactions.
Machine learning in scientific domains presents an opportunity (and challenge) typically absent in non-scientific problems: how best to incorporate rich prior knowledge, sometimes accumulated over decades. Rely too much on (imperfect) priors and little is left to be learned; rely too little and much valuable insight is left unutilized, reducing statistical power. In our recent paper (Cunningham, et. al. Nat. Meth. 2020) we tackle this problem in the context of protein-protein interactions involving peptide-binding domains (PBDs). PBDs are modular protein domains that bind short (<15 residues) peptidic regions in their interacting partners. Metazoans contain many PBD families (e.g. Src Homology 2 domains) and each family can have hundreds of variants in a given organism that are related in structure and sequence. PBDs underlie a substantial fraction of protein-protein interactions involved in cellular signaling and are often targets for cancer drugs (e.g. receptor tyrosine kinases).
A major challenge in modeling PBD-mediated interactions is the wide variation in available data across PBD families; some families enjoy an abundance of binding data (100,000s of measurements) while others are poorly characterized (<10,000 measurements). Our objective was to inject aspects of PBD structural biology and biophysics into a machine-learned model to facilitate information sharing between data rich and data poor families. We did this by structuring our model to reflect two priors we believed to be true based on our analysis of PBD-peptide structures.
Superposition of PBD-peptide complexes (⍺-carbon traces) for three families: Src-homology 2 (SH2), Src-homology 3 (SH3) and PDZ. Peptides are colored pink. Domains are colored grey (⍺-helices are green, β-sheets yellow). Supplementary Fig. 1a in Cunningham, et. al. Nat. Meth. 2020.
First, when we aligned many structures of a given PBD family it became readily apparent that most peptides spatially superpose, indicating a common binding mode within each family. This suggested a simple model in which all PBD and peptide sequences are aligned, and the energy terms associated with a PBD-peptide interaction are defined in terms of the residue positions of the alignments. In this way we implicitly assume that an aligned position in the PBD or peptide should present roughly the same steric and chemical environment across the alignment. This model, which we call HSM for Independent Domains (HSM/ID), performs well, and is a direct generalization of a previous model we published specific to SH2 domains (AlQuraishi, et. al. Nat. Genet. 2014). However, it enables information sharing only within a PBD family and not across families, which was our initial objective.
Energy terms (from AlQuraishi, et. al. Nat. Genet. 2014) embedded into two-dimensions (left) using t-SNE. A color wheel is superimposed over the data (centered at the center of the embedding) and energy terms are colored accordingly. When mapped to aligned positions (right), patterns of similar energy terms (e.g. similar colors) emerge. From Supplementary Figure 2 in Cunningham, et. al. Nat. Meth. 2020.
To address this shortcoming, we incorporated an observation from our analysis of energy terms learned by the first approach: distinct residue positions, often quite spatially separated, can behave similarly in terms of their energetic profiles (e.g. amino acid preferences). This makes sense! For example, areas that preferentially bind hydrophobic parts of peptides should behave the same. To incorporate this observation, we changed the HSM/ID formulation so that energy terms are not directly defined in terms of residue positions, but instead use a pool of global energy potentials, which are assigned (as a weighted mixture) to specific residue positions. This construction implicitly captures the idea that any small protein surface patch can be represented as a mixture of more primitive structural contexts and is consistent with other emerging research in the field. Furthermore, it allows energy potentials to be shared across different PBD families, enabling information transfer from data rich to data poor PBDs.
We used the model to gain structural insights into the basis of PBD-peptide binding across a wide range of families, as well as how the topology of PBD-containing proteins is organized in human signaling networks. We are most excited however by what this model enables in terms of investigating the signaling logic of protein complexes, and how these complexes are rewired in common human diseases. We look forward to investigating these questions in the future.