Quality vs quantity in proteome-wide cross-linking mass spectrometry
We systematically demonstrate that the existing validation approach for proteome-wide cross-linking mass spectrometry suffers from several limitations. Additionally, we propose a set of four reliable quality assessment metrics to address this issue.
Protein-protein interactions play central roles in almost all the biological functions. It is also well-known that the three dimensional structural features of proteins are one of the determining factors of their functions. Hence, charting-out all the physiological protein-protein interactions and their structural features provide a plethora of crucial information to help us understand the functional landscape of proteomes.
Cross-linking mass spectrometry (XLMS) is a powerful and increasingly popular tool to identify novel interactions and to elucidate structural dynamics on a proteome-scale. In addition to the primary quality estimate false discovery rate (FDR), complementary validation procedures are necessary to confirm the quality of identified cross-links. It is needless to say that robust quality control is critical for ensuring the data quality of any large-scale study.
Typically, researchers have been using a structure-based validation approach for validation. Specifically, the identified cross-links are mapped to three dimensional (3D) structure of a representative protein complex (such as the ribosome or proteasome). Then, among the ones that could be mapped to the structure, they ask what fraction of them are within the maximum distance constraint of the cross-linker (potential true positives).
While such validation approach might be useful for studies performed on purified proteins and protein complexes (and the cross-link search is performed against those specific sequence databases), we note that it drastically underestimates false positives in proteome-wide XL-MS studies (Figure 1).
As a thought experiment, let us consider a reference protein complex structure consisting of 100 subunits. Because a false positive cross-link can be detected between any two random proteins within the proteome (~20,000 proteins for human proteome-wide experiments), for a given false positive with one of its ends mapped to the reference complex, the probability that the second end also maps to this complex by random chance is 5×10-3 (100/20000). It should be noted that this probability would be even lower for the often used ribosome and proteasome complexes. However, these probabilities only hold for random peptide pairs (derived from false positive cross-links); true positive cross-links are much more likely to perfectly map to existing 3D structures. In our paper, we demonstrated this limitation using raw data from three published XL-MS studies including our previous study. Our results revealed that the best and worst quality datasets look virtually identical in terms of their error rates estimated using this approach.
Furthermore, we proposed a set of four quality assessment metrics to address this limitation and efficiently differentiate datasets based on their true underlying quality. First, we designed an improved structure-based metric (fraction of structure-corroborating identifications (FSI)) that could overcome the shortcomings of the existing validation approach. Our fraction of misidentifications (FMI) metric utilizes search space from an unrelated organism to estimate the underlying error rate independent of the FDR filtering. The Fraction of interprotein cross-links from known interactions (FKI) leverages the knowledge of known interactions to provide a relative quality estimate. Finally, to ensure the reliable data quality in proteome-wide XL-MS studies, we strongly suggest employing an orthogonal experimental assay to validate a random subset of the identified novel interactions. Furthermore, using this Fraction of validated novel interactions using orthogonal experimental assays, one can derive an absolute quality estimate for a given XL-MS dataset. Overall, our metrics are based on diverse principles and they complement each other well.
Given the broad use and interest in the XL-MS datasets, the unexpected high number of false positives will severely hinder the development of the field and the utility of these datasets. Thus, it is of utmost importance to bring this problem to the attention of the field. Going forward, a comprehensive and reliable quality assessment framework such as the one proposed in this work needs to be adapted to aid in the rapid advancement of the field.
Link to our paper: https://www.nature.com/articles/s41592-020-0959-9 (Yugandhar et al. Structure-based validation can drastically underestimate error rate in proteome-wide cross-linking mass spectrometry studies. Nat Methods 17, 985–988 (2020))