Cryo-electron microscopy (cryo-EM) is a powerful technique in structural biology for determining three-dimensional (3D) structures of large biological macromolecules. Its powerful advantage of solving large macromolecular assemblies complements conventional structural biology techniques, such as X-ray crystallography and Nuclear Magnetic Resonance (NMR), bridging atomic-detailed structures of molecules with higher-level structure information of molecular machinery in a cell. Biomolecular structures have been determined by cryo-EM at an increasing pace, with over 8,000 entries now accumulated in a public database, EMDB.
Structure information (density maps) of biomolecules determined by cryo-EM often have a low resolution, which is not sufficient for researchers to identify atomic-detailed structures of molecules in the maps. Roughly speaking, if the resolution of a map is 3 Angstroms or better, atomic-detailed structures can usually be built by using conventional structure modeling tools in structural biology. When maps are at 3-5 Angstroms, tracing the main-chain of proteins becomes more difficult and would need specific software, such as MAINMAST developed in our group. In the next range of resolution, 5-10 Angstroms, which we call the intermediate resolution, one can only occasionally detect regularly patterned structures in proteins, such as alpha helices and beta sheets.
We are an interdisciplinary research group affiliated both in biology and computer science (CS) departments. As our lab physically locates in the structural biology building, we have observed numerous successful structural biology projects with cryo-EM in the past few years. On the other hand, our lab has many CS students who are eager to apply recent machine learning methods, particularly deep learning, to important problems. Therefore, it was natural for us to come up to the idea to apply deep learning to better detect local structures in intermediate resolution EM maps.
We started to apply 3D convolutional neural networks (CNNs) to intermediate resolution EM maps about two years ago. CNNs would be a reasonable choice because it has been very successful in image recognition tasks. CNNs worked very well from the beginning for computationally simulated EM maps generated from known atomic-detailed biomolecular structures because simulated maps have no noise. But structure detection in experimentally determined cryo-EM density maps was more difficult because real maps are not as “clean” as simulated maps, e.g. having uneven local resolutions and different noise levels. To have good performance on real maps, we needed to train our CNN on real maps, which also suggests a different nature between simulated maps and real maps. As shown in the paper, our method worked relatively well for real maps at close to 10 Angstrom resolution.
The architecture of our network is standard for CNNs, except that it concatenates two networks. The first network is a standard CNN, which captures density features of local structures. Now, the second network takes the raw probability values from the first network and “smoothens” predictions so that nonphysical local structure assignments, e.g. acute changes of secondary structure types, are removed. This architecture is inspired by a classical neural network-based tool for protein secondary structure prediction from a protein sequence (PSI-Pred). Thus, our method inherited knowledge from bioinformatics.
This is how our tool, Emap2Sec is built – it is an outcome from active daily discussions in our lab by lab members of different backgrounds and educations. We are thankful to all the lab members as well as our interdisciplinary arrangement by the College of Science of Purdue University.