Thousands of proteins in our body may contribute to disease, but one of the most challenging problems is figuring out what drugs can target them. Testing pairs of proteins and drugs in a laboratory setting is time consuming and expensive, and computational simulations require massive computers and complex computations. “That doesn’t scale to levels where you can scan an entire genome or massive [drug] compound libraries,” said Rohit Singh, a computational biologist at the Massachusetts Institute of Technology.
These challenges motivated Singh and Samuel Sledzieski, a fellow computational biologist at the Massachusetts Institute of Technology, to develop a simpler computational method to predict whether drugs and proteins bind. Their approach, called ConPlex, was recently published in the Proceedings of the National Academy of Sciences.1 Unlike more complicated methods that use 3D protein structure models, ConPlex only requires the sequences of the proteins and simple descriptions of the candidate drugs.
The researchers first fed protein sequences into a protein language model inspired by increasingly common text-generating algorithms such as autofill or ChatGPT.2 “[Text algorithms] are basically just predicting what the next thing should be based on what has come before,” Sledzieski said. “These properties of these algorithms apply really nicely to proteins because they are also a linear chain.” While text algorithms use large amounts of text data to predict the rest of a sentence or answer questions, protein language models use information about millions of protein sequences to identify key features that can predict a protein’s properties.3
Then, the researchers built ConPlex, a machine learning algorithm that can be used by other scientists to predict whether a drug will bind to a protein based on key features extracted by the protein language model and a set of known protein-drug interactions. ConPlex also incorporates information about drugs that are known not to bind to proteins, despite looking similar to drugs that do bind, so that the model can identify subtle features that might promote binding. The researchers found that ConPlex was fast and accurate, even when predicting the binding of new drugs or proteins that the model hadn’t encountered before.
In future iterations, Singh and Sledzieski hope to incorporate additional elements into the model, such as how multiple drugs might interact and the effect of mutations on drug-target binding. Ozlem Garibay, a computer scientist at the University of Central Florida who was not involved in the study, agreed that more details about the proteins could further improve performance. “Simplicity can be a strength,” she said. “But it may be limiting here because [proteins] are three-dimensional structures.”
The researchers have made ConPlex freely available online for scientists to use to find new drugs that target a protein or to identify existing drugs that can be repurposed to target proteins in other diseases. According to Sledzieski, while ConPlex will not offer the final word on whether a drug will work, it can prioritize promising candidates for further study.
ConPlex may even have a role to play in clinical trials because it can predict potential off-target binding that could lead to unwanted side effects. “The failure rate for drugs [in clinical trials] is very high,” Singh said. “The earlier you can model off-target effects into your computational pipeline, the the earlier you can say ‘This drug looks interesting, but it is just not a good idea.’”
References
- Singh R, Sledzieski S, et al. Contrastive learning in protein language space predicts interactions between drugs and protein targets. PNAS. 120(24), e2220778120 (2023).
- Brandes N, et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 38(8), 2102-2110 (2022).
- Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Syst. 12(6), 654-669.e3 (2021).