Scientists investigate disease targets by studying gene expression data often obtained by assaying entire cell populations. For instance, researchers used bulk RNA sequencing to discover druggable cancer-associated protein targets1 and to uncover potential blood-based biomarkers for the early diagnosis of Alzheimer’s disease.2
More recently, scientists have turned to single cell RNA sequencing (scRNA-seq), which provides insights into how gene expression varies between individual cells.3 Scientists typically analyze scRNA-seq data using machine learning tools that have been built from scratch to carry out specific individual tasks.
Bo Wang, a computational biologist, and his team of computer scientists and cell biologists at the University of Toronto have built a new artificial intelligence (AI) model called single cell generative pretrained transformer, or scGPT, which can be finetuned to carry out a diverse range of tasks using scRNA-seq data. These tasks include predicting the effects of manipulating specific genes and merging distinct batches of data together to reveal otherwise undetectable cell types.
scGPT is a foundational AI tool because the core model can be built upon and tweaked into distinct versions that carry out a range of downstream tasks. The increasingly popular AI known as ChatGPT works much the same way; while the chatbot generates the next words in a sentence, scGPT predicts the expression levels of genes in a cell.
According to Wang, employing a single base model to perform many downstream tasks is beneficial because using various computational models to carry out different tasks can cause a misalignment when comparing data from the distinct analyses. Each computational approach might make different assumptions regarding the structure of the same data depending on how it was built, and this can lead to less accurate conclusions.
In their recent preprint study, Wang’s team showed that scGPT analyzes scRNA-seq data better than standard approaches.4 They first trained scGPT for four days by feeding the model scRNA-seq data from more than 10.3 million blood and bone marrow cells, including more than 50 cell types. This allowed the model to learn fundamental links between the expression of genes within and across cells. As not all genes are expressed in a given cell, and some genes are expressed at levels undetectable by current sequencing technology, each cell provided information on a few thousand of the 20,000 genes in the human genome. Overall, the model learned nearly all of the genes in the genome.
See also “Now AI Can Be Used to Design New Proteins”
One task that the team finetuned the foundational model to achieve was merging together 10 distinct batches of scRNA-seq data that were previously collected from human immune cells. Using a portion of the data from each batch, they taught the model to categorize the same cell types across the datasets into common clusters. scGPT also learned to adjust for any differences between batches caused by nonbiological factors, such as the day the experiment was carried out or how the cells were collected. Pooling datasets together in this way, a process known as batch integration, boosts the amount of data on each cell type, allowing scientists to better detect and characterize rare cell types that could play a role in healthy or disease states.
The researchers then tested how well the finetuned version of scGPT and three of the most popular methods used for this task merged together the remaining previously unseen data. scGPT categorized cell types from different batches together five percent more effectively than the standard models and corrected for nonbiological effects similarly well compared to the widely used methods.
The team also tested how well a different honed version of scGPT and a standard model called GEARS predicted the effects of perturbing more than 80 genes—either alone or in pairs—on the activity of other genes.5 By focusing on the expression of 20 genes that were most affected by each genetic manipulation, Wang and his colleagues found that scGPT came out on top.
“Do these improvements really result in additional biological knowledge? Are they useful in generating new hypotheses?” questioned Ahmed Mahfouz, a computational biologist at Leiden University Medical Center in the Netherlands who was not involved in the study.
While the findings are promising, Mahfouz cautioned that these models have millions of parameters and require a lot of data to train. As a result, they use a lot of energy and have a huge carbon footprint. Based on this high energy demand during training and because researchers will need some familiarity with machine learning to supervise the finetuning process, it is unclear how widely used scGPT could become among cell biologists.
Nevertheless, “the fine-tuning is extremely efficient,” said Wang. “For a dataset of let’s say 10,000 or 20,000 cells, you only need five to ten minutes.” The team hopes that this will make scGPT widely accessible. “We have made the code and model available to everyone, and we are working really hard to create educational websites, providing lots of tutorials with concrete examples for every task it can solve,” he said.
Wang’s team plans to continue working on scGPT. While the original version of the model is useful for analyses of bone marrow and immune cells, the team recently released an updated version of scGPT that was trained on 33 million cells including brain, blood, pancreas, lung, heart, kidney, cancer, and gut cells.6
Recently, other foundational models similar to scGPT have been released, making it only a matter of time before it is known which, if any, gain traction in research.7,8,9 Mahfouz thinks that models such as scGPT will likely provide answers to important biological questions in the near future, although this can only be proven with time. “It is an exciting time. By the end of the year, I think you will have a very different picture than what we see now,” said Mahfouz.
References
- Stransky N, et al. The landscape of kinase fusions in cancer. Nat Commun. 2014;5:4846.
- Shigemizu D, et al. Identification of potential blood biomarkers for early diagnosis of Alzheimer’s disease through RNA sequencing analysis. Alzheimers Res Ther. 2020;12(1):87.
- Li X, C Wang. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci. 2021;36(13).
- Cui H, et al. scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv. 2023.
- Roohani Y, et al. GEARS: Predicting transcriptional outcomes of novel multi-gene perturbations. bioRxiv. 2022.
- Cui H, et al. scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv. 2023.
- Theodoris CV, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616-624.
- Yang F, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4:852-866.
- Shen H, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience. 2023;26(5):106536.
Note: This article was updated to better represent Ahmed Mahfouz's profession.