© BRYAN SATALINOIn 2002, a group of plant researchers studying legumes at the Max Planck Institute for Plant Breeding Research in Cologne, Germany, discovered that a 679-nucleotide RNA believed to function in a noncoding capacity was in fact a protein-coding messenger RNA (mRNA).1 It had been classified as a long (or large) noncoding RNA (lncRNA) by virtue of being more than 200 nucleotides in length. The RNA, transcribed from a gene called early nodulin 40 (ENOD40), contained short open reading frames (ORFs)—putative protein-coding sequences bookended by start and stop codons—but the ORFs were so short that they had previously been overlooked. When the Cologne collaborators examined the RNA more closely, however, they found that two of the ORFs did indeed encode tiny peptides: one of 12 and one of 24 amino acids. Sampling the legumes confirmed that these micropeptides were made in the plant, where...

Five years later, another ORF-containing mRNA that had been posing as a lncRNA was discovered in Drosophila.2,3 After performing a screen of fly embryos to find lncRNAs, Yuji Kageyama, then of the National Institute for Basic Biology in Okazaki, Japan, suppressed each transcript’s expression. “Only one showed a clear phenotype,” says Kageyama, now at Kobe University. Because embryos missing this particular RNA lacked certain cuticle features, giving them the appearance of smooth rice grains, the researchers named the RNA “polished rice” (pri).

Turning his attention to how the RNA functioned, Kageyama thought he should first rule out the possibility that it encoded proteins. But he couldn’t. “We actually found it was a protein-coding gene,” he says. “It was an accident—we are RNA people!” The pri gene turned out to encode four tiny peptides—three of 11 amino acids and one of 32—that Kageyama and colleagues showed are important for activating a key developmental transcription factor.4

Since then, a handful of other lncRNAs have switched to the mRNA ranks after being found to harbor micropeptide-encoding short ORFs (sORFs)—those less than 300 nucleotides in length. And given the vast number of documented lncRNAs—most of which have no known function—the chance of finding others that contain micropeptide codes seems high.

As researchers take a deeper dive into the function of the thousands of noncoding RNAs believed to exist in genomes, they continue to uncover surprise micro­peptides.

The hunt for these tiny treasures is now on, but it’s a challenging quest. After all, there are good reasons why these itty-bitty peptides and their codes went unnoticed for so long.

Overlooked ORFs

From the late 1990s into the 21st century, as species after species had their genomes sequenced and deposited in databases, the search for novel genes and their associated mRNAs duly followed. With millions or even billions of nucleotides to sift through, researchers devised computational shortcuts to hunt for canonical gene and mRNA features, such as promoter regions, exon/intron splice sites, and, of course, ORFs.

ORFs can exist in practically any stretch of RNA sequence by chance, but many do not encode actual proteins. Because the chance that an ORF encodes a protein increases with its length, most ORF-finding algorithms had a size cut-off of 300 nucleotides—translating to 100 amino acids. This allowed researchers to “filter out garbage—that is, meaningless ORFs that exist randomly in RNAs,” says Eric Olson of the University of Texas Southwestern Medical Center in Dallas.

Of course, by excluding all ORFs less than 300 nucleotides in length, such algorithms inevitably missed those encoding genuine small peptides. “I’m sure that the people who came up with [the cut-off] understood that this rule would have to miss anything that was shorter than 100 amino acids,” says Nicholas Ingolia of the University of California, Berkeley. “As people applied this rule more and more, they sort of lost track of that caveat.” Essentially, sORFs were thrown out with the computational trash and forgotten.

Aside from statistical practicality and human oversight, there were also technical reasons that contributed to sORFs and their encoded micropeptides being missed. Because of their small size, sORFs in model organisms such as mice, flies, and fish are less likely to be hit in random mutagenesis screens than larger ORFs, meaning their functions are less likely to be revealed. Also, many important proteins are identified based on their conservation across species, says Andrea Pauli of the Research Institute of Molecular Pathology in Vienna, but “the shorter [the ORF], the harder it gets to find and align this region to other genomes and to know that this is actually conserved.”

As for the proteins themselves, the standard practice of using electrophoresis to separate peptides by size often meant micropeptides would be lost, notes Doug Anderson, a postdoc in Olson’s lab. “A lot of times we run the smaller things off the bottom of our gels,” he says. Standard protein mass spectrometry was also problematic for identifying small peptides, says Gerben Menschaert of Ghent University in Belgium, because “there is a washout step in the protocol so that only larger proteins are retained.”

But as researchers take a deeper dive into the function of the thousands of lncRNAs believed to exist in genomes, they continue to uncover surprise micropeptides. In February 2014, for example, Pauli, then a postdoc in Alex Schier’s lab at Harvard University, discovered a hidden code in a zebrafish lncRNA. She had been hunting for lncRNAs involved in zebrafish development because “we hadn’t really anticipated that there would be any coding regions out there that had not been discovered—at least not something that is essential,” she says. But one lncRNA she identified actually encoded a 58-amino-acid micropeptide, which she called Toddler, that functioned as a signaling protein necessary for cell movements that shape the early embryo.5

Then, last year, Anderson and his colleagues reported another. Since joining Olson’s lab in 2010, Anderson had been searching for lncRNAs expressed in the heart and skeletal muscles of mouse embryos. He discovered a number of candidates, but one stood out for its high level of sequence conservation—suggesting to Anderson that it might have an important function. He was right, the RNA was important, but for a reason that neither Anderson nor Olson had considered: it was in fact an mRNA encoding a 46-amino-acid-long micropeptide.6

“When we zeroed in on the conserved region [of the gene], Doug found that it began with an ATG [start] codon and it terminated with a stop codon,” Olson says. “That’s when he looked at whether it might encode a peptide and found that indeed it did.” The researchers dubbed the peptide myoregulin, and found that it functioned as a critical calcium pump regulator for muscle relaxation.

With more and more overlooked peptides now being revealed, the big question is how many are left to be discovered. “Were there going to be dozens of [micropeptides]? Were there going to be hundreds, like there are hundreds of microRNAs?” says Ingolia. “We just didn’t know.”

Olson suspects the number is quite large. The fact that “myoregulin went below the radar screen for all these years . . . really told us that there’s likely to be a gold mine of undiscovered micropeptides out there,” he says. “So we are aggressively mining that right now.”

Hunting for hidden peptides

In the mid-2000s, Menschaert was working on mass spectrometry protocols to enrich small peptides, which at that time were believed to be cleaved from larger proteins, when he read the papers about the polished rice sORFs. If there is one example of sORF-encoded micropeptides, he thought, there are bound to be others.

FOLLOWING THE CODE: With the advent of genome sequencing technologies, researchers began combing genomes for open reading frames (ORFs). To enrich for genuine protein-coding ORFs and to eliminate those random sequences that by chance were bookended by start and stop codons, most ORF-finding algorithms ignored any stretches shorter than 300 nucleotides. Unfortunately, this also meant that many short ORFs encoding micropeptides were missed. Now, new techniques are helping scientists identify tiny ORFs within what were presumed to be long noncoding RNAs.
See full infographic: WEB | PDF
© BRYAN SATALINO

To find out if his hunch was correct, Menschaert performed a lot of RNA sequencing to identify sORFs, and a lot of mass spectrometry to find the putative peptides. But it was a slow and painstaking endeavor, as he could only survey a small number of sORFs at a time. Then, in 2009, researchers developed a new, rapid, genome-wide approach called ribosome profiling, which enabled the translation of all ORFs, large and small, to be assessed en masse using next-generation sequencing of ribosome-associated RNA.

The technique was an update of another method called ribosome footprinting, in which researchers would isolate ribosome-associated RNAs, digest them with a nuclease, and then recover and sequence the short fragments of RNA protected from digestion by the bound ribosomes. Mass spec was still required to confirm that the proteins generated from these RNAs actually existed in the cell; even truly noncoding RNAs can sometimes associate with ribosomes by chance. But ribosome footprinting was a straightforward way to identify RNAs that, at the very least, associated with the translation machinery.

Their diminutive size may have caused micropeptides to be overlooked, but it does not prevent them from serving important, often essential functions.

Until the past decade of advances in sequencing technology, however, this too was a time-consuming process, says Ingolia. “People had used ribosome footprinting on single, specific messages, but you couldn’t apply it to everything that was going on in a cell.” Then next-gen sequencing was developed, giving researchers the power to “read hundreds of millions of these footprints at once,” says Jonathan Weissman of the University of California, San Francisco.

So he, Ingolia—then a postdoc in his lab—and their colleagues turned ribosome footprinting into ribosome profiling to obtain a global snapshot of translation events across the entire transcriptome. In 2011, the researchers reported that in mouse embryonic stem cells, the majority of lncRNAs transcribed from apparently noncoding regions of the genome were in fact associated with ribosomes.7 “Very early on . . . we could see that we were getting signals outside of the canonical open reading frames,” says Weissman.

“That paper was really a milestone in terms of showing that there is a lot of translation outside of [known] coding regions,” says Pauli.

But just how much is still unclear. While Ingolia and Weissman’s findings could have pointed to a transcriptome littered with micropeptide-encoding sORFs, they also found some fully characterized lncRNAs with well-known nuclear functions to be associated with ribosomes in their analysis. Classical noncoding RNAs such as telomerase RNA, which acts as a template for telomeric DNA replication, for example, and small nuclear RNAs known to be involved in splicing “come up as very highly translated” in ribosome profiling assays, says Caltech’s Mitch Guttman. “That’s what originally clued us in to the fact that . . . this ribosome-occupancy measure is not [always] indicative of real translation.”

Some ORFs may associate with ribosomes as part of translation regulatory mechanisms, Guttman says, or simply as random interactions—these latter associations might even produce small nonfunctional peptides that, it’s thought, would be unstable and thus rapidly degraded. To distinguish ribosome profiles that reflect true translation from those that don’t, Guttman joined forces with Ingolia and Weissman to create a metric, called the ribosome release score, based on the distribution of ribosome-bound fragments recovered from a particular mRNA. When ribosomes translating a genuine ORF come to the stop codon, they are released from the mRNA. Truly translated RNAs, then, should display a greater proportion of ribosome footprint fragments from their coding region than from the downstream untranslated region. “For bona fide peptides, you see a very clear drop [after the stop codon],” says Guttman, “while for classic noncoding RNAs you do not.” (See illus­tration above.)

Applying this metric to Ingolia and Weissman’s 2011 mouse embryonic stem cell data, the researchers found that the vast majority of intergenic lncRNAs can still be considered noncoding.8 But not all of them. About five percent of supposedly lncRNAs have ribosome release scores akin to protein-coding transcripts, says Guttman. “Five percent is a huge number if you think about the fact that there are tens of thousands of lncRNAs,” he says. “[It] still creates a huge number of possible micropeptides. So that’s very interesting and worthy of exploration.”

ALGORITHMS FOR ASSESSING sORF CODING POTENTIAL

Genomes contain countless sORFs, but most do not produce functional proteins. To help identify the true protein-coding needles in the nonsense haystacks, scientists have devised methods and metrics to calculate sORFs’ coding potential based on their sequences and ribosome profiling characteristics.

Ribosome Release Score (RSS): After a ribosome reaches the stop codon of a true protein-coding mRNA, the ribosome’s association with the transcript ceases. The distribution of ribosome-bound fragments for those RNAs would thus show a dramatic reduction following the putative stop codon. (Cell, 154:240-51, 2013)

Fragment Length Organization Similarity Score (FLOSS): This metric distinguishes RNAs that have ribosome profiling fragment sizes clustered tightly in the 30–32 nucleotide range—the size protected by a eukaryotic ribosome—from those that have more varied fragment sizes, which might indicate protection by contaminating nonribosomal proteins. (Cell Rep, 8:1365-79, 2014)

ORF Regression Algorithm for Translation Evaluation of RPFs (ribosome-protected mRNA fragments) (ORF-RATER): This algorithm determines the likelihood that an ORF is translated based on its similarity to known protein-coding ORFs in terms of ribosome-occupancy pattern—that is, the distribution of ribosome profiling fragments across the ORF. For example, true protein-coding ORFs tend to exhibit peaks in the number of fragments at the start and stop codons where ribosomes are built and dismantled, and their fragments show a three-nucleotide periodicity in the expected reading frame—the ribosome appears to jump along three nucleotides (one codon) at a time. (Mol Cell, 60:816-27, 2015)

Phylogenetic Conservation Score of a sORF (PhyloCSF): This metric examines conservation of a sORF across species. (Bioinformatics, 27:i275-i282, 2011)

To aid in the verification of sORF translation and the identification of the micropeptides produced, new metrics and algorithms—based on ribosome footprint patterns, sequence conservation, synonymous mutation frequency and other features—continue to be developed.9,10 (See table below.) And in a study published online last November, Menschaert and colleagues established a searchable sORF database called sORFs.org with the aim of accumulating and centralizing data on sORFs and their translation potential.11

For now, the researchers have included all sORFs identified during ribosome profiling studies in mice, Drosophila, and humans—“with no filtering whatsoever,” says Menschaert. “The idea was to include everything.” All told, the database currently contains a whopping 266,342 sORFs, but screening with assorted metrics can narrow down this vast list. Stringent filtering of the human sORFs, for example, reduces the list to about 400 or so strong candidates, says Menschaert, who is systematically performing mass spectrometry experiments to determine whether the putative peptides are actually expressed in cells.

Once a new micropeptide is identified, it’s back to the molecular biology bench to interrogate its function. “That’s the slow bit,” says Menschaert. But several of the scientists interviewed for this article indicated that they have new micropeptides in their sights. In January of this year, for example, Olson and his colleagues reported their discovery of a second lncRNA-concealed muscle-specific micropeptide—a 34-amino-acid peptide they named dwarf open reading frame (DWORF).12 The team found evidence that DWORF acts as a regulator of muscle contractility, is abundantly expressed in the mouse heart, and is suppressed in ischemic human heart tissue, suggesting a possible link with heart failure.

Other such small peptides might be immunogenic, says Weissman, who has found that micropeptides encoded by sORFs in a human-infecting cytomegalovirus lncRNA were capable of producing T-cell responses in previously infected patients’ cells.10 “I’m sure there will be some that are important for certain diseases,” agrees Pauli.

And as researchers continue to more carefully comb small snippets of genomes, it’s likely that even more cellular roles for micropeptides will be uncovered. Their diminutive size may have caused these peptides to be overlooked, their sORFs to be buried in statistical noise, and their RNAs to be miscategorized, but it does not prevent them from serving important, often essential functions, as the micropeptides characterized to date demonstrate.

In short, size isn’t everything. Indeed, says Pauli, the only reason researchers haven’t identified more peptide-encoding sORFs to date is “because one just didn’t know that these things existed.” 

NEW PEPTIDES ON THE BLOCK
Species Gene name Size of encoded proteins (number of amino acids) Function Reference
Soybean (Glycine max) ENOD40-1 12 and 24 Associate with a subunit of sucrose synthase in root nodules 1
Fruit fly
(Drosophila melanogaster)
Polished rice (pri), also known as tarsal-less (tal) Three of 11; one of 32 Activate an essential transcription factor, driving formation of cuticle structures during embryo development 2,3,4
Fruit fly pgc 71 Prevents phosphorylation of RNA polymerase II in germline progenitor cells 14
Fruit Fly SclA and SclB 28 and 29 Involved in calcium handling in muscle cells 15
Red flour beetle
(Tribolium castaneum)
mlpt 10, 11, 15, 23 Ortholog of pri involved in the development of abdominal segments 16
Zebrafish
(Danio rerio)
Toddler 58 Activates a G protein–coupled receptor to promote migration of mesendodermal cells in the developing embryo 6
Mouse
(Mus musculus)
myoregulin 46 Interacts with and inhibits a calcium pump in muscle cells, interfering with muscle relaxation 5
Mouse DWORF 34 Interacts with and enhances calcium pump activity in muscle cells 12

References

  1. H. Röhrig et al., “Soybean ENOD40 encodes two peptides that bind to sucrose synthase,” PNAS, 99:1915-20, 2002.
  2. T. Kondo et al., “Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA,” Nature Cell Biol, 9:660-65, 2007.
  3. M.I. Galindo et al., “Peptides encoded by short ORFs control development and define a new eukaryotic gene family,” PLOS Biol, 5:e106, 2007.
  4. T. Kondo et al., “Small peptides switch the transcriptional activity of Shavenbaby during Drosophila embryogenesis,” Science, 329:336-39, 2010.
  5. A. Pauli et al., “Toddler: An embryonic signal that promotes cell movement via Apelin receptors,” Science, 343:1248636, 2014.
  6. D.M. Anderson et al., “A micropeptide encoded by a putative long noncoding RNA regulates muscle performance,” Cell, 160:595-606, 2015.
  7. N.T. Ingolia et al., “Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes,” Cell, 147:789-802, 2011.
  8. M. Guttman et al., “Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins,” Cell, 154:240-51, 2013.
  9. N.T. Ingolia et al., “Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes,” Cell Rep, 8:1365-79, 2014.
  10. A.P. Fields et al., “A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation,” Mol Cell, 60:816-27, 2015.
  11. V. Olexiouk et al., “sORFs.org: A repository of small ORFs identified by ribosome profiling,” Nucleic Acids Res, 44:D324-29, 2016.
  12. B.R. Nelson et al., “A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle,” Science, 351:271-75, 2016.
  13. M.F. Lin et al., “PhyloCSF: A comparative genomics method to distinguish protein coding and non-coding regions,” Bioinformatics, 27:i275-i282, 2011.
  14. K. Hanyu-Nakamura et al., “Drosophila Pgc protein inhibits P-TEFb recruitment to chromatin in primordial germ cells,” Nature, 451:730-33, 2008.
  15. E.G. Magny et al., “Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames,” Science, 341:1116-20, 2013.
  16. J. Savard et al., “A segmentation gene in Tribolium produces a polycistronic mRNA that codes for multiple conserved peptides,” Cell, 126:559-69, 2006.

Interested in reading more?

Magaizne Cover

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?