The team released the 15-billion-parameter ESM-2 transformer-based model and a database of protein structure predictions, called the ESM Metagenomic Atlas, on Tuesday. This database includes protein shapes that have not yet been observed by scientists. Proteins are complex biological molecules that contain up to 20 types of amino acids and perform all kinds of biological functions in organisms. Mostly, they fold into complex three-dimensional structures, the shape of which is crucial to how they function. Knowing their shape helps scientists understand how they work, and from that helps them find ways to mimic, change, or counter that behavior. Unfortunately, you can’t just take the amino acid formula and immediately work out the final structure. You can do simulations or experiments to figure it out, but that’s time consuming. These days you can give properly trained machine learning software the chemical composition of a protein and the model will quickly and relatively accurately predict the structure. Indeed, DeepMind demonstrated the same with its AlphaFold model, which won the biennial CASP International Computational Protein Folding Competition in 2020. Given an input sequence of amino acids, AlphaFold and other machine learning software can generate its corresponding 3D structure. Researchers at London-based DeepMind have since refined their system to predict the structure of more than 200 million proteins known to science. The latest ESM system from Meta has gone even further, predicting hundreds of millions more after being trained on millions of protein sequences. A preprinted paper from the Meta team – Lin et al – explaining the design of ESM-2 can be found here. Interestingly enough, according to the researchers, the system is actually a large language model designed to “learn evolutionary patterns and generate accurate end-to-end structure predictions directly from a protein’s sequence.” AlphaFold, for one, is not a model language and uses a different approach. As the boffins note in their paper, these large language models can be used for much more than manipulating human languages: “Modern language models containing tens to hundreds of billions of parameters develop capabilities such as few-shot language translation, common sense reasoning, and mathematical problem solving, all without explicit supervision. “These observations raise the possibility that a parallel form of display may be exhibited by language models trained on protein sequences.” The result is ESM-2, which though a language model has been taught to predict the physical shape of a protein from a text string representing its amino acids. ESM-2 is the largest model of its kind and apparently predicts structures faster than similar systems. it’s up to 60 times faster than previous state-of-the-art systems like AlphaFold or Rosetta, which can take over ten minutes to generate an output, according to Meta. The model was able to generate the ESM Metagenomic Atlas, predicting over 600 million structures from the MGnify90 protein database in just two weeks with 2,000 GPUs. On a single Nvidia V100 GPU, it takes just 14.2 seconds to simulate a protein consisting of 384 amino acids. It appears from the paper that Meta said its system mostly, but not completely, matched AlphaFold in terms of accuracy, although its speed is key, allowing it to predict more proteins. “With current state-of-the-art computational tools, structure prediction for hundreds of millions of protein sequences in a practical timeframe could take years, even using the resources of a large research institution. To make predictions at the scale of metagenomics Discovery at speed prediction is critical,” said the Facebook owner. Meta hopes that ESM-2 and the ESM Metagenomic Atlas will help advance science by helping scientists studying evolutionary history or tackling disease and climate change. “To extend this work even further, we are studying how language models can be used to design new proteins and help solve challenges in health, disease and the environment,” biz concluded. ®