Nicholas Sujecki
Machine Learning Engineer
Protein structure prediction and design play pivotal roles across many scientific and industrial fields, impacting drug discovery, enzyme engineering, and biotechnology. Traditionally, these endeavours have been hindered by the complexities of accurately predicting how amino acid sequences fold into functional three-dimensional structures. This challenge stems from the vast conformational space proteins can adopt and the subtle interactions governing their stability and function. ESM-3 is a new approach that leverages advanced ML techniques to unify sequence, structure and function prediction. If these concepts are unfamiliar to you and you’d like further clarity, there’s a good overview here. Models in the past have only been capable of achieving this to a limited extent, and have instead had to specialise within a domain to achieve the best performance. ESM-3 not only enhances our understanding of protein biology but also holds promise for accelerating the discovery and design of novel proteins with tailored functions.
ESM3 is the newest development in the realm of deep learning and protein science. It stands for “Evolutionary Scale Modeling version 3,” a state-of-the-art language model designed by researchers at EvolutionaryScale to predict and generate protein structures.
I’m sure we’re all familiar enough with evolution as a general concept — yeah, that thing the guy with the crazy hair kept preaching that we’re all related to monkeys or something — well the E in ESM has all this and a little more going for it.
Evolutionary data is used to inform protein design by embedding it into the model through training data. In less vague terms, the idea is that functionally important parts of proteins are conserved through evolution, so by including known evolutionary patterns in the training data, the model learns to recognise and conserve the patterns critical for protein function and structure.
In your ‘E’ models this typically comes from using Multiple Sequence Alignments (MSAs) of amino acids as training data. As with all things there are other ways, but MSAs remains popular for good reason. This was first used in earlier ESM models, such as ESM-1b and ESM2, which at the time were groundbreaking for their plausible generation via this technique. ESM3 has continued the tradition, and stands true as a shining beacon of the E in its name.
ESM-3 has three separate input tracks for sequence, structure and function, allowing each to be prompted partially or in full. These tracks are processed separately before being combined into a single latent space, making the model not only capable of answering all three tasks, but also grants it an understanding of the protein domain unique only to itself.
ESM3 incorporates bidirectional transformers as the scheme of its architecture. The advantages of bidirectional transformers is they enable comprehensive information exchange between all positions in the sequence, that is both backwards and forwards, enabling ESM-3 to effectively contextualise each amino acid based on both preceding and succeeding residues, giving a more holistic view of secondary and tertiary structural elements.
Attention may be all that you need, but it never hurts to have a little more, right? ESM3 implements a Geometric Attention layer, which is stacked with sequence attention to produce contextual embeddings that consider both. In essence, amino acids are represented as a set of three-dimensional coordinates and put into distance matrices. Amino acids closer to one another get higher attention scores, allowing the model to selectively attend to spatially proximal residues in the protein structure and their potential interactions.
In generation, ESM-3 first defines a backbone and works from an iterative-refinement approach that alternates between sequence and structure. This ensures both correct structure and sequence generation, and results in a comprehensive understanding of the entire protein over sequence, structure and function prediction.
Where traditional methods treat sequences as a single, continuous input, ESM-3 discretises each protein into smaller arrangements of atoms and residues. By breaking down the structural complexity into discrete tokens, ESM3 further enhances its ability to capture fine-grained details such as atomic interactions, spatial arrangements, and geometric features.
As mentioned before, the resulting architecture makes ESM-3 capable of sequence, structure and function prediction. It is a generative masked language model and will iteratively sample masked positions until they’re all unmasked.
Most paragons in their field, such as AlphaFold, have always focused on one specified output type, which in this case is structures to address that pesky old folding problem. It hasn’t been until more recently with the use and fine-tuning of Large Language Models (LLMs) that we’ve seen the natural language capabilities of such models augment the ability to receive and return various input types. An example, ProGen2 from 2022, is able to accept sequences with some functional annotations as inputs and can return the same. As impressive as this is, it still only combines sequence with some functional capability. With this in mind, the value of ESM3’s 3-in-1 flexible multi-input/multi-output track can not be understated.
As an example the paper cites a fluorescent protein — esmGFP, which our remaining-anonymous head of bio described only as a “really bad example” (I didn’t ask why) — generated by ESM3 with only a 58% sequence identity to the nearest known fluorescent protein. Where 58% may seem like a pretty unsuspecting number, the important part here is that a low similarity is good, showcasing ESM3’s ability to plausibly generate ‘distant’ proteins.
Where previous models have been constrained by the bounds of known protein sequences, ESM3 pushes generative possibilities into unknown territory, opening a vast design space that previous methods haven’t been able to access. This not only has the potential to expand our understanding of protein structure-function relationships but also opens new avenues for designing novel biomolecules with tailored functionalities. Through such capabilities, ESM3 may just provide us with the innovative solutions required for complex biomedical changes.
You can ignore the upcoming numbers if you aren’t familiar, they’re just to appease the acronym lovers hiding in the wings. ESM-3 is capable of producing high confidence structure generations (pTM > 0.8, pLDDT > 0.8) that share a low sequence and structural similarity to the training set (sequence identity < 20%, TM-score 0.52 ± 0.10). Below is a full table of results for structure generation, where it shows ESM-3 models largely out-perform ESMFold, while still falling short of AlphaFold2.
ESM3’s generated sequences produce structures with realistic foldability (TM-score 0.52 ± 0.10) that are reliable on a lower level (pLDDT 0.84), as well as coming from a diverse range of sequences that share little similarity (mean pairwise sequence identity of 0.155). While not directly compared in the paper, it is worth noting that ESM3s TM-scores are much lower than some of its predecessors, such as the aforementioned ProGen2 which has a mean TM-Score of 0.72. It is worth noting that this may be from ESM3’s tendency to generate new proteins that are ‘distant’ from known ones, which would adversely affect its TM-Scores.
Functionally, ESM3s generated proteins maintain coordination and active sites with high designability (pTM > 0.8, scTM 0.96 ± 0.04) and shows a strong ability to follow complex prompts including sequence, structure coordinates, secondary structure (SS8), solvent-accessible surface area (SASA), and function keywords. The table below shows how well ESM-3 predicts each modality, given the others as inputs.
Results, Schmeults, amiright? ESM3 may not perform as well across all benchmarks as models specifically tuned for that task, and it’s hard to expect it to with everything else it does. There’s also the question of size, where out-performance of ESMFold begins at the price of 7B parameters. It’s worth noting that many models tuned to a particular domain, such as ESM-2, are still really really good at that particular task, and they tend to do it with a hell of a lot less parameters. Unless there’s a specific need for in-built structure generation and function prediction as well, it makes one wonder why 98B parameters is justifiable over 650M.
And there is the question of access… At the moment EvolutionaryScale are keeping ESM-3 tight to their chests, meaning it’s strictly available under a non-commercial license, meaning hardly anyone can use it. With our hands all over ESM-3 we could get a better understanding of its capabilities and limitations, though for now we’ll just have to go with what they’ve given us — it appears to be a completely performant model over the sequence, structure and functional domains, without a particular expertise in any.
Let’s hope that in future our benevolent overlords at EvolutionaryScale change their minds and open their models to the masses. At the end of the day, what have they got to lose?
TL;DR, ESM-3 offers us sequence, structure and function design all unified in a single model. It appears performant across all three domains, yet remains under an elusive non-commercial license.
Only time will tell if the proteins ESM3 makes are a little too distant from those known to us to be practically useful. Has mother nature evolved this way for a reason, or is she simply missing the point? Until ESM3 unlocks the secrets to reverse ageing and cure every disease on Earth, we‘ll never know. To ensure you’re as up to date as can be with all that is Machine Learning, make sure to stay tuned, and if you’re interested in any of our work you can find it here.
For more information, or if you think you could’ve done a better job and want to tell me why, you can find me here: nicholas.sujecki@ml6.eu.
Good luck out there.