An AI model that can decode and design living organisms

The Arc Institute’s new EVO 2 model does for DNA what ChatGPT did for words. From the Future of Being Human Substack.

Last week the Arc Institute, in collaboration with NVIDIA and researchers from Stanford, UC Berkeley, and UC San Francisco, published an AI frontier model that is capable of generating plausible and seemingly functional DNA sequences — much as mainstream generative AI platforms generate text or code. And while it’s still early days for general purpose AI models that speak the language of DNA, the implications of the new model are profound.

On one level, the emergence of generative AI models that “speak” DNA rather than natural language, or computer code, seems inevitable. After all, DNA as it appears in biological organisms is just another type of language that connects a sequence of symbols with functional outcomes.

Yet the translation of DNA sequences into outcomes — such as protein synthesis and cellular functions, all the way up to influencing what organisms look like and how they behave — is fiendishly complex. And as a result, it’s largely remained out of reach of large language models that excel with text — until now.

Evo 2 — the new model from the Arc Institute — builds on previous work on AI genome models. But the sheer scale of its training set and context window — the amount of information it can work with — place it in a new category of model.

Like a more conventional generative AI model, Evo 2 is, in the words of the just-released paper that describes it, “fundamentally a generative model trained to predict the next base pair in a sequence.”

In other words, just as ChatGPT, DeepSeek, or other models predict the most likely words, sentences and paragraphs that follow the prompt you give them, Evo 2 predicts the most likely sequence of DNA base pairs that follow a DNA “prompt.”

But while a text based large language model is trained on billions of pages of (mainly) human-generated text, EVO 2 is trained on trillions of DNA base pairs — 9.3 trillion to be precise — spanning over 128,000 complete genomes covering bacteria, archaea, phages, plants, and other single-celled and multi-cellular species; including humans …

Andrew Maynard

Director, ASU Future of being Human initiative