Researchers from AIRI trained a large language model on DNA. This is not the first time when NLP techniques are used in this field. For example, such models as DNABERT existed before. But it was released in 2021, so since then many LLM tricks were proposed.
For example, sparse attention allowed the researchers to significantly increase the number of tokens accepted for input compared to DNABERT, which is very important when working with such data: from 512 so-called base pairs up to 4.5 thousand with the usual attention mechanism and up to 36 thousand with sparse attention.
Also, the classic Byte Pair Encoding for NLP tasks was used for tokenization, which, as it turned out, is able to allocate significant DNA tokens (as well as with text). This led to a significant increase in the number of base pairs processed by the model: by submitting the same 512 tokens as it was in DNABERT, the model is now able to process 4.5 thousand base pairs, since one token can contain several such pairs.
At the output, they obtained metrics on DNA-specific tasks that are comparable or sometimes superior to existing models. At the same time, the researchers have open-sourced a whole zoo of models.
If you (like me) do not have an intuition of what a textual representation of DNA looks like, then you definitely have to go here.