What types of omics data (e.g., genomic, transcriptomic, epigenomic) are being used as inputs for AI models, and what high-throughput sequencing technologies are used to generate them?

19 papers analyzed

Shared by Zifeng | 2026-01-05 | 4 views

Decoding the Genome: A Survey of Omics Data Inputs for AI Models

Created by: Zifeng Last Updated: January 05, 2026

TL;DR: The field is rapidly converging on transformer-based foundation models that ingest a broadening array of omics data—primarily DNA sequences but increasingly transcriptomics and epigenomics—to predict complex biological functions, with a clear trajectory towards unified, multimodal architectures for a holistic view of genome regulation.

Keywords: #Genomics #FoundationModels #OmicsData #Transformers #MultimodalAI #DeepLearning #Bioinformatics

❓ The Big Questions

The intersection of artificial intelligence and genomics is a hotbed of innovation, driven by a set of fundamental challenges. The research surveyed here collectively seeks to answer several big questions:

  1. How can we translate the diverse languages of biology into a format AI can understand? The genome is not just a string of letters; it's a complex, multi-layered system. A central challenge is representing diverse omics data—from raw DNA sequences and epigenetic marks (ATAC-seq) to transcriptomic profiles (RNA-seq) and interaction networks—in a way that preserves biological meaning. This has sparked a vibrant debate around tokenization strategies (Dotan et al., 2024; Testagrose & Boucher, 2025) and graph-based representations (Schulte-Sasse et al., 2023).

  2. Which AI architectures are best suited for decoding genomic complexity? While early work leveraged CNNs for local pattern detection, the field has overwhelmingly embraced Transformer architectures for their ability to model long-range dependencies, akin to understanding grammar in language (Consens et al., 2025). However, the conversation is expanding to include Graph Neural Networks (GNNs) for modeling interaction data (Zhang et al., 2022) and novel, more scalable architectures like Hyena to overcome the computational cost of Transformers on genome-length sequences (Consens et al., 2023).

  3. Can we build a single, general-purpose "foundation model" for biology? Inspired by the success of large language models in NLP, a grand ambition has emerged: to create a universal model pre-trained on vast biological data. Papers like Cui et al. (2025) and Zhang et al. (2025) conceptualize and build towards general AI models that integrate multiple omics modalities (DNA, ATAC-seq, RNA-seq) to predict a wide range of biological phenomena, from gene expression to chromatin organization, across different tissues and even species.

  4. How do we move from prediction to genuine biological understanding? A correct prediction is useful, but an interpretable one is revolutionary. A recurring theme is the critical need to open the "black box" of these deep learning models. Researchers are leveraging attention mechanisms and feature importance techniques to pinpoint the specific DNA motifs, regulatory elements, or cellular pathways that drive model predictions, turning AI into a hypothesis-generation engine for experimental biologists (Dalla-Torre et al., 2024; Sanabria et al., 2024).

🔬 The Ecosystem

The rapid progress in this field is driven by a dynamic ecosystem of academic labs, research institutes, and industry players. While not exhaustive, the provided papers highlight several key contributors:

  • Key Researchers & Institutions:

    • DeepMind (Google): A major force pushing the boundaries of large-scale models with projects like AlphaGenome (Avsec & Latysheva, 2025), which builds on their legacy with AlphaFold to predict regulatory effects from massive DNA sequences.
    • Instadeep/BioNTech: Their collaboration produced the Nucleotide Transformer (Dalla-Torre et al., 2024), a foundational work in benchmarking DNA language models and demonstrating the power of pre-training on diverse species.
    • Bo Wang (University of Toronto) & Fabian J. Theis (Helmholtz Munich): Their groups are at the forefront of reviewing and conceptualizing the application of LLMs to genomics, as seen in the comprehensive review "To Transformers and Beyond" (Consens et al., 2023).
    • Academic Consortia: Many of the models are built upon massive public datasets generated by consortia like ENCODE, TCGA, GTEx, and the Human Cell Atlas, which provide the essential ground-truth labels for training and validation (Chen & Gao, 2024).
  • Pivotal Papers & Concepts:

    • The Rise of Foundation Models: A clear trend is the shift towards pre-trained foundation models. Papers like GROVER (Sanabria et al., 2024) and Nucleotide Transformer (Dalla-Torre et al., 2024) show that models trained on raw DNA sequences can learn fundamental "grammatical rules" of the genome.
    • The Push for Multimodality: The conceptualization of a "Super Transformer" (Cui et al., 2025) and the development of models like EMGNN (Schulte-Sasse et al., 2023) and the general AI model by Zhang et al. (2025) epitomize the field's move towards integrating disparate data types (e.g., DNA sequence, chromatin accessibility, gene expression) for a more holistic understanding.
    • The Importance of Representation: The work on tokenization (Dotan et al., 2024; Testagrose & Boucher, 2025) highlights that how data is fed to the model is as important as the model architecture itself, forming a critical sub-field of research.

🎯 Who Should Care & Why

The implications of this research extend far beyond the computational biology community.

  • Computational Biologists & Bioinformaticians: This survey is a direct roadmap to the state-of-the-art. It provides an overview of dominant architectures (Transformers, GNNs), critical data representation techniques (tokenization, graphs), and emerging challenges (scalability, interpretability). It is essential for anyone developing new computational tools for genomic analysis.

  • Molecular & Cellular Biologists: AI models are becoming powerful "in silico" laboratories. Instead of costly and time-consuming experiments, biologists can use models like AlphaGenome to predict the functional impact of a genetic variant or use EMGNN to prioritize candidate cancer genes for further study. This accelerates the cycle of hypothesis generation and experimental validation.

  • Clinicians & Precision Medicine Researchers: The integration of multi-modal AI holds the key to true personalized medicine. As discussed by Zhuang (2025), these models can integrate a patient's genomic data with clinical records and imaging to predict disease risk, stratify patients for clinical trials, and recommend optimal treatments.

  • AI & Machine Learning Researchers: Genomics presents a unique and formidable challenge for AI. The sheer scale of the data (billions of base pairs), the need for long-range dependency modeling, the inherent multimodality, and the high stakes of interpretability make it a fertile ground for developing novel architectures, self-supervised learning techniques, and explainability methods that can have an impact beyond biology.

✍️ My Take

This body of work paints a clear picture of a field in the midst of a paradigm shift. We are moving away from bespoke, single-task models and hurtling towards a future dominated by large-scale, pre-trained genomic foundation models. The central thesis is that DNA, and the regulatory layers that control it, can be treated as a complex language. By applying architectures born from NLP, we can learn the grammar and syntax of this language to predict its meaning—function.

Key Patterns & Debates:

  1. From Unimodal to Multimodal: The most significant trend is the acknowledgment that DNA sequence alone is insufficient. True biological function arises from the interplay between the static genome and its dynamic epigenomic and transcriptomic states. The frontier is no longer just reading the DNA sequence but integrating it with data on chromatin accessibility (ATAC-seq), gene expression (RNA-seq), and 3D structure (Hi-C). Early models used DNA-seq as the primary input; leading-edge models like those proposed by Cui et al. (2025) and Zhang et al. (2025) explicitly use ATAC-seq and DNA sequence as dual inputs to predict a multitude of outputs, a far more sophisticated approach.

  2. The Architecture Wars: While the Transformer is the current champion, its reign is not absolute. Its quadratic complexity makes it computationally expensive for chromosome-scale analysis. This has opened the door to competitors like GNNs, which excel at modeling the explicit relationships in protein-protein or gene-gene interaction networks (Schulte-Sasse et al., 2023; Zhang et al., 2022). Furthermore, novel architectures like Hyena are being explored as more efficient alternatives for capturing long-range dependencies (Consens et al., 2023). The future is likely hybrid, combining the strengths of different architectures for different data types and scales.

  3. The Tokenization Dilemma: An undercurrent in the literature is the critical importance of data representation. How do you tokenize a genome? Simple k-mers? Biologically-informed units like codons? Or data-driven subwords learned via Byte-Pair Encoding (BPE)? As Dotan et al. (2024) and Testagrose & Boucher (2025) demonstrate, this choice has profound impacts on model performance, efficiency, and interpretability. There is no one-size-fits-all answer, and developing biologically-aware tokenization strategies is a key open problem.

Future Directions:

  • True Scalability: The next milestone is a model that can process an entire human chromosome, or even a whole genome, in a single pass. This will require architectural innovations beyond the standard Transformer to handle sequences of billions of units.
  • Causal Inference: Current models are masters of correlation. The next frontier is to imbue them with a sense of causality. Can we design models and training regimes that allow us to ask "what if" questions and predict the outcome of specific interventions (e.g., the precise effect of a CRISPR-based edit)?
  • Standardization and Benchmarking: As models become more complex and multimodal, comparing them becomes difficult. The field desperately needs standardized, large-scale benchmark datasets and evaluation protocols, following the example of Dalla-Torre et al. (2024), to ensure reproducible progress.
  • From Lab to Clinic: For these models to impact human health, we must address the "last mile" challenges: ensuring model robustness and fairness, navigating data privacy and regulatory hurdles, and developing truly interpretable outputs that a clinician can trust and act upon (Zhuang, 2025; Lim, 2025).

In conclusion, the era of AI in genomics is just beginning. The papers surveyed here lay the groundwork for a future where AI models act as indispensable partners in our quest to understand the blueprint of life.

📚 The Reference List

Paper Author(s) Year Data Used Method Highlight Core Contribution
Multimodal foundation transformer models for multiscale genomics Cui, H., Khan, S. A., Tegner, J., et al. 2025 Genomic sequence, scRNA-seq, spatial transcriptomics, ATAC-seq Multimodal Transformer Proposes a modular 'Super Transformer' architecture to integrate heterogeneous omics data.
Transformers and genome language models Consens, M.E., Dufault, C., Wainberg, M., et al. 2025 DNA sequence, ChIP-seq, ATAC-seq, Hi-C Transformer (gLMs) Reviews the application of transformer-based genome language models (gLMs) in genomics.
Foundation models for bioinformatics: recent advances and future perspectives Chen ZJ, Gao G 2024 Genomics, transcriptomics, proteomics, single-cell data Foundation Models (Transformers) Reviews recent developments in foundation models (FMs) applied across bioinformatics.
Benchmarking DNA foundation models for genomic and genetic tasks Dalla-Torre, H., et al. 2024 Whole-genome DNA sequences (human and other species) Transformer (Nucleotide Transformer) Presents an extensive benchmark of large-scale DNA foundation models for diverse genomic tasks.
A comprehensive review of foundation models in bioinformatics: applications, challenges, and future directions Lee et al. 2023 DNA/RNA/protein sequences, scRNA-seq, knowledge graphs Foundation Models (Transformers) Explores the broad application of foundation models (FMs) in bioinformatics.
To Transformers and Beyond: Large Language Models for the Genome Consens, M.E., et al. 2023 DNA sequence, ATAC-seq, Hi-C, RNA-seq Transformers, Hyena Layers Reviews the application of LLMs to genomics, including emerging architectures beyond transformers.
Graph neural network approaches for drug-target interactions Zhang, Z., et al. 2022 Molecular structures, protein interaction networks Graph Neural Network (GNN) Reviews the application of GNNs in predicting drug-target interactions using graph data.
Integration of multiomics data with graph convolutional networks to identify new cancer genes and their associated molecular mechanisms Schulte-Sasse R, et al. 2023 Multi-omics (SNV, CNA, methylation, expression), gene networks Multilayer Graph Neural Network (EMGNN) Introduces EMGNN, a GNN that integrates multiple gene networks and multi-omics data for cancer gene prediction.
Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs Li X, et al. 2023 Gene expression (microarray, RNA-Seq), gene interaction graphs Neighbour Connection Neural Network (NCNN) Introduces NCNN, a GNN variant that uses gene interaction graphs to improve gene expression prediction.
Effect of tokenization on transformers for biological sequences Dotan, E., et al. 2024 Protein and nucleotide sequences Transformer (BERT-based) Investigates how different tokenization algorithms impact transformer model performance on biological sequences.
Tokenization and deep learning architectures in genomics: A comprehensive review Testagrose, C., Boucher, C. 2025 DNA sequencing data Survey (Transformers, CNNs, Hybrids) Surveys current deep learning models and tokenization strategies applied to genomic data.
Developing a general AI model for integrating diverse genomic modalities and comprehensive genomic knowledge Zhang Z, et al. 2025 DNA sequence, ATAC-seq Multi-task CNN-Transformer Presents a multi-task AI model integrating genomic and epigenomic data to predict multiple modalities.
DNA language model GROVER learns sequence context in the human genome Sanabria, M., et al. 2024 Human genome sequence (hg19) Transformer (BERT-like) Introduces GROVER, a foundation model trained on the human genome using byte-pair encoding.
Genome language modeling (GLM): a beginner’s cheat sheet Tyagi, N., et al. 2025 DNA/genomic sequences Transformer (GLMs) Provides an overview and guide to applying NLP techniques for genome language modeling.
AlphaGenome: AI for better understanding the genome Avsec, Ž., Latysheva, N. 2025 Long DNA sequences (up to 1M bp) CNN-Transformer Hybrid Introduces AlphaGenome, an AI model for predicting regulatory effects from long DNA sequences.
Web content on genomic data formats in AI models N/A 2023 Survey (FASTQ, BAM, VCF) N/A (Data Format Review) Provides an overview of input and output formats used in AI models processing genomic data.
How genomics and multi-modal AI are reshaping precision medicine Zhuang, H. 2025 Genomics, clinical data, imaging Multimodal AI (Transformers, GNNs) Discusses the integration of genomics and multi-modal AI to advance precision medicine.
AI for Genomics: The 2025 Revolution Unspecified (Lifebit) 2023 Genomic sequencing data, expression data CNN, RNN, Transformer Discusses the transformative impact of AI on genomics, from data analysis to personalized medicine.
Genomics AI: A Comprehensive Guide for 2025 Lim, V. 2025 DNA/RNA sequencing, epigenetic markers CNN, Transformer, GNN Provides an overview of AI applications in genomics, including workflows and best practices.
Originally generated on 2026-01-05 12:15:42
Discussion 0
Login via Uzei to join the discussion


No comments yet. Be the first to share your thoughts!