Interpretable Deep Learning Models Reveal Regulatory Elements Predicting Gene Expression in Diverse Plant Species

Gene expression regulation is a highly complex process involving multiple factors, including interactions between transcription factors and cis-regulatory elements (CREs). Traditional molecular biology techniques present challenges in fully elucidating these mechanisms. Convolutional neural networks (CNNs) have the potential to systematically investigate sequence-to-regulation relationships. In this study, researchers from Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), the Institute of Bio- and Geosciences, IBG-4, Cluster of Excellence on Plant Sciences (CEPLAS), South Westphalia University of Applied Sciences, University of Göttingen and the Center of Integrated Breeding Research (CiBreed) developed interpretable deep learning models to predict high and low gene expression from gene flanking regions in four plant species: Arabidopsis thaliana, Solanum lycopersicum, Sorghum bicolor, and Zea mays. The single-species models achieved over 80% accuracy and were instrumental in identifying important regulatory elements from flanking regions that influence gene expression levels. Multi-species models showed impressive cross-species performance, successfully identifying conserved and species-specific regulatory sequence features. This deep learning approach allows for automated motif extraction from raw sequences, so the researchers demonstrated its applicability by revealing causal relationships between genetic variations and gene expression changes in tomato genomes. Furthermore, the models accurately predicted genotype-specific expression of key functional gene groups, highlighting known phenotypic and metabolic differences between a pair of domesticated and wild tomato species. This study underscores the potential of deep learning in exploring gene regulation and genetic variation, offering a powerful tool for functional genomics and phenotypic trait prediction.

SorghumBase examples:

Figure 1: Gene expression views of ABI3 orthologs in maize and sorghum. These transcriptomic views are based on experimental data (RNA-seq) across multiple experiments. The genomic sequence information for each gene model was taken from the same source that supplies the SorghumBase genetic sequence views and information. The researchers used that sequence information to train the CNN model and compared the expression prediction of their in silico approach to the ground truth expression data shown in the two figure panels. Eventually, non-experimentally determined gene expression data (CNN-produced) could be included as a part of the views and data hosted by SorghumBase.

Reference:

Peleke FF, Zumkeller SM, Gültas M, Schmitt A, Szymański J. Deep learning the cis-regulatory code for gene expression in selected model plants. Nat Commun. 2024 Apr 25;15(1):3488.  PMID: 38664394. doi: 10.1038/s41467-024-47744-0. Read more

Related Project Websites:

Image 1: Schematic representation of the inputs and outputs of the machine learning pipeline trained to: a) classify genes as highly or lowly expressed based on the their regulatory sequence; b) highlight sequence features (motifs) contributing to the classification. Photo credit Jedrzej Szymanski.
Image 2: Szymanski Lab on a retreat in Burg Falkenstein, Harz, DE. Photo credit Jedrzej Szymanski.
Image 3: Fritz Forbang Peleke presenting his results as Translational Biology Conference 2022 VIB Ghent, BE. Photo credit Jedrzej Szymanski.