HPC-GVCW: Accelerating SNP Detection in Major Crops through High-Performance Computing and Parallelized Workflows

The exponential increase in sequencing data and the need for rapid processing present challenges for efficient SNP (single-nucleotide polymorphism) detection in major crops, such as sorghum, maize and soybean. SNP detection is significant for trait discovery and genomic selection, necessitating the development of faster and more efficient workflows. The researchers, from King Abdullah University of Science and Technology, University of Arizona, Huazhong Agricultural University, International Rice Research Institute, Cold Spring Harbor Laboratory, Crop Science Research Center and the USDA-ARS, introduce HPC-GVCW, an open-source high-performance computing genome variant calling workflow, designed to address these challenges. This workflow employs high-performance computing (HPC) architectures and a novel algorithm called “Genome Index splitter” to parallelize genome processing and reduce execution times significantly. By dividing genomes into smaller chunks, HPC-GVCW achieves a remarkable speedup, enabling the rapid calling of SNPs from large datasets, such as resequenced rice accessions, in a fraction of the time previously required.

The paper demonstrates the utility of HPC-GVCW by running it on a dataset comprising 25 crop genomes and calling millions of SNPs for rice, sorghum, maize and soybean. Analysis of these SNPs reveals a substantial number of novel variations, particularly in a 16-genome rice reference panel, which could significantly contribute to molecular-assisted selection breeding programs and functional genomics studies. The authors emphasize the reproducibility of their results through containerization and highlight the scalability of HPC-GVCW across various computational platforms. They also discuss future directions, including integrating machine learning-based tools like “DeepVariant” into the workflow and expanding pan-genome strategies to other species, with the goal of enhancing genetic studies and breeding programs in major crops. Overall, HPC-GVCW represents a significant advancement in SNP detection, offering a faster and more scalable solution for exploring genetic diversity in crop plants.

Our High-performance Computing (HPC) workflow was specifically crafted to expedite variant identification in key crop species, enabling scientists to dedicate their efforts primarily to downstream analysis utilizing pan-genomic variant data. The vast potential and ample opportunities within this dataset warrant thorough exploration. While the logical structure and demonstration of the workflow was straightforward, its implementation on the HPC platform presented numerous obstacles. By collaborating closely with a team of computer scientists, we successfully navigated these challenges. Such interdisciplinary collaborations will become increasingly vital in the future. – Zhou

We are excited that our hpc workflow can be wildly adapted to rapidly call SNPs on multiple crops species on genomes that range over 10 fold in size (e.g. rice GS = 400 Mb, sorghum GS = 700 Mb, and maize GS = 2400 Mb). – Wing

SorghumBase examples: 

An example gene SbiRTX2783.04G076200 in the TX2783 genome is highly conserved across sorghum varieties and other crops including rice and maize,  as shown in Homology view Figure 1. Homology view from the search result page on SorghumBase provides quick information on the function of this gene as ABC transporter based on homology to an arabidopsis ABC transporter with 685 identity and showing 24 paralogs in Tx2783 

Figure 1: Homology view from the search result page on SorghumBase provides quick information on the function of the gene based on homology to an arabidopsis ABC transporter and the number of paralogs 15, in Tx2783http://sorghumbase.org/genes?idList=SbiRTX2783.04G076200
Figure 2: Ensembl gene tree view
Figure 3: The inset shows the 4_6047465_C_T variant that was also shown in Figure S7.
Figure 4: Functional consequences of the 4_6047465_C_T variant.
Figure 5: Comparison of the genetic variants at the 3’-end of the orthologous genes: SbiRTX2783 (in sorghum Tx2783) and SORBI_3004G073600 (in sorghum BTx623). The inset pops up when mousing over the last variant (R highlighted in green) at the end of cDNA sequence for the respective gene. Several other variants are visibly conserved between these two genomes, and because both variants are part of the Sorghum Association Panel (SAP), their genotypes may be corroborated by looking at the “Sample genotypes” view (the standard identifiers for these corresponding germplasm are PI564163 for BTx623 and PI656001 for Tx2783).

Reference:

Zhou Y, Kathiresan N, Yu Z, Rivera LF, Yang Y, Thimma M, Manickam K, Chebotarov D, Mauleon R, Chougule K, Wei S, Gao T, Green CD, Zuccolo A, Xie W, Ware D, Zhang J, McNally KL, Wing RA. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol. 2024 Jan 25;22(1):13. PMID: 38273258. doi: 10.1186/s12915-024-01820-5. Read more

 

Related Project Websites:

Image 1: From left to right, Yong Zhou, Nagarajan Kathiresan, Zhichao Yu, Luis Rivera Serna, Doreen Ware, Jianwei Zhang, Kenneth McNally, Rod A. Wing.