CN-122024855-A - Complex character gene mining method and system based on large language model and multi-modal data
Abstract
The application provides a complex character gene mining method and system based on a large language model and multi-modal data, belonging to the technical field of gene data analysis, wherein the method comprises the steps of obtaining a statistical data set of the multi-modal data containing different bean groups; annotating all genes in the statistical data set based on a reference genome to obtain a whole genome semantic map, calculating comprehensive scores of all genes at a site by using the large language model based on the whole genome semantic map and the site if the large language model receives the site of the target significance trait, determining complex trait genes based on the comprehensive scores of all genes, determining candidate genes by using the large language model based on the target significance trait if the large language model does not receive the site of the target significance trait, and determining the complex trait genes from all candidate genes by using the large language model based on the whole genome semantic map. The application can locate the gene affecting the most essential of the target significance.
Inventors
- TIAN ZHIXI
- ZHU ZHOU
- LIU SHULIN
- MA YANMING
- ZHOU GUOAN
Assignees
- 崖州湾国家实验室
Dates
- Publication Date
- 20260512
- Application Date
- 20260323
Claims (10)
- 1. A complex character gene mining method based on a large language model and multi-modal data is characterized by comprising the following steps: Carrying out integrated analysis on a gene variation map by using a whole genome association analysis method to obtain a statistical data set containing multi-mode data, wherein the gene variation map comprises gene data and character data of different bean groups; Determining a reference genome, annotating all genes in the statistical data set by using a special database based on the reference genome to obtain gene expression data, and converting the gene expression data into space-time characteristic data based on the growth stage and the tissue structure of beans; Constructing a whole genome semantic map based on the all genes and the spatiotemporal feature data; If the large language model receives the locus of the target significance trait, calculating the comprehensive score of each gene at the locus by utilizing the large language model based on the whole genome semantic map and the locus of the target significance trait, and determining the complex trait genes based on the comprehensive scores of the genes; if the large language model does not receive the locus of the target significance trait, determining candidate genes by using the large language model based on the target significance trait, and determining complex trait genes from the candidate genes by using the large language model based on the whole genome semantic map, wherein the complex trait genes are causal genes influencing the target significance trait.
- 2. The method of claim 1, wherein calculating a composite score for each gene at a site of the target significance trait using a large language model based on the whole genome semantic map and the site comprises: Traversing the whole genome semantic map based on the genes to obtain annotation data corresponding to the genes and associated data between the genes and the rest genes except the genes in all the genes; Inputting annotation data corresponding to the gene and association data between the gene and other genes into a large language model, and controlling the large language model to score at least one of statistical significance, character expression space-time consistency and functional rationality of homologous protein domains of the gene respectively based on preset prompt sentences to obtain comprehensive scores of the gene.
- 3. The method of claim 2, wherein controlling the large language model based on a preset hint statement separately scores at least one of statistical significance, temporal-spatial consistency of expression, and functional rationality of a homologous protein domain for the gene, resulting in a composite score for the gene, comprising: Controlling the large language model to score the statistical significance of the gene, the temporal-spatial consistency of the character expression and the functional rationality of the homologous protein structural domain respectively based on a preset prompt sentence, so as to obtain a statistical significance score, a temporal-spatial consistency score and a functional rationality score; And carrying out weighted summation on the statistical significance score, the space-time consistency score and the functional rationality score to obtain the comprehensive score of the gene.
- 4. The method of claim 2, wherein the preset hint statement includes a task purpose, a task evaluation dimension, and an output information format.
- 5. The method of claim 1, wherein said determining complex trait genes based on composite scores of said individual genes comprises: sequencing the comprehensive scores of the genes according to the sequence from large to small, and determining the first N genes as complex trait genes, wherein N is more than or equal to 1.
- 6. The method of claim 1, wherein the integrated analysis of the genetic variation profile using a whole genome correlation analysis method to obtain a statistical dataset comprising multimodal data comprises: independently analyzing the genetic variation patterns corresponding to each bean group by utilizing a mixed linear model in the whole genome association analysis method to obtain character association characteristic data corresponding to each bean group respectively; and carrying out integration analysis on the character association characteristic data to obtain a statistical data set containing multi-modal data.
- 7. The method of claim 1, wherein said determining complex trait genes from among the candidate genes using a large language model based on the whole genome semantic map comprises: Calculating semantic priority indexes of each candidate gene by using a large language model based on the whole genome semantic graph to obtain a priority index ranking table, and obtaining the complex character genes based on the priority index ranking table.
- 8. The method of claim 7, wherein calculating a semantic priority index for each candidate gene comprises: calculating a statistical significance P value of the candidate gene based on the whole genome semantic profile; Calculating semantic similarity between a character vector corresponding to the target significance character and the candidate gene based on the whole genome semantic map; And calculating the semantic priority index of the candidate gene based on the statistical significance P value and the semantic similarity.
- 9. The method of claim 8, wherein said calculating a semantic priority index for the candidate gene based on the statistical significance P value and the semantic similarity comprises: calculating the semantic priority index of the candidate gene based on the statistical significance P value, the semantic similarity and the following formula: Wherein, the As a semantic priority index, For the value of P to be statistically significant, For the purpose of semantic similarity, Is a candidate gene, t is a character vector, As an indicator function for the gating of the expression level, For the maximum expression level of the candidate gene in the tissue, In order to preset the threshold value of the expression level, 、 Are different weight coefficients.
- 10. A complex trait gene mining system based on a large language model and multimodal data, comprising: The data analysis module is used for carrying out integration analysis on the gene variation spectrum by utilizing a whole genome association analysis method to obtain a statistical data set containing multi-mode data, wherein the gene variation spectrum comprises gene data and character data of different bean groups; The feature conversion module is used for determining a reference genome, annotating all genes in the statistical data set by utilizing a special database based on the reference genome to obtain gene expression data, and converting the gene expression data into space-time feature data based on the growth stage and the tissue structure of beans; the semantic map construction module is used for constructing a whole genome semantic map based on the all genes and the space-time characteristic data; the selection module is used for calculating the comprehensive score of each gene at the site by utilizing the large language model based on the whole genome semantic map and the site of the target significance trait if the large language model receives the site of the target significance trait, and determining the complex trait genes based on the comprehensive score of each gene; if the large language model does not receive the locus of the target significance trait, determining candidate genes by using the large language model based on the target significance trait, and determining complex trait genes from the candidate genes by using the large language model based on the whole genome semantic map, wherein the complex trait genes are causal genes influencing the target significance trait.
Description
Complex character gene mining method and system based on large language model and multi-modal data Technical Field The application belongs to the technical field of gene data analysis, and particularly relates to a complex character gene mining method and system based on a large language model and multi-modal data. Background Currently, research of bean genetics has entered into big data age, and a whole Genome association analysis method (Genome-Wide Association Study, GWAS) becomes a mainstream means for analyzing complex quantitative traits such as bean yield, quality and the like, but quantitative trait loci positioned by the GWAS in the prior art cannot reflect causal relationship among genes, and cannot be positioned to the most essential causal genes by a statistically significant P value. Thus, there is a need for a method that can locate causal genes based on trait characteristics. Disclosure of Invention The application aims to provide a complex character gene mining method and system based on a large language model and multi-modal data, so as to determine causal genes based on whole genome semantic graphs, and effectively improve screening accuracy of the causal genes. In a first aspect of the embodiment of the present application, there is provided a complex trait gene mining method based on a large language model and multi-modal data, including: carrying out integrated analysis on a genetic variation map by using a whole genome association analysis method to obtain a statistical data set containing multi-mode data, wherein the genetic variation map comprises genetic data and character data of different bean groups; Determining a reference genome, annotating all genes in the statistical data set by using a special database based on the reference genome to obtain gene expression data, and converting the gene expression data into space-time characteristic data based on the growth stage and the tissue structure of beans; constructing a whole genome semantic map based on all genes and space-time characteristic data; If the large language model receives the locus of the target significance trait, calculating the comprehensive score of each gene at the locus by using the large language model based on the whole genome semantic map and the locus of the target significance trait, and determining the complex trait genes based on the comprehensive scores of each gene; If the large language model does not receive the locus of the target salient feature, determining candidate genes by using the large language model based on the target salient feature, and determining complex character genes from the candidate genes by using the large language model based on the whole genome semantic map, wherein the complex character genes are causal genes affecting the target salient feature. In a second aspect of the embodiment of the present application, there is provided a complex trait gene mining system based on a large language model and multimodal data, including: the data analysis module is used for carrying out integration analysis on the gene variation spectrum by utilizing a whole genome association analysis method to obtain a statistical data set containing multi-mode data, wherein the gene variation spectrum comprises gene data and character data of different bean groups; The feature conversion module is used for determining a reference genome, annotating all genes in the statistical data set by using a special database based on the reference genome to obtain gene expression data, and converting the gene expression data into space-time feature data based on the growth stage and the tissue structure of beans; the semantic map construction module is used for constructing a whole genome semantic map based on all genes and space-time characteristic data; The selection module is used for calculating the comprehensive score of each gene at the site by utilizing the large language model based on the whole genome semantic map and the site of the target significance trait if the large language model receives the site of the target significance trait, and determining the complex trait genes based on the comprehensive score of each gene; If the large language model does not receive the locus of the target salient feature, determining candidate genes by using the large language model based on the target salient feature, and determining complex character genes from the candidate genes by using the large language model based on the whole genome semantic map, wherein the complex character genes are causal genes affecting the target salient feature. In a third aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the steps of the above-described complex trait gene mining method based on a large language model and multi-modal data when the computer prog