CN-122019780-A - Scientific literature innovation evaluation method based on large model Agent technology

CN122019780ACN 122019780 ACN122019780 ACN 122019780ACN-122019780-A

Abstract

The application provides a scientific literature innovation evaluation method based on a large model Agent technology, which comprises the steps of classifying each reference source in a reference data set in a subject and research branch level, determining subject field codes and specific research branch codes according to a subject classification system to obtain reference distribution data containing complete classification labels, counting the reference number and the occupation ratio of each research branch in the reference distribution data, calculating the crossing degree of a literature among different research branches, comprehensively evaluating innovation values of the literature by the large model Agent based on the crossing degree and the total number of the referenced branches, adjusting weights of different literatures according to the crossing degree among the branches to generate an innovation value scoring data set, cross-verifying a candidate literature list by the large model Agent, and combining the reference number occupation ratio information of each branch to confirm the effectiveness of cross-field breakthrough characteristics of the candidate literature list, finish final labeling and identify the innovation literature.

Inventors

SHEN ZHIRUI
Peng Feiling

Assignees

广州市奇之信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (9)

1. A scientific literature innovation evaluation method based on a large model Agent technology is characterized by comprising the following steps: Constructing an Agent large model, acquiring complete reference records of a target document through a document database interface, including the total number of references and detailed information of each reference source, and storing the acquired reference records into long-term memories of the Agent large model to form a reference data set; Carrying out hierarchical classification of disciplines and research branches on each reference source in the reference data set, and determining discipline field codes and specific research branch codes of the discipline field codes according to a discipline classification system to obtain reference distribution data containing complete classification labels; Counting the reference quantity and the ratio of each research branch in the reference distribution data, and calculating the crossing degree of the literature among different research branches; the large model Agent comprehensively evaluates the innovation value of the documents based on the crossing degree among the branches and the total number of the quoted documents, and adjusts the weights of different documents according to the crossing degree among the branches to generate an innovation value scoring dataset; Screening candidate documents meeting the double conditions of high score and high crossing degree by the large model Agent according to the innovation value score and a preset crossing degree threshold value to obtain a candidate breakthrough innovation document list; and the large model Agent performs cross verification on the candidate literature list, confirms the effectiveness of cross-domain breakthrough characteristics by combining the reference quantity of each branch and the ratio information, completes final labeling and identifies breakthrough innovation literature.
2. The method for creative evaluation of technical literature based on large model Agent technology according to claim 1, wherein the constructing Agent large model, obtaining complete citation records of target literature through a literature database interface, including the total number of cited and detailed information of each citation source, storing the collected citation records in long-term memory of the Agent large model to form a citation data set, comprises: The Agent large model initiates a query request according to DOI numbers of target documents by calling an API interface of WebofScience or Scopus databases, acquires a returned JSON format reference data packet, analyzes a reference total number field and a reference detail array in the data packet, extracts author name, publication journal, publication year and document title information of each reference record, and obtains an original reference record set; Performing data cleaning on the original reference record set, removing repeated reference records and self-reference entries, screening effective reference records after target time according to time stamps, and supplementing missing information through a cross Ref database to obtain a standardized reference record set; and the Agent large model generates a unique index key value by adopting a hash algorithm according to the standardized reference record set, establishes a mapping relation table between the reference source document and the referenced document, and stores the mapping relation table into a database of the long-term memory module to form a reference data set capable of carrying out traceable query.
3. The scientific literature innovation evaluation method based on the large model Agent technology according to claim 1, wherein the step of classifying each reference source in the reference data set in a subject and research branch level manner, determining subject domain codes and specific research branch codes according to a subject classification system to obtain reference distribution data containing complete classification labels comprises the following steps: the Agent large model extracts the author name, the publication journal, the publication year and the literature title of each citation record from the citation data set, extracts the literature keyword set based on the ISSN number of the publication journal query journal and the literature title, queries the primary subject code of the journal by the subject classification table of the Chinese academy of sciences literature information center, adopts the TF-IDF algorithm to carry out vectorization processing on the keywords, carries out dot product operation with each pre-constructed research branch feature vector, and selects the branch with the highest similarity as the primary branch attribution; Aiming at the preliminary branch attribution, when the similarity between a keyword vector of a reference record and a plurality of research branch feature vectors exceeds a preset cross subject judgment threshold, judging as a cross subject document, giving a main branch code and a subsidiary branch code, otherwise, keeping a single branch code, and obtaining a hierarchical classification label of each reference record; According to the hierarchical classification labels, the reference records are organized according to three hierarchies of discipline categories, primary disciplines and research branches, the reference quantity and the ratio under each hierarchy node are counted, different statistical weights are respectively given to the cross discipline documents according to the main branches and the auxiliary branches, and a structured data table containing discipline codes, branch codes, reference times and percentage ratio fields is generated to form reference distribution data of the complete classification labels.
4. A scientific literature innovation assessment method based on large model Agent technology according to claim 3, wherein said hierarchical categorizing of subject and research branches for each reference source in the reference dataset further comprises: Extracting the full names and the document abstract text of each citation record from the citation data set, extracting noun phrases from the document abstract text through a natural language processing tool as a research subject vocabulary set, and inquiring the subject classification codes of the journal according to the SCI journal partition table to obtain journal initial classification information; Calculating the number of the vocabulary co-occurring with the research topic vocabulary sets and the keywords of each subject by adopting a preset subject classification system vocabulary list, and judging that the vocabulary sets belong to the subject field if the calculation result exceeds a preset threshold value to obtain a subject field label; And according to the subject field labels, positioning the corresponding secondary branch sets, carrying out character string matching on the research subject vocabulary and each secondary branch characteristic word, counting the number of successfully matched characteristic words, selecting the branch with the largest matching number as a research branch label, and determining the subject field label and the research branch label of the reference source.
5. The scientific literature innovation evaluation method based on the large model Agent technology according to claim 1, wherein in the statistical citation distribution data, the citation number and the duty ratio of each research branch calculate the crossing degree of the literature among different research branches, and the method comprises the following steps: extracting a reference record list of each research branch from the reference distribution data, counting the reference number contained in each branch, calculating the percentage of the reference number of the branch to the total reference number, and forming a branch reference ranking list according to the ranking of the reference number from high to low; Recognizing branches with reference quantity accounting for more than a preset minimum accounting threshold as effective branches, calculating standard deviation and average value of the reference quantity of all the effective branches, judging that the branches are uniformly distributed when the standard deviation is less than a preset proportion threshold of the average value, otherwise judging that the branches are intensively distributed; And according to the distribution type identification and the total number of the effective branches, if the distribution type is the balanced distribution, the value of the crossing degree is the effective branch number divided by the total number of the research branches, otherwise, the value is the effective branch number divided by the total number of the research branches, and the crossing degree of the literature among different research branches is determined.
6. The scientific literature innovation evaluation method based on the large model Agent technology according to claim 5, wherein the calculating the crossing degree of the literature among different study branches further comprises: Extracting complete citation records of each research branch from citation distribution data, obtaining published journals, author institutions and keyword sets of citation documents, determining primary subject codes to which each citation belongs through a subject classification mapping table, and counting the types of the crossed subject fields; Extracting a core keyword set of each branch, converting keywords into high-dimensional vector representations by using a word vector algorithm, calculating a center vector of the keyword vector set of each branch, and calculating the topic similarity among the branches by using a cosine similarity algorithm; Dividing the reference number of each branch by the number of sub-fields contained in the branch to obtain reference density, identifying branches with reference density exceeding average density as core influence branches, counting the number of the core influence branches as influence breadth values, calculating variation coefficients of the reference number of each core influence branch as influence depth values, mapping the influence breadth values and the influence depth values to a range from 0 to 1 by adopting normalization processing, and carrying out weighted summation on the comprehensive span degree equal to the influence breadth values and the influence depth values according to preset weight coefficients.
7. The method for creative evaluation of technical literature based on large model Agent technology according to claim 1, wherein the large model Agent comprehensively evaluates the creative value of literature based on the crossing degree and the total number of cited branches, adjusts the weights of different literature according to the crossing degree between branches, and generates a creative value scoring dataset, comprising: the large model Agent obtains branch crossing degree values and the total number of cited of each document from a document database, adds one to the total number of cited and then obtains natural logarithms to obtain a reference base number, multiplies the crossing degree values by a preset amplification factor and limits the multiplied crossing degree values to a range from 0 to 1 to be used as adjustment factors, when the crossing degree exceeds a preset high crossing threshold value, the adjustment factors take the maximum value, otherwise, the adjustment factors are calculated proportionally; Extracting the publication year of each document, calculating the difference between the current year and the publication year, calculating the timeliness weight by adopting a negative exponential decay formula, inquiring the importance value of the subject to which the document belongs from a preset subject weight configuration table as the field weight, and multiplying the reference base, the adjustment coefficient, the timeliness weight and the field weight to obtain the original innovation score; Calculating average value and standard deviation of the scores of the documents in the same subject, dividing the original score by the standard deviation after subtracting the average value, classifying the documents into three grades according to the normalized score, marking the scores as high innovations, marking the scores between the preset high innovations and the preset low innovations as medium innovations, marking the scores as normal innovations smaller than the preset low innovations, constructing a structured data table containing five fields of document identification codes, crossing degree values, total number of references, innovation scores and innovation grades, and arranging the structured data table according to the innovation scores in descending order to generate an innovation value scoring data set.
8. The scientific literature innovation evaluation method based on the large model Agent technology according to claim 1, wherein the large model Agent screens out candidate documents meeting the double conditions of high score and high crossing degree in innovation value score data set according to innovation value score and preset crossing degree threshold value to obtain a candidate breakthrough innovation document list, and the method comprises the following steps: The large model Agent reads innovation score and crossing degree values of each document from the innovation value scoring dataset, screens out document sets with scores higher than a preset innovation score threshold according to the preset innovation score threshold, and screens out document sets with crossing degrees higher than the threshold according to the preset crossing degree threshold; Performing intersection operation on a document set with a score higher than the threshold value and a document set with a crossing degree higher than the threshold value, and extracting identification codes, innovation scores and crossing degree values of document records simultaneously existing in the two sets; And calculating the weighted sum of the innovation score and the crossing degree value of each document, arranging the documents according to the descending order of the comprehensive scores, marking the documents with the preset proportion before ranking as core breakthrough candidates, and marking the rest documents as general breakthrough candidates to obtain a candidate breakthrough innovation document list.
9. The scientific literature innovation evaluation method based on the large model Agent technology according to claim 1, wherein the large model Agent performs cross-validation on a candidate literature list, confirms the effectiveness of cross-domain breakthrough features by combining the information of the quantity of branch references to complete final labeling and identify breakthrough innovation literature, and comprises the following steps: The large model Agent extracts branch reference distribution data of each document from a candidate breakthrough innovation document list, calculates the percentage of the reference number of each research branch to the total reference number, counts the effective branch number with the proportion exceeding a preset minimum threshold, and judges that verification fails if the effective branch number is less than the preset number; According to the preliminary verification result, calculating the reference distribution of the documents passing the preliminary verification, marking the documents with the reference distribution smaller than a preset balance threshold as breakthrough innovative documents, marking the documents with the reference distribution larger than the threshold as non-breakthrough innovative documents, and finishing final marking and identifying the breakthrough innovative documents.

Description

Scientific literature innovation evaluation method based on large model Agent technology Technical Field The invention relates to the technical field of information, in particular to a scientific literature innovation evaluation method based on a large model Agent technology. Background In the field of scientific and technological literature innovation evaluation, accurate judgment of the true innovation value of the literature directly influences scientific research direction selection, resource allocation and talent evaluation, and has a key meaning. Traditionally, the number of times a document is cited is often considered as a core indicator of the measure of innovation, with more citations generally meaning that the greater the impact of the document, the more pronounced the innovation contribution. However, this approach to assessment has significant drawbacks in practical applications, as it is susceptible to cross-reference within a particular study population, resulting in some documents that only spread widely within narrow branches to get too high a score, while those that truly drive the progress of multiple fields may be underestimated. A further problem is that the number of references, while reflecting the overall impact, does not reveal the distribution of the source of the reference. When a large number of references are concentrated from the same research branch, even if the total number of references is very high, the references are only deepened and continued in the branch, and the breakthrough innovation contribution is difficult to embody. Conversely, if the sources of reference are distributed across multiple different branches of the study, even if the total number of references is relatively small, it may be shown that this document provides a general idea or approach across branches, and thus has a higher innovative value. Most of the existing methods only pay attention to the total index of the reference quantity, but neglect the dispersion degree of the reference relation among different research branches, so that systematic deviation occurs between an evaluation result and an actual innovation value. This deviation is particularly pronounced during specific evaluations. For example, one document is cited 200 times, but 180 of them all come from the subsequent work of the same study branch, only 20 times are spread over the other branches, at which time its innovativeness is easily over-amplified by the total index, and the other document is cited only 80 times, but evenly distributed over five study branches that are independent of each other, each branch being about 16 times, and its cross-branch influence clearly represents a significant breakthrough. However, the existing evaluation system is difficult to capture the dispersion difference of the branches of the reference sources, and the innovation scores cannot be adjusted differently according to the dispersion difference, so that serious difficulties are faced in identifying breakthrough documents truly having cross-domain influence. Therefore, how to evaluate the innovation of the literature not only considers the total number of cited materials, but also can effectively measure the dispersion degree of the cited sources among different study branches, and dynamically adjust the scoring weight according to the dispersion degree, thereby becoming a key problem for realizing the accurate evaluation of the innovation value of the scientific literature. Disclosure of Invention In order to solve the technical problems provided by the background, the invention discloses a scientific literature innovation evaluation method based on a large model Agent technology, which comprises the following steps: Constructing an Agent large model, acquiring complete reference records of a target document through a document database interface, including the total number of references and detailed information of each reference source, and storing the acquired reference records into long-term memories of the Agent large model to form a reference data set; Carrying out hierarchical classification of disciplines and research branches on each reference source in the reference data set, and determining discipline field codes and specific research branch codes of the discipline field codes according to a discipline classification system to obtain reference distribution data containing complete classification labels; Counting the reference quantity and the ratio of each research branch in the reference distribution data, and calculating the crossing degree of the literature among different research branches; the large model Agent comprehensively evaluates the innovation value of the documents based on the crossing degree among the branches and the total number of the quoted documents, and adjusts the weights of different documents according to the crossing degree among the branches to generate an innovation value scoring dataset; Screening candidate d