CN-121996777-A - Method and device for searching interpretable academic paper

CN121996777ACN 121996777 ACN121996777 ACN 121996777ACN-121996777-A

Abstract

The invention provides an interpretable academic paper retrieval method and device, and relates to the technical field of information retrieval and natural language processing. The method comprises the steps of firstly extracting a plurality of academic element dimensions of a paper title and abstract by using a knowledge extraction model, including dimensions of a research object, a research method, a research field and the like, then constructing a text content and vector bimodal index for each academic element, extracting a user query according to the same academic element dimensions, respectively executing keyword retrieval and vector retrieval based on each query element, fusing the same dimensional results and aggregating cross-dimensional results, rearranging candidate papers by using a semantic ranking model, and finally generating an interpretability description based on matching information of each academic element dimension. According to the invention, through introducing a multidimensional academic element decomposition and mixed retrieval mechanism, the accuracy, correlation and interpretability of academic paper retrieval are obviously improved, and an efficient and transparent document retrieval tool is provided for academic researchers.

Inventors

ZHAI PANPAN
LIU YU
LEI JINYUAN
Long Cunyu
YU HUAJUAN

Assignees

华中师范大学

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (10)

1. An interpretable academic paper retrieval method, comprising: s1, defining n academic element dimensions, carrying out semantic analysis on titles and abstracts of papers to be retrieved, and respectively extracting the n academic element dimensions to form an academic element set; S2, carrying out vectorization coding on the text content of each academic element to obtain vector representation, and establishing a mapping relation from the academic element to the paper based on the text content and the vector representation; S3, receiving a user query, carrying out semantic analysis on the query, extracting according to n academic element dimensions, outputting a query element set, and if a certain academic element dimension is not involved in the query, making the text content of the academic element dimension empty; S4, traversing the query element set, fusing the search results of the same academic element dimension, aggregating the search results of all non-empty academic element dimensions, and outputting a candidate paper set; s5, rearranging the candidate paper sets, and outputting the ordered paper result sets; S6, extracting a search score and a query element set of each paper in the paper result set under each academic element dimension, and generating an interpretable text based on the search score and the query element to obtain an interpretable academic paper search result.
2. The method for searching for an interpretable academic paper according to claim 1, wherein S1 specifically comprises: acquiring titles Title and Abstract abstracts of papers to be searched, and defining n academic element dimensions as Input = { Title, abstract }, wherein the academic element dimensions comprise study objects, study methods, study fields, data sets/experiment objects, key technologies, study problems, application scenes and evaluation indexes; Carrying out semantic analysis on an input text by using a first knowledge extraction model, and respectively extracting n defined academic element dimensions, wherein the first knowledge extraction model comprises a large language model, a sequence labeling model, an information extraction model and a named entity recognition model, and an extraction result forms an academic element set elements= { E 1 , E 2 , ..., E n }, wherein E i = { Type, content } (i=1, 2,., n) represents the Type and text Content of the ith academic element dimension; The Title, abstract and academic element set Elements are combined, and the output structured Paper represents Paper = { Title, abstract, elements }.
3. The method for searching for an interpretable academic paper according to claim 2, wherein S2 specifically comprises: Storing Text content Text (E i ) = Content(E i ) of each academic element E i (i=1, 2,., n) in the academic element set Elements; Vectorizing the text content of each academic element using the embedded model Embedding to obtain a vector representation Vec (E i ) = Embedding(Text(E i )) (i=1, 2,., n); constructing a keyword Index index_keyword based on text content, and constructing a vector index_vector based on vector representation; Mapping relation Mapping of academic elements to papers is established, paper and academic element Type (E i ) to which each academic element E i (i=1, 2,., n) belongs are recorded, and reverse positioning from the retrieved academic elements to the complete papers is supported.
4. The method for searching for an interpretable academic paper according to claim 3, wherein S3 specifically comprises: Receiving a Query input by a user, wherein the Query comprises a natural language question, a keyword combination and a research problem description; And carrying out semantic analysis on the Query by using a second knowledge extraction model, and respectively extracting n academic element dimensions defined in the step S1, wherein the second knowledge extraction model is the same as or different from the first knowledge extraction model, and the extraction result forms a Query element set QueryElements = { QE 1 , QE 2 , ..., QE n }, wherein QE i = { Type and Content } (i=1, 2, the first knowledge, n) represents the Query element Type and text Content of the ith academic element dimension, and if the Query does not involve a certain academic element dimension, the text Content of the academic element dimension is empty.
5. The method for searching for an interpretable academic paper according to claim 4, wherein S4 specifically includes: traversing the query element set QueryElements = { QE 1 , QE 2 , ..., QE n }, and respectively performing keyword search and vector search for each query element QE i (i=1, 2,., n) whose text content is not blank; Fusing the keyword search result and the vector search result of the same query element QE i (i=1, 2,., n) to obtain a search result (QE i ) of the academic element dimension; search Results of all text content non-empty academic element dimensions are aggregated, and candidate paper set CANDIDATEPAPERS =aggregate ({ Results (QE i )|Content(QE i ) noteq) is obtained by paper identifier de-duplication and merging I=1, 2,..n }) and a search Score score_dim (Paper j ,QE i ) for each candidate Paper in each academic element dimension.
6. The interpretable academic paper retrieval method of claim 5, wherein the keyword retrieval is: Results_keyword(QE i )=KeywordSearch(Content(QE i ),Index_keyword), calculating a keyword search score based on a BM25 algorithm or a TF-IDF algorithm; the vector search is as follows: Results_vector(QE i )=VectorSearch(Vec(Content(QE i )),Index_vector), a semantic search score is calculated based on vector similarity, where Vec (Content (QE i )) = Embedding(Content(QE i )).
7. The method for searching for an interpretable academic paper according to claim 5, wherein S5 specifically comprises: SemanticScore (Paper j ) = Ranker(Query, Paper j ) (j=1, 2., | CANDIDATEPAPERS |) is expressed by calculating a semantic similarity score of the Query and each Paper in the candidate Paper set CANDIDATEPAPERS by using a semantic ordering model Ranker, wherein the Paper j can be subjected to semantic matching by using the title, abstract or the whole text of the Paper; The candidate papers are ranked according to the semantic similarity score SemanticScore from high to low, the first k papers with the highest score are selected, a ranked Paper result set RANKEDPAPERS = { Paper 1 , Paper 2 , ..., Paper k } is output, paper 1 is the Paper with the highest score, paper k is the Paper with the kth score, and k is the preset number of returned results.
8. The method for searching for an interpretable academic paper according to claim 7, wherein S6 specifically includes: Extracting each Paper j (j=1, 2..k) in the Paper results set, identifying a search Score score_dim (Paper j , QE i ) (i=1, 2..n.) for each academic element dimension, identifying the academic element dimension with a non-zero search Score to form a matching dimension set MATCHEDDIMS = { i|score_dim (Paper j , QE i ) > 0}, extracting a query dimension with non-empty text content in the query element set QueryElements = { QE 1 , QE 2 , ..., QE n } and its text content; Based on the search Score score_dim (Paper j , QE i ) and the query element set QueryElements, generating interpretable text Explanation (Paper j ) (j=1, 2..k); combining the ordered paper result set and the corresponding interpretable text, and outputting the interpretable academic paper retrieval result as follows: FinalResults={(Paper 1 ,Explanation(Paper 1 )),(Paper 2 ,Explanation(Paper 2 )),...,(Paper k ,Explanation(Paper k ))}.
9. The method for retrieving an interpretable academic Paper as claimed in claim 8, wherein the means for generating the interpretable text Explanation (Paper j ) (j=1, 2,..k) includes: Constructing a Prompt word Prompt, taking an academic element set Elements, a query element set QueryElements and a search Score score_dim (Paper j , QE i ) of each academic element dimension of Paper j as input, and calling the large language model LLM to generate natural and smooth interpretable text Explanation (Paper j ) =llm (Prompt); Based on the predefined Template generation, a corresponding Template (Type (QE i )) is selected according to the Type of query element in the matching dimension set MATCHEDDIMS, text Content of the query element (QE i ), text Content of academic elements of Paper j (E i '), and search Score score_dim (Paper j , QE i ) are filled into placeholders of the templates, and structured interpretable text is generated.
10. An interpretable academic paper retrieval device, comprising: The extraction unit is used for defining a plurality of academic element dimensions, carrying out semantic analysis on titles and abstracts of papers to be searched, and extracting n academic element dimensions respectively to form an academic element set; The establishing unit is used for carrying out vectorization coding on the text content of each academic element to obtain vector representation, and establishing a mapping relation from the academic element to the paper based on the text content and the vector representation; The query unit is used for receiving a user query, carrying out semantic analysis on the query, extracting according to n academic element dimensions, outputting a query element set, and if a certain academic element dimension is not involved in the query, the text content of the academic element dimension is empty; the search unit is used for traversing the query element set, fusing search results of the same academic element dimension, aggregating search results of all non-empty academic element dimensions and outputting a candidate paper set; the rearrangement unit is used for rearranging the candidate paper sets and outputting the ordered paper result sets; and the output unit is used for extracting the search score and the query element set of each paper in the paper result set under each academic element dimension, and generating an interpretable text based on the search score and the query element to obtain an interpretable academic paper search result.

Description

Method and device for searching interpretable academic paper Technical Field The invention relates to the technical field of information retrieval and natural language processing, in particular to an interpretable academic paper retrieval method and device based on multi-dimensional academic element extraction. Background The number of academic papers grows exponentially, and how to quickly and accurately retrieve relevant research results from mass documents is an important challenge facing scientific researchers. The existing academic paper retrieval methods mainly comprise two major types of retrieval based on keywords and retrieval based on semantic embedding, but the methods have the following limitations in practical application: Keyword-based retrieval methods (e.g., BM 25) rely on accurate vocabulary matching to perform relevance ranking by calculating the degree of matching of query terms to the vocabulary in the document. The method has the advantages of high retrieval speed and strong interpretability, but has obvious word gap problems that when different but semantically related terms are used for query and paper, the query cannot be matched, for example, the deep learning cannot be matched with the paper using a neural network, the method is insensitive to synonyms, abbreviations and variants of academic terms, and only focuses on word matching but cannot understand the semantic intention of the query, so that the retrieval recall rate is low. The semantic embedding-based retrieval method utilizes vector similarity for retrieval by mapping queries and papers to a high-dimensional vector space. The method can alleviate the problem of word gap to a certain extent, captures semantic similarity, but has the defects that the whole paper or the whole query is encoded into a single vector by adopting an integral embedding mode, the method belongs to coarse granularity matching, the correlation of the paper in different study dimensions cannot be distinguished, for example, two papers can be similar in research method but completely different in research object, the integral similarity can be higher but the actual correlation is lower, the search process is black box, a user cannot understand why a certain paper is searched out, the interpretation is lacking, the structural characteristics of the academic paper are ignored, and the semantic information of the paper in the dimensions of the research object, the research method, the research field and the like is not fully utilized. More importantly, the existing method generally lacks consideration of the specificity of the academic paper retrieval scene. The academic paper retrieval and the general document retrieval have the essential difference that academic queries are often focused on specific dimensions, for example, users may only care about papers which are "using a certain research method" or "aiming at a certain research object" rather than the papers which are similar in whole, the academic papers have highly structured semantic features, and elements such as research objects, research methods, research fields, data sets, key technologies and the like form a core semantic framework of the papers, the structured information is not fully utilized, and scientific researchers need to understand the matching basis of the retrieval results so as to quickly judge the relativity of the papers, but the prior methods lack the interpretation capability of the retrieval results. In addition, two types of methods based on keywords and based on embedding have advantages and disadvantages, namely, the former is accurate but recall is low, the latter recall is high but irrelevant results can be introduced, and how to organically combine the advantages of the two types of methods is also a problem to be solved. In recent years, knowledge extraction technology, particularly an information extraction method based on a pre-training language model, has significantly progressed, and can accurately extract structured information such as entities, relations and the like from unstructured texts, thereby providing a technical foundation for fine granularity semantic analysis of academic papers. However, how to apply knowledge extraction technology to academic paper retrieval, construct a multi-dimensional retrieval framework, and provide interpretable retrieval results, still lacks a systematic technical solution. Disclosure of Invention The invention aims to provide an interpretable academic paper retrieval method and device, and aims to solve the problems that the existing academic paper retrieval method is insufficient in semantic understanding, the retrieval result is lack of interpretability and weak in fine granularity matching capability, and the accuracy, the relevance and the interpretability of the academic paper retrieval can be remarkably improved by introducing a multi-dimensional academic element decomposition and mixed retrieval mechanism, so that an ef