Search

CN-122019760-A - Mixed retrieval and reordering method for medical question and answer

CN122019760ACN 122019760 ACN122019760 ACN 122019760ACN-122019760-A

Abstract

The invention relates to a mixed retrieval and reordering method for medical questions and answers, which comprises the steps of obtaining input Chinese medical questions, carrying out sparse retrieval on a medical knowledge retrieval base according to the Chinese medical questions to obtain initial candidate fragments of Top-100, wherein the medical knowledge retrieval base is constructed by collecting multi-source medical texts and preprocessing, carrying out dense retrieval on the Chinese medical questions and knowledge fragments to obtain semantic related candidate fragments of Top-100, carrying out fusion on the initial candidate fragments and the semantic related candidate fragments to obtain final candidate fragments of Top-50, splicing the Chinese medical questions and the final candidate fragments into a single sequence input encoder, outputting a relevance score, and outputting Top-K medical knowledge fragments according to the relevance score. The invention can effectively improve the retrieval capability of the related content in the medical question-answering task.

Inventors

  • LIU MEI
  • LU JIAPENG
  • ZHANG JIANFENG
  • LUO PENGFEI
  • WANG PENG

Assignees

  • 广东石油化工学院

Dates

Publication Date
20260512
Application Date
20260212

Claims (8)

  1. 1. The medical question-answer oriented mixed retrieval and reordering method is characterized by comprising the following steps of: acquiring an input Chinese medical problem; according to the Chinese medical problem, performing sparse retrieval on a medical knowledge retrieval base to obtain an initial candidate segment of Top-100, wherein the medical knowledge retrieval base is constructed by acquiring multi-source medical texts and performing preprocessing; performing dense retrieval on the Chinese medical problems and the knowledge segments to obtain semantic related candidate segments of Top-100; fusing the initial candidate segment and the semantically related candidate segment to obtain a final candidate segment of Top-50; And splicing the Chinese medical problem and the final candidate segment into a single sequence input encoder, outputting a correlation score, and outputting a Top-K medical knowledge segment according to the correlation score.
  2. 2. The medical question-and-answer oriented hybrid retrieval and reordering method of claim 1, wherein constructing the medical knowledge retrieval base by collecting multi-source medical text and preprocessing comprises: And carrying out segmentation processing on the multi-source medical text by adopting a fixed window segmentation strategy, carrying out cleaning, denoising, format unification and term standardization processing on the segmented text, obtaining medical knowledge segments, respectively constructing a sparse inverted index and a dense vector index based on the medical knowledge segments, and constructing the medical knowledge retrieval library, wherein the sparse inverted index is established according to word frequency and document frequency of terms in a document, and the dense vector index is established by carrying out vectorization coding on each medical knowledge segment and mapping the medical knowledge segments into a low-dimensional dense semantic vector.
  3. 3. The medical question-and-answer oriented hybrid retrieval and reordering method of claim 1, wherein sparse retrieval of the medical knowledge retrieval base according to the chinese medical question comprises: and according to the Chinese medical problem, a BM25 model is adopted as a sparse retrieval module, and the medical knowledge retrieval base is retrieved through preset super parameters.
  4. 4. The medical question and answer oriented mixed retrieval and reordering method according to claim 1, wherein the dense retrieval of the Chinese medical questions and knowledge segments comprises the steps of adopting BGE-base-zh as a semantic coding model, adding a medical question template into coding input, mapping the Chinese medical questions and the medical knowledge segments to the same vector space, calculating semantic relevance through cosine similarity, and retrieving based on FAISS vector indexes.
  5. 5. The medical question-and-answer oriented hybrid retrieval and reordering method of claim 1, wherein fusing the initial candidate segment and semantically related candidate segment comprises: and setting a smoothing parameter by adopting a reciprocal ranking fusion method, and fusing the initial candidate segment and the semantically related candidate segment.
  6. 6. The medical question-answering oriented hybrid retrieval and reordering method according to claim 1, wherein the reciprocal rank fusion method is: ; wherein d is a candidate document to be fused, N is a search model set participating in fusion, rank i (d) represents the ranking of the document d in a returned result list of an ith search model, and k is a constant for smoothing ranking differences.
  7. 7. The method for hybrid retrieval and reordering of medical questions and answers of claim 3, wherein the encoder adopts a BGE-Reranker model, and the BGE-Reranker model outputs a relevance score through a transducer joint coding and a multi-layer perceptron.
  8. 8. A medical question-and-answer oriented hybrid retrieval and reordering system for implementing a medical question-and-answer oriented hybrid retrieval and reordering method according to any of claims 1-7, comprising: the knowledge base construction module is used for constructing a medical knowledge retrieval base by collecting multi-source medical texts and preprocessing; The sparse retrieval module is used for performing sparse retrieval on the medical knowledge retrieval base according to the Chinese medical problem to obtain an initial candidate segment of Top-100; The dense retrieval module is used for carrying out dense retrieval on the Chinese medical problems and the knowledge segments to obtain semantic related candidate segments of Top-100; The ranking fusion module is used for fusing the initial candidate segment and the semantically related candidate segment to obtain a final candidate segment of Top-50; And the reordering module is used for splicing the Chinese medical problem and the final candidate segment into a single sequence input encoder, outputting a correlation score and outputting a Top-K medical knowledge segment according to the correlation score.

Description

Mixed retrieval and reordering method for medical question and answer Technical Field The invention relates to the technical field of intelligent medical services, in particular to a medical question-answer oriented hybrid retrieval and reordering method. Background The medical question-answering system aims at retrieving and returning credible medical knowledge according to medical questions posed by users, and is an important basic capability of intelligent medical services. Along with the rapid increase of the demands of on-line medical consultation and medical science popularization, how to accurately and comprehensively recall related contents from a large-scale medical knowledge base becomes one of key factors influencing the quality of medical questions and answers. However, the medical text has the characteristics of specialized terms, complex structure, rich cross-sentence implicit relation and the like, so that obvious expression difference exists between the user problem and the medical knowledge segment. If the retrieval stage fails to fully recall the critical medical information, the subsequent reasoning or generating module has difficulty in obtaining reliable input, thereby affecting the accuracy of the diagnosis advice or medical interpretation, which has been repeatedly emphasized in many medical NLP studies. The existing research indicates that the medical corpus has high specialized characteristics, the traditional retrieval method is easy to be influenced by word face mismatch, and the pure semantic model is possibly limited by the deviation of field migration and term understanding. Studies such as MedQA-USMLE data by Luo et al, the medical literature question-answering experiments by Bienvenu et al, and the systematic analysis of clinical term ambiguity by Blagec all indicate that medical scenarios have far higher requirements for recall integrity and semantic sensitivity than general question-answering tasks. Therefore, it is difficult to obtain stable performance only by relying on sparse or dense single-path retrieval, and how to effectively combine two types of information becomes an important direction for medical question-answer retrieval. Furthermore, several studies indicate that ranking quality in a medical question-answer scenario also directly affects final answer credibility. Cheng et al prove in clinical decision support task that the sequencing error may cause the critical evidence segments to be ranked to the rear column so as to influence the subsequent diagnosis inference, deYoung et al further show in EVIDENCE INFERENCE task that the high quality sequencing can significantly improve the utilization efficiency of the model on the medical evidence. Therefore, only raising recall coverage is still insufficient to meet the medical question-answer requirements, and it is also indispensable to construct a reordering mechanism capable of performing fine-grained semantic judgment. Disclosure of Invention The invention aims to provide a medical question-answering oriented hybrid retrieval and reordering method, which effectively improves the retrieval capability of relevant contents in medical question-answering tasks through a stage retrieval and reordering strategy. In order to achieve the above object, the present invention provides the following solutions: The medical question-answer oriented mixed retrieval and reordering method comprises the following steps: acquiring an input Chinese medical problem; according to the Chinese medical problem, performing sparse retrieval on a medical knowledge retrieval base to obtain an initial candidate segment of Top-100, wherein the medical knowledge retrieval base is constructed by acquiring multi-source medical texts and performing preprocessing; performing dense retrieval on the Chinese medical problems and the knowledge segments to obtain semantic related candidate segments of Top-100; fusing the initial candidate segment and the semantically related candidate segment to obtain a final candidate segment of Top-50; And splicing the Chinese medical problem and the final candidate segment into a single sequence input encoder, outputting a correlation score, and outputting a Top-K medical knowledge segment according to the correlation score. Optionally, constructing the medical knowledge retrieval library by collecting multi-source medical text and preprocessing includes: And carrying out segmentation processing on the multi-source medical text by adopting a fixed window segmentation strategy, carrying out cleaning, denoising, format unification and term standardization processing on the segmented text, obtaining medical knowledge segments, respectively constructing a sparse inverted index and a dense vector index based on the medical knowledge segments, and constructing the medical knowledge retrieval library, wherein the sparse inverted index is established according to word frequency and document frequency of terms in a document,