CN-122019795-A - Document retrieval method, apparatus, computer device, storage medium, and computer program product
Abstract
The present application relates to a document retrieval method, apparatus, computer device, storage medium and computer program product. The method comprises the steps of receiving query words, segmenting the query words into a plurality of sub words, inquiring from a document library to obtain a sub word sequence and candidate documents matched with any sub word, segmenting the sub word sequence to obtain a plurality of query sub word sequences, constructing the sub word sequences, determining fragments matched with each query sub word sequence from the sub word sequences of the candidate documents according to any candidate document, determining the relevance score of the candidate documents based on the inverse document frequency of each sub word covered in each fragment and the length of each fragment, sorting the candidate documents based on the relevance score of each candidate document, and outputting the sorted candidate documents. By adopting the method, the document ordering and the relativity of the query words can be improved.
Inventors
- LIU ZHONGYU
- DING ZHENGSHENG
Assignees
- 杭州亿格云科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251227
Claims (10)
- 1. A document retrieval method, the method comprising: Receiving a query word, segmenting the query word into a plurality of sub words, and querying from a document library to obtain a sub word sequence and candidate documents matched with any sub word; Constructing a plurality of query sub-word sequences, wherein sub-words in the query sub-word sequences are continuous sub-words in the query words; For any candidate document, determining segments matched with the query sub-word sequences from the sub-word sequences of the candidate document respectively, and determining a relevance score of the candidate document based on the inverse document frequency of each sub-word covered in each segment and the length of the segment; and sorting the candidate documents based on the relevance scores of the candidate documents, and outputting the sorted candidate documents.
- 2. The method of claim 1, wherein ranking each of the candidate documents based on the relevance scores of each of the candidate documents comprises: ranking each candidate document based on a coverage factor of each candidate document and/or a length penalty factor of each candidate document and based on a relevance score of each candidate document; the coverage factor is determined according to the number of matched subwords of the candidate document in each subword corresponding to the query word; The length penalty factor is determined based on the sub-word sequence length of the candidate document.
- 3. The method according to claim 2, wherein the method further comprises: Determining non-repeated subwords from the subwords, and determining the number of the non-repeated subwords; Determining the number of matched subwords of the candidate document in the non-repeated subwords according to any candidate document, and determining the coverage factor of the candidate document according to the ratio between the number of matched subwords and the number of non-repeated subwords; Wherein the coverage factor and the ratio have a super-linear relationship.
- 4. The method according to claim 2, wherein the method further comprises: Determining a global average sub-word sequence length according to the sub-word sequence length of each document in the document library; Determining the length penalty factor of the candidate document according to the ratio of the sub-word sequence length to the global average sub-word sequence length of the candidate document for any candidate document; wherein said length penalty factor is inversely related to said ratio.
- 5. The method of claim 1, wherein the determining, from the sequence of subwords of the candidate document, a segment that matches each of the query subword sequences, respectively, comprises: traversing each query sub-word sequence from long to short, and inquiring from the sub-word sequence of the candidate document according to the currently traversed query sub-word sequence to obtain fragments matched with the query sub-word sequence; and adding the fragment as a matching fragment in the case that any matching fragment does not cover the fragment, or discarding the fragment in the case that the matching fragment covering the fragment exists until each query sub-word sequence traversal is completed.
- 6. The method of claim 1, wherein the determining the relevance score for the candidate document based on the inverse document frequency of each of the subwords covered in each of the segments and the length of the segments comprises: Determining a relevance score of the segment based on the inverse document frequency of each subword covered in the segment and the length of the segment for any one of the segments, wherein the relevance score of the segment has a sub-linear relationship with the inverse document frequency of each subword and has a super-linear relationship with the length of the segment; A relevance score for the candidate document is determined based on the relevance scores for the segments contained in the candidate document.
- 7. A document retrieval apparatus, the apparatus comprising: the segmentation module is used for receiving the query word, segmenting the query word into a plurality of sub words, and querying from a document library to obtain a sub word sequence and any candidate documents matched with the sub words; The construction module is used for constructing a plurality of query sub-word sequences, wherein the sub-words in the query sub-word sequences are continuous sub-words in the query words; a first determining module, configured to determine, for any one of the candidate documents, a segment matching each of the query sub-word sequences from a sub-word sequence of the candidate document, and determine a relevance score of the candidate document based on an inverse document frequency of each of the sub-words covered in each of the segments and a length of the segment; and the sorting module is used for sorting the candidate documents based on the relevance scores of the candidate documents and outputting the sorted candidate documents.
- 8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
- 9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
- 10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
Description
Document retrieval method, apparatus, computer device, storage medium, and computer program product Technical Field The present application relates to the field of computer technology, and in particular, to a document retrieval method, apparatus, computer device, storage medium, and computer program product. Background Sparse retrieval (SPARSE RETRIEVAL) is a retrieval algorithm which is used for mapping a document into a high-dimensional space formed by a vocabulary by word segmentation of the document and screening out the most matched document by calculating the similarity between a query vector and a document vector during retrieval. The algorithm still occupies a core position in a modern information retrieval system due to the characteristics of high efficiency and strong interpretability. The BM25 algorithm is the most well-accepted ranking algorithm that works best and is most widely used. However, the BM25 algorithm relies on a word segmenter in a specific language. When encountering new words, network hot words, brand models and other words which are not contained in the dictionary, the word segmentation device is easy to generate word segmentation errors, so that the retrieval recall rate is seriously influenced, and documents output to a user are not necessarily arranged according to the objective matching degree of query words, and the retrieval precision is poor. Disclosure of Invention In view of the foregoing, it is desirable to provide a document retrieval method, apparatus, computer device, storage medium, and computer program product. In a first aspect, the present application provides a document retrieval method, the method comprising: Receiving a query word, segmenting the query word into a plurality of sub words, and querying from a document library to obtain a sub word sequence and candidate documents matched with any sub word; Constructing a plurality of query sub-word sequences, wherein sub-words in the query sub-word sequences are continuous sub-words in the query words; For any candidate document, determining segments matched with the query sub-word sequences from the sub-word sequences of the candidate document respectively, and determining a relevance score of the candidate document based on the inverse document frequency of each sub-word covered in each segment and the length of the segment; and sorting the candidate documents based on the relevance scores of the candidate documents, and outputting the sorted candidate documents. In one embodiment, the ranking the candidate documents based on the relevance scores of the candidate documents includes: ranking each candidate document based on a coverage factor of each candidate document and/or a length penalty factor of each candidate document and based on a relevance score of each candidate document; the coverage factor is determined according to the number of matched subwords of the candidate document in each subword corresponding to the query word; The length penalty factor is determined based on the sub-word sequence length of the candidate document. In one embodiment, the method further comprises: Determining non-repeated subwords from the subwords, and determining the number of the non-repeated subwords; Determining the number of matched subwords of the candidate document in the non-repeated subwords according to any candidate document, and determining the coverage factor of the candidate document according to the ratio between the number of matched subwords and the number of non-repeated subwords; Wherein the coverage factor and the ratio have a super-linear relationship. In one embodiment, the method further comprises: Determining a global average sub-word sequence length according to the sub-word sequence length of each document in the document library; Determining the length penalty factor of the candidate document according to the ratio of the sub-word sequence length to the global average sub-word sequence length of the candidate document for any candidate document; wherein said length penalty factor is inversely related to said ratio. In one embodiment, the determining, from the sub-word sequences of the candidate documents, the segment matching each of the query sub-word sequences includes: traversing each query sub-word sequence from long to short, and inquiring from the sub-word sequence of the candidate document according to the currently traversed query sub-word sequence to obtain fragments matched with the query sub-word sequence; and adding the fragment as a matching fragment in the case that any matching fragment does not cover the fragment, or discarding the fragment in the case that the matching fragment covering the fragment exists until each query sub-word sequence traversal is completed. In one embodiment, the determining the relevance score of the candidate document based on the inverse document frequency of each of the subwords covered in each of the segments and the length of the segments includes: Det