Search

CN-121980009-A - Data retrieval method and device

CN121980009ACN 121980009 ACN121980009 ACN 121980009ACN-121980009-A

Abstract

The embodiment of the application provides a data retrieval method, a data retrieval device, computer equipment, a computer readable storage medium and a computer program product, and belongs to the technical field of data processing. The method comprises the steps of obtaining a problem to be queried, converting the problem to be queried into a query object, wherein the query object comprises a query character string, searching in the mixed search system by taking the query object as a query condition to obtain candidate result sets returned by each search library, and carrying out fusion processing on the candidate result sets returned by each search library to obtain a query result corresponding to the problem to be queried. The application can fully exert the technical advantages of different search libraries, make up the defect of single search technology, effectively reduce the result deviation of single search library, and remarkably improve the comprehensiveness and reliability of search results.

Inventors

  • XUE JINHU

Assignees

  • 上海幻电信息科技有限公司

Dates

Publication Date
20260505
Application Date
20260122

Claims (14)

  1. 1. The data retrieval method is applied to a hybrid retrieval system, and is characterized in that the hybrid retrieval system comprises at least 2 of a semantic retrieval library, a keyword retrieval library, a local sensitive hash retrieval library and a fuzzy matching retrieval library, and the method comprises the following steps: acquiring a problem to be queried; converting the to-be-queried problem into a query object, wherein the query object comprises a query character string; searching in the mixed search system by taking the query object as a query condition to obtain candidate result sets returned by each search library; and carrying out fusion processing on candidate result sets returned by each search library to obtain a query result corresponding to the to-be-queried problem.
  2. 2. The method of claim 1, wherein the hybrid search system comprises a semantic search library, wherein the query object further comprises a filter condition type, wherein searching in the semantic search library using the query object as a query condition, and wherein obtaining the candidate result set returned by the semantic search library comprises: Vectorizing the query character string in the query object to obtain a query vector; respectively calculating the similarity between the query vector and a plurality of candidate text vectors stored in the semantic search library; Selecting candidate texts corresponding to candidate text vectors with similarity values larger than a first preset threshold value, and forming an initial candidate result set; And screening the candidate texts in the initial candidate result set based on the type of the filtering condition in the query object to obtain a target candidate result set.
  3. 3. The method of claim 2, wherein the semantic search library is a vector database constructed based on a hierarchical navigation small world algorithm, and wherein the calculating similarity of the query vector to a plurality of candidate text vectors stored in the semantic search library comprises: And respectively calculating the similarity between the query vector and a plurality of candidate text vectors stored in the semantic search library by adopting a hierarchical navigation small world algorithm.
  4. 4. The method of claim 1, wherein the hybrid search system includes a keyword search library, the query object further includes a filter condition type, the query object is used as a query condition to search in the keyword search library, and obtaining the candidate result set returned by the keyword search library includes: Word segmentation processing is carried out on the query character strings in the query object to obtain at least one keyword; screening a plurality of candidate texts stored in the keyword search library based on the type of the filtering condition in the query object to obtain a candidate text set; Respectively calculating the matching degree of the at least one keyword and a plurality of candidate texts in the candidate text set; And selecting candidate texts with the matching degree larger than a second preset threshold value to form a candidate result set.
  5. 5. The method of claim 4, wherein the keyword search library is a search library constructed based on a word frequency-inverse document frequency algorithm, the keyword search library includes a classification index dictionary constructed according to a preset filtering condition type, the classification index dictionary includes a word segmentation list of a plurality of candidate texts, the filtering condition type based on the query object performs a filtering operation on the plurality of candidate texts stored in the keyword search library, and obtaining a candidate text set includes: Searching a target classification index dictionary matched with the filtering condition type in the query object from the keyword search library, wherein a plurality of candidate texts associated with the target classification index dictionary form the candidate text set; the calculating the matching degree of the at least one keyword and the plurality of candidate texts in the candidate text set respectively comprises: And calculating word frequency-inverse document frequency scores of the at least one keyword and a plurality of candidate texts in the candidate text set by adopting a word frequency-inverse document frequency algorithm based on the target classification index dictionary and the at least one keyword.
  6. 6. The method of claim 1, wherein the hybrid search system includes a locality sensitive hash search library, the locality sensitive hash search library being a search library constructed based on a locality sensitive hash technique, the locality sensitive hash search library including a locality sensitive hash bucket index constructed based on a locality sensitive hash technique, the query object being a query condition to search in the locality sensitive hash search library, the obtaining a candidate result set returned by the locality sensitive hash search library comprising: Carrying out signature processing on the query character string in the query object to obtain signature data; performing index matching processing on the local sensitive hash sub-buckets based on the signature data to obtain target sub-buckets; Taking the candidate texts associated with the target sub-buckets as a candidate text set; And carrying out unique value extraction processing on a plurality of candidate texts in the candidate text set to obtain a candidate result set.
  7. 7. The method of claim 6, wherein signing the query string in the query object to obtain signature data comprises: Detecting whether the length of the query character string is larger than a preset length; under the condition that the length of the query character string is smaller than or equal to the preset length, adopting a MinHash signature algorithm to carry out signature processing on the query character string to obtain signature data; And under the condition that the length of the query character string is larger than the preset length, segmenting the query character string into a plurality of character strings by using an n_gram technology, and carrying out signature processing on the segmented plurality of character strings by using the MinHash signature algorithm to obtain the signature data.
  8. 8. The method of claim 1, wherein the hybrid search system comprises a fuzzy matching search library, wherein searching in the fuzzy matching search library using the query object as a query condition, and wherein obtaining the candidate result set returned by the fuzzy matching search library comprises: Respectively calculating the matching degree of the query character strings in the query object and a plurality of candidate texts stored in the fuzzy matching search library by adopting a fuzzy matching algorithm; and selecting candidate texts with the matching degree larger than a third preset threshold value to form a candidate result set.
  9. 9. The method according to claim 8, wherein the matching algorithm adopted by the fuzzy matching search library is a longest common subsequence algorithm, each candidate text in the fuzzy matching search library includes an original candidate text and a field expansion content, the field expansion content is obtained by performing content expansion processing on target field data in the original candidate text, and the adopting the fuzzy matching algorithm to calculate the matching degree between the query character string in the query object and the plurality of candidate texts stored in the fuzzy matching search library includes: And respectively calculating the similarity between the query character strings in the query object and the expanded content data of the multiple fields stored in the fuzzy matching search library by adopting a longest common subsequence algorithm.
  10. 10. The method according to any one of claims 1 to 9, wherein the performing fusion processing on the candidate result sets returned by each search pool to obtain a query result corresponding to the to-be-queried problem includes: the candidate results in the candidate result set returned by each search library are fused through a reciprocal ranking fusion algorithm, and ranking scores of the candidate results are obtained; And selecting the candidate results with the ranking scores of the top N bits as query results corresponding to the to-be-queried problem, wherein N is an integer greater than or equal to 1.
  11. 11. A data retrieval apparatus, the apparatus comprising: The acquisition module is used for acquiring the problem to be queried; the conversion module is used for converting the to-be-queried problem into a query object, wherein the query object comprises a query character string; The searching module is used for searching in the mixed searching system by taking the query object as a query condition to acquire candidate result sets returned by each searching library; and the fusion module is used for carrying out fusion processing on the candidate result sets returned by each search library to obtain the query result corresponding to the to-be-queried problem.
  12. 12. A computer device, comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein: The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.
  13. 13. A computer readable storage medium having stored therein computer instructions which when executed by a processor implement the method of any one of claims 1 to 10.
  14. 14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 10.

Description

Data retrieval method and device Technical Field Embodiments of the present application relate to the field of data processing technology, and in particular, relate to a data retrieval method, apparatus, computer device, computer readable storage medium, and computer program product. Background In the age of rapid development of big data and artificial intelligence technology, data retrieval is widely applied to multiple fields such as intelligent question-answering, document retrieval, e-commerce recommendation and the like as a core means for information acquisition. The current mainstream retrieval technology is based on a retrieval technology based on semantic vectors, and the technology can effectively capture deep semantic information of a text by converting a problem to be queried and a data resource into high-dimensional semantic vectors and measuring content correlation by using vector similarity, but the retrieval precision is greatly influenced by vector dimensions and data scale. It should be noted that the foregoing is not necessarily prior art, and is not intended to limit the scope of the present application. Disclosure of Invention Embodiments of the present application provide a data retrieval method, apparatus, computer device, computer readable storage medium, computer program product, to solve or alleviate one or more of the technical problems set forth above. An aspect of an embodiment of the present application provides a data retrieval method, including: acquiring a problem to be queried; converting the to-be-queried problem into a query object, wherein the query object comprises a query character string; searching in the mixed search system by taking the query object as a query condition to obtain candidate result sets returned by each search library; and carrying out fusion processing on candidate result sets returned by each search library to obtain a query result corresponding to the to-be-queried problem. Optionally, the hybrid search system includes a semantic search library, the query object further includes a filtering condition type, searching in the semantic search library with the query object as a query condition, and obtaining a candidate result set returned by the semantic search library includes: Vectorizing the query character string in the query object to obtain a query vector; respectively calculating the similarity between the query vector and a plurality of candidate text vectors stored in the semantic search library; Selecting candidate texts corresponding to candidate text vectors with similarity values larger than a first preset threshold value, and forming an initial candidate result set; And screening the candidate texts in the initial candidate result set based on the type of the filtering condition in the query object to obtain a target candidate result set. Optionally, the semantic search library is a vector database constructed based on a hierarchical navigation small world algorithm, and the calculating the similarity between the query vector and the plurality of candidate text vectors stored in the semantic search library includes: And respectively calculating the similarity between the query vector and a plurality of candidate text vectors stored in the semantic search library by adopting a hierarchical navigation small world algorithm. Optionally, the hybrid search system includes a keyword search library, the query object further includes a filtering condition type, searching in the keyword search library with the query object as a query condition, and obtaining a candidate result set returned by the keyword search library includes: Word segmentation processing is carried out on the query character strings in the query object to obtain at least one keyword; screening a plurality of candidate texts stored in the keyword search library based on the type of the filtering condition in the query object to obtain a candidate text set; Respectively calculating the matching degree of the at least one keyword and a plurality of candidate texts in the candidate text set; And selecting candidate texts with the matching degree larger than a second preset threshold value to form a candidate result set. Optionally, the keyword search library is a search library constructed based on a word frequency-inverse document frequency algorithm, the keyword search library includes a classification index dictionary constructed according to a preset filtering condition type, the classification index dictionary includes a word segmentation list of a plurality of candidate texts, the filtering condition type based on the query object performs a filtering operation on the plurality of candidate texts stored in the keyword search library, and obtaining a candidate text set includes: Searching a target classification index dictionary matched with the filtering condition type in the query object from the keyword search library, wherein a plurality of candidate texts associated with