Search

CN-116226221-B - Optimization method, optimization device and storage medium for fuzzy search

CN116226221BCN 116226221 BCN116226221 BCN 116226221BCN-116226221-B

Abstract

The application discloses an optimization method, an optimization device and a storage medium for fuzzy search. The optimizing method comprises the steps of respectively obtaining an input query character string and a plurality of document character strings stored in a knowledge base, segmenting the query character string to obtain a plurality of query sub-character strings, segmenting any document character string of the plurality of document character strings to obtain a plurality of document sub-character strings of any document character string, judging whether the query character string has a target query sub-character string which has no fuzzy similarity with any document sub-character string or not for any document sub-character string, and replacing the target query sub-character string with null and determining the editing distance between the target query sub-character string and any document sub-character string as a preset value under the condition that the query character string has the target query sub-character string which has no fuzzy similarity with any document sub-character string. The application reduces the condition that the fuzzy search returns wrong results and can improve the efficiency of the fuzzy search.

Inventors

  • ZHOU YANG
  • Liao Deng
  • ZHOU ZHIZHONG
  • TONG XING
  • ZHANG ZEQUN

Assignees

  • 中联重科股份有限公司
  • 中科云谷科技有限公司

Dates

Publication Date
20260505
Application Date
20221201

Claims (8)

  1. 1. An optimization method for fuzzy search, characterized in that the optimization method comprises: respectively acquiring an input query character string and a plurality of document character strings stored in a knowledge base; word segmentation is carried out on the query character string to obtain a plurality of query sub-character strings; word segmentation is carried out on any document character string of the document character strings to obtain a plurality of document sub-character strings of the document character strings; For any document sub-character string, judging whether the query character string has a target query sub-character string which has no fuzzy similarity with the any document sub-character string; Under the condition that the query character string has a target query sub-character string with no fuzzy similarity with the arbitrary document sub-character string, replacing the target query sub-character string with a null, and determining the editing distance between the target query sub-character string and the arbitrary document sub-character string as a preset value; the plurality of query substrings and the plurality of document substrings of the arbitrary document character string respectively comprise a word segmentation array and a part-of-speech array, and the part-of-speech array and the word segmentation array are in one-to-one correspondence; Wherein the determining whether the query string has a target query sub-string that has no fuzzy similarity with the arbitrary document sub-string includes at least one of: Judging whether the query character string has a query sub-character string which is matched with the characters of the arbitrary document sub-character string but has different parts of speech according to the word segmentation array and the part of speech array of the plurality of query sub-character strings and the plurality of document sub-character strings of the arbitrary document character string, judging that the query character string has a target query sub-character string which has no fuzzy similarity with the arbitrary document sub-character string under the condition that the query character string has the query sub-character string which is matched with the characters of the arbitrary document sub-character string but has different parts of speech, or Obtaining entity sets of the query character string and the arbitrary document character string, judging whether the query character string has the query sub-character string which is matched with the characters of the arbitrary document sub-character string, has the same part of speech and has different entity types according to the part-of-speech array, the part-of-speech array and the entity sets, judging whether the query character string has the target query sub-character string which has no fuzzy similarity with the arbitrary document sub-character string under the condition that the query character string has the query sub-character string which is matched with the characters of the arbitrary document sub-character string, has the same part of speech and has different entity types, or The method comprises the steps of obtaining an entity set of the query character string and any document character string and a synonym table of a knowledge base, judging whether the query character string has a query sub-character string which is matched with characters of any document sub-character string, identical in part of speech and identical in entity type but not synonyms according to the part of speech array, the entity set and the synonym table, and judging that the query character string has a target query sub-character string which is not fuzzy similar to any document sub-character string under the condition that the query character string has the query sub-character string which is matched with characters of any document sub-character string, identical in part of speech and identical in entity type but not synonyms.
  2. 2. The optimization method of claim 1, wherein the determining whether the query string has a query substring that matches a character of the arbitrary document substring but has a different part of speech according to a word array and a part of speech array of the plurality of query substrings and the plurality of document substrings of the arbitrary document string comprises: Judging whether the query character string has characters intersected with the arbitrary document character string or not according to the word segmentation arrays of the plurality of query character strings and the plurality of document character strings of the arbitrary document character string; Acquiring a similar query sub-string and a similar Wen Dangzi string under the condition that the query string has characters intersected with any document sub-string; Judging whether the parts of speech of the similar query substring and the similar Wen Dangzi character string are the same or not according to the parts of speech arrays of the multiple query substring and the multiple document substring of the arbitrary document character string; And under the condition that the parts of speech of the similar query sub-strings and the similar document sub-strings are different, judging that the query sub-strings have the query sub-strings which are matched with the characters of the arbitrary document sub-strings but have different parts of speech.
  3. 3. The optimization method of claim 1, wherein the determining whether the query string has a query substring that matches a character of the arbitrary document substring, has the same part of speech, but has a different entity type according to the part-of-speech array, and the entity set comprises: Judging whether the query character string has characters intersected with the arbitrary document character string or not according to the word segmentation arrays of the plurality of query character strings and the plurality of document character strings of the arbitrary document character string; Acquiring a similar query sub-string and a similar Wen Dangzi string under the condition that the query string has characters intersected with any document sub-string; Judging whether the parts of speech of the similar query substring and the similar Wen Dangzi character string are the same or not according to the parts of speech arrays of the multiple query substring and the multiple document substring of the arbitrary document character string; Judging whether the entity types of the similar query substring and the similar Wen Dangzi character string are the same according to the entity set under the condition that the parts of speech of the similar query substring and the similar Wen Dangzi character string are the same; And under the condition that the entity types of the similar query sub-strings and the similar Wen Dangzi character strings are different, judging that the query sub-strings have the same part of speech but different entity types and have character matching with the arbitrary document sub-strings.
  4. 4. The optimization method of claim 1, wherein the determining whether the query string has a query substring that matches a character of the arbitrary document substring, has the same part of speech, has the same entity type, and is not a synonym according to the part-of-speech array, the entity set, and the synonym table comprises: Judging whether the query character string has characters intersected with the arbitrary document character string or not according to the word segmentation arrays of the plurality of query character strings and the plurality of document character strings of the arbitrary document character string; Acquiring a similar query sub-string and a similar Wen Dangzi string under the condition that the query string has characters intersected with any document sub-string; Judging whether the parts of speech of the similar query substring and the similar Wen Dangzi character string are the same or not according to the parts of speech array of the multiple query substring and the multiple document substring of the arbitrary document substring; Judging whether the entity types of the similar query substring and the similar Wen Dangzi character string are the same according to the entity set under the condition that the parts of speech of the similar query substring and the similar Wen Dangzi character string are the same; Judging whether the similar query sub-string and the similar Wen Dangzi string are synonyms according to the synonym table under the condition that the entity types of the similar query sub-string and the similar Wen Dangzi string are the same; And under the condition that the similar query sub-character string and the similar Wen Dangzi character string are not synonyms, judging that the query character string has the query sub-character strings which are matched with the characters of the arbitrary document sub-character string, have the same part of speech and have the same entity type but are not synonyms.
  5. 5. The optimization method according to claim 1, characterized in that the optimization method further comprises: Obtaining synonyms and professional vocabularies of the target field; and constructing a knowledge base according to the synonyms and the professional vocabulary of the target field.
  6. 6. The optimization method according to claim 5, characterized in that the optimization method further comprises: the knowledge base is updated in case of receiving newly entered synonyms and/or professional vocabulary.
  7. 7. An optimizing apparatus for fuzzy search, comprising: a memory configured to store instructions, and A processor configured to invoke the instructions from the memory and when executing the instructions is capable of implementing the optimization method for fuzzy search according to any of claims 1 to 6.
  8. 8. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the optimization method for fuzzy search of any of claims 1 to 6.

Description

Optimization method, optimization device and storage medium for fuzzy search Technical Field The present application relates to the field of computer technologies, and in particular, to an optimization method, an optimization apparatus, and a storage medium for fuzzy search. Background Search engine technology has become one of the important means for information resource acquisition. And searches can be simply divided into "fuzzy searches" and "accurate searches". The fuzzy search means that the search system automatically performs fuzzy search according to the similarity of synonyms or character strings of keywords input by a user, so as to obtain more search results. The prior art fuzzy search technique generally includes the steps of querying, editing distance calculation, sorting and outputting the result. For example, firstly dividing the query string according to the length of the character string in the paragraph to obtain a query string sub-string set, when the character string in the paragraph is matched with the character string in the query string, adding the length of the character string to the matching degree of the original character string corresponding to the index of the character string, when the matching degree of the character string is larger than a preset upper limit value and the position list has no repeated elements, adding the character string into the result set, otherwise, performing editing distance verification on the character string, when the matching degree of the character string is smaller than a preset lower limit value, directly filtering the character string, and when the matching degree of the character string is between the preset lower limit value and the preset upper limit value, performing editing distance verification on the character string. In the prior art, the edit distance is directly calculated by inquiring related substrings, semantic information is lacked, character strings with identical characters but completely different meanings possibly exist, and therefore, a large amount of inaccurate results are brought while the results are returned, so that the calculation task amount of the edit distance is larger, and the fuzzy search efficiency is lower. Disclosure of Invention The embodiment of the application aims to provide an optimization method, an optimization device and a storage medium for fuzzy search, which are used for solving the problem that the efficiency is low due to the fact that a large number of inaccurate results are possibly caused by fuzzy search in the prior art. In order to achieve the above object, a first aspect of the present application provides an optimization method for fuzzy search, the optimization method comprising: respectively acquiring an input query character string and a plurality of document character strings stored in a knowledge base; Word segmentation is carried out on the query character string to obtain a plurality of query sub-character strings; Word segmentation is carried out on any document character string of the document character strings to obtain a plurality of document sub-character strings of the document character strings; For any document sub-character string, judging whether the query character string has a target query sub-character string which has no fuzzy similarity with any document sub-character string; And under the condition that the query character string has a target query character string which has no fuzzy similarity with any document character string, replacing the target query character string with a null character, and determining the editing distance between the target query character string and any document character string as a preset value. In the embodiment of the application, the query substrings and the document substrings of any document string comprise word segmentation arrays and part-of-speech arrays, and the part-of-speech arrays and the word segmentation arrays are in one-to-one correspondence. In the embodiment of the application, judging whether the query character string has the target query character string which has no fuzzy similarity with any document character string comprises at least one of the following steps: Judging whether the query character string has the query sub-character string which is matched with the characters of any document sub-character string but has different parts of speech according to the word segmentation array and the part of speech array of the multiple query sub-character strings and the multiple document sub-character strings of any document character string, judging that the query character string has the target query sub-character string which has no fuzzy similarity with any document sub-character string under the condition that the query character string has the query sub-character string which is matched with the characters of any document sub-character string but has different parts of speech, or The method comprises the steps of ob