CN-122019843-A - Query base determination method and device

CN122019843ACN 122019843 ACN122019843 ACN 122019843ACN-122019843-A

Abstract

The embodiment of the application provides a query base determining method and device, relates to the technical field of databases, and can reduce space overhead when determining a query base. The method comprises the steps of determining the largest target suffix in suffix data of target data, and determining the query base of the target data based on the target suffix, a suffix array of character strings in a database and a target function for indicating character distribution conditions.

Inventors

ZHAN YIRUI
NIE WEN
GAO JUN

Assignees

华为技术有限公司
北京大学

Dates

Publication Date: 20260512
Application Date: 20241111

Claims (20)

1. A query base determination method, comprising: determining a target suffix of target data, wherein the target suffix is the maximum suffix in suffix data generated based on the target data; And determining a query base of the target data based on the target suffix, a suffix array and a target function, wherein the suffix array is obtained based on suffix data of character strings in a database, the target function is a character accumulation distribution function of the suffix array and is used for indicating the distribution condition of characters in the suffix array, and the query base is used for indicating the occurrence times of the target data in the database.
2. The method of claim 1, wherein the suffix array includes a first-order F-number column, wherein the objective function is used to indicate a relationship between a number of occurrences of each character in a last-order L-number column and a position in the L-number column, wherein the F-number column and the L-number column are obtained by performing a robert-wheatstone change BWT on a character string in the database, and wherein characters in the F-number column are in one-to-one correspondence with characters in the L-number column.
3. The method of claim 2, wherein the determining the query cardinality of the target data based on the target suffix, suffix array, and target function comprises: determining the interval position of the nth character of the target suffix in the F number column, wherein n is a positive integer, and the initial value of n is the length of the target suffix; Determining the interval position of the L number column corresponding to the nth character based on the interval position of the nth character in the F number column; If n is greater than 2, determining a section position of an n-1 th character in the F number row based on the objective function and the section position of the L number row corresponding to the n-th character, and returning to the step of determining the section position of the L number row corresponding to the n-th character based on the section position of the n-th character in the F number row by making n=n-1; If n is less than or equal to 2, determining the occurrence number of the n-1 th character in the interval position of the L number sequence corresponding to the n-th character based on the objective function, and determining the occurrence number of the n-th character as the query base of the target data.
4. The method of claim 2, wherein prior to determining the query cardinality of the target data based on the target suffix, suffix array, and target function, the method further comprises: Determining a distribution function of each character in the L number sequence according to the occurrence frequency of each character in the L number sequence, wherein the distribution function is used for indicating the relation between the occurrence frequency of the character and the position in the L number sequence; And fitting the distribution function based on the error condition to obtain the objective function.
5. The method of claim 4, wherein said fitting the distribution function based on the error condition results in the objective function comprising: Determining a mean square error of the distribution function; and fitting the distribution function based on the mean square error, and taking the distribution function under the condition of minimum mean square error as the target function.
6. The method of claim 4, wherein fitting the distribution function based on the error condition results in the objective function, comprising: Determining the slope of a sub-function between a starting node and an ith node of the distribution function, wherein i is a positive integer, and each node of the distribution function corresponds to one position in the L number series and the occurrence number of characters; Returning to the step of determining the slope between the start node and the i-th node of the distribution function, with i=i+1, in the case where the slope is smaller than a slope threshold and the i-th node is not the end node of the distribution function; When the slope is greater than or equal to the slope threshold and the i node is not the ending node of the distribution function, taking the i-1 node as the ending node of the sub-function, taking the i node as the starting node of the next sub-function, enabling i=i+1, and returning to the step of determining the slope between the starting node of the distribution function and the i node; all sub-functions are taken as the objective function.
7. The method according to claim 2, wherein the method further comprises: traversing the cyclic suffix of each character string in the database, and establishing a suffix tree corresponding to the database, wherein the suffix tree comprises a plurality of nodes, each node is used for indicating one character of the cyclic suffix in the database, and each node corresponds to a section of position interval in the L number sequence; Pruning the nodes aiming at any node in the character suffix tree under the condition that the node meets pruning conditions to obtain a pruned suffix tree; And replacing the F number sequence with the pruned suffix tree.
8. The method of claim 7, wherein the pruning condition includes at least one of a height of a suffix tree corresponding to a node being less than a height threshold and a section size of a location section corresponding to the node being less than a section threshold, the height of the suffix tree corresponding to the node being a height of the suffix tree in which the node is located, the section size being used to indicate a section size of an L number of columns of location sections corresponding to the node.
9. The method of claim 7 or 8, wherein the determining a query radix of the target data based on the target suffix, suffix array, and target function comprises: Determining the longest sub-suffix of the target suffix in the pruned suffix tree and the target length of the longest sub-suffix; if the target length is equal to the length of the target suffix, determining a query base of the target data based on node parameters of the pruned suffix tree; If the target length is smaller than the length of the target suffix, determining the interval position of the L number columns corresponding to the longest sub-suffix based on the node parameters of the pruning suffix tree, and determining the query base of the target data based on the interval position of the L number columns corresponding to the longest sub-suffix.
10. The method of claim 2, wherein the database includes a plurality of character strings therein, the method further comprising: Determining at least one cyclic suffix corresponding to each character string in the plurality of character strings to obtain a plurality of cyclic suffixes; If all characters of the first cyclic suffix are different from characters at the same position in the second cyclic suffix, determining an order between the first cyclic suffix and the second cyclic suffix based on dictionary order, wherein the length of the second cyclic suffix is greater than that of the first cyclic suffix; If all characters of a first cyclic suffix are the same as characters at the same position in a second cyclic suffix, determining an order between the first cyclic suffix and the second cyclic suffix based on remaining characters of the first cyclic suffix and the second cyclic suffix compared to the first cyclic suffix; And ordering the plurality of cyclic suffixes based on an order among the plurality of cyclic suffixes to obtain the F-number sequence.
11. A method according to claim 3, characterized in that the method further comprises: determining a change character string in the database, wherein the change character string refers to a character string changed in the database after the objective function is obtained; Determining a change suffix tree corresponding to the change character string based on the cyclic suffix corresponding to the change character string, wherein the change suffix tree is used for indicating the occurrence times of each suffix in the change character string; and correcting the query base of the target data based on the changed suffix tree.
12. The method of claim 11, wherein the change string comprises an add string and/or a delete string, wherein the modifying the query base of the target data based on the change suffix tree comprises: Determining a base correction parameter of the target data based on the target suffix and the changed suffix tree; When the changed character string is a newly added character string, taking the sum of the base correction parameter and the query base as the query base of the target data; and when the change character string is a deletion character string, taking the difference value between the base correction parameter and the query base as the query base of the target data.
13. The method of claim 11, wherein the method further comprises: And under the condition that the change character string meets the target condition, the target function is redetermined based on all character strings in the database.
14. A query base determination device, comprising: The suffix determining module is used for determining a target suffix of target data, wherein the target suffix is the maximum suffix in suffix data generated based on the target data; the base number determining module is used for determining a query base number of the target data based on the target suffix, a suffix array and a target function, wherein the suffix array is obtained based on suffix data of character strings in a database, the target function is a character cumulative distribution function of the suffix array and is used for indicating the distribution condition of characters in the suffix array, and the query base number is used for indicating the occurrence times of the target data in the database.
15. The apparatus of claim 14, wherein the suffix array comprises a first-order F-number column, wherein the objective function is configured to indicate a relationship between a number of occurrences of each character in a last-order L-number column and a position in the L-number column, wherein the F-number column and the L-number column are obtained by performing a robert-wheatstone change BWT on a character string in the database, and wherein the characters in the F-number column are in one-to-one correspondence with the characters in the L-number column.
16. The apparatus of claim 15, wherein the cardinality determination module is configured to: determining the interval position of the nth character of the target suffix in the F number column, wherein n is a positive integer, and the initial value of n is the length of the target suffix; Determining the interval position of the L number column corresponding to the nth character based on the interval position of the nth character in the F number column; If n is greater than 2, determining a section position of an n-1 th character in the F number row based on the objective function and the section position of the L number row corresponding to the n-th character, and returning to the step of determining the section position of the L number row corresponding to the n-th character based on the section position of the n-th character in the F number row by making n=n-1; If n is less than or equal to 2, determining the occurrence number of the n-1 th character in the interval position of the L number sequence corresponding to the n-th character based on the objective function, and determining the occurrence number of the n-th character as the query base of the target data.
17. The apparatus of claim 15, wherein the cardinality determination module is further configured to: Determining a distribution function of each character in the L number sequence according to the occurrence frequency of each character in the L number sequence, wherein the distribution function is used for indicating the relation between the occurrence frequency of the character and the position in the L number sequence; And fitting the distribution function based on the error condition to obtain the objective function.
18. The apparatus of claim 17, wherein the cardinality determination module is configured to: Determining a mean square error of the distribution function; and fitting the distribution function based on the mean square error, and taking the distribution function under the condition of minimum mean square error as the target function.
19. The apparatus of claim 17, wherein the cardinality determination module is configured to: Determining the slope of a sub-function between a starting node and an ith node of the distribution function, wherein i is a positive integer, and each node of the distribution function corresponds to one position in the L number series and the occurrence number of characters; Returning to the step of determining the slope between the start node and the i-th node of the distribution function, with i=i+1, in the case where the slope is smaller than a slope threshold and the i-th node is not the end node of the distribution function; When the slope is greater than or equal to the slope threshold and the i node is not the ending node of the distribution function, taking the i-1 node as the ending node of the sub-function, taking the i node as the starting node of the next sub-function, enabling i=i+1, and returning to the step of determining the slope between the starting node of the distribution function and the i node; all sub-functions are taken as the objective function.
20. The apparatus of claim 15, wherein the cardinality determination module is further configured to: traversing the cyclic suffix of each character string in the database, and establishing a suffix tree corresponding to the database, wherein the suffix tree comprises a plurality of nodes, each node is used for indicating one character of the cyclic suffix in the database, and each node corresponds to a section of position interval in the L number sequence; Pruning the nodes aiming at any node in the character suffix tree under the condition that the node meets pruning conditions to obtain a pruned suffix tree; And replacing the F number sequence with the pruned suffix tree.

Description

Query base determination method and device Technical Field The present application relates to the field of database technologies, and in particular, to a method and an apparatus for determining a query base. Background With the development of database technology, the amount of data that can be stored in a database is increasing. In the database, the user can quickly search out data such as character string data meeting specific conditions by adopting query sentences such as like sentences. In consideration of the fact that the number of query results is different, the optimal strategy for executing the query is different, if a certain query condition only involves a small number of results, the nested loop connection can be directly adopted for query, otherwise, a complex index structure may be needed. Based on the above, in order to further optimize the query performance, when query retrieval is performed, the number of results meeting the query condition can be estimated by determining the query base of the query instruction, so that a query mode with higher efficiency is selected for query. In order to quickly determine the query radix, the data of the database needs to be processed, for example, a suffix array of the data of the database is established and stored, so that the determination of the query radix is realized through the suffix array of the database. However, since databases typically contain large amounts of data, suffix arrays for data can take up a significant amount of space overhead. Disclosure of Invention The application provides a query base determining method and device, which can reduce space overhead for determining the query base to a certain extent. In order to achieve the above purpose, the application adopts the following technical scheme: In a first aspect, an embodiment of the present application provides a method for determining a query radix, including determining a target suffix of target data, and determining a query radix of the target data based on the target suffix, a suffix array, and a target function. The target suffix is the maximum suffix in suffix data generated based on target data, the suffix array is obtained based on suffix data of character strings in a database, the target function is a character cumulative distribution function of the suffix array and is used for indicating the distribution condition of characters in the suffix array, and the query base is used for indicating the occurrence times of the target data in the database. In the implementation manner, the character distribution condition in the suffix array is indicated by introducing the objective function, and under the condition that the query base number of the objective data in the database is determined based on the objective suffix, the suffix data of all character strings in the database is not required to be stored, the occurrence times of the objective suffix in the database can be determined based on the suffix array and the objective function, the query base number of the objective data is obtained, and a large amount of space expenditure can be saved. Optionally, the suffix array includes a first-order F-number-column, and the objective function is used to indicate a relationship between the number of occurrences of each character in a last-order L-number-column and a position in the L-number-column, where the F-number-column and the L-number-column are obtained by performing a robusthuiller change BWT on a character string in the database, and the characters in the F-number-column are in one-to-one correspondence with the characters in the L-number-column. Alternatively, the interval position of the nth character of the target suffix in the F-number column may be determined. And determining the interval position of the L number column corresponding to the nth character based on the interval position of the nth character in the F number column. If n is greater than 2, determining the section position of the n-1 character in the F number row based on the objective function and the section position of the L number row corresponding to the n-th character, and returning to the step of determining the section position of the L number row corresponding to the n-th character based on the section position of the n-th character in the F number row by making n=n-1. If n is less than or equal to 2, determining the occurrence number of the n-1 th character in the interval position of the L number sequence corresponding to the n-th character based on the objective function, and determining the occurrence number of the n-th character as the query base of the objective data. Wherein n is a positive integer, and the initial value of n is the length of the target suffix. In the above implementation, the determination of the number of occurrences of the target suffix in the database is implemented by means of the BWT algorithm through the F-number columns and the L-number columns, and the L-number col