CN-121979887-A - Length-independent set similarity query method based on B+ tree and application thereof

CN121979887ACN 121979887 ACN121979887 ACN 121979887ACN-121979887-A

Abstract

The application relates to the technical field of database information retrieval, in particular to a length-independent set similarity query method based on a B+ tree and application thereof. Through pre-constructing length partitions, key generation and index construction, in the query stage, length partition indexes are utilized, length filtering, key boundary generation and combined bucket difference filtering are used for directly generating key boundaries, a large number of unlikely similar sets are filtered, candidate sets are reduced, and accordingly query efficiency is improved. The method aims at solving the problem of how to reduce the candidate set so as to reduce the calculation cost in the multi-candidate set scene.

Inventors

JIA LIANYIN
ZHAO YONGXUE
LI MENGJUAN
DING JIAMAN
ZHOU HAIHE
ZENG RUI

Assignees

昆明理工大学

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A length independent set similarity query method based on a b+ tree, the method comprising the steps of: S10, determining the length of each set in the data set, and dividing the sets with the same length into the same partition; S20, determining bucket vectors of each set, sequentially connecting values of the bucket vectors, and generating integer keys; S30, constructing a B+ tree index for each partition by taking the integer key as a key value; S40, when a set to be queried is received, selecting a target B+ tree index corresponding to a target partition in a length interval of the set to be queried, determining an upper limit of symmetry difference of the target partition, determining a key boundary according to the upper limit of symmetry difference, searching a target set meeting a preset rule in the key boundary in the target B+ tree index, and outputting corresponding data in the target set as a query result.
2. The method of claim 1, wherein in S40, determining an upper boundary of a symmetry difference of the target partition, and determining a key boundary based on the upper boundary of the symmetry difference comprises: s41, determining an upper limit of symmetry difference of the target partition, wherein the upper limit of symmetry difference is expressed as follows: ; In the formula, For the preset similarity threshold value, Q is the length of a set to be queried, and L is the vector length of the set in the target partition; s42, judging the integer key of the set to be queried The first bucket vector in (1) Subtracting the upper limit of the symmetry difference Whether or not it is negative, if not, from Is subtracted by If yes, then Set to 0 and recursively set the remaining portions Assigned to subsequent bucket vectors by the same rule to obtain key lower bounds ; S43, judging the upper limit of the symmetry difference Integer keys with the set to be queried The first bucket vector in (1) If the addition is greater than the preset upper threshold value, if not, then Added to Otherwise, the bucket vector is calculated Is arranged as And recursively combine the remaining portions Assigned to subsequent bucket vectors by the same rule to obtain key upper bounds ; S44, lower bound of key And key upper bound As the key boundary.
3. The method as claimed in claim 2, wherein in S40, searching the target b+ tree index for a target set meeting a preset rule in the key boundary, and outputting the data corresponding to the target set as a query result includes: S45, determining the target key in the key boundary Corresponding integer key And integer keys corresponding to the set Q to be queried ; S46, determining integer key Integer key Whether the accumulated value of the difference between the previous j term bucket vectors is greater than the upper bound of the symmetry difference between the target key K and the set Q to be queried ; S47, if not, selecting a target set R associated with the target key K and calling corresponding data in the target set R to be output as a query result, and if so, filtering the target key K.
4. The method of claim 3, further comprising, after S47: s48, if a plurality of target sets R exist, determining the similarity between each target set R and the set Q to be queried; And S49, outputting the target set R with the similarity larger than a similarity threshold value as a query result.
5. The method according to any one of claims 1 to 4, wherein in S40, an upper bound of the length interval is a product of a length of the set to be queried and a preset similarity threshold, and a lower bound is a ratio of the length of the set to be queried and the preset similarity threshold.
6. The method of claim 1, wherein in S20, determining the bucket vector for each of the sets comprises: S21, statistics data set The frequency number of each element is arranged, and all elements are arranged in descending order; s22, dividing all elements into elements by using greedy division strategy The barrel comprises: S23, leading The elements are divided into A plurality of buckets, elements being assigned to a bucket while recording the cumulative frequency value of the bucket at that time; s24, sequentially distributing the rest elements to the barrel with the smallest current accumulated frequency value until the elements are completely divided; S25, according to the mapping relation between the elements and the barrel, a data set Each set of (3) Mapping to obtain bucket vectors ; Wherein, the Representation of Is assigned to the first element The number of the individual barrels is set to be equal, Maximum use The binary representation of the bits.
7. The method as claimed in claim 6, wherein the step of generating an integer key by sequentially concatenating values of the respective bucket vectors in S20 comprises: s27, connecting according to the generation sequence of the barrel vectors Integer key of each vector value ; Wherein the shaping key The maximum aggregate length of (2) is 。
8. The method of claim 1, wherein S30 comprises: s31, arranging all the set mapping keys in each partition in an ascending order; S32, for each partition, constructing a B+ tree index by taking the mapped key as a key, and marking as And for each key in leaf nodes of the B+ tree index New array To save the mapping as Is set in the database, and is set in the database.
9. Use of a length independent set similarity query method based on a b+ tree according to any of claims 1 to 8 in data queries.
10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the b+ tree-based length independent set similarity query method according to any of claims 1 to 6.

Description

Length-independent set similarity query method based on B+ tree and application thereof Technical Field The application relates to the technical field of database information retrieval, in particular to a length-independent set similarity query method based on a B+ tree and application thereof. Background A set similarity query is a query that finds all sets from a dataset that satisfy similarity with a given set of queries. In the related technical scheme, the MBQ adopts k-means clustering to partition a data set, an independent B+ tree index is constructed for each partition, in the query process, an Upper function is called from the root downwards to nodes to filter nodes, and if the Upper similarity bound of a certain node and a query set is lower than a similarity threshold, the whole subtree taking the node as the root can be safely filtered. However, the algorithm for calculating the similarity upper bound node by node has the problem of too high calculation cost when the candidate set is large, so that the query efficiency is low. Although the prior art adopts a deep learning method to partition the data set, the partitioning process still needs higher preprocessing cost and cannot be well adapted to the scene of frequent data updating. In view of this, the application proposes a set similarity query method different from the traditional node-by-node calculation similarity upper bound, which aims to reduce the calculation cost in a multi-candidate set scene and improve the query efficiency. Disclosure of Invention The application mainly aims to provide a length-independent set similarity query method based on a B+ tree, which aims to solve the problem of how to reduce candidate sets so as to reduce calculation cost in a multi-candidate set scene. In order to achieve the above object, the present application provides a length-independent set similarity query method based on a b+ tree, the method comprising: S10, determining the length of each set in the data set, and dividing the sets with the same length into the same partition; S20, determining bucket vectors of each set, sequentially connecting values of the bucket vectors, and generating integer keys; S30, constructing a B+ tree index for each partition by taking the integer key as a key value; S40, when a set to be queried is received, selecting a target B+ tree index corresponding to a target partition in a length interval of the set to be queried, determining an upper limit of symmetry difference of the target partition, determining a key boundary according to the upper limit of symmetry difference, searching a target set meeting a preset rule in the key boundary in the target B+ tree index, and outputting corresponding data in the target set as a query result. Optionally, in S40, determining an upper boundary of symmetry difference of the target partition, and determining a key boundary according to the upper boundary of symmetry difference includes: s41, determining an upper limit of symmetry difference of the target partition, wherein the upper limit of symmetry difference is expressed as follows: In the formula, For the preset similarity threshold value, Q is the length of a set to be queried, and L is the vector length of the set in the target partition; s42, judging the integer key of the set to be queried The first bucket vector in (1)Subtracting the upper limit of the symmetry differenceWhether or not it is negative, if not, fromIs subtracted byIf yes, thenSet to 0 and recursively set the remaining portionsAssigned to subsequent bucket vectors by the same rule to obtain key lower bounds; S43, judging the upper limit of the symmetry differenceInteger keys with the set to be queriedThe first bucket vector in (1)If the addition is greater than the preset upper threshold value, if not, thenAdded toOtherwise, the bucket vector is calculatedIs arranged asAnd recursively combine the remaining portionsAssigned to subsequent bucket vectors by the same rule to obtain key upper bounds; S44, lower bound of keyAnd key upper boundAs the key boundary. Optionally, in S40, searching the target b+ tree index for a target set meeting a preset rule in the key boundary, and outputting the data corresponding to the target set as a query result includes: S45, determining the target key in the key boundary Corresponding integer keyAnd integer keys corresponding to the set Q to be queried; S46, determining integer keyInteger keyWhether the accumulated value of the difference between the previous j term bucket vectors is greater than the upper bound of the symmetry difference between the target key K and the set Q to be queried; S47, if not, selecting a target set R associated with the target key K and calling corresponding data in the target set R to be output as a query result, and if so, filtering the target key K. Optionally, after S47, the method further includes: s48, if a plurality of target sets R exist, determining the similarity between each target se