US-20260127151-A1 - VECTOR DATABASE INDEXING

US20260127151A1US 20260127151 A1US20260127151 A1US 20260127151A1US-20260127151-A1

Abstract

Techniques for vector database indexing are described. In an example, a plurality of unstructured data is obtained. Vector embeddings are performed on each unstructured data from amongst the plurality of unstructured data to generate a plurality of vectors. Each vector from amongst the plurality of vectors is indicative of attributes of the unstructured data from amongst the plurality of unstructured data. The plurality of unstructured data and the plurality of vectors is stored in the vector database. A vector index corresponding to the plurality of vectors based on a first indexing mechanism is generated where the vector index includes at least one cluster of vectors with similar attributes. An index mapping is performed to store the vector index into a relational database.

Inventors

Pratyush Goel
Nimar Arora
Ritesh Goru

Assignees

DevRev, Inc.

Dates

Publication Date: 20260507
Application Date: 20241101

Claims (20)

1 . A method for implementation of a vector database, the method comprising: obtaining a plurality of unstructured data; performing vector embedding on each unstructured data from amongst the plurality of unstructured data to generate a plurality of vectors, wherein each vector from amongst the plurality of vectors is indicative of attributes of the unstructured data from amongst the plurality of unstructured data; storing the plurality of unstructured data and the plurality of vectors in the vector database; generating a vector index corresponding to the plurality of vectors based on a first indexing mechanism, wherein the vector index comprises at least one cluster of vectors with similar attributes; and performing an index mapping to store the vector index into a relational database.
2 . The method of claim 1 , wherein performing the index mapping comprises creating a first table structure comprising the plurality of vectors and metadata corresponding to the plurality of vectors, and the metadata corresponding to each of the plurality of vectors comprises at least one of a vector ID, a tenant ID, an index name, and an identifier of a maximum layer at which each of the plurality of vectors is found.
3 . The method of claim 2 , wherein performing the index mapping comprises creating a second table structure comprising metadata corresponding to a first vector from amongst the plurality of vectors and a list of neighbour vectors corresponding to the first vector, and the metadata corresponding to the first vector comprises at least one of a first vector ID, a first tenant ID, a first index name, and a first identifier of a layer comprising the first vector.
4 . The method of claim 3 , wherein performing the index mapping comprises creating a third table structure comprising at least one entry point to the vector index and metadata corresponding to the first vector, wherein the entry point corresponds to a top layer of the vector index and the metadata corresponding to the first vector comprises at least one of the first tenant ID, the first index name, and default and construction parameters corresponding to the first vector.
5 . The method of claim 1 , the method further comprises sharding of the relational database for a plurality of tenants.
6 . The method of claim 4 , wherein the method further comprises accessing at least one unstructured data from the plurality of unstructured data, wherein the accessing comprises: receiving a query vector; searching the vector index stored on the relational database using the query vector to identify a nearest neighbour vector corresponding to the query vector; and identifying the unstructured data using the nearest neighbour vector.
7 . The method of claim 6 , wherein identifying the nearest neighbour vector corresponding to the query vector comprises: retrieving the at least one entry point for the vector index; utilizing the at least one entry point for searching the top layer to identify a nearest neighbour vector corresponding to the query vector in the top layer; updating the nearest neighbour vector as an entry point to a second layer next to the top layer; for each layer starting from the layer next to the top layer to a last layer in the vector index: searching a current layer with the query vector to identify the nearest neighbour vector corresponding to the query vector in the current layer; updating the nearest neighbour vector as an entry point to a layer subsequent to the current layer; and finding nearest neighbour vectors to the query vector; and returning the nearest neighbour vectors to the query vector.
8 . An Index Mapping System (IMS) for a vector database, the IMS comprising: at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the IMS at least to: obtain a plurality of unstructured data; perform vector embedding on each unstructured data from amongst the plurality of unstructured data to generate a plurality of vectors, wherein each vector from amongst the plurality of vectors is indicative of attributes of the unstructured data from amongst the plurality of unstructured data; store the plurality of unstructured data and the plurality of vectors in the vector database; generate a vector index corresponding to the plurality of vectors based on a first indexing mechanism, wherein the vector index comprises at least one cluster of vectors with similar attributes; and perform an index mapping to store the vector index into a relational database.
9 . The IMS of claim 8 , wherein to perform the index mapping, the at least one processor causes the IMS to create a first table structure comprising the plurality of vectors and metadata corresponding to the plurality of vectors, wherein the metadata corresponding to each of the plurality of vectors comprises at least one of a vector ID, a tenant ID, an index name, and an identifier of a maximum layer at which each of the plurality of vectors is found.
10 . The IMS of claim 9 , wherein to perform the index mapping, the at least one processor causes the IMS to create a second table structure comprising metadata corresponding to a first vector from amongst the plurality of vectors and a list of neighbour vectors corresponding to the first vector, and the metadata corresponding to the first vector comprises at least one of a first vector ID, a first tenant ID, a first index name, and a first identifier of a layer comprising the first vector.
11 . The IMS of claim 10 , wherein to perform the index mapping, the at least one processor causes the IMS to create a third table structure comprising at least one entry point to the vector index, the entry point corresponding to a top layer of the vector index, and the metadata corresponding to the first vector comprises at least one of the first tenant ID, the first index name, and default and construction parameters corresponding to the first vector.
12 . The IMS as claimed in claim 8 , wherein the first indexing mechanism is Hierarchical Navigable Small World (HNSW) indexing mechanism.
13 . The IMS as claimed in claim 8 , wherein the at least one processor causes the IMS to further access at least one unstructured data from the plurality of unstructured data, and wherein to access the at least one unstructured data, the IMS: receives a query vector; searches the vector index stored on the relational database using the query vector to identify a nearest neighbour vector corresponding to the query vector; and identifies the unstructured data using the nearest neighbour vector.
14 . The IMS as claimed in claim 11 , wherein the at least one processor causes the IMS to further insert a new vector in the vector database, and wherein to insert the new vector, the IMS is to: retrieve the at least one entry point from the third table structure; assign a new maximum layer to the new vector; for each layer from the top layer to the new maximum layer, update an entry point corresponding to each layer by searching each layer with the new vector based on the at least one entry point search for neighbour vectors for each layer from the new maximum layer to a last layer of the vector index, wherein the IMS is to: search each layer with the new vector and the entry point corresponding to each layer to identify a plurality of nearest neighbour vectors corresponding to the new vector; and identify a predetermined number of nearest neighbour vectors from amongst the plurality of nearest neighbour vectors corresponding to the new vector; create an edge between the new vector and each vector from amongst the predetermined number of nearest vectors; set the entry points to newly identified vectors; update the entry point to be the new vector, if the new maximum layer is greater than the top layer; and increment the top layer count.
15 . The IMS as claimed in claim 14 , wherein the IMS is to: determine that the new maximum layer is greater than the top layer; and set the new maximum layer to one level higher than top layer.
16 . The IMS as claimed in claim 14 , wherein the IMS is to: determine that a list of neighbour vectors comprising the plurality of nearest neighbour vectors exceeds a maximum length (M); and delete at least one existing edge.
17 . A non-transitory computer-readable medium comprising instructions for implementation of a vector database, the instructions being executable by a processing resource to: obtain a plurality of unstructured data; perform vector embedding on each unstructured data from amongst the plurality of unstructured data to generate a plurality of vectors, wherein each vector from amongst the plurality of vectors is indicative of attributes of the unstructured data from amongst the plurality of unstructured data; store the plurality of unstructured data and the plurality of vectors in the vector database; generate a vector index corresponding to the plurality of vectors based on a first indexing mechanism, wherein the vector index comprises at least one cluster of vectors with similar attributes; and perform an index mapping to store the vector index into a relational database.
18 . The non-transitory computer-readable medium of claim 17 , wherein performing the index mapping comprises creating a first table structure comprising the plurality of vectors and metadata corresponding to the plurality of vectors; and the metadata corresponding to each of the plurality of vectors comprises at least one of a vector ID, a tenant ID, an index name, and an identifier of a maximum layer at which each of the plurality of vectors is found.
19 . The non-transitory computer-readable medium of claim 17 , wherein performing the index mapping comprises creating a second table structure comprising metadata corresponding to a first vector from amongst the plurality of vectors and a list of neighbour vectors corresponding to the first vector; and the metadata corresponding to the first vector comprises at least one of a first vector ID, a first tenant ID, a first index name, and a first identifier of a layer comprising the first vector.
20 . The non-transitory computer-readable medium of claim 19 , wherein performing the index mapping comprises creating a third table structure comprising an entry point for a query vector corresponding to the first vector and metadata corresponding to the first vector wherein the entry point corresponding to a top layer of the vector index; and the metadata corresponding to the first vector comprises at least one of the first tenant ID, the first index name, and default and construction parameters corresponding to the first vector.

Description

BACKGROUND A database is a systematic way of storing information so that data can be accessed, analysed, transformed, updated and transferred with efficiency. A database is usually managed by a database management system (DBMS). The data, the DBMS, and applications associated with the DBMS are referred to as a database system. There are different types of databases, where one such type is a relational database that manages structured or tabular data. These types of databases are referred to as ‘relational databases’ because data items within such databases have pre-determined relationships with one another. Thus, the relational database allows access to data which is related to one another. A relational database stores numbers, strings and other types of scalar data in the form of rows and columns. Another type of database is a vector database that allows storage and management of unstructured data, such as text, documents, images, audio, and video. The unstructured data is stored in the vector database in the form of vectors, where a vector is a mathematical representation of attributes of unstructured data. The vectors are generated by applying an embedding function to the unstructured data, where the embedding function is based on different algorithms or models, such as feature extraction algorithms, machine learning models, and word embedding techniques. The vector database allows fast and accurate similarity search and retrieval of unstructured data based on vector distance similarity. Therefore, the vector database is used for effective storage and retrieval of unstructured data while providing similar results based on semantic or contextual meaning. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 illustrates an Index Mapping System (IMS) for implementing a vector database, in accordance with an example of the present subject matter; FIG. 2 illustrates the IMS for implementing a vector database, in accordance with another example of the present subject matter; FIGS. 3, 4, and 5 illustrate example tables. FIG. 6 illustrates a method for implementing a vector database, in accordance with an example of the present subject matter; FIG. 7 illustrates a non-transitory computer-readable medium for implementing a vector database, in accordance with an example of the present subject matter. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical elements. The drawings provide examples and/or implementations consistent with the description; however, the description is not limited to the examples and/or implementations provided in the drawings. DETAILED DESCRIPTION In vector databases, a combination of different algorithms is used that allows Approximate Nearest Neighbour (ANN) search. Such algorithms optimize the search through hashing, quantization, or graph-based search. Vector databases can be used to find the most similar data based on the semantic or contextual meaning of the data, instead of using conventional methods of querying databases based on exact matches or predefined criteria. To ensure effective storage and search of the unstructured data, the unstructured data stored in a vector database is indexed. The indexing of the vector database includes performing vector embedding on unstructured data to generate vectors corresponding to the unstructured data, clustering the vectors based on the attributes of the vectors, and storing the vectors in the vector index. The mechanics of indexing the vectors has a direct bearing on the efficiency and latency experienced by the database in managing a data retrieval query, such as a nearest neighbour search. Generally, for a nearest neighbour query, the vector database is pre-processed to create an index for querying. For a given query vector, the index is used to identify a set of vectors that are likely to be close to the query vector. When a vector database receives a query, the vector database compares the indexed vectors to the query vector to determine the nearest vector neighbour. To establish nearest neighbour, the vector database may rely on mathematical methods, such as similarity measures. Similarity measures include Cosine similarity to establish similarity by measuring the cosine of the angle between two vectors in a vector space, Euclidean distance to establish similarity by measuring the straight-line distance between vectors, and Dot product to establish similarity by measuring the product of the magnitude of two vectors and the cosine of the angle between them. The nearest neighbour vectors are then retrieved from the indexed vector database and returned to a user as the search results. Conventional approaches for implementation of the vector database store the index in fast access memories, such as Random Access Memory (RAM) or Cache memory, to provide a quick response to a nearest neighbour query. However, such approaches do not scale well in multi-tenant environments. The non-scalability of the conventional