KR-20260066798-A - TRANSACTIONALLY-CONSISTENT HNSW INDEX
Abstract
Techniques for providing transaction-consistent Hierarchical Navigable Small Worlds (HNSW) indexes are described. In one technique, an HNSW index for multiple vectors is stored. In response to receiving a set of changes for multiple vectors, the set of changes is stored in a shared journal instead of applying it to the HNSW index. In response to receiving a vector query containing a query vector, a subset of the set of changes in the shared journal is identified based on the query vector. Additionally, a subset of multiple vectors is identified based on the query vector and the HNSW index. A result of the vector query is generated based on the subset of the set of changes and the subset of multiple vectors.
Inventors
- 라히리, 티르탄카
- 브루그나라, 마틴
- 알라지아니스, 요아니스
- 하프리안, 블라드 요안
- 데인즈, 로렌트 필립
- 공, 웨이웨이
- 로아이자, 주안 알.
- 바리아, 프라나브
- 보세, 슈브하
- 크리쉬나파, 친마이
- 차반, 샤샹크 키산
- 미쉬라, 아우로시쉬
- 펜제, 수카다 샤샹크
- 사하, 아그니보
- 아가르왈, 로한
- 왕, 샨치
- 브리토, 베니타 캐슬린
- 아르나볼디, 마르코
Assignees
- 오라클 인터내셔날 코포레이션
Dates
- Publication Date
- 20260512
- Application Date
- 20240916
- Priority Date
- 20240914
Claims (15)
- As a method: A step of storing multiple snapshots of a Hierarchical Navigable Small Worlds (HNSW) index generated based on a vector table containing an initial set of vectors generated by one or more machine learning embedding models—each of the multiple snapshots has a different build time and is also associated with a shared journal storing multiple changes to multiple vectors within the vector table—; A step of receiving a vector query associated with a specific timestamp; In response to receiving the above vector query: A step of selecting a specific snapshot from among a plurality of snapshots of the HNSW index based on the specific timestamp above; A step of identifying the shared journal associated with the specific snapshot; and The method includes the step of executing the vector query targeting the specific snapshot and the shared journal; The above method is performed by one or more computing devices.
- In paragraph 1, The above multiple changes are: A first set of changes associated with timestamps prior to the aforementioned specific timestamp; It includes a second set of changes associated with timestamps after the aforementioned specific timestamp; A method comprising the step of executing the vector query, wherein the step of executing the vector query is not aimed at a second set of changes but at a first set of changes.
- In paragraph 1, A method further comprising the step of retrieving the oldest snapshot in response to a determination that no pending vector queries are targeting the oldest snapshot among the multiple snapshots of the HNSW index.
- In paragraph 1, A step of identifying the latest snapshot among the plurality of snapshots above; A step of identifying the above shared journal; A method further comprising the step of determining whether to generate a new snapshot based on the latest snapshot and the shared journal based on one or more criteria.
- In Paragraph 4, the above one or more criteria are: Number of changes within the shared journal mentioned above; Average rate of incoming changes to the above vector table; Number of scans in the above HNSW index; Latency of one or more vector queries targeting the most recent snapshot; or A method based on the amount of memory required to store the above plurality of snapshots.
- In paragraph 4, In response to the decision to create a new snapshot of the above HNSW index: A method further comprising the step of creating the new snapshot by applying one or more changes that occurred after the build timestamp of the latest snapshot and are indicated in the shared journal to the latest snapshot.
- In paragraph 6, A step of determining the number of deletions of vectors indicated in one or more first changes that occurred after the build timestamp of the latest snapshot and indicated in the shared journal; and It further includes a step of determining whether the number of deletions of the above vectors is greater than a specific threshold number; A method comprising applying one or more of the above changes by applying the deletions of the vectors to the latest snapshot using a first application technique when the number of deletions of the vectors is less than or equal to the specific threshold, or using a second application technique when the number of deletions of the vectors is greater than the specific threshold.
- In Paragraph 7, Applying the above changes includes applying the deletions of the vectors to the latest snapshot using the first application technique, and A method of using the first application technique described above, comprising, for each deletion indicated in the shared journal, storing in the new snapshot a note indicating that the vertex corresponding to each deletion has been deleted while maintaining a list of neighbors of the vertex in the new snapshot.
- In Paragraph 7, Applying the above changes includes applying the deletions of the vectors to the latest snapshot using the second application technique; Using the above second application technique for each deletion indicated in the above shared journal, Identifying vertices within the latest snapshot corresponding to each of the above deletions; Identifying one or more neighbor lists including the vertex in the above latest snapshot; A method further comprising, for the new snapshot, updating the one or more neighbor lists by removing the vertex from the one or more neighbor lists.
- In paragraph 4, A step of identifying a proximal graph based on the above shared journal, comprising: (1) a first set of vertices in the latest snapshot - said first set of vertices is deleted based on changes indicated in the shared journal after the build timestamp of said latest snapshot -; and (2) a second set of vertices in said latest snapshot - said vertices are each connected to at least one vertex in said first set of vertices -; A method further comprising the step of deciding to rebuild the HNSW index without using the latest snapshot among the plurality of snapshots, based on the size of the above proximal graph.
- In paragraph 4, In response to the decision to create a new snapshot of the above HNSW index: A step of identifying a first set of vertices whose neighbor lists have not changed since the previous snapshot among the plurality of snapshots in the latest snapshot among the plurality of snapshots; A step of identifying a second set of vertices whose neighbor lists have changed since the previous snapshot among the plurality of snapshots in the latest snapshot among the plurality of snapshots; For the new snapshot, the step of sharing a first memory for the latest snapshot, storing a first set of neighbor lists of the vertices; and A method further comprising the step of copying a second memory for the latest snapshot, storing a second set of neighbor lists of the vertices for the new snapshot.
- In paragraph 1, In response to the decision to generate a vector index based on the above vector table, a step of identifying the number of vectors in an initial set of the above vectors; A step of identifying an HNSW type vector index among multiple type vector indices, at least based on the number of vectors in the initial set of the above vectors; A method further comprising the step of generating the HNSW index for the above vector table.
- In paragraph 1, A step of receiving an instruction to create a vector index for an initial set of vectors stored in a vector database that is communically coupled to multiple nodes within a cluster; In response to receiving the above instructions: The step of generating the HNSW index based on an initial set of the above vectors; and A step of causing a copy of the above HNSW index to be stored in the memory of each of the plurality of nodes; In response to receiving a first vector query, the first node among the plurality of nodes processes the first vector query against a first copy of the HNSW index; A method further comprising the step of processing the second vector query against a second copy of the HNSW index by the second node among the plurality of nodes in response to receiving the second vector query.
- One or more storage media, wherein the storage media stores instructions that cause the execution of any one of claims 1 to 13 when executed by one or more computing devices.
- As a system: One or more computing devices; A system comprising one or more storage media storing instructions that cause the execution of a method described in any one of claims 1 to 13 when executed by one or more of the above-mentioned computing devices.
Description
Transactionally Consistent HNSW Index The present disclosure generally relates to vector embeddings, and more specifically, to database systems that support a specific type of vector index on multiple nodes within a cluster. A vector is a fixed-length sequence composed of numbers, typically floating-point numbers, such as the 5-dimensional vector [21.4, 45.2, 675.34, 19.4, 83.24]. Embeddings are a means of representing objects (e.g., text, images, and audio) as points within a continuous vector space, where the locations of these points within the space are semantically meaningful to one or more machine learning (ML) algorithms. Embeddings are often represented as vectors. Generally, vector embeddings represent points within an N-dimensional space. Vector embeddings are intended to capture important "features" of the data they represent (or embed). The data represented by a vector embedding can be one of many types of data, such as documents, emails, images, or videos. Examples of features include color, size, category, location, texture, semantics, and concepts. Each feature is represented by one or more numbers (dimensions) within a vector embedding. Hereinafter, "vector embedding" is referred to as "vector". Today, vectors are often generated by machine learning models (e.g., neural networks), and the features they represent are often difficult for humans to understand. One way vectors are produced by neural networks is to capture the outputs of neurons within the penultimate layer—that is, the outputs of the neural network immediately before the final processing layer. Distance between vectors An important property of vectors is that the distance between two vectors serves as a good proxy for the similarity of the objects they represent. Two vectors representing similar data must be close to each other in vector space. The opposite is also true: dissimilar data is represented by vectors that are far apart in vector space. For example, the distance between the vector for the word "cat" and the vector for the word "dog" must be smaller than the distance between the vector for the word "cat" and the vector for the word "plant". The distance between two vectors is often calculated by summing the squares of the differences between the numbers at their respective positions: The property that vector distance indicates object similarity allows similar data to be found using a vector database. For example, when a vector representing a picture of a dog is searched within a vector database, the closest vectors will be those representing other dogs, rather than vectors representing plants. Vector processing workloads Vector processing workloads (not to be confused with SIMD vector processing) have been used in Natural Language Processing (NLP), image recognition, recommendation systems, and the like. Vector processing workloads have two subcategories that require separate optimization strategies: indexing and searching. Regarding indexing, vector embeddings (or simply vectors) are indexed using approximate indexing techniques. Unlike B-tree indexes, vector indexes return many matching values ranked by similarity. Index creation and rebuilding tend to be CPU-intensive tasks and are optimized for throughput. In relation to search, the stored vectors are retrieved using a class of algorithms known as "Similarity Search" or "Approximate Nearest Neighbor (ANN)" to find the vectors closest to the query vector. The search is designed to minimize CPU usage to minimize response time. Vector processing patterns Vector similarity search is similar to online transaction processing (OLTP) in that end users submit vector queries and expect immediate responses. Vector similarity search requires millisecond response times to find nearby (representing similar data) vectors, even if the database storing the vectors holds billions of vectors. One exemplary query is "Find products similar to this photo [refer to digital image]." Another exemplary query is "Find in-house documents that conceptually match this natural language prompt: [NL prompt]." Providing fast response times requires the use of specialized vector indexes and high-speed algorithms to calculate distances between vectors. In some use cases, it is necessary to combine vector similarity search with relational data. For example, a query might request data on homes that match a natural language prompt, are valued at over $1 million, are located in zip code 94070, and whose owners have recently declared bankruptcy. Additionally, it may be necessary to have the capability to insert new vectors into a database, delete vectors from a database, and index vectors in real time. Vector databases Early vector workloads often used flat files or object stores to store vectors. Applications would load vectors from their backend repositories into memory and then perform vector processing using third-party libraries such as FAISS. Generative artificial intelligence (AI) has significantly