US-20260127178-A1 - Database System Having Distributed Memory and Processing of Data Objects
Abstract
A set of computing devices of a database system is operable to receive a dataset that includes a plurality of data cells organized by rows and columns. The set is further operable to process the dataset to produce LTS data units. The set is further operable to store, in accordance with a storage model, the LTS data units in the distributed memory, where a first cell correlates to a first LTS data unit, and where the first LTS data unit is stored at one or more data block size addressable memory spaces of the data block size addressable memory spaces. The set is further operable to generate a file that records storing of the LTS data units within the distributed storage and that records correlation of the data cells to the LTS data units. The set is further operable to store the file for subsequent retrieval of data of the dataset.
Inventors
- S. Christopher Gladwin
- George Kondiles
- Jason Arnold
- Greg R. Dhuse
- Joseph Jablonski
- Ian Michael Drury
- Sarah Kate Schieferstein
- Andrew Park
Assignees
- Ocient Holdings LLC
Dates
- Publication Date
- 20260507
- Application Date
- 20260106
Claims (20)
- 1 . A database system comprises: a plurality of computing device clusters, wherein a computing device cluster includes a plurality of computing devices, wherein a computing device of the pluralities of computing devices includes a plurality of computing nodes, wherein a computing node of the pluralities of computing nodes includes a plurality of processing core resources, wherein a processing core resource of the pluralities of processing core resources includes a plurality of memory devices, wherein a memory device of the pluralities of memory devices includes a plurality of data block size addressable memory spaces, and wherein the pluralities of memory devices provide distributed memory for the database system; wherein a set of computing devices of the pluralities of computing device is operable to: receive a dataset that includes a plurality of data cells organized by rows and columns, wherein a row of data cells includes a plurality of columns of data, and wherein a data cell of the plurality of data cells is identifiable by a unique combination of a row identifier and a column identifier; and process the dataset in accordance with a long-term storage (LTS) protocol to produce a plurality of LTS data units; store, in accordance with a storage model, the plurality of LTS data units in the distributed memory, wherein a first cell of the plurality of cells correlates to a first LTS data unit of the plurality of LTS data units, wherein the first LTS data unit is stored at one or more data block size addressable memory spaces of the plurality of data block size addressable memory spaces; generate a file that records storing of the plurality of LTS data units within the distributed storage and that records correlation of the plurality of data cells to the plurality of LTS data units; and store the file for subsequent retrieval of data of the dataset.
- 2 . The database system of claim 1 , wherein the set of computing devices is further operable to process the dataset in accordance with the LTS protocol comprises one or more of: performing dictionary compression on select variable length data cells of the plurality of data cells to produce dictionary compressed data cells; performing data compression on data cells of the plurality of data cells to produce compressed data cells; and erasure encoding the dictionary compressed data cells, the compressed data cells, and remaining data cells of the plurality of data cells to produce erasure encoded data cells.
- 3 . The database system of claim 1 , wherein the set of computing devices is further operable to, when the storage model is object storage: identify the dataset as a data object; generate a unique object storage identifier for the dataset; generate first metadata regarding the plurality of data cells; generate second metadata regarding the correlation of the plurality of data cells to the plurality of LTS data units; generate third metadata regarding storage of the plurality of LTS data units in the distributed memory; and include the unique object storage identifier, the first metadata, the second metadata, and the third metadata in the file.
- 4 . The database system of claim 1 , wherein the set of computing devices is further operable to, when the storage model is block storage: generate a file name for the dataset in accordance with a file system associated with the block storage; partition the plurality of LTS data units into a plurality of data blocks having a data size corresponding to the data block size addressable memory spaces; assign a plurality of addresses to the plurality of data blocks; create metadata that maps the plurality of addresses to the file name; and store the file name and the metadata in the file.
- 5 . The database system of claim 1 , wherein the set of computing devices is further operable to process the dataset in accordance with the LTS protocol comprises: temporarily store the data of the dataset before LTS processing the data of the dataset.
- 6 . The database system of claim 1 further comprises: wherein a first sub-set of computing devices of the plurality of computing device is operable to: receive the dataset; and process the dataset in accordance with a long-term storage (LTS) protocol to produce a plurality of LTS data units; wherein a second sub-set of computing devices of the plurality of computing devices is operable to: store, in accordance with a storage model, the plurality of LTS data units in the distributed storage; wherein a third sub-set of computing devices of the plurality of computing device is operable to: generate the file; and store the file.
- 7 . The database system of claim 1 further comprises: the dataset is identifiable by a unique dataset identifier; and wherein the set of computing devices is further operable to: include, in the file, the unique dataset identifier for the plurality of LTS data units and for the correlation of the plurality of data cells to the plurality of LTS data units.
- 8 . The database system of claim 7 further comprises: the dataset is divided into a plurality of data partition, a data partition of the plurality of data partition is divided into a plurality of segment groups, a segment group of the pluralities of segment groups is divided into a plurality of segments, wherein a segment of the pluralities of segments includes a set of rows of the plurality of rows, wherein the segment has a unique segment identifier, wherein the segment group has a unique segment group identifier, wherein the data partition has a unique data partition identifier, and wherein the first cell is further identifiable based on the unique segment identifier, the unique segment group identifier, and the unique data partition identifier.
- 9 . The database system of claim 1 , wherein the set of computing devices is further operable to store the plurality of LTS data units in the distributed memory by: determining virtual memory space for storing the plurality of LTS data units in distributed memory; and mapping the virtual memory space to the data block size addressable memory spaces of the distributed memory.
- 10 . The database system of claim 1 further comprises: receive a request regarding data of the dataset; access the file to: identify storage locations of the data of the dataset; identify a set of LTS data units of the plurality of LTS data unit based on the identified storage locations; identify processing core resources associated with the distributed memory; and process, by the processing core resources associated with the distributed memory, the set of LTS data units to recover a set of data cells of the plurality of data cells that correspond to the data of the dataset.
- 11 . A computer readable memory device comprises: a first memory that stores operational instructions that, when executed by a set of computing devices of pluralities of computing device of a database system, causes the set of computing devices to: receive a dataset that includes a plurality of data cells organized by rows and columns, wherein a row of data cells includes a plurality of columns of data, and wherein a data cell of the plurality of data cells is identifiable by a unique combination of a row identifier and a column identifier, and process the dataset in accordance with a long-term storage (LTS) protocol to produce a plurality of LTS data units; a second memory that stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to: store, in accordance with a storage model, the plurality of LTS data units in distributed memory, wherein a first cell of the plurality of cells correlates to a first LTS data unit of the plurality of LTS data units, wherein the first LTS data unit is stored at one or more data block size addressable memory spaces of the plurality of data block size addressable memory spaces; a third memory that stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to: generate a file that records storing of the plurality of LTS data units within the distributed storage and that records correlation of the plurality of data cells to the plurality of LTS data units; and store the file for subsequent retrieval of data of the dataset; wherein the database system includes a plurality of computing device clusters, wherein a computing device cluster includes a plurality of computing devices, wherein a computing device of the pluralities of computing devices includes a plurality of computing nodes, wherein a computing node of the pluralities of computing nodes includes a plurality of processing core resources, wherein a processing core resource of the pluralities of processing core resources includes a plurality of memory devices, wherein a memory device of the pluralities of memory devices includes a plurality of data block size addressable memory spaces, and wherein the pluralities of memory devices provide the distributed memory for the database system.
- 12 . The computer readable memory device of claim 11 , wherein the first memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to process the dataset in accordance with the LTS protocol by one or more of: performing dictionary compression on select variable length data cells of the plurality of data cells to produce dictionary compressed data cells; performing data compression on data cells of the plurality of data cells to produce compressed data cells; and erasure encoding the dictionary compressed data cells, the compressed data cells, and remaining data cells of the plurality of data cells to produce erasure encoded data cells.
- 13 . The computer readable memory device of claim 11 , wherein the, first, second, and/or third memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to, when the storage model is object storage: identify the dataset as a data object; generate a unique object storage identifier for the dataset; generate first metadata regarding the plurality of data cells; generate second metadata regarding the correlation of the plurality of data cells to the plurality of LTS data units; generate third metadata regarding storage of the plurality of LTS data units in the distributed memory; and include the unique object storage identifier, the first metadata, the second metadata, and the third metadata in the file.
- 14 . The computer readable memory device of claim 11 , wherein the first, second, and/or third memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to, when the storage model is block storage: generate a file name for the dataset in accordance with a file system associated with the block storage; partition the plurality of LTS data units into a plurality of data blocks having a data size corresponding to the data block size addressable memory spaces; assign a plurality of addresses to the plurality of data blocks; create metadata that maps the plurality of addresses to the file name; and store the file name and the metadata in the file.
- 15 . The computer readable memory device of claim 11 , wherein the first memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to process the dataset in accordance with the LTS protocol comprises: temporarily store the data of the dataset before LTS processing the data of the dataset.
- 16 . The computer readable memory device of claim 11 further comprises: wherein a first sub-set of computing devices of the plurality of computing device is operable to: receive the dataset; and process the dataset in accordance with a long-term storage (LTS) protocol to produce a plurality of LTS data units; wherein a second sub-set of computing devices of the plurality of computing devices is operable to: store, in accordance with a storage model, the plurality of LTS data units in the distributed storage; wherein a third sub-set of computing devices of the plurality of computing device is operable to: generate the file; and store the file.
- 17 . The computer readable memory device of claim 11 , wherein the first and/or third memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to: obtain a unique dataset identifier for the dataset; and include, in the file, the unique dataset identifier for the plurality of LTS data units and for the correlation of the plurality of data cells to the plurality of LTS data units.
- 18 . The computer readable memory device of claim 17 further comprises: the dataset is divided into a plurality of data partition, a data partition of the plurality of data partition is divided into a plurality of segment groups, a segment group of the pluralities of segment groups is divided into a plurality of segments, wherein a segment of the pluralities of segments includes a set of rows of the plurality of rows, wherein the segment has a unique segment identifier, wherein the segment group has a unique segment group identifier, wherein the data partition has a unique data partition identifier, and wherein the first cell is further identifiable based on the unique segment identifier, the unique segment group identifier, and the unique data partition identifier.
- 19 . The computer readable memory device of claim 11 , wherein the first, second, and/or third memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to store the plurality of LTS data units in the distributed memory by: determining virtual memory space for storing the plurality of LTS data units in distributed memory; and mapping the virtual memory space to the data block size addressable memory spaces of the distributed memory.
- 20 . The computer readable memory device of claim 11 , wherein the first, second, and/or third memory further stores operational instructions that, when executed by the set of computing devices, causes the set of computing devices to: receive a request regarding data of the dataset; access the file to: identify storage locations of the data of the dataset; identify a set of LTS data units of the plurality of LTS data unit based on the identified storage locations; identify processing core resources associated with the distributed memory; and process, by the processing core resources associated with the distributed memory, the set of LTS data units to recover a set of data cells of the plurality of data cells that correspond to the data of the dataset.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS The present U.S. Utility Patent Application claims priority pursuant to pursuant to 35 U.S.C. § 120 as a continuation-in-part of U.S. Utility application Ser. No. 18/402,954, entitled, “FILTERING RECORDS INCLUDED IN OBJECTS OF AN OBJECT STORAGE SYSTEM BASED ON APPLYING A RECORD IDENTIFICATION PIPELINE”, filed on Jan. 3, 2024, issuing as U.S. Pat. No. 12,524,407 on Jan. 13, 2026, which claims priority pursuant to 35 U.S.C. § 119(e) to U.S. Provisional Application No. 63/482,485, entitled “QUERY PROCESSING APPLIED TO OBJECTS OF AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023; U.S. Provisional Application No. 63/482,497, entitled “QUERY EXECUTION VIA INDEXING OBJECTS OF AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023; and U.S. Provisional Application No. 63/482,504, entitled “QUERY EXECUTION VIA COMMUNICATION WITH AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023, all of which are hereby incorporated herein by reference in their entirety and made part of the present U.S. Utility Patent Application for all purposes. The present U.S. Utility Patent Application also claims priority pursuant to pursuant to 35 U.S.C. § 120 as a continuation-in-part of U.S. Utility application Ser. No. 18/768,288, entitled, “DATABASE SYSTEM WITH PUSH CO-LITERAL FILTERING AND METHODS FOR USE THEREWITH”, filed on Jul. 10, 2024, which claims priority pursuant to 35 U.S.C. § 120 as a continuation of U.S. Utility application Ser. No. 18/309,897, entitled “OPTIMIZING AN OPERATOR FLOW FOR PERFORMING FILTERING BASED ON NEW COLUMNS VALUES VIA A DATABASE SYSTEM”, filed May 1, 2023, issued as U.S. Pat. No. 12,072,887 on Aug. 27, 2024, all of which are hereby incorporated herein by reference in their entirety and made part of the present U.S. Utility Patent Application for all purposes. The present U.S. Utility Patent Application also claims priority pursuant to pursuant to 35 U.S.C. § 120 as a continuation-in-part of U.S. Utility application Ser. No. 19/032,973, entitled, “FILTERING RECORDS INCLUDED IN FILES OF A DATA LAKEHOUSE PLATFORM BASED ON APPLYING A RECORD IDENTIFICATION PIPELINE”, filed Jan. 21, 2025, which claims priority pursuant to 35 U.S.C. § 119(e) to U.S. Provisional Application No. 63/730,041, entitled “FILTERING RECORDS INCLUDED IN FILES OF A DATA LAKEHOUSE PLATFORM BASED ON APPLYING A RECORD IDENTIFICATION PIPELINE”, filed Dec. 10, 2024, all of which are hereby incorporated herein by reference in their entirety and made part of the present U.S. Utility Patent Application for all purposes. U.S. Utility patent application Ser. No. 19/032,973 also claims priority pursuant to 35 U.S.C. § 120 as a continuation-in-part of U.S. Utility application Ser. No. 18/403,002, entitled “QUERY EXECUTION VIA COMMUNICATION WITH AN OBJECT STORAGE SYSTEM VIA AN OBJECT STORAGE COMMUNICATION PROTOCOL”, filed Jan. 3, 2024, issued as U.S. Pat. No. 12,271,381 on Apr. 8, 2025, which claims priority pursuant to 35 U.S.C. § 119(e) to U.S. Provisional Application No. 63/482,485, entitled “QUERY PROCESSING APPLIED TO OBJECTS OF AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023; U.S. Provisional Application No. 63/482,497, entitled “QUERY EXECUTION VIA INDEXING OBJECTS OF AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023; and U.S. Provisional Application No. 63/482,504, entitled “QUERY EXECUTION VIA COMMUNICATION WITH AN OBJECT STORAGE SYSTEM”, filed Jan. 31, 2023, all of which are hereby incorporated herein by reference in their entirety and made part of the present U.S. Utility Patent Application for all purposes. STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT Not Applicable. INCORPORATION-BY-REFERENCE OF MATERIAL SUBMITTED ON A COMPACT DISC Not Applicable. BACKGROUND OF THE INVENTION Technical Field of the Invention This invention relates generally to computer networking and more particularly to database system and operation. Description of Related Art Computing devices are known to communicate data, process data, and/or store data. Such computing devices range from wireless smart phones, laptops, tablets, personal computers (PC), work stations, and video game devices, to data centers that support millions of web searches, stock trades, or on-line purchases every day. In general, a computing device includes a central processing unit (CPU), a memory system, user input/output interfaces, peripheral device interfaces, and an interconnecting bus structure. As is further known, a computer may effectively extend its CPU by using “cloud computing” to perform one or more computing functions (e.g., a service, an application, an algorithm, an arithmetic logic function, etc.) on behalf of the computer. Further, for large services, applications, and/or functions, cloud computing may be performed by multiple cloud computing resources in a distributed manner to improve the response time for completion of the service, application, and/or function. Of the many applications a computer can perform, a database system is one of t