EP-3679478-B1 - SCALABLE STORAGE SYSTEM

EP3679478B1EP 3679478 B1EP3679478 B1EP 3679478B1EP-3679478-B1

Inventors

HALLAK, RENEN
LEVY, ASAF
GOREN, AVI
HOREV, Alon

Dates

Publication Date: 20260506
Application Date: 20171106

Claims (14)

A large-scale storage system (100), comprising: a plurality of compute nodes (110); a plurality of storage nodes (120), the plurality of storage nodes including a plurality of non-volatile random-access memories (NVRAMs) (223); and a communication fabric (130) for providing a communication infrastructure between the plurality of compute nodes (110) and the plurality of storage nodes (120); wherein each compute node of the plurality of compute nodes (110) is configured to independently perform at least a storage operation on any of the storage nodes (120) in a persistent manner, and wherein each storage node of the plurality of storage nodes (120) provides physical storage space of the large-scale storage system (100) and has the plurality of NVRAMs (223) as write buffers; wherein each compute node of the plurality of compute nodes (110) is configured to: receive a write request from a client, wherein the write request includes a data element; determine a location for writing the data element, wherein the location is determined by determining a shard allocated to the received data element and a NVRAM of the plurality of NVRAMs (223) that the determined shard allocated to the data element resides on; and shard data on the determined NVRAM according to an element handle of the received data element and according to offset ranges, namespace, or both, that a respective storage node of the plurality of storage nodes (120) is adapted to handle.
The system of claim 1, wherein the large-scale storage system (100) is connected to a network (150) to allow access of client devices to the large-scale storage system (100), wherein each of the plurality of compute nodes (110) is configured to receive, from a client device (140), a request over a first protocol and translate a command to the at least a storage operation to be communicated to the storage node (120) over a second protocol of the communication fabric (130).
The system of claim 1, wherein each of the plurality of storage nodes (120) comprise: a plurality of solid state persistent drives (210), wherein each of the plurality of solid state persistent drives (210) is a consumer grade solid state persistent drive; the plurality of NVRAMs (223), wherein at least one NVRAM for temporary holding data to be written to the plurality of solid state persistent drives (210), thereby reducing a number of write amplification of each solid state persistent drive (210); and at least one interface module (220) configured to control the plurality of solid state persistent drives (210) and non-volatile random-access memory (223) and communicate with the plurality of compute nodes (110).
The system of claim 3, wherein the at least one interface module (220) further comprises: a network interface card (222) for interfacing with the plurality of compute nodes (110) over the communication fabric (130); and a switch (224) to allow connectivity to the plurality of solid state persistent drives.
The system of claim 4, wherein the compute node of the plurality of compute nodes (110) is further configured to perform on the at least one NVRAM (223) a garbage collection process by aggregating write requests from a plurality of write requests until a complete data block is ready to be written to at least one solid state persistent drive of the plurality of solids state persistent drive.
The system of claim 1, wherein each compute node of the plurality of compute nodes (110) accesses an entire namespace of the large-scale storage system (100), and wherein each storage node (120) accesses a predefined range of the namespace.
The system of claim 1, wherein the compute node (110) is further configured to: determine data blocks of a data element to be read; determine a location of the data blocks, wherein the location is in at least one storage node (120); and access the determined location to retrieve the data blocks of the requested element.
The system of claim 1, wherein the at least a storage operation includes performing a write operation, wherein the compute node (110) is further configured to: receive a first write request, wherein the first write request includes at least a first data element to be written of the data element; determine a location to write the received first data element, wherein the location is in at least one storage node (120); write the first data element to the NVRAM at the determined location; and receive an acknowledgment upon writing the first data element to the NVRAM.
The system of claim 8, wherein the location is determined using a hash table mapping a handler of the first data element to a physical storage destination data block and metadata.
A method for performing a write request in a large-scale storage system (100), comprising: receiving, by a compute node of a plurality of compute nodes (110) of the large-scale storage system (100), a write request, wherein the write request includes at least a first data element to be written, wherein the large-scale storage system (100) includes the plurality of compute node (110) and a plurality of storage nodes (120), wherein the plurality of storage nodes (120) includes a plurality of non-volatile random-access memories (NVRAMs) (223) as write buffers; determining a location to write the received first data element, wherein the location is in at least one storage node (120) of the large-scale storage system (100); writing the first data element to a write buffer at the determined location; and receiving, at the compute node (110), an acknowledgment upon writing the first data element to the write buffer; wherein each compute node of the plurality of compute nodes (110) of the large-scale storage system (100) performs the steps of: receiving the write request from a client, wherein the write request includes a data element; determining a location for writing the data element, wherein the location is determined by determining a shard allocated to the received data element and a NVRAM of the plurality of NVRAMs (223) that the determined shard allowed to the data element resides on; and sharding data on the determined NVRAM according to an element handle of the received data element and according to offset ranges, namespace, or both, that a respective storage node of the plurality of storage nodes (120) is adapted to handle.
The method of claim 10, wherein determining the location further comprises: mapping, using a hash table, a handler of the first data element to a physical storage destination data block and metadata.
The method of claim 10, further comprising: controlling NVRAM (223) to aggregate a complete data block; and writing the complete data block to at least one solid state persistent drive included in the at least one storage node (120).
A non-transitory computer readable medium having stored thereon instructions which, when executed by a processing circuitry, cause the processing circuitry to execute the method according to claim 10.
The large-scale storage system (100) of claim 1, wherein the each compute node of the plurality of compute nodes (110) is further configured to: control aggregation of the write request until a complete data block is ready in the determined NVRAM of the plurality of NVRAMs (223) in order to simplify an operation of a garbage collection process.

Description

TECHNICAL FIELD The present disclosure generally relates to the field of data storage, and more particularly to large-scale storage systems. BACKGROUND A data center is a large group of networked computer servers typically used by organizations for the remote storage, processing, or distribution of large amounts of data. Traditionally, a data center is arranged using four different networks: a wide area network (WAN) providing connectivity to and from the data center, a local area network (LAN) providing connectivity among the servers of the data center, a storage area network (SAN) providing connectivity between the servers to the storage system, and an internal storage fabric for connecting the various storage elements (e.g., disks). With the advancements in networking technologies, the traditional arrangement of data centers may not provide the optimal performance. For example, using 2 different connectivity types between components of the data centers is an inefficient configuration. Advanced technologies of solid state persistent drives (SSDs), such as Flash and NVRAM, provide reliable and faster alternatives to traditional magnetic hard drives. The disadvantage of SSDs is their price. Thus, such persistent media is not typically used to backup or archive applications. Further, in servers and storage systems installed in data centers, enterprise grade SSDs are utilized, which ensure a high number of write-erase cycles. Such enterprise grade SSDs are relatively expensive. To keep up with the demand for storage and performance in data centers, software defined storage (SOS) has been introduced. Software defined storage refers to computer data storage technologies which separate storage hardware from the software that manages the storage infrastructure. The software implements policy management for operation including deduplication, replication, snapshots, and backup. With software defined storage technologies, the requirement of flexible adjustment of infrastructure can be fulfilled. In a typical arrangement of a software defined storage solution, the storage drives are attached directly to servers executing the storage software (logic). This is inefficient from a physical space perspective, because servers are transitioning to smaller form factors and have less room to hause storage drives. Further, a server attached to multiple storage drives can cause a single point of failure, i.e.,causing inaccessibility to all the drives which are attached to it. Another major disadvantage of software defined storage solution is the fact that the computing and storage resources are coupled. That is, increasing the computing resources to achieve better performance would require increasing the number of storage drives (e.g., as part of the sever). Similarly, increasing the storage drives to increase the available storage would require increasing the number of severs. US 2015/0248366 A1 teaches an approach for accessing multiple storage devices from multiple hosts without use of remote direct memory access. As taught by US 2015/0248366 A1, a data store switch fabric enabling data communications between a data storage access system and a plurality of compute nodes is provided. Each compute node has integrated compute capabilities, data storage, and a network interface controller. A plurality of physical data storage devices is provided. A a host bus adapter is provided in data communication with the plurality of physical data storage devices and the plurality of compute nodes via the data store switch fabric. The host bus adapter includes at least one submission queue and a corresponding shadow queue. An input/output (I/O) request is received from the plurality of compute nodes, where the I/O request includes an element of the I/O request to the at least one submission queue. Additional information related to the element of the at least one submission queue to the corresponding shadow queue is included. It would therefore be advantageous to provide a storage system operable as a storage solution that would overcome the deficiencies noted above. US8527699 B2 discloses a distributed RAID system comprising a plurality of hosts and a plurality of data banks. The hosts are coupled to the data banks by means of switches. Each data bank comprises a write cache that may be stored in some form of non-volatile memory. A write command may be routed from the host issuing the command to an appropriate data bank. A received write command can be placed in a write cache before a response is sent to the host. SUMMARY The invention is defined in independent claims 1, 10 and 13. BRIEF DESCRIPTION OF THE DRAWINGS The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying dra