Search

US-12625862-B2 - Adaptive updates to data items

US12625862B2US 12625862 B2US12625862 B2US 12625862B2US-12625862-B2

Abstract

Examples relate to updating data items in a data store. In some examples, a batch of data items are received at a data update node associated with the data store. In accordance with a determination that the batch of data items corresponds to a real-time data update, for each data item, the data item in the data store are updated and supplemental data indicating a historical generation time of the data item is stored in a local map of the data update node. In accordance with a determination that the batch of data items corresponds to a full data refresh, a subset of the batch of data items are adaptively updated based on the supplemental data stored in the local map.

Inventors

  • Menkae Jeng
  • Sharat Narayan Hegde
  • Kushagra Dar
  • Ruchir Choudhry

Assignees

  • WALMART APOLLO, LLC

Dates

Publication Date
20260512
Application Date
20241011

Claims (20)

  1. 1 . A method for managing data storage, comprising: receiving, at a first data update node of a data store of a data system, a first batch of data items, the data store including a plurality of data update nodes, the first data update node configured to be updated via a real-time data pipeline and a full refresh data pipeline that is parallel to the real-time data pipeline; implementing a real-time data update of the first batch of data items via the real-time data pipeline, including, for each respective data item in the first batch of data items: updating the respective data item in the data store of a data system; and storing, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item; and implementing a full data refresh of the first batch of data items via the full refresh data pipeline, including: adaptively updating, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map; while the first batch of data items is being updated in the data store, receiving, at the data store, a second batch of data items via the real-time data pipeline; and determining that the second batch of data items corresponds to the real-time data update; in response to a determination that the second batch of data items corresponds to the real-time data update, deterring erroneous data overwriting between the full refresh data pipeline and the real-time data pipeline, including: determining that the second batch of data items and the first batch of data items share a common item; and suspending an update of the second batch of data items until an update of the first batch of data items is completed.
  2. 2 . The method of claim 1 , wherein adaptively updating the subset of the first batch of data items further comprises, for a first data item of the subset of the first batch of data items: extracting, from the first batch of data items, a current time stamp of the first data item indicating a current generation time of the first data item; determining that the first local map stores a corresponding historical time stamp of the first data item indicating the historical generation time of the first data item; comparing the historical time stamp and the current time stamp of the first data item; determining that the historical generation time of the first data item is earlier than the current generation time of the first data item; and updating the first data item in the data store.
  3. 3 . The method of claim 1 , wherein adaptively updating the subset of the first batch of data items further comprises, for a first data item of the subset of the first batch of data items: determining that the first local map does not include the historical time stamp of the first data item; and updating the first data item in the data store.
  4. 4 . The method of claim 1 , wherein the first batch of data items includes a skipped data item distinct from the subset of the first batch of data items, and the method further comprises: determining that the first batch of data items corresponds to the full data refresh; determining that the first local map stores the historical time stamp indicating the historical generation time of the skipped data item; extracting, from the first batch of data items, a current time stamp of the skipped data item indicating a current generation time of the skipped data item; comparing the historical time stamp and the current time stamp of the skipped data item; determining that the historical generation time is subsequent to the current generation time of the respective data item indicated by the current time stamp of the skipped data item; and aborting updating the skipped data item in the data store.
  5. 5 . The method of claim 1 , further comprising: determining that, the first batch of data items corresponds to the real-time data update and that the second batch of data items corresponds to the full data refresh; and adaptively updating a subset of the second batch of data items based on the supplemental data stored in the first local map.
  6. 6 . The method of claim 5 , further comprising: in accordance with a determination that an update of the first batch of data items in the data store is completed: reinitiating the update of the second batch of data items; and adaptively updating the common item based on distinct generation times of the common item in the first batch of data items and the second batch of data items.
  7. 7 . The method of claim 1 , further comprising: receiving a plurality of data update requests including a plurality of data items for updating information of a plurality of physical items in the data store; identifying a plurality of messaging partitions each corresponding to a respective subset of physical items and a respective data update node associated with the data store; and in response to the plurality of data update requests, assigning the plurality of data items to the plurality of messaging partitions.
  8. 8 . The method of claim 7 , wherein assigning the plurality of data items to the plurality of messaging partitions comprises: determining that the first batch of data items are associated with a first subset of physical items, wherein a first messaging partition of the plurality of messaging partitions is associated with the first subset of physical items; assigning the first batch of data items to the first messaging partition; and wherein the information of the plurality of physical items stored in the data store is updated based on the first batch of data items assigned to the first messaging partition.
  9. 9 . The method of claim 7 , wherein the plurality of data update requests includes one of a full refresh data request and a real-time data request for updating a set of data items, the method further comprising: determining that a subset of the set of data items corresponds to a respective subset of physical items associated with a first messaging partition; assigning the subset of data items to the first messaging partition; and generating plurality of batches including the first batch of data items.
  10. 10 . The method of claim 7 , wherein the plurality of data update requests includes a plurality of real-time data requests for updating a set of data items, the method further comprising: determining that a subset of data items corresponds to a respective subset of physical items associated with a first messaging partition; assigning the subset of data items to the first messaging partition; and consolidating the subset of data items into the first batch of data items.
  11. 11 . The method of claim 1 , wherein the supplemental data of each respective data item includes a data item identifier uniquely identifying the respective data item and the historical time stamp.
  12. 12 . The method of claim 1 , wherein: the data store is configured to store a collection of data items including the first batch of data items; the full data refresh is configured to update the collection of data items according to a predefined schedule, and the real-time data update is configured to update different subsets of the collection of data items without a fixed schedule; and each of the subsets of the collection of data items includes less than all of the collection of data items.
  13. 13 . A system, comprising: a data store associated with a plurality of data update nodes including a first data update node, the first data update node configured to be updated via a real-time data pipeline and a full refresh data pipeline that is parallel to the real-time data pipeline; a processor; and a memory having instructions that is configured to be executed by the processor and cause the processor to: receive, at the first data update node, a first batch of data items; implement a real-time data update of the first batch of data items via the real-time data pipeline, including, for each respective data item in the first batch of data items: updating the respective data item in the data store; and storing, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item; and implement a full data refresh of the first batch of data items via the full refresh data pipeline, including: adaptively updating, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map; while the first batch of data items is being updated in the data store, receiving, at the data store, a second batch of data items via the real-time data pipeline; determining that the second batch of data items corresponds to the real-time data update; and in response to a determination that the second batch of data items corresponds to the real-time data update, deterring erroneous data overwriting between the full refresh data pipeline and the real-time data pipeline, including: determining that the second batch of data items and the first batch of data items share a common item; and suspending an update of the second batch of data items until an update of the first batch of data items is completed.
  14. 14 . The system of claim 13 , wherein adaptively updating the subset of the first batch of data items further comprises, for a first data item of the subset of the first batch of data items: extracting, from the first batch of data items, a current time stamp of the first data item indicating a current generation time of the first data item; determining that the first local map stores a corresponding historical time stamp of the first data item indicating the historical generation time of the first data item; comparing the historical time stamp and the current time stamp of the first data item; determining that the historical generation time of the first data item is earlier than the current generation time of the first data item; and updating the first data item in the data store.
  15. 15 . The system of claim 13 , wherein adaptively updating the subset of the first batch of data items further comprises, for a first data item of the subset of the first batch of data items: determining that the first local map does not include the historical time stamp of the first data item; and updating the first data item in the data store.
  16. 16 . The system of claim 13 , wherein the first batch of data items includes a skipped data item distinct from the subset of the first batch of data items, and the processor further reads the instructions to: determine that the first batch of data items corresponds to the full data refresh; determine that the first local map stores the historical time stamp indicating the historical generation time of the skipped data item; extracting, from the first batch of data items, a current time stamp of the skipped data item indicating a current generation time of the skipped data item; compare the historical time stamp and the current time stamp of the skipped data item; determine that the historical generation time is subsequent to the current generation time of the respective data item indicated by the current time stamp of the skipped data item; and abort updating the skipped data item in the data store.
  17. 17 . A non-transitory computer-readable storage medium, having instructions stored thereon, which is configured to be executed by one or more processors and cause the processors to: receive, at a first data update node, a first batch of data items, wherein the data store is associated with a plurality of data update nodes including the first data update node, the first data update node configured to be updated via a real-time data pipeline and a full refresh data pipeline that is parallel to the real-time data pipeline; implement a real-time data update of the first batch of data items via the real-time data pipeline, including, for each respective data item in the first batch of data items: updating the respective data item in the data store; and storing, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item; and implement a full data refresh of the first batch of data items via the full refresh data pipeline, including: adaptively updating, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map; while the first batch of data items is being updated in the data store, receiving, at the data store, a second batch of data items via the real-time data pipeline; determining that the second batch of data items corresponds to the real-time data update; and in response to a determination that the second batch of data items corresponds to the real-time data update, deterring erroneous data overwriting between the full refresh data pipeline and the real-time data pipeline, including: determining that the second batch of data items and the first batch of data items share a common item; and suspending an update of the second batch of data items until an update of the first batch of data items is completed.
  18. 18 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions further comprise instructions for: determining that the first batch of data items corresponds to the real-time data update and that the second batch of data items corresponds to the full data refresh; and adaptively updating a subset of the second batch of data items based on the supplemental data stored in the first local map.
  19. 19 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions further comprise instructions for: receiving a plurality of data update requests including a plurality of data items for updating information of a plurality of physical items in the data store; identifying a plurality of messaging partitions each corresponding to a respective subset of physical items and a respective data update node associated with the data store; and in response to the plurality of data update requests, assigning the plurality of data items to the plurality of messaging partitions.
  20. 20 . The non-transitory computer-readable storage medium of claim 17 , wherein the supplemental data of each respective data item includes a data item identifier uniquely identifying the respective data item and the historical time stamp.

Description

TECHNICAL FIELD This disclosure relates generally to information processing and, more particularly, to systems and methods for controlling data loss and recovering lost data for data systems that update data via multiple data pipelines. BACKGROUND Organizations may provide data systems for users to browse and search data. Data systems that support applications with time-sensitive data, such as flash sales and inventory changes, may utilize a process to handle real-time data changes to maintain up-to-date information. Data in systems also may need to be fully refreshed for reasons such as adding new data fields or updating data fields that do not have real-time change events, among others. However, data loss issues can arise between full refresh and real-time refresh processes in data systems. Full refreshes involve replacing large data sets at scheduled intervals, which can inadvertently overwrite recent real-time data changes if not carefully managed, leading to loss of critical updates. SUMMARY In various embodiments, a system including a data store, a non-transitory memory configured to store instructions thereon, and at least one processor is disclosed. The data store is associated with a plurality of data update nodes including a first data update node. The at least one processor is configured to receive, at the first data update node, a first batch of data items. The at least one processor is further configured to, in accordance with a determination that the first batch of data items corresponds to a real-time data update, for each respective data item in the first batch of data items, update the respective data item in the data store and store, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item. The at least one processor is further configured to, in accordance with a determination that the first batch of data items corresponds to a full data refresh, adaptively update, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map. In various embodiments, a computer-implemented method is disclosed. The computer-implemented method is performed at a system having a data store associated with a plurality of data update nodes that further includes a first data update node. The computer-implemented method includes steps of receiving, at the first data update node, a first batch of data items; in accordance with a determination that the first batch of data items corresponds to a real-time data update, for each respective data item in the first batch of data items, updating the respective data item in the data store and storing, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item; and in accordance with a determination that the first batch of data items corresponds to a full data refresh, adaptively updating, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map. In various embodiments, a non-transitory computer readable medium having instructions stored thereon is disclosed. The instructions, when executed by at least one processor, cause at least one device to perform operations including receiving, at a first data update node, a first batch of data items; in accordance with a determination that the first batch of data items corresponds to a real-time data update, for each respective data item in the first batch of data items, updating the respective data item in a data store and storing, in a first local map associated with the first data update node, supplemental data related to the respective data item including a historical time stamp indicating a historical generation time of the respective data item; and in accordance with a determination that the first batch of data items corresponds to a full data refresh, adaptively updating, in the data store, a subset of the first batch of data items based on the supplemental data stored in the first local map. In various embodiments, a system including a non-transitory memory configured to store instructions thereon and at least one processor is disclosed. The at least one processor is configured to receive one or more real-time data update requests to update a plurality of data items stored at a data store. The data store is configured to update the plurality of data items via a plurality of data update nodes, and each data update node has a respective local map. The at least one processor is further configured to, in response to the one or more real-time data update requests and for each respective data item in the plurality of data items, assign the respective data item to a respective data update n