US-12625729-B2 - Packet processing computations utilizing a pre-allocated memory function

US12625729B2US 12625729 B2US12625729 B2US 12625729B2US-12625729-B2

Abstract

The present disclosure relates to systems, methods, and computer-readable media for utilizing a new memory allocation function library called PmemMalloc. For example, the PmemMalloc library allocates pre-allocated, partitioned, and fixed shared memory blocks. In addition, by utilizing the PmemMalloc library, the memory allocation system described herein overcomes problems with persistence and enumeration that encumber existing malloc libraries. Indeed, the PmemMalloc library enables the memory allocation system to perform servicing computation in parallel across multiple CPU cores/threads, distribute computation equally among threads, prioritize servicing, among other improvements. Notably, the PmemMalloc library provides major constructs (e.g., persistence, enumeration, and debuggability) not available existing malloc libraries. Additionally, as detailed in this disclosure, the PmemMalloc library migrates various computations out of application-based packet processing to memory block-based deferred enumeration, which improves both packet processing and efficient use of CPU cores on a computing device.

Inventors

Janardhana Reddy NAREDULA
Naresh Kumar BADE

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260512
Application Date: 20220524

Claims (20)

1 . A computer-implemented method for managing memory, comprising: creating shared memory blocks that are pre-allocated, partitioned, and fixed in size, the shared memory blocks being accessible by multiple applications; subsequent to completion of creating the shared memory blocks, receiving packet data at a network device that comprises an application that processes the packet data; in response to receiving the packet data, processing the packet data by creating a memory heap within the shared memory blocks upon receiving the packet data, wherein the shared memory blocks are pre-allocated within a memory region, partitioned in the memory region, and fixed in size within the memory region, and wherein the shared memory blocks are accessible by multiple applications, wherein the memory heap persists across rebooting or restarting the application and does not involve operating integration; generating allocation metadata for each block of the shared memory blocks, wherein the allocation metadata includes an object type of packet data portions stored in each block of the shared memory blocks; attaching the allocation metadata to each corresponding block of the shared memory blocks; and performing one or more memory operations to perform packet processing operations on the packet data and perform enumeration operations across the shared memory blocks based on the allocation metadata.
2 . The computer-implemented method of claim 1 , further comprising generating the allocation metadata for each block of the shared memory blocks before attaching the allocation metadata to corresponding blocks of the shared memory blocks.
3 . The computer-implemented method of claim 2 , wherein the allocation metadata for each block of the shared memory blocks comprises a timestamp, and a port number.
4 . The computer-implemented method of claim 3 , wherein the enumeration operations include zero blocking, servicing with data structure change, object expiry, port delete, and memory leak detection.
5 . The computer-implemented method of claim 4 , wherein performing the one or more memory operations across the shared memory blocks is further based on priority levels of the enumeration operations comprising a high-priority bulk operation, a low-priority operation, and an optional-priority operation.
6 . The computer-implemented method of claim 5 , wherein performing the one or more memory operations across the shared memory blocks further comprises processing high-priority bulk operations while deferring low-priority bulk operations until CPU idle time increases beyond a threshold level.
7 . The computer-implemented method of claim 4 , wherein performing the one or more memory operations across the shared memory blocks comprising: filtering memory blocks from the shared memory blocks based on object types; and performing an enumeration operation on the filtered memory blocks having a same object type.
8 . The computer-implemented method of claim 1 , wherein performing the one or more memory operations across the shared memory blocks comprising: sharding the packet data across the shared memory blocks; and performing both the packet processing operations and the enumeration operations across multiple CPU data threads utilizing the shared memory blocks.
9 . The computer-implemented method of claim 1 , wherein the memory heap of the shared memory blocks comprises memory that persists across rebooting the application.
10 . The computer-implemented method of claim 1 , wherein the memory heap of the shared memory blocks is pre-allocated without operating system integration.
11 . The computer-implemented method of claim 1 , further comprising receiving the packet data corresponding to the application from a network interface card before flows are converted into new objects.
12 . The computer-implemented method of claim 1 , wherein generating the memory heap of the shared memory blocks comprising pre-allocating and partitioning the shared memory blocks across a fixed number of data threads, a fixed number of CPUs, and a fixed amount of memory.
13 . The computer-implemented method of claim 1 , wherein CPU computations and the shared memory blocks occur on a data plane development server device.
14 . The computer-implemented method of claim 1 , wherein performing the one or more memory operations across the shared memory blocks based on the allocation metadata comprises filtering and selecting one or more shared memory blocks that include a same metadata object type for combined processing.
15 . A system comprising: at least one processor; and a non-transitory computer memory comprising instructions that, when executed by the at least one processor, cause the system to: create shared memory blocks that are pre-allocated, partitioned, and fixed in size, the shared memory blocks being accessible by multiple applications; subsequent to completion of creating the shared memory blocks, receive packet data at a network device that comprises an application that processes the packet data; in response to receiving the packet data, process the packet data by creating a memory heap within the shared memory blocks upon receiving the packet data, wherein the shared memory blocks are pre-allocated within a memory region, partitioned in the memory region, and fixed in size within the memory region, and wherein the shared memory blocks are accessible by multiple applications, wherein the memory heap persists across rebooting or restarting the application and does not involve operating integration; generate allocation metadata for each block of the shared memory blocks, wherein the allocation metadata includes an object type of packet data portions stored in each block of the shared memory blocks; attach the allocation metadata to each corresponding block of the shared memory blocks; and perform one or more memory operations to perform packet processing operations on the packet data and perform enumeration operations across the shared memory blocks based on the allocation metadata.
16 . The system of claim 15 , wherein performing the one or more memory operations across the shared memory blocks is further based on priority levels of the enumeration operations that comprise a high-priority bulk operation and a low-priority operation.
17 . The system of claim 15 , further comprising additional instructions that, when executed by the at least one processor, cause the system to receive the packet data corresponding to the application from a network interface card before flows are converted into new objects.
18 . The system of claim 15 , wherein generating the memory heap of the shared memory blocks comprising pre-allocating and partitioning the shared memory blocks across a fixed number of data threads, a fixed number of CPUs, and a fixed amount of memory.
19 . A non-transitory computer-readable storage medium comprising instructions that, when executed by at least one processor, cause a computing device to: create shared memory blocks that are pre-allocated, partitioned, and fixed in size, the shared memory blocks being accessible by multiple applications; subsequent to completion of creating the shared memory blocks, receive packet data at a network device that comprises an application that processes the packet data; in response to receiving the packet data, process the packet data by creating a memory heap within the shared memory blocks upon receiving the packet data, wherein the shared memory blocks are pre-allocated within a memory region, partitioned in the memory region, and fixed in size within the memory region, and wherein the shared memory blocks are accessible by multiple applications, wherein the memory heap persists across rebooting or restarting the application and does not involve operating integration; generate allocation metadata for each block of the shared memory blocks, wherein the allocation metadata includes an object type of packet data portions stored in each block of the shared memory blocks; attach the allocation metadata to each corresponding block of the shared memory blocks; and perform one or more memory operations to perform packet processing operations on the packet data and perform enumeration operations across the shared memory blocks based on the allocation metadata.
20 . The non-transitory computer-readable storage medium of claim 19 , further comprising additional instructions that, when executed by the at least one processor, cause the computing device to generate the allocation metadata for each block of the shared memory blocks before attaching the allocation metadata to corresponding blocks of the shared memory blocks, and wherein: the allocation metadata for each block of the shared memory blocks comprises a timestamp, and a port number; and performing the one or more memory operations across the shared memory blocks is further based on priority levels of the enumeration operations comprising a high-priority bulk operation, a low-priority operation, and an optional-priority operation.

Description

BACKGROUND Recent years have seen significant advancements in hardware and software platforms that implement network devices, such as those found in cloud computing systems. For instance, network devices often provide functions or services by receiving and processing incoming data packets. In some cases, one or more network applications (or simply “applications”) on a network device process the incoming data packets. In many instances, these applications need to request memory from the network device to process the incoming data packets. Currently, applications like these rely on existing memory allocation function (“malloc”) libraries, such as Glibc, jemalloc, and tcmalloc to allocate memory for packet data processing. However, existing malloc libraries face several technical shortcomings that result in inefficient, inaccurate, and inflexible operations. To elaborate, existing malloc libraries do not provide persistent memory support or enumeration support, resulting in numerous problems. Indeed, existing malloc libraries are not designed to store heaps (i.e., heap memory) on persistent memory but rather, they dynamically allocate memory that is integrated with an Operating System (OS) and subject to OS interactions. As a result, memory and processing inaccuracies occur when data stored in memory is lost due to application restarts, when the OS overwrites or otherwise interferes with allocated data, or when other issues arise, as noted below. Additionally, existing malloc libraries do not provide enumeration support, which increases processing time and reduces computational efficiency. This lack of enumeration support often results in inefficient serial processing of packet data, unequal CPU distribution, and memory locks. To elaborate, a server or virtual machine using existing malloc libraries allocates a number (e.g., four, eight, etc.) of CPU cores (e.g., worker threads) for processing packet data. For example, in a four-CPU core system, when an application calls an existing malloc library, existing malloc libraries horizontally split the CPU cores by allocating three of the cores to data processing while the fourth is used by the application for enumeration operations. Accordingly, in these cases, enumeration operations (or simply “enumeration”) are rigidly restricted to a single CPU core and prevented from operating in parallel on the other CPU cores, even when the other CPU cores are underutilized. Additionally, due to using a single CPU core, memory locks, lock contentions, and memory access errors frequently occur with existing malloc libraries due to memory blocks being owned by other cores and/or heaps than those accessing them. As additional examples of inefficiencies, existing malloc libraries suffer from unequal CPU utilization and poor distribution caused by enumeration. As noted above, enumeration often requires its own CPU cores, which cannot be used for packet processing. Thus, even if the enumeration core is idle or not fully engaged, existing malloc libraries do not allow packet processing to occur on idle CPU cores allocated to enumerations (and vice versa). Further, many existing malloc libraries are inflexibly limited to passive sync functions (e.g., a synchronous programming model), which process packet data as they arrive regardless of processing importance. Because servicing computations are done synchronously, existing malloc libraries require O(N) computations due to serialization and de-serialization flows (e.g., in current servicing, all flows and objects are converted into new objects (e.g., at O(n)). As an example, with existing malloc libraries, processing 1 million packet data flows takes on average 40 seconds. Indeed, under existing malloc libraries, enumerations are performed serially on a single core and cannot be deferred and, as a result, processing throughput is greatly reduced. These and other problems result in significant inefficiencies, inaccuracies, and inflexibilities of existing malloc libraries. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description provides one or more implementations with additional specificity and detail through the use of the accompanying drawings, as briefly described below. FIGS. 1A-1B illustrate a diagram of a computing system environment where a memory allocation system is implemented in accordance with one or more implementations. FIG. 2 illustrates an example workflow for utilizing a new memory allocation function library called PmemMalloc to process packet data in accordance with one or more implementations. FIG. 3 illustrates a diagram processing packet data with the PmemMalloc library in accordance with one or more implementations. FIGS. 4A-4B illustrate an example architecture of a PmemMalloc function in accordance with one or more implementations. FIG. 5 illustrates graphs comparing CPU utilization between existing malloc libraries and the PmemMalloc library in accordance with one or more implementations. FIG. 6 illustrates