Optimisations in Memory Management

Ravi Sri Ram Chowdary
17 min readOct 9, 2022

Design principles to optimise various aspects of Memory Management.

Co-authors: Akshat Singh Rathore, Anindya, ANURAT BHATTACHARYA, Ravi Sri Ram Chowdary

Introduction

At the Warehouse scale, Memory Allocation falls under the Datacenter Tax, which is the total time spent on common service overheads like RPC communication, serialisation etc. Because we cannot often optimise every single application to significantly enhance total system efficiency, we focus on these components to improve the efficiency of an entire class of applications.

Page-based implementations of virtual memory coupled with a Translation Lookaside Buffer (TLB) for fast address lookup are ubiquitous in modern hardware. TLB misses because of small TLB size, are a dominant performance overhead on modern systems.

We focus here on various methods to decrease the TLB misses, like introducing Redundant Memory Mappings outside the CPU’s critical path, and using modern-processor features like Hugepages and allocators that increase their coverage.

We will see some ML techniques for predicting lifetimes of unobserved allocation sites, pack objects of similar lifelines together in the blocks of a huge page, track actual lifetimes and use them to correct for mispredictions.

We also discuss about a novel memory management system, Cost-Benefit Memory Management (CBMM) that uses cost-benefit analysis to make policy decisions that prioritize behavioral consistency and debuggability.

CBMM: Financial Advice for Kernel Memory Managers

Kernel memory management(MM ) faces new challenges as a result of data center workloads. MM encompasses a large collection of kernel mechanisms and policies to allocate and map physical memory. Together, they make up a difficult set of tradeoffs that, if not handled carefully, might result in subpar performance or unexpected behavior. A lack of high-quality data for policymaking, a lack of cost awareness in many current MM policies, and difficult-to-debug opaque and distributed policy implementations are all reasons why current MM designs frequently display inconsistent, opaque behavior that is difficult to reproduce, understand, or fix. Invoking reclamation or compaction almost certainly outweighs any advantages of using a large page, and Linux’s fallback algorithms cause inconsistent system behavior.

Idea

All kernel MM operations have a cost and a benefit to userspace, and CBMM analyses these costs and benefits using cost-benefit models with an empirical foundation to help determine MM policy. CBMM makes fewer pathologically poor policy decisions than current designs because it explicitly models cost and benefit. The fundamental tenet of policymaking is that userspace benefits must outweigh userspace costs. Workloads requiring numerous large pages suffer when too aggressive rules, such as Linux’s THP or defragmentation policies, are simply disabled. Additionally, because existing policies are frequently implemented across multiple files, such as with Linux’s big page policies, it is challenging to alter them to match new performance goals. By explicitly stating costs and benefits and centralizing policymaking, CBMM reduces all three issues.

Cost-Benefit Memory Management (CBMM )

In CBMM, policy decisions follow the guiding principle that userspace benefits must outweigh userspace costs. By applying this principle uniformly, CBMM significantly improves consistency over Linux and HawkEye, while matching their performance.

The Estimator

The estimator is a new part of the kernel that CBMM adds. When a policy choice is required, it calculates the benefits and costs of a specific MM operation. If the advantages outweigh the costs, the kernel chooses to carry out the operation. In CBMM, the MM subsystem calls the estimator at decision points or locations in the code where policy judgments must be taken. This centralizes decision-making and identifies key decision points in the policy. Information about the type and parameters of the operations is passed to the estimator when a decision point calls it. The estimator performs as a “black box,” returning an estimate of the costs and gains associated with the provided MM operation and parameters. Costs and benefits are calculated in CBMM in terms of userspace time saved or lost, which often closely reflects user objectives. Given that many data center workloads run continuously, CBMM uses the rate of time saving/loss over a certain horizon in particular.

Cost and Benefit Models

For various MM processes, the estimator internally consists of a collection of cost and benefit models. Each (sub)model is an independent, self-contained black box that receives input from the decision point, mixes it with input from the ambient kernel state and preloaded profiles (files loaded into the kernel that supply knowledge about application-specific behavior), and generates an estimate. The models can be made more context-aware and utilize better data about workload behavior thanks to the additional input from the kernel state and preloaded profiles. Contrary to current heuristics, CBMM confines policies to certain cost and benefit models; as a result, their inputs and outputs can be observed and improved in a single location, making performance debugging in CBMM easier than in Linux. By starting with a basic model and making adjustments as necessary, model development in CBMM can be done iteratively.

In designing the models, it was found that benefits tend to be application-specific, whereas costs tend to be system-specific

Figure: CBMM model inputs and outputs

Preloaded Profiles

Applications react to MM regulations differently, and kernels currently lack high-quality data to forecast workload behavior. Preloaded profiles are files that are loaded into the kernel when a process is started (for example, by a cluster management) to give models details about the behavior of the process. They make it possible for the estimator to make use of offline processing for specific policy decisions. Prior techniques, however, rely on expensive and imprecise proxy information like page fault counts or page access bits. Preloaded profiles in CBMM especially offer spatial, per-process data, i.e., data about portions of single address space with arbitrary granularity as tiny as a 4KB page.

Evaluation

Comparing CBMM to Linux or HawkEye, tail latency is improved. CBMM, especially when fragmented, significantly lowers the occurrence of slow page faults for different workloads. Competitive performance and more comprehensible conduct are strengths of CBMM. Most of the time, CBMM performs as well as or better than Linux. Due to its emphasis on consistent behavior, CBMM frequently performs significantly better than Linux or HawkEye in fragmented environments. Additionally, CBMM is simple to debug and tune by modifying profiles and/or models. CBMM frequently requires substantially fewer big pages than other methods, despite having the most consistent behavior and occasionally higher performance. Although in certain circumstances these profiles undervalue the usefulness of enormous pages, CBMM is more targeted in its use of huge pages since it is cost and context-aware.

Figure: Runtime of CBMM workloads when enabling more models, normalized to Linux with THP without fragmentations (lower is better).
Figure: Soft page fault tail latency impact of different models.

Redundant Memory Mappings for fast access to large memory

Virtual memory gives each process the illusion of private and huge address space improving security and programmer productivity. Page-based implementations of virtual memory coupled with a Translation Lookaside Buffer (TLB) for fast address lookup are ubiquitous in modern hardware.

However with the increasing size of physical memory and increased workloads of modern demanding applications the performance of paging is suffering due to the following reasons:

  • TLB size is stagnant and can’t be increased arbitrarily as it lies on the CPU’s critical path.
  • On a TLB miss, the system must walk the page table increasing execution time overhead to nearly 50% in modern workloads.

To address this issue Redundant Memory Mapping (RMM) is introduced which is a novel hardware/software co-designed implementation of virtual memory. The key idea of RMM is

Diverse workloads exhibit contiguity in virtual as well as physical address space. RMM exploits this contiguity to represent memory segments as a range giving a succinct representation of allocated memory and uses range memory mappings to map huge chunks of virtual to physical memory.

As the name suggests RMM only adds redundant range mappings on top of the traditional paging architecture. A range mapping/translation is a mapping of contiguous virtual pages to contiguous physical pages. A minimum threshold (ThRng) on the number of pages is defined to make a range. A range translation is identified as a BASE and LIMIT address with an OFFSET which is the start address of the corresponding physical address range minus the BASE. The RMM architecture is analogous to paging architecture:

RMM hardware support consists primarily of a range TLB that is accessed in parallel with the last-level page TLB.
  • The range table stores the range translations for a process in memory. The OS manages the range table entries based on the applications memory management operations.
  • Range TLB: The Range TLB is a fully associative cache that holds BASE, LIMIT, OFFSET, and protection bits for fast-range translation. The Range TLB is looked up parallelly with the last level page TLB and in case of a hit it generates the corresponding entry in the previous level page TLB. This is done by testing BASE Virtual Page Number LIMIT for all entries parallelly and on a hit returning the hit OFFSET plus the Virtual Page Number. In case of a miss, a walker similar to a page table walker walks the Range Table to find the required range translation.
The range table stores the range translations for a process in memory. The OS manages the range table entries based on the application's memory management operations.
  • Range Table: Similar to Page Table Range Table maintains the range translations per process. It can be implemented as a cache-friendly B-Tree with (BASE, LIMIT) as key and OFFSET and protection bits as values with logarithmic insert, update, search and delete operations. Each node can store up to 4 translations and point to up to 5 children which makes it a dense representation of ranges. Therefore in only 3 levels, it stores 124 range translations which is sufficient for most workloads. A CR-RT register points to the root node of the table.
    Now let’s see what happens when a TLB miss occurs both in the last level page TLB and range TLB. Since ranges are redundant we can just bring the missing PTE and continue execution while the Range Table Walker hardware updates the Range TLB in the background which keeps it out of the CPU’s critical path resulting in no additional overhead.
    The walker hardware simply traverses the tree from the root and checks if the address is present in any range translation and updates the range TLB. Additionally, it uses a range bit in the PTEs to verify if the page is part of any range at all before traversal.
  • Eager Paging: As stated earlier RMM depends heavily on the contiguity of physical as well as virtual address space. In traditional demand paging, this is not strictly followed since only one page is allocated on the first access. To ensure maximum contiguity Eager Paging allocates consecutive physical pages to consecutive virtual pages eagerly at allocation, rather than lazily on demand at access time.
    It uses the traditional Buddy Allocator to allocate pages. If the number of pages crosses ThRng RMM defines multiple range translation for that process.
    We note that demand paging replaced eager paging in early systems. However, one motivation for demand paging was to limit unnecessary swapping in multi-programmed workloads, which modern large memories make less common

Advantages of RMM

  • RMM performs at par or, better than other solutions (e.g. Direct Segments, Multipage Mappings, etc.) across all kinds of workloads.
  • RMM builds on top of traditional paging architecture and hence requires minimal change in hardware and software. Due to its simplicity, implementation and debugging are not difficult.
  • RMM’s eager paging efficiently allocates consecutive pages ensuring a small size of range table requiring small memory overhead over traditional paging.

Learning-based Memory Allocation for C++ Server Workloads

Overview

The paper is based on using ML techniques for effectively dealing with fragmentation in C++ server workloads induced by long-lived workloads allocated at peak heap size. Language models (LSTM) were used to predict lifetimes for unobserved allocation sites. Learned Lifetime-Aware Memory Allocator(LLMA) is a novel memory manager to organise the heap using huge pages and lifetime classes. It packs objects of similar lifelines together in the blocks of a huge page, tracks actual lifetimes and uses them to correct for mispredictions. Fragmentations are limited by filling gaps with shorter-lived objects.

Process of Lifeline Prediction

Multiple executions are sampled to ensure proper coverage. A given application is monitored for a sample period and lifetimes for a small fraction of all allocations during this period are recorded. Since sampling may not observe all allocation calling contexts, samples for heterogenous sets of different software versions are combined. Thus it is infeasible to simply use a lookup table and hence the need for using ML for predicting the lifetimes given context. Those predictions are used by the C++ memory allocator to organise the heap based on the lifetimes.

Another challenge is to tackle overhead. As seen from Table 1 the TCMalloc Fast Path (8.3ns) is too short given we need about 400 ns just to get the full stack trace. One of the methods to mitigate this problem is to use a hashing-based mechanism to identify previously used contexts and not call the model every time, thus ensuring better amortised time complexity.

Sampling

Periodically samples of a subset of memory allocations are collected. Each sample includes a stack trace, object size and address at allocation and deallocation time. This approach is integrated with the already existing TCMalloc heap profiling mechanism, which identifies long-lived objects by producing a list of sampled objects at the end of the execution of the application, most of which are long-lived. However, it misses the short-lived objects that are not live at the end of the program. So the process is extended to record the deallocations as well. This is done by using hooks which are called periodically based on the number of allocated bytes.

Data preprocessing

Sample data is pre-processed using a distributed computation framework. Inputs are grouped by allocation site and the distribution of observed lifetimes for each site is calculated. 95th percentile is used to assign labels to each site. A label 𝐿𝑖 ∈ {1, . . . , 7, ∞} ,

This creates lifeline classes of 10ms,100ms,1s,10s,100s,1000s, >=1000s and infinity.The ML model classifies the stack traces according to these labels.

ML Model

Each frame in the stack trace is treated as a string and tokenized by splitting based on special characters. The most common tokens are taken and a table is created mapping them to a particular ID. Then LSTM recurrent neural network is used for sequence prediction. They, capture long-term sequential dependencies by applying a recursive computation to every element in a sequence and outputting a prediction based on the final step.

Lifetime Aware Allocator Design

Llama organises the heap into huge pages to increase TLB reach. Each active huge page is assigned one of the N lifetime classes. The global allocator manages huge pages and their blocks. Bump pointer allocation of huge pages in their initial LC regions is done by the global allocator, on acquiring them from the OS. Scalability on multicore hardware is achieved by using mostly unsynchronised thread local allocation for small objects.

Each huge page has 3 states: open, active, and free. Open and active blocks are live. The first allocation into a huge page makes it open and determines its LC. Only one huge page per LC can be opened at a time. Once a huge page is open Llama assigns its blocks to the same LC, A huge page transitions from open to active once all its constituent blocks are filled for the first time and from then onwards it remains active for the rest of its lifetime. A huge page is free when all its blocks get free and is immediately returned to the OS. All blocks on a huge page are free or live; open or closed for allocation, and residual or non-residual. All blocks are initially free. When a global allocator returns a block span to a local allocator, that block is marked open for allocation. If the blocks are on an open huge span to a local allocator, it is also marked as residual. Residual blocks are made to match the LC of the huge page. An active page may also contain other live,non-residual blocks, but these blocks with contain shorter lifetime classes. Huge pages contain live residual blocks and free blocks of the same LC. Llama limits fragmentation by aggressively recycling such free blocks for objects in shorter LCs.

Hugepage-Aware Memory Allocator — TEMERAIRE

Beyond Malloc Efficiency to Fleet efficiency

This work demonstrates that the overall efficiency of the application may be greatly enhanced by employing a memory allocator that itself may not be the most efficient.

Instead of reducing the cycles in the allocator code, the allocator is designed such that there will be less stall cycles in the main application code.

Here our goal is to improve the hugepage coverage of the application and reducing the fragmentation overhead and hence reducing CPU overheads.

Co-ordinating Hugepages

For many applications, TLB only covers a tiny portion of the memory footprint, resulting in cache misses. An entire aligned hugepage (2MiB is typically) occupies just one TLB entry. Hugepage coverage reduces TLB misses by increasing the effective capacity of the cache and; it also reduces stall time for a miss as there is one fewer level to traverse in the page-table.

However, these TLB-efficient Hugepages have a high level of fragmentation because allocations are aligned with hugepages.

Overview of TCMALLOC

TCMALLOC is a memory allocator that is commonly used in large-scale applications. Objects are classified according to their size. It divides memory into (multi-page-)spans that are aligned to the tcmalloc-page-size. It explains how to choose object sizes and organise metadata in order to reduce space overhead and fragmentation.

A simple pageheap allocates the spans, keeps track of all unused pages and uses the best-fit algorithm for allocation, releasing memory to OS.

TCMALLOC structure

TCMALLOC’s pageheap has a simple interface for managing memory: allocating a span, returning to allocator, and releasing to OS.

For allocation: TCMALLOC will first try to serve allocations from local per-thread caches which store a list of free objects for each sizeclass. When the local cache has no objects of the appropriate sizeclass, requests are routed to a Single Central Cache for that sizeclass. It has two components: Transfer Cache and Central Free list containing every span assigned to that sizeclass.

For release: Ideally, TCMALLOC would return all not-soon-needed memory from the user code. It is challenging to return memory since memory demand varies unpredictably. It is also not better to return all memory eagerly to avoid syscalls and page faults.

TEMERAIRE’s approach

TEMERAIRE is designed to make allocations densely onto highly-used hugepages and simultaneously form entirely unused hugepages for return to the OS. If we make a bad placement, we pay the time-cost of breaking up a hugepage for a long time. So, it is worth slowing down the allocator, if doing so lets it make better decisions.

Emphasizes smart work over speed.

Based on the size of the memory needed to be allocated, the functioning is delegated to several subcomponents.

TEMERAIRE’s components

Overall Algorithm

Our goal is to minimize generated slack, and if we do generate slack, to reuse it for other allocations.

Components of TEMERAIRE

  • HugeAllocator: It provides other components with unbacked memory that they can back and pass on.
  • HugeCache: Maintains a cache of empty hugepages. Allocations which are sufficiently large that slack is immaterial are handled.
    It keeps track of periodicity in the demand over a 2-second sliding window to decide whether to release hugepages to OS, when returned.
  • Hugefiller: Keeps a list of partially filled hugepages that can be densely filled. Small requests (≤ 1MiB) are always served here.

Nearly-empty hugepages (fullness) and Hugepages with long free ranges (defragmentation) are precious.

  • Statistics tracked per hugepage:
  1. The longest free range (L), (promotes defragmentation)
  2. The total number of allocations (A), (promotes fullness)
  3. The total number of used pages (U).
  • Outperforms the Global best-fit algorithm – placing a request in the single gap in any hugepage that is closest to its size.
Various bin-packing strategies. Efficiency: Best-fit < fullness < longest free range (LFR)

HugeRegion:

  • Implemented for certain allocation patterns where bin-packing the allocations along hugepage boundaries would be inefficient, allocations are made across hugepage boundaries to minimize this overhead.
  • A large fixed-size allocation (currently 1 GiB) is tracked at small-page granularity. This technique prevents this slack-heavy uncommon allocation pattern from bloating memory use.

Memory Release

  • TEMERAIRE normally just frees HugeCache ranges and perhaps reduces its limit. If the HugeCache is unable to release, the HugeFiller will just release the free (small) pages on the hugepage that is the least filled. This is the last resort for reducing memory footprints.
  • The HugeFiller separates the subreleased hugepages, and no allocations are made from them unless no other viable hugepage is available.

Evaluation of TEMERAIRE

Application-level performance metrics are obtained to measure workload productivity.

Case studies: Used on applications which use roughly 10% of cycles and 15% of RAM. With periodic release, an average improvement of CPU by 7.7% and an average RAM reduction of 2.4% is observed.

Fleet Experiment: Selected 1% of the machines of WSC and enabled TEMERAIRE on all applications running on them

Percentage of heap memory backed by hugepages

Full rollout trajectories: Gradually expanded to 100% of our workloads. Cycles stalled on TLB misses were reduced from 21.6% to 20.3% (a 6% reduction) and pageheap overhead was reduced from 14.3% to 10.6% (a 26% reduction).

The observed gains are because of the user code acceleration not because of malloc. Evidently, there is an increase of malloc cycles from 2.7% to 3.5%.

Conclusion

The four different papers discuss various memory management ideas and techniques for the datacenter workloads to effectively deal with fragmentation, memory overhead, kernel consistency and CPU overhead.

While some discuss on the software aspects, there are some that directly act on the hardware level.

We have seen various memory management ideas of which some alleviate the hardware specialisations already provided by modern processors like Hugepages as in TCMALLOC and TEMERAIRE, some need extra-level hardware as for RMM. We also observed how we can use already existing architecture and introduce new software techniques by including a cost-benefit model as observed in CBMM and using ML models to predict some metrics as in LLMA.

References

[1] Mark Mansi, Bijan Tabatabai, and Michael M. Swift, University of Wisconsin - Madison. CBMM: Financial Advice for Kernel Memory Managers. This paper is included in the Proceedings of the 2022 USENIX Annual Technical Conference. July 11–13, 2022 • Carlsbad, CA, USA 978-1-939133-29-8.

https://www.usenix.org/conference/atc22/presentation/mansi

[2] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman Ünsal. 2015. Redundant Memory Mappings for Fast Access to Large Memories. In Proc. of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15). 66–78.

https://dl.acm.org/doi/10.1145/2749469.2749471

[3] Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 541–556.

https://doi.org/10.1145/3373376.3378525

[4] A.H. Hunter, Jane Street Capital; Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan, Google. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. This paper is included in the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. July 14–16, 2021 978-1-939133-22-9.

https://www.usenix.org/conference/osdi21/presentation/hunter

This blog is made as part of the first reading assignment for the course Advances in Operating Systems Design.

--

--