Re: [RFC] Stack allocation, hugepages and RSS implications

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Cupertino Miranda <cupertino.miranda@oracle.com>
To: Florian Weimer <fweimer@redhat.com>
Cc: Cupertino Miranda via Libc-alpha <libc-alpha@sourceware.org>,
	"Jose E. Marchesi" <jose.marchesi@oracle.com>,
	Elena Zannoni <elena.zannoni@oracle.com>,
	Cupertino Miranda <cupertinomiranda@gmail.com>
Subject: Re: [RFC] Stack allocation, hugepages and RSS implications
Date: Thu, 09 Mar 2023 14:29:56 +0000	[thread overview]
Message-ID: <875yba3sm3.fsf@oracle.com> (raw)
In-Reply-To: <87bkl2b3f1.fsf@oldenburg.str.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4737 bytes --]


Hi Florian,

>> Hi everyone,
>>
>> For performance purposes, one of ours in-house applications requires to enable
>> TRANSPARENT_HUGEPAGES_ALWAYS option in linux kernel, actually making the
>> kernel to force all of the big enough and alligned memory allocations to
>> reside in hugepages.  I believe the reason behind this decision is to
>> have more control on data location.
>>
>> For stack allocation, it seems that hugepages make resident set size
>> (RSS) increase significantly, and without any apparent benefit, as the
>> huge page will be split in small pages even before leaving glibc stack
>> allocation code.
>>
>> As an example, this is what happens in case of a pthread_create with 2MB
>> stack size:
>>  1. mmap request for the 2MB allocation with PROT_NONE;
>>       a huge page is "registered" by the kernel
>>  2. the thread descriptor is writen in the end of the stack.
>>       this will trigger a page exception in the kernel which will make the actual
>>       memory allocation of the 2MB.
>>  3. an mprotect changes protection on the guard (one of the small pages of the
>>     allocated space):
>>       at this point the kernel needs to break the 2MB page into many small pages
>>       in order to change the protection on that memory region.
>>       This will eliminate any benefit of having small pages for stack allocation,
>>       but also makes RSS to be increaded by 2MB even though nothing was
>>       written to most of the small pages.
>>
>> As an exercise I added __madvise(..., MADV_NOHUGEPAGE) right after the
>> __mmap in nptl/allocatestack.c. As expected, RSS was significantly
>> reduced for the application.
>
> Interesting.  I did not expect to get hugepages right out of mmap.  I
> would have expected subsequent coalescing by khugepaged, taking actual
> stack usage into account.  But over-allocating memory might be
> beneficial, see below.
It is probably not getting the hugepages on mmap. Still the RSS is
growing as if it did.
>
> (Something must be happening between step 1 & 2 to make the writes
> possible.)
Totally right.
Could have explained it better. There is a call to setup_stack_prot that
I believe changes the protection for the stack-related values single
small page.

The write happens right after when you start writting to stack-related
values.
This is the critical point where it makes RSS grow by the hugepage size.

>
>> In any case, I wonder if there is an actual use case where an hugepage would
>> survive glibc stack allocation and will bring an actual benefit.
>
> It can reduce TLB misses.  The first-level TLB might only have 64
> entries for 4K pages, for example.  If the working set on the stack
> (including the TCB) needs more than a couple of pages, it might
> beneficial to use a 2M page and use just one TLB entry.
Indeed it might only not make sense if (guardsize > 0) as it is the case
of the example.
I think that in this case you can never get a hugepage since the guard
TLB entries will be write protected and would have different protection
from the remaining of the stack pages.
At least if you don't plan to allocate more than 2 hugepages.

I believe allocating 2M+4k was considered but it made it hard to control
data location.

> In your case, if your stacks are quite small, maybe you can just
> allocate slightly less than 2 MiB?
>
> The other question is whether the reported RSS is real, or if the kernel
> will recover zero stack pages on memory pressure.
Its a good point. I have no idea if the kernel is capable to recover the
zero stack pages in this particular case. Is there any way to trigger a recover?

In our example (in attach), there is a significant difference in
reported RSS, when we madvise the kernel.
Reported RSS is collected from /proc/self/statm.

# LD_LIBRARY_PATH=${HOME}/glibc_example/lib ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 2458323 (/proc/2458323/smaps)
Creating 128 threads...
RSS: 65888 pages (269877248 bytes = 257 MB)

After the madvise is added right before the writes to stack related
values (patch below):

# LD_LIBRARY_PATH=${HOME}/glibc_example/lib ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 2463199 (/proc/2463199/smaps)
Creating 128 threads...
RSS: 448 pages (1835008 bytes = 1 MB)

Thanks,
Cupertino

>
> Thanks,
> Florian

@@ -397,6 +397,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
                }
            }

+         __madvise(mem, size, MADV_NOHUGEPAGE);
          /* Remember the stack-related values.  */
          pd->stackblock = mem;
          pd->stackblock_size = size;


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tststackalloc.c --]
[-- Type: text/x-csrc, Size: 4600 bytes --]

// Compile & run:
//    gcc -Wall -g -o tststackalloc tststackalloc.c $< -lpthread
//    ./tststackalloc 1     # Attempt to use huge pages for stacks -> RSS bloat
//    ./tststackalloc 0     # Do not attempt to use huge pages -> No RSS bloat

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <fcntl.h>

// Number of threads to create
#define NOOF_THREADS (128)

// Size of a small page (hard-coded)
#define SMALL_PAGE_SIZE (4*1024)

// Size of a huge page (hard-coded)
#define HUGE_PAGE_SIZE (2*1024*1024)

// Total size of the thread stack, including the guard page(s)
#define STACK_SIZE_TOTAL (HUGE_PAGE_SIZE)

// Size of the guard page(s)
#define GUARD_SIZE (SMALL_PAGE_SIZE)

//#define PRINT_STACK_RANGES
//#define PRINT_PROC_SMAPS

// When enabled (set to non-zero), tries to align thread stacks on
// huge page boundaries, making them eligible for huge pages
static int huge_page_align_stacks;

static volatile int exit_thread = 0;

#if defined(PRINT_STACK_RANGES)
static void print_stack_range(void) {
  pthread_attr_t attr;
  void* bottom;
  size_t size;
  int err;

  err = pthread_getattr_np(pthread_self(), &attr);
  if (err != 0) {
    fprintf(stderr, "Error looking up attr\n");
    exit(1);
  }

  err = pthread_attr_getstack(&attr, &bottom, &size);
  if (err != 0) {
    fprintf(stderr, "Cannot locate current stack attributes!\n");
    exit(1);
  }

  pthread_attr_destroy(&attr);

  fprintf(stderr, "Stack: %p-%p (0x%zx/%zd)\n", bottom, bottom + size, size, size);
}
#endif

static void* start(void* arg) {
#if defined(PRINT_STACK_RANGES)
  print_stack_range();
#endif

  while(!exit_thread) {
    sleep(1);
  }
  return NULL;
}

#if defined(PRINT_PROC_SMAPS)
static void print_proc_file(const char* file) {
  char path[128];
  snprintf(path, sizeof(path), "/proc/self/%s", file);
  int smap = open(path, O_RDONLY);
  char buf[4096];
  int x = 0;
  while ((x = read(smap, buf, sizeof(buf))) > 0) {
    write(1, buf, x);
  }
  close(smap);
}
#endif

static size_t get_rss(void) {
  FILE* stat = fopen("/proc/self/statm", "r");
  long rss;
  fscanf(stat, "%*d %ld", &rss);
  return rss;
}

uintptr_t align_down(uintptr_t value, uintptr_t alignment) {
  return value & ~(alignment - 1);
}

// Do a series of small, single page mmap calls to attempt to set
// everything up so that the next mmap call (glibc allocating the
// stack) returns a 2MB aligned range. The kernel "expands" vmas from
// higher to lower addresses (subsequent calls return ranges starting
// at lower addresses), so this function keeps calling mmap until it a
// huge page aligned address is returned. The next range (the stack)
// will then end on that same address.
static void align_next_on(uintptr_t alignment) {
  uintptr_t p;
  do {
    p = (uintptr_t)mmap(NULL, SMALL_PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE, -1, 0);
  } while (p != align_down(p, HUGE_PAGE_SIZE));
}

int main(int argc, char* argv[]) {
  pthread_t t[NOOF_THREADS];
  pthread_attr_t attr;
  int i;

  if (argc != 2) {
    printf("Usage: %s <huge page stacks>\n", argv[0]);
    printf("    huge page stacks = 1 - attempt to use huge pages for stacks\n");
    exit(1);
  }
  huge_page_align_stacks = atoi(argv[1]);

  void* dummy = malloc(1024);
  free(dummy);

  fprintf(stderr, "Page size: %d kB, %d MB huge pages\n", SMALL_PAGE_SIZE / 1024, HUGE_PAGE_SIZE / (1024 * 1024));
  if (huge_page_align_stacks) {
    fprintf(stderr, "Will attempt to align allocations to make stacks eligible for huge pages\n");
  }
  pid_t pid = getpid();
  fprintf(stderr, "pid: %d (/proc/%d/smaps)\n", pid, pid);

  size_t guard_size = GUARD_SIZE;
  size_t stack_size = STACK_SIZE_TOTAL;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, stack_size);
  pthread_attr_setguardsize(&attr, guard_size);

  fprintf(stderr, "Creating %d threads...\n", NOOF_THREADS);
  for (i = 0; i < NOOF_THREADS; i++) {
    if (huge_page_align_stacks) {
      // align (next) allocation on huge page boundary
      align_next_on(HUGE_PAGE_SIZE);
    }
    pthread_create(&t[i], &attr, start, NULL);
  }
  sleep(1);

#if defined(PRINT_PROC_SMAPS)
  print_proc_file("smaps");
#endif

  size_t rss = get_rss();
  fprintf(stderr, "RSS: %zd pages (%zd bytes = %zd MB)\n", rss, rss * SMALL_PAGE_SIZE, rss * SMALL_PAGE_SIZE / 1024 / 1024);

  fprintf(stderr, "Press enter to exit...\n");
  getchar();

  exit_thread = 1;
  for (i = 0; i < NOOF_THREADS; i++) {
    pthread_join(t[i], NULL);
  }
  return 0;
}

     prev parent reply	other threads:[~2023-03-09 14:30 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <87pm9j4azf.fsf@oracle.com>
2023-03-08 14:17 ` Cupertino Miranda
2023-03-08 14:53   ` Cristian Rodríguez
2023-03-08 15:12     ` Cupertino Miranda
2023-03-08 17:19   ` Adhemerval Zanella Netto
2023-03-09  9:38     ` Cupertino Miranda
2023-03-09 17:11       ` Adhemerval Zanella Netto
2023-03-09 18:11         ` Cupertino Miranda
2023-03-09 18:15           ` Adhemerval Zanella Netto
2023-03-09 19:01             ` Cupertino Miranda
2023-03-09 19:11               ` Adhemerval Zanella Netto
2023-03-09 10:54   ` Florian Weimer
2023-03-09 14:29     ` Cupertino Miranda [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875yba3sm3.fsf@oracle.com \
    --to=cupertino.miranda@oracle.com \
    --cc=cupertinomiranda@gmail.com \
    --cc=elena.zannoni@oracle.com \
    --cc=fweimer@redhat.com \
    --cc=jose.marchesi@oracle.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).