Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Cupertino Miranda <cupertino.miranda@oracle.com>
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	'GNU C Library' <libc-alpha@sourceware.org>
Subject: Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
Date: Fri, 14 Apr 2023 12:28:53 +0100	[thread overview]
Message-ID: <878reud7nu.fsf@oracle.com> (raw)
In-Reply-To: <7fd2f7b1-73d3-1bc3-cabf-c67d1930cefc@linaro.org>

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]


Hi Adhemerval,

If my memory does not betray me, I believe the reason why the hugepages
would not be relevant was because the mprotect to the guard region would
enforce different protection on part of the huge page.

In that sense you could test if the guard region is a part of the
of the first page. If it is it would mean that the page would be broken
in smaller parts and it should advise for smaller pages.

While debugging I realized as well you were chacking for allignment of
"mem" however for the common case STACK_GROWS_DOWN, the relevant
allignment to check is the start of the stack (end of the memory allocated).

I have included your patch with some changes which I think make a
reasonable estimation.
In any case, I would like to keep the tunable, as I think it is armless
and makes sense.

BTW, you mentioned that you would install the tunable patch in a
previous email. Maybe you forgot.
I am waiting for it to do some internal backporting. ;-)

Regards,
Cupertino


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: adhemerval_1.patch --]
[-- Type: text/x-diff, Size: 5579 bytes --]

diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index c7adbccd6f..ad718cfb4a 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -33,6 +33,7 @@
 #include <nptl-stack.h>
 #include <libc-lock.h>
 #include <tls-internal.h>
+#include <malloc-hugepages.h>
 
 /* Default alignment of stack.  */
 #ifndef STACK_ALIGN
@@ -206,6 +207,38 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
 #endif
 }
 
+/* If the Transparent Huge Page (THP) is set as 'always', the resulting
+   address and the stack size are multiple of THP size, kernel may use THP for
+   the thread stack.  However, if the guard page size is not multiple of THP,
+   once it is mprotect the allocate range could no longer be served with THP
+   and then kernel will revert back using default page sizes.
+
+   However, the kernel might also not keep track of the offsets within the THP
+   that has been touched and need to reside on the memory.  It will then keep
+   all the small pages, thus using much more memory than required.  In this
+   scenario, it is better to just madvise that not use huge pages and avoid
+   the memory bloat.  */
+static __always_inline int
+advise_thp (void *mem, size_t size, char *guardpos)
+{
+  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
+  unsigned long int thpsize = __malloc_default_thp_pagesize ();
+
+#if _STACK_GROWS_DOWN
+  char *stack_start_addr = mem + size;
+#elif _STACK_GROWS_UP
+  char *stack_start_addr = mem;
+#endif
+
+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+  if ((((uintptr_t) stack_start_addr) % thpsize) != 0
+      || (((uintptr_t) mem / thpsize) != ((uintptr_t) guardpos / thpsize)))
+    return 0;
+
+  return __madvise (mem, size, MADV_NOHUGEPAGE);
+}
+
 /* Returns a usable stack for a new thread either by allocating a
    new stack or reusing a cached stack of sufficient size.
    ATTR must be non-NULL and point to a valid pthread_attr.
@@ -385,6 +418,13 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 				   - TLS_PRE_TCB_SIZE);
 #endif
 
+	  char *guardpos = guard_position (mem, size, guardsize,
+			  				pd, pagesize_m1);
+	  int r = advise_thp (mem, size, guardpos);
+	  if (r != 0)
+	    return r;
+
+
 	  /* Now mprotect the required region excluding the guard area.  */
 	  if (__glibc_likely (guardsize > 0))
 	    {
diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h
index d68b85630c..21d4844bc4 100644
--- a/sysdeps/generic/malloc-hugepages.h
+++ b/sysdeps/generic/malloc-hugepages.h
@@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden;
 
 enum malloc_thp_mode_t
 {
+  malloc_thp_mode_unknown,
   malloc_thp_mode_always,
   malloc_thp_mode_madvise,
   malloc_thp_mode_never,
diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c
index 740027ebfb..15b862b0bf 100644
--- a/sysdeps/unix/sysv/linux/malloc-hugepages.c
+++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c
@@ -22,19 +22,33 @@
 #include <not-cancel.h>
 #include <sys/mman.h>
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static unsigned long int thp_pagesize = -1;
+
 unsigned long int
 __malloc_default_thp_pagesize (void)
 {
+  unsigned long int size = atomic_load_relaxed (&thp_pagesize);
+  if (size != -1)
+    return size;
+
   int fd = __open64_nocancel (
     "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY);
   if (fd == -1)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   char str[INT_BUFSIZE_BOUND (unsigned long int)];
   ssize_t s = __read_nocancel (fd, str, sizeof (str));
   __close_nocancel (fd);
   if (s < 0)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   unsigned long int r = 0;
   for (ssize_t i = 0; i < s; i++)
@@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void)
       r *= 10;
       r += str[i] - '0';
     }
+  atomic_store_relaxed (&thp_pagesize, r);
   return r;
 }
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown;
+
 enum malloc_thp_mode_t
 __malloc_thp_mode (void)
 {
+  enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode);
+  if (mode != malloc_thp_mode_unknown)
+    return mode;
+
   int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled",
 			      O_RDONLY);
   if (fd == -1)
-    return malloc_thp_mode_not_supported;
+    {
+      atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported);
+      return malloc_thp_mode_not_supported;
+    }
 
   static const char mode_always[]  = "[always] madvise never\n";
   static const char mode_madvise[] = "always [madvise] never\n";
@@ -67,13 +93,19 @@ __malloc_thp_mode (void)
   if (s == sizeof (mode_always) - 1)
     {
       if (strcmp (str, mode_always) == 0)
-	return malloc_thp_mode_always;
+	mode = malloc_thp_mode_always;
       else if (strcmp (str, mode_madvise) == 0)
-	return malloc_thp_mode_madvise;
+	mode = malloc_thp_mode_madvise;
       else if (strcmp (str, mode_never) == 0)
-	return malloc_thp_mode_never;
+	mode = malloc_thp_mode_never;
+      else
+	mode = malloc_thp_mode_not_supported;
     }
-  return malloc_thp_mode_not_supported;
+  else
+    mode = malloc_thp_mode_not_supported;
+
+  atomic_store_relaxed (&thp_mode, mode);
+  return mode;
 }
 
 static size_t

[-- Attachment #3: Type: text/plain, Size: 2696 bytes --]


Adhemerval Zanella Netto writes:

> On 13/04/23 13:23, Cupertino Miranda wrote:
>>
>> Hi Wilco,
>>
>> Exactly my remark on the patch. ;)
>>
>> I think the tunable is benefitial when we care to allocate hugepages for
>> malloc, etc. But still be able to force small pages for stack.
>>
>> Imagine a scenario were you create lots of threads. Most threads
>> barelly use any stack, however there is one that somehow requires a lot
>> of it to do some crazy recursion. :)
>>
>> Most likely the heuristic would detect that hugepages would be useful
>> based on the stack size requirement, but it would never predict that it
>> only brings any benefit to 1% of the threads created.
>
> The problem is not find when hugepages is beneficial, but rather when
> using will incur in falling back to default pages.  And re-reading the
> THP kernel docs and after some experiment, I am not sure it is really
> possible to come up with good heuristics to do so (not without poking
> in khugepaged stats).
>
> For instance, if guard size is 0 THP will still backup the thread stack.
> However, if you force stack alignment by issuing multiple mmaps; the
> khugepaged won't have available VMA and thus won't use THP (using your
> example to force the mmap alignment in thread creation).
>
> I think my proposal will end with very limited and complicated
> heuristic (specially because khugepaged have various tunable itself),
> I agree that the tunable is a better strategy.
>
>>
>> Regards,
>> Cupertino
>>
>>
>>
>> Wilco Dijkstra writes:
>>
>>> Hi Adhemerval,
>>>
>>> I agree doing this automatically sounds like a better solution.
>>> However:
>>>
>>> +static __always_inline int
>>> +advise_thp (void *mem, size_t size, size_t guardsize)
>>> +{
>>> +  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
>>> +  if (thpmode != malloc_thp_mode_always)
>>> +    return 0;
>>> +
>>> +  unsigned long int thpsize = __malloc_default_thp_pagesize ();
>>> +  if ((uintptr_t) mem % thpsize != 0
>>> +      || size % thpsize != 0
>>> +      || (size - guardsize) % thpsize != 0)
>>> +    return 0;
>>>
>>> Isn't the last part always true currently given the guard page size is based on
>>> the standard page size? IIRC the issue was the mmap succeeds but the guard
>>> page is taken from the original mmap which then causes the decomposition.
>>>
>>> So you'd need something like:
>>>
>>> || guardsize % thpsize == 0)
>>>
>>> Ie. we return without the madvise if the size and alignment is wrong for a huge
>>> page or it is correct and the guardsize is a multiple of a huge page (in which
>>> case it shouldn't decompose).
>>>
>>> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
>>> +}
>>>
>>> Cheers,
>>> Wilco

next prev parent reply	other threads:[~2023-04-14 11:29 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-13 15:43 Wilco Dijkstra
2023-04-13 16:23 ` Cupertino Miranda
2023-04-13 17:48   ` Adhemerval Zanella Netto
2023-04-14 11:28     ` Cupertino Miranda [this message]
2023-04-14 13:24       ` Wilco Dijkstra
2023-04-14 14:49         ` Cupertino Miranda
2023-04-14 15:32           ` Wilco Dijkstra
2023-04-14 16:03             ` Wilco Dijkstra
2023-04-14 16:35               ` Cupertino Miranda
2023-04-14 23:10                 ` Wilco Dijkstra
2023-04-14 16:27             ` Cupertino Miranda
  -- strict thread matches above, loose matches on Subject: below --
2023-03-28 15:22 [PATCH v5 0/1] *** " Cupertino Miranda
2023-03-28 15:22 ` [PATCH v5 1/1] " Cupertino Miranda
2023-04-11 19:56   ` Adhemerval Zanella Netto
2023-04-12  8:53     ` Cupertino Miranda
2023-04-12 14:10       ` Adhemerval Zanella Netto
2023-04-13 16:13         ` Cupertino Miranda
2023-04-14 11:41       ` Adhemerval Zanella Netto
2023-04-14 12:27         ` Cupertino Miranda
2023-04-14 13:06           ` Adhemerval Zanella Netto
2023-04-14 14:33             ` Cupertino Miranda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878reud7nu.fsf@oracle.com \
    --to=cupertino.miranda@oracle.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).