public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v5 0/1] *** Created tunable to force small pages on stack allocation.
@ 2023-03-28 15:22 Cupertino Miranda
  2023-03-28 15:22 ` [PATCH v5 1/1] " Cupertino Miranda
  2023-04-10  8:59 ` [PING] Re: [PATCH v5 0/1] *** " Cupertino Miranda
  0 siblings, 2 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-03-28 15:22 UTC (permalink / raw)
  To: libc-alpha; +Cc: jose.marchesi, elena.zannoni, Cupertino Miranda

Hi,

Fixes in tunables.texi by suggestion of Adhemerval.
Thanks for the improvements.

Regards,
Cupertino

Cupertino Miranda (1):
  Created tunable to force small pages on stack allocation.

 manual/tunables.texi          | 15 +++++++++++++++
 nptl/allocatestack.c          |  6 ++++++
 nptl/nptl-stack.c             |  1 +
 nptl/nptl-stack.h             |  3 +++
 nptl/pthread_mutex_conf.c     |  8 ++++++++
 sysdeps/nptl/dl-tunables.list |  6 ++++++
 6 files changed, 39 insertions(+)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-03-28 15:22 [PATCH v5 0/1] *** Created tunable to force small pages on stack allocation Cupertino Miranda
@ 2023-03-28 15:22 ` Cupertino Miranda
  2023-04-11 19:56   ` Adhemerval Zanella Netto
  2023-04-10  8:59 ` [PING] Re: [PATCH v5 0/1] *** " Cupertino Miranda
  1 sibling, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-03-28 15:22 UTC (permalink / raw)
  To: libc-alpha; +Cc: jose.marchesi, elena.zannoni, Cupertino Miranda

Created tunable glibc.pthread.stack_hugetlb to control when hugepages
can be used for stack allocation.
In case THP are enabled and glibc.pthread.stack_hugetlb is set to
0, glibc will madvise the kernel not to use allow hugepages for stack
allocations.

Changed from v1:
 - removed the __malloc_thp_mode calls to check if hugetlb is
   enabled.

Changed from v2:
 - Added entry in manual/tunables.texi
 - Fixed tunable default to description
 - Code style corrections.

Changes from v3:
 - Improve tunables.texi.

Changes from v4:
 - Improved text in tunables.texi by suggestion of Adhemerval.
---
 manual/tunables.texi          | 15 +++++++++++++++
 nptl/allocatestack.c          |  6 ++++++
 nptl/nptl-stack.c             |  1 +
 nptl/nptl-stack.h             |  3 +++
 nptl/pthread_mutex_conf.c     |  8 ++++++++
 sysdeps/nptl/dl-tunables.list |  6 ++++++
 6 files changed, 39 insertions(+)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 70dd2264c5..130f94b2bc 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -459,6 +459,21 @@ registration on behalf of the application.
 Restartable sequences are a Linux-specific extension.
 @end deftp
 
+@deftp Tunable glibc.pthread.stack_hugetlb
+This tunable controls whether to use Huge Pages in the stacks created by
+@code{pthread_create}.  This tunable only affects the stacks created by
+@theglibc{}, it has no effect on stack assigned with
+@code{pthread_attr_setstack}.
+
+The default is @samp{1} where the system default value is used.  Setting
+its value to @code{0} enables the use of @code{madvise} with
+@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}.
+
+This is a memory utilization optimization, since internal glibc setup of either
+the thread descriptor and the guard page might force the kernel to move the
+thread stack originally backup by Huge Pages to default pages.
+@end deftp
+
 @node Hardware Capability Tunables
 @section Hardware Capability Tunables
 @cindex hardware capability tunables
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index c7adbccd6f..f9d8cdfd08 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 	  if (__glibc_unlikely (mem == MAP_FAILED))
 	    return errno;
 
+	  /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is
+	     set to 0, disabling hugetlb.  */
+	  if (__glibc_unlikely (__nptl_stack_hugetlb == 0)
+	      && __madvise (mem, size, MADV_NOHUGEPAGE) != 0)
+	    return errno;
+
 	  /* SIZE is guaranteed to be greater than zero.
 	     So we can never get a null pointer back from mmap.  */
 	  assert (mem != NULL);
diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c
index 5eb7773575..e829711cb5 100644
--- a/nptl/nptl-stack.c
+++ b/nptl/nptl-stack.c
@@ -21,6 +21,7 @@
 #include <pthreadP.h>
 
 size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024;
+int32_t __nptl_stack_hugetlb = 1;
 
 void
 __nptl_stack_list_del (list_t *elem)
diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h
index 34f8bbb15e..cf90b27c2b 100644
--- a/nptl/nptl-stack.h
+++ b/nptl/nptl-stack.h
@@ -27,6 +27,9 @@
 /* Maximum size of the cache, in bytes.  40 MiB by default.  */
 extern size_t __nptl_stack_cache_maxsize attribute_hidden;
 
+/* Should allow stacks to use hugetlb. (1) is default.  */
+extern int32_t __nptl_stack_hugetlb;
+
 /* Check whether the stack is still used or not.  */
 static inline bool
 __nptl_stack_in_use (struct pthread *pd)
diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c
index 329c4cbb8f..60ef9095aa 100644
--- a/nptl/pthread_mutex_conf.c
+++ b/nptl/pthread_mutex_conf.c
@@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp)
   __nptl_stack_cache_maxsize = valp->numval;
 }
 
+static void
+TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp)
+{
+  __nptl_stack_hugetlb = (int32_t) valp->numval;
+}
+
 void
 __pthread_tunables_init (void)
 {
@@ -52,5 +58,7 @@ __pthread_tunables_init (void)
                TUNABLE_CALLBACK (set_mutex_spin_count));
   TUNABLE_GET (stack_cache_size, size_t,
                TUNABLE_CALLBACK (set_stack_cache_size));
+  TUNABLE_GET (stack_hugetlb, int32_t,
+	       TUNABLE_CALLBACK (set_stack_hugetlb));
 }
 #endif
diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list
index bd1ddb121d..4cde9500b6 100644
--- a/sysdeps/nptl/dl-tunables.list
+++ b/sysdeps/nptl/dl-tunables.list
@@ -33,5 +33,11 @@ glibc {
       maxval: 1
       default: 1
     }
+    stack_hugetlb {
+      type: INT_32
+      minval: 0
+      maxval: 1
+      default: 1
+    }
   }
 }
-- 
2.38.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PING] Re: [PATCH v5 0/1] *** Created tunable to force small pages on stack allocation.
  2023-03-28 15:22 [PATCH v5 0/1] *** Created tunable to force small pages on stack allocation Cupertino Miranda
  2023-03-28 15:22 ` [PATCH v5 1/1] " Cupertino Miranda
@ 2023-04-10  8:59 ` Cupertino Miranda
  1 sibling, 0 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-10  8:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: jose.marchesi, elena.zannoni, Cupertino Miranda


Cupertino Miranda writes:

> Hi,
>
> Fixes in tunables.texi by suggestion of Adhemerval.
> Thanks for the improvements.
>
> Regards,
> Cupertino
>
> Cupertino Miranda (1):
>   Created tunable to force small pages on stack allocation.
>
>  manual/tunables.texi          | 15 +++++++++++++++
>  nptl/allocatestack.c          |  6 ++++++
>  nptl/nptl-stack.c             |  1 +
>  nptl/nptl-stack.h             |  3 +++
>  nptl/pthread_mutex_conf.c     |  8 ++++++++
>  sysdeps/nptl/dl-tunables.list |  6 ++++++
>  6 files changed, 39 insertions(+)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-03-28 15:22 ` [PATCH v5 1/1] " Cupertino Miranda
@ 2023-04-11 19:56   ` Adhemerval Zanella Netto
  2023-04-12  8:53     ` Cupertino Miranda
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2023-04-11 19:56 UTC (permalink / raw)
  To: Cupertino Miranda, libc-alpha, Florian Weimer
  Cc: jose.marchesi, elena.zannoni



On 28/03/23 12:22, Cupertino Miranda via Libc-alpha wrote:
> Created tunable glibc.pthread.stack_hugetlb to control when hugepages
> can be used for stack allocation.
> In case THP are enabled and glibc.pthread.stack_hugetlb is set to
> 0, glibc will madvise the kernel not to use allow hugepages for stack
> allocations.
> 
> Changed from v1:
>  - removed the __malloc_thp_mode calls to check if hugetlb is
>    enabled.
> 
> Changed from v2:
>  - Added entry in manual/tunables.texi
>  - Fixed tunable default to description
>  - Code style corrections.
> 
> Changes from v3:
>  - Improve tunables.texi.
> 
> Changes from v4:
>  - Improved text in tunables.texi by suggestion of Adhemerval.

Florian has raised some concern [1] that reported RSS increase is not 
technically correct because the once the kernel need to split the Huge
Page, it does not need keep all of them (only the one that actually
generate the soft fault).

However this is not what I see using the previous testcase that creates 
lot of threads to force the THP usage and checking the
/proc/self/smaps_rollout.  The resulting 'Private_Dirty' still accounts
for *all* the default smalls pages once kernel decides to split the
page, and it seems to be same outcome from a recent OpenJDK thread [2].

Afaiu the kernel does not keep track which possible small pages from the
THP has been already hit when the guard page is mprotect (which forces
the split), so when the kernel reverts back to using default pages it 
keeps all the pages.  This is also a recent kernel discussion which 
similar conclusion [3].

So this patch is LGTM, and I will install this shortly.  

I also discussed on the same call if it would be better to make the m
advise the *default* behavior if the pthread stack usage will always ended 
up requiring the kernel to split up to use default pages, i.e:

  1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to
     'always'.

  2. The stack size is multiple of THP size
     (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size).

  3. And if stack size minus guard pages is still multiple of THP
     size ((stack_size - guard_size) % thp_size == 0).

It does not mean that the stack will automatically backup by THP, but
it also means that depending of the process VMA it might generate some
RSS waste once kernel decides to use THP for the stack.  And it should
also make the tunables not required.

[1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings
[2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
[3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/

> ---
>  manual/tunables.texi          | 15 +++++++++++++++
>  nptl/allocatestack.c          |  6 ++++++
>  nptl/nptl-stack.c             |  1 +
>  nptl/nptl-stack.h             |  3 +++
>  nptl/pthread_mutex_conf.c     |  8 ++++++++
>  sysdeps/nptl/dl-tunables.list |  6 ++++++
>  6 files changed, 39 insertions(+)
> 
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 70dd2264c5..130f94b2bc 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -459,6 +459,21 @@ registration on behalf of the application.
>  Restartable sequences are a Linux-specific extension.
>  @end deftp
>  
> +@deftp Tunable glibc.pthread.stack_hugetlb
> +This tunable controls whether to use Huge Pages in the stacks created by
> +@code{pthread_create}.  This tunable only affects the stacks created by
> +@theglibc{}, it has no effect on stack assigned with
> +@code{pthread_attr_setstack}.
> +
> +The default is @samp{1} where the system default value is used.  Setting
> +its value to @code{0} enables the use of @code{madvise} with
> +@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}.
> +
> +This is a memory utilization optimization, since internal glibc setup of either
> +the thread descriptor and the guard page might force the kernel to move the
> +thread stack originally backup by Huge Pages to default pages.
> +@end deftp
> +
>  @node Hardware Capability Tunables
>  @section Hardware Capability Tunables
>  @cindex hardware capability tunables
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index c7adbccd6f..f9d8cdfd08 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>  	  if (__glibc_unlikely (mem == MAP_FAILED))
>  	    return errno;
>  
> +	  /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is
> +	     set to 0, disabling hugetlb.  */
> +	  if (__glibc_unlikely (__nptl_stack_hugetlb == 0)
> +	      && __madvise (mem, size, MADV_NOHUGEPAGE) != 0)
> +	    return errno;
> +
>  	  /* SIZE is guaranteed to be greater than zero.
>  	     So we can never get a null pointer back from mmap.  */
>  	  assert (mem != NULL);
> diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c
> index 5eb7773575..e829711cb5 100644
> --- a/nptl/nptl-stack.c
> +++ b/nptl/nptl-stack.c
> @@ -21,6 +21,7 @@
>  #include <pthreadP.h>
>  
>  size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024;
> +int32_t __nptl_stack_hugetlb = 1;
>  
>  void
>  __nptl_stack_list_del (list_t *elem)
> diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h
> index 34f8bbb15e..cf90b27c2b 100644
> --- a/nptl/nptl-stack.h
> +++ b/nptl/nptl-stack.h
> @@ -27,6 +27,9 @@
>  /* Maximum size of the cache, in bytes.  40 MiB by default.  */
>  extern size_t __nptl_stack_cache_maxsize attribute_hidden;
>  
> +/* Should allow stacks to use hugetlb. (1) is default.  */
> +extern int32_t __nptl_stack_hugetlb;
> +
>  /* Check whether the stack is still used or not.  */
>  static inline bool
>  __nptl_stack_in_use (struct pthread *pd)
> diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c
> index 329c4cbb8f..60ef9095aa 100644
> --- a/nptl/pthread_mutex_conf.c
> +++ b/nptl/pthread_mutex_conf.c
> @@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp)
>    __nptl_stack_cache_maxsize = valp->numval;
>  }
>  
> +static void
> +TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp)
> +{
> +  __nptl_stack_hugetlb = (int32_t) valp->numval;
> +}
> +
>  void
>  __pthread_tunables_init (void)
>  {
> @@ -52,5 +58,7 @@ __pthread_tunables_init (void)
>                 TUNABLE_CALLBACK (set_mutex_spin_count));
>    TUNABLE_GET (stack_cache_size, size_t,
>                 TUNABLE_CALLBACK (set_stack_cache_size));
> +  TUNABLE_GET (stack_hugetlb, int32_t,
> +	       TUNABLE_CALLBACK (set_stack_hugetlb));
>  }
>  #endif
> diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list
> index bd1ddb121d..4cde9500b6 100644
> --- a/sysdeps/nptl/dl-tunables.list
> +++ b/sysdeps/nptl/dl-tunables.list
> @@ -33,5 +33,11 @@ glibc {
>        maxval: 1
>        default: 1
>      }
> +    stack_hugetlb {
> +      type: INT_32
> +      minval: 0
> +      maxval: 1
> +      default: 1
> +    }
>    }
>  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-11 19:56   ` Adhemerval Zanella Netto
@ 2023-04-12  8:53     ` Cupertino Miranda
  2023-04-12 14:10       ` Adhemerval Zanella Netto
  2023-04-14 11:41       ` Adhemerval Zanella Netto
  0 siblings, 2 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-12  8:53 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni


Hi Adhemerval, everyone,

Thanks the approval, detailed analysis and time spent on the topic.

Best regards,
Cupertino

Adhemerval Zanella Netto writes:

> On 28/03/23 12:22, Cupertino Miranda via Libc-alpha wrote:
>> Created tunable glibc.pthread.stack_hugetlb to control when hugepages
>> can be used for stack allocation.
>> In case THP are enabled and glibc.pthread.stack_hugetlb is set to
>> 0, glibc will madvise the kernel not to use allow hugepages for stack
>> allocations.
>>
>> Changed from v1:
>>  - removed the __malloc_thp_mode calls to check if hugetlb is
>>    enabled.
>>
>> Changed from v2:
>>  - Added entry in manual/tunables.texi
>>  - Fixed tunable default to description
>>  - Code style corrections.
>>
>> Changes from v3:
>>  - Improve tunables.texi.
>>
>> Changes from v4:
>>  - Improved text in tunables.texi by suggestion of Adhemerval.
>
> Florian has raised some concern [1] that reported RSS increase is not
> technically correct because the once the kernel need to split the Huge
> Page, it does not need keep all of them (only the one that actually
> generate the soft fault).
>
> However this is not what I see using the previous testcase that creates
> lot of threads to force the THP usage and checking the
> /proc/self/smaps_rollout.  The resulting 'Private_Dirty' still accounts
> for *all* the default smalls pages once kernel decides to split the
> page, and it seems to be same outcome from a recent OpenJDK thread [2].
>
> Afaiu the kernel does not keep track which possible small pages from the
> THP has been already hit when the guard page is mprotect (which forces
> the split), so when the kernel reverts back to using default pages it
> keeps all the pages.  This is also a recent kernel discussion which
> similar conclusion [3].
>
> So this patch is LGTM, and I will install this shortly.
>
> I also discussed on the same call if it would be better to make the m
> advise the *default* behavior if the pthread stack usage will always ended
> up requiring the kernel to split up to use default pages, i.e:
>
>   1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to
>      'always'.
>
>   2. The stack size is multiple of THP size
>      (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size).
>
>   3. And if stack size minus guard pages is still multiple of THP
>      size ((stack_size - guard_size) % thp_size == 0).
>
> It does not mean that the stack will automatically backup by THP, but
> it also means that depending of the process VMA it might generate some
> RSS waste once kernel decides to use THP for the stack.  And it should
> also make the tunables not required.
>
> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings
> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/
>
>> ---
>>  manual/tunables.texi          | 15 +++++++++++++++
>>  nptl/allocatestack.c          |  6 ++++++
>>  nptl/nptl-stack.c             |  1 +
>>  nptl/nptl-stack.h             |  3 +++
>>  nptl/pthread_mutex_conf.c     |  8 ++++++++
>>  sysdeps/nptl/dl-tunables.list |  6 ++++++
>>  6 files changed, 39 insertions(+)
>>
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index 70dd2264c5..130f94b2bc 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -459,6 +459,21 @@ registration on behalf of the application.
>>  Restartable sequences are a Linux-specific extension.
>>  @end deftp
>>
>> +@deftp Tunable glibc.pthread.stack_hugetlb
>> +This tunable controls whether to use Huge Pages in the stacks created by
>> +@code{pthread_create}.  This tunable only affects the stacks created by
>> +@theglibc{}, it has no effect on stack assigned with
>> +@code{pthread_attr_setstack}.
>> +
>> +The default is @samp{1} where the system default value is used.  Setting
>> +its value to @code{0} enables the use of @code{madvise} with
>> +@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}.
>> +
>> +This is a memory utilization optimization, since internal glibc setup of either
>> +the thread descriptor and the guard page might force the kernel to move the
>> +thread stack originally backup by Huge Pages to default pages.
>> +@end deftp
>> +
>>  @node Hardware Capability Tunables
>>  @section Hardware Capability Tunables
>>  @cindex hardware capability tunables
>> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
>> index c7adbccd6f..f9d8cdfd08 100644
>> --- a/nptl/allocatestack.c
>> +++ b/nptl/allocatestack.c
>> @@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>>  	  if (__glibc_unlikely (mem == MAP_FAILED))
>>  	    return errno;
>>
>> +	  /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is
>> +	     set to 0, disabling hugetlb.  */
>> +	  if (__glibc_unlikely (__nptl_stack_hugetlb == 0)
>> +	      && __madvise (mem, size, MADV_NOHUGEPAGE) != 0)
>> +	    return errno;
>> +
>>  	  /* SIZE is guaranteed to be greater than zero.
>>  	     So we can never get a null pointer back from mmap.  */
>>  	  assert (mem != NULL);
>> diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c
>> index 5eb7773575..e829711cb5 100644
>> --- a/nptl/nptl-stack.c
>> +++ b/nptl/nptl-stack.c
>> @@ -21,6 +21,7 @@
>>  #include <pthreadP.h>
>>
>>  size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024;
>> +int32_t __nptl_stack_hugetlb = 1;
>>
>>  void
>>  __nptl_stack_list_del (list_t *elem)
>> diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h
>> index 34f8bbb15e..cf90b27c2b 100644
>> --- a/nptl/nptl-stack.h
>> +++ b/nptl/nptl-stack.h
>> @@ -27,6 +27,9 @@
>>  /* Maximum size of the cache, in bytes.  40 MiB by default.  */
>>  extern size_t __nptl_stack_cache_maxsize attribute_hidden;
>>
>> +/* Should allow stacks to use hugetlb. (1) is default.  */
>> +extern int32_t __nptl_stack_hugetlb;
>> +
>>  /* Check whether the stack is still used or not.  */
>>  static inline bool
>>  __nptl_stack_in_use (struct pthread *pd)
>> diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c
>> index 329c4cbb8f..60ef9095aa 100644
>> --- a/nptl/pthread_mutex_conf.c
>> +++ b/nptl/pthread_mutex_conf.c
>> @@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp)
>>    __nptl_stack_cache_maxsize = valp->numval;
>>  }
>>
>> +static void
>> +TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp)
>> +{
>> +  __nptl_stack_hugetlb = (int32_t) valp->numval;
>> +}
>> +
>>  void
>>  __pthread_tunables_init (void)
>>  {
>> @@ -52,5 +58,7 @@ __pthread_tunables_init (void)
>>                 TUNABLE_CALLBACK (set_mutex_spin_count));
>>    TUNABLE_GET (stack_cache_size, size_t,
>>                 TUNABLE_CALLBACK (set_stack_cache_size));
>> +  TUNABLE_GET (stack_hugetlb, int32_t,
>> +	       TUNABLE_CALLBACK (set_stack_hugetlb));
>>  }
>>  #endif
>> diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list
>> index bd1ddb121d..4cde9500b6 100644
>> --- a/sysdeps/nptl/dl-tunables.list
>> +++ b/sysdeps/nptl/dl-tunables.list
>> @@ -33,5 +33,11 @@ glibc {
>>        maxval: 1
>>        default: 1
>>      }
>> +    stack_hugetlb {
>> +      type: INT_32
>> +      minval: 0
>> +      maxval: 1
>> +      default: 1
>> +    }
>>    }
>>  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-12  8:53     ` Cupertino Miranda
@ 2023-04-12 14:10       ` Adhemerval Zanella Netto
  2023-04-13 16:13         ` Cupertino Miranda
  2023-04-14 11:41       ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2023-04-12 14:10 UTC (permalink / raw)
  To: Cupertino Miranda
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni



On 12/04/23 05:53, Cupertino Miranda wrote:
>>
>> So this patch is LGTM, and I will install this shortly.
>>
>> I also discussed on the same call if it would be better to make the m
>> advise the *default* behavior if the pthread stack usage will always ended
>> up requiring the kernel to split up to use default pages, i.e:
>>
>>   1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to
>>      'always'.
>>
>>   2. The stack size is multiple of THP size
>>      (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size).
>>
>>   3. And if stack size minus guard pages is still multiple of THP
>>      size ((stack_size - guard_size) % thp_size == 0).
>>
>> It does not mean that the stack will automatically backup by THP, but
>> it also means that depending of the process VMA it might generate some
>> RSS waste once kernel decides to use THP for the stack.  And it should
>> also make the tunables not required.
>>
>> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings
>> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
>> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/

I implemented my idea above, which should cover the issue you brought without
the need of the extra tunable.  It seems that if the kernel can not keep track
of the touch 'subpages' once THP is used on the stack allocation, it should be
always an improvement to madvise (MADV_NOHUGEPAGE).

What do you think?

---

[PATCH] nptl: Disable THP on thread stack if it incurs in large RSS usage

If the Transparent Huge Page (THP) is set as 'always', the resulting
address and the stack size are multiple of THP size, kernel may use
THP for the thread stack.  However, if the guard page size is not
multiple of THP, once it is mprotect the allocate range could no
longer be served with THP and then kernel will revert back using
default page sizes.

However, the kernel might also not keep track of the offsets within
the THP that has been touched and need to reside on the memory.  It
will then keep all the small pages, thus using much more memory than
required.  In this scenario, it is better to just madvise that not
use huge pages and avoid the memory bloat.

The __malloc_default_thp_pagesize and __malloc_thp_mode now cache
the obtained value to avoid require read and parse the kernel
information on each thread creation (if system change its setting,
the process will not be able to adjust it).

Checked on x86_64-linux-gnu.
---
 nptl/allocatestack.c                       | 32 +++++++++++++++
 sysdeps/generic/malloc-hugepages.h         |  1 +
 sysdeps/unix/sysv/linux/malloc-hugepages.c | 46 ++++++++++++++++++----
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index c7adbccd6f..d197edf2e9 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -33,6 +33,7 @@
 #include <nptl-stack.h>
 #include <libc-lock.h>
 #include <tls-internal.h>
+#include <malloc-hugepages.h>
 
 /* Default alignment of stack.  */
 #ifndef STACK_ALIGN
@@ -206,6 +207,33 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
 #endif
 }
 
+/* If the Transparent Huge Page (THP) is set as 'always', the resulting
+   address and the stack size are multiple of THP size, kernel may use THP for
+   the thread stack.  However, if the guard page size is not multiple of THP,
+   once it is mprotect the allocate range could no longer be served with THP
+   and then kernel will revert back using default page sizes.
+
+   However, the kernel might also not keep track of the offsets within the THP
+   that has been touched and need to reside on the memory.  It will then keep
+   all the small pages, thus using much more memory than required.  In this
+   scenario, it is better to just madvise that not use huge pages and avoid
+   the memory bloat.  */
+static __always_inline int
+advise_thp (void *mem, size_t size, size_t guardsize)
+{
+  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+
+  unsigned long int thpsize = __malloc_default_thp_pagesize ();
+  if ((uintptr_t) mem % thpsize != 0
+      || size % thpsize != 0
+      || (size - guardsize) % thpsize != 0)
+    return 0;
+
+  return __madvise (mem, size, MADV_NOHUGEPAGE);
+}
+
 /* Returns a usable stack for a new thread either by allocating a
    new stack or reusing a cached stack of sufficient size.
    ATTR must be non-NULL and point to a valid pthread_attr.
@@ -373,6 +401,10 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 	     So we can never get a null pointer back from mmap.  */
 	  assert (mem != NULL);
 
+	  int r = advise_thp (mem, size, guardsize);
+	  if (r != 0)
+	    return r;
+
 	  /* Place the thread descriptor at the end of the stack.  */
 #if TLS_TCB_AT_TP
 	  pd = (struct pthread *) ((((uintptr_t) mem + size)
diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h
index d68b85630c..21d4844bc4 100644
--- a/sysdeps/generic/malloc-hugepages.h
+++ b/sysdeps/generic/malloc-hugepages.h
@@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden;
 
 enum malloc_thp_mode_t
 {
+  malloc_thp_mode_unknown,
   malloc_thp_mode_always,
   malloc_thp_mode_madvise,
   malloc_thp_mode_never,
diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c
index 683d68c327..5954dd13f6 100644
--- a/sysdeps/unix/sysv/linux/malloc-hugepages.c
+++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c
@@ -22,19 +22,33 @@
 #include <not-cancel.h>
 #include <sys/mman.h>
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static unsigned long int thp_pagesize = -1;
+
 unsigned long int
 __malloc_default_thp_pagesize (void)
 {
+  unsigned long int size = atomic_load_relaxed (&thp_pagesize);
+  if (size != -1)
+    return size;
+
   int fd = __open64_nocancel (
     "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY);
   if (fd == -1)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   char str[INT_BUFSIZE_BOUND (unsigned long int)];
   ssize_t s = __read_nocancel (fd, str, sizeof (str));
   __close_nocancel (fd);
   if (s < 0)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   unsigned long int r = 0;
   for (ssize_t i = 0; i < s; i++)
@@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void)
       r *= 10;
       r += str[i] - '0';
     }
+  atomic_store_relaxed (&thp_pagesize, r);
   return r;
 }
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown;
+
 enum malloc_thp_mode_t
 __malloc_thp_mode (void)
 {
+  enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode);
+  if (mode != malloc_thp_mode_unknown)
+    return mode;
+
   int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled",
 			      O_RDONLY);
   if (fd == -1)
-    return malloc_thp_mode_not_supported;
+    {
+      atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported);
+      return malloc_thp_mode_not_supported;
+    }
 
   static const char mode_always[]  = "[always] madvise never\n";
   static const char mode_madvise[] = "always [madvise] never\n";
@@ -66,13 +92,19 @@ __malloc_thp_mode (void)
   if (s == sizeof (mode_always) - 1)
     {
       if (strcmp (str, mode_always) == 0)
-	return malloc_thp_mode_always;
+	mode = malloc_thp_mode_always;
       else if (strcmp (str, mode_madvise) == 0)
-	return malloc_thp_mode_madvise;
+	mode = malloc_thp_mode_madvise;
       else if (strcmp (str, mode_never) == 0)
-	return malloc_thp_mode_never;
+	mode = malloc_thp_mode_never;
+      else
+	mode = malloc_thp_mode_not_supported;
     }
-  return malloc_thp_mode_not_supported;
+  else
+    mode = malloc_thp_mode_not_supported;
+
+  atomic_store_relaxed (&thp_mode, mode);
+  return mode;
 }
 
 static size_t

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-12 14:10       ` Adhemerval Zanella Netto
@ 2023-04-13 16:13         ` Cupertino Miranda
  0 siblings, 0 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-13 16:13 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni


Hi Adhemerval,

Sorry for taking a while to reply to you.

I compiled your patch with the particular example that lead me in the
tunable propose, it does not seem to optimize in that particular case.

Looking at the code, I realized that the condition would most likely
never allow for it to be madvised unless you would have a guardsize ==
thpsize, which appears irrealistic.

+  if ((uintptr_t) mem % thpsize != 0           (is not aligned)
+      || size % thpsize != 0                   (1)
+      || (size - guardsize) % thpsize != 0)    (2)
+    return 0;

If (1) is not true then (2) is likely to be.


I aggree that for the scenarios where the hugepages would never be used,
we should advise the kernel not to bloat it.
In any case I am not fully sure you can always predict that.

Lastly, I would say that the tunable would still be usable.
For example, when the application creates sort lived threads but
allocates hugepage size stacks to control allignment.

Regards,
Cupertino


Adhemerval Zanella Netto writes:

> On 12/04/23 05:53, Cupertino Miranda wrote:
>>>
>>> So this patch is LGTM, and I will install this shortly.
>>>
>>> I also discussed on the same call if it would be better to make the m
>>> advise the *default* behavior if the pthread stack usage will always ended
>>> up requiring the kernel to split up to use default pages, i.e:
>>>
>>>   1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to
>>>      'always'.
>>>
>>>   2. The stack size is multiple of THP size
>>>      (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size).
>>>
>>>   3. And if stack size minus guard pages is still multiple of THP
>>>      size ((stack_size - guard_size) % thp_size == 0).
>>>
>>> It does not mean that the stack will automatically backup by THP, but
>>> it also means that depending of the process VMA it might generate some
>>> RSS waste once kernel decides to use THP for the stack.  And it should
>>> also make the tunables not required.
>>>
>>> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings
>>> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
>>> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/
>
> I implemented my idea above, which should cover the issue you brought without
> the need of the extra tunable.  It seems that if the kernel can not keep track
> of the touch 'subpages' once THP is used on the stack allocation, it should be
> always an improvement to madvise (MADV_NOHUGEPAGE).
>
> What do you think?
>
> ---
>
> [PATCH] nptl: Disable THP on thread stack if it incurs in large RSS usage
>
> If the Transparent Huge Page (THP) is set as 'always', the resulting
> address and the stack size are multiple of THP size, kernel may use
> THP for the thread stack.  However, if the guard page size is not
> multiple of THP, once it is mprotect the allocate range could no
> longer be served with THP and then kernel will revert back using
> default page sizes.
>
> However, the kernel might also not keep track of the offsets within
> the THP that has been touched and need to reside on the memory.  It
> will then keep all the small pages, thus using much more memory than
> required.  In this scenario, it is better to just madvise that not
> use huge pages and avoid the memory bloat.
>
> The __malloc_default_thp_pagesize and __malloc_thp_mode now cache
> the obtained value to avoid require read and parse the kernel
> information on each thread creation (if system change its setting,
> the process will not be able to adjust it).
>
> Checked on x86_64-linux-gnu.
> ---
>  nptl/allocatestack.c                       | 32 +++++++++++++++
>  sysdeps/generic/malloc-hugepages.h         |  1 +
>  sysdeps/unix/sysv/linux/malloc-hugepages.c | 46 ++++++++++++++++++----
>  3 files changed, 72 insertions(+), 7 deletions(-)
>
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index c7adbccd6f..d197edf2e9 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -33,6 +33,7 @@
>  #include <nptl-stack.h>
>  #include <libc-lock.h>
>  #include <tls-internal.h>
> +#include <malloc-hugepages.h>
>
>  /* Default alignment of stack.  */
>  #ifndef STACK_ALIGN
> @@ -206,6 +207,33 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
>  #endif
>  }
>
> +/* If the Transparent Huge Page (THP) is set as 'always', the resulting
> +   address and the stack size are multiple of THP size, kernel may use THP for
> +   the thread stack.  However, if the guard page size is not multiple of THP,
> +   once it is mprotect the allocate range could no longer be served with THP
> +   and then kernel will revert back using default page sizes.
> +
> +   However, the kernel might also not keep track of the offsets within the THP
> +   that has been touched and need to reside on the memory.  It will then keep
> +   all the small pages, thus using much more memory than required.  In this
> +   scenario, it is better to just madvise that not use huge pages and avoid
> +   the memory bloat.  */
> +static __always_inline int
> +advise_thp (void *mem, size_t size, size_t guardsize)
> +{
> +  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
> +  if (thpmode != malloc_thp_mode_always)
> +    return 0;
> +
> +  unsigned long int thpsize = __malloc_default_thp_pagesize ();
> +  if ((uintptr_t) mem % thpsize != 0
> +      || size % thpsize != 0
> +      || (size - guardsize) % thpsize != 0)
> +    return 0;
> +
> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
> +}
> +
>  /* Returns a usable stack for a new thread either by allocating a
>     new stack or reusing a cached stack of sufficient size.
>     ATTR must be non-NULL and point to a valid pthread_attr.
> @@ -373,6 +401,10 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>  	     So we can never get a null pointer back from mmap.  */
>  	  assert (mem != NULL);
>
> +	  int r = advise_thp (mem, size, guardsize);
> +	  if (r != 0)
> +	    return r;
> +
>  	  /* Place the thread descriptor at the end of the stack.  */
>  #if TLS_TCB_AT_TP
>  	  pd = (struct pthread *) ((((uintptr_t) mem + size)
> diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h
> index d68b85630c..21d4844bc4 100644
> --- a/sysdeps/generic/malloc-hugepages.h
> +++ b/sysdeps/generic/malloc-hugepages.h
> @@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden;
>
>  enum malloc_thp_mode_t
>  {
> +  malloc_thp_mode_unknown,
>    malloc_thp_mode_always,
>    malloc_thp_mode_madvise,
>    malloc_thp_mode_never,
> diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c
> index 683d68c327..5954dd13f6 100644
> --- a/sysdeps/unix/sysv/linux/malloc-hugepages.c
> +++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c
> @@ -22,19 +22,33 @@
>  #include <not-cancel.h>
>  #include <sys/mman.h>
>
> +/* The __malloc_thp_mode is called only in single-thread mode, either in
> +   malloc initialization or pthread creation.  */
> +static unsigned long int thp_pagesize = -1;
> +
>  unsigned long int
>  __malloc_default_thp_pagesize (void)
>  {
> +  unsigned long int size = atomic_load_relaxed (&thp_pagesize);
> +  if (size != -1)
> +    return size;
> +
>    int fd = __open64_nocancel (
>      "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY);
>    if (fd == -1)
> -    return 0;
> +    {
> +      atomic_store_relaxed (&thp_pagesize, 0);
> +      return 0;
> +    }
>
>    char str[INT_BUFSIZE_BOUND (unsigned long int)];
>    ssize_t s = __read_nocancel (fd, str, sizeof (str));
>    __close_nocancel (fd);
>    if (s < 0)
> -    return 0;
> +    {
> +      atomic_store_relaxed (&thp_pagesize, 0);
> +      return 0;
> +    }
>
>    unsigned long int r = 0;
>    for (ssize_t i = 0; i < s; i++)
> @@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void)
>        r *= 10;
>        r += str[i] - '0';
>      }
> +  atomic_store_relaxed (&thp_pagesize, r);
>    return r;
>  }
>
> +/* The __malloc_thp_mode is called only in single-thread mode, either in
> +   malloc initialization or pthread creation.  */
> +static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown;
> +
>  enum malloc_thp_mode_t
>  __malloc_thp_mode (void)
>  {
> +  enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode);
> +  if (mode != malloc_thp_mode_unknown)
> +    return mode;
> +
>    int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled",
>  			      O_RDONLY);
>    if (fd == -1)
> -    return malloc_thp_mode_not_supported;
> +    {
> +      atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported);
> +      return malloc_thp_mode_not_supported;
> +    }
>
>    static const char mode_always[]  = "[always] madvise never\n";
>    static const char mode_madvise[] = "always [madvise] never\n";
> @@ -66,13 +92,19 @@ __malloc_thp_mode (void)
>    if (s == sizeof (mode_always) - 1)
>      {
>        if (strcmp (str, mode_always) == 0)
> -	return malloc_thp_mode_always;
> +	mode = malloc_thp_mode_always;
>        else if (strcmp (str, mode_madvise) == 0)
> -	return malloc_thp_mode_madvise;
> +	mode = malloc_thp_mode_madvise;
>        else if (strcmp (str, mode_never) == 0)
> -	return malloc_thp_mode_never;
> +	mode = malloc_thp_mode_never;
> +      else
> +	mode = malloc_thp_mode_not_supported;
>      }
> -  return malloc_thp_mode_not_supported;
> +  else
> +    mode = malloc_thp_mode_not_supported;
> +
> +  atomic_store_relaxed (&thp_mode, mode);
> +  return mode;
>  }
>
>  static size_t

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-12  8:53     ` Cupertino Miranda
  2023-04-12 14:10       ` Adhemerval Zanella Netto
@ 2023-04-14 11:41       ` Adhemerval Zanella Netto
  2023-04-14 12:27         ` Cupertino Miranda
  1 sibling, 1 reply; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2023-04-14 11:41 UTC (permalink / raw)
  To: Cupertino Miranda
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni

On 12/04/23 05:53, Cupertino Miranda wrote:
> 
> Hi Adhemerval, everyone,
> 
> Thanks the approval, detailed analysis and time spent on the topic.
> 
> Best regards,
> Cupertino
> 

Cupertino, could you also add a NEWS entry? I recall that every tunable
is also mentioned on NEWS. 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 11:41       ` Adhemerval Zanella Netto
@ 2023-04-14 12:27         ` Cupertino Miranda
  2023-04-14 13:06           ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 12:27 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni


Hi Adhemerval,

I came up with the following text in patch below.
Do you want me to do a new version or is this good enough ?

Regards,
Cupertino


diff --git a/NEWS b/NEWS
index c54af824e0..941808049a 100644
--- a/NEWS
+++ b/NEWS
@@ -21,6 +21,9 @@ Major new features:

 * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>.

+* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable
+  Transparent Huge Pages (THP) in pthread_create stack allocation.
+
 Deprecated and removed features, and other changes affecting compatibility:

 * In the Linux kernel for the hppa/parisc architecture some of the


Adhemerval Zanella Netto writes:

> On 12/04/23 05:53, Cupertino Miranda wrote:
>>
>> Hi Adhemerval, everyone,
>>
>> Thanks the approval, detailed analysis and time spent on the topic.
>>
>> Best regards,
>> Cupertino
>>
>
> Cupertino, could you also add a NEWS entry? I recall that every tunable
> is also mentioned on NEWS.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 12:27         ` Cupertino Miranda
@ 2023-04-14 13:06           ` Adhemerval Zanella Netto
  2023-04-14 14:33             ` Cupertino Miranda
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2023-04-14 13:06 UTC (permalink / raw)
  To: Cupertino Miranda
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni



On 14/04/23 09:27, Cupertino Miranda wrote:
> 
> Hi Adhemerval,
> 
> I came up with the following text in patch below.
> Do you want me to do a new version or is this good enough ?

Yes, so I can install by pulling from patchwork.

> 
> Regards,
> Cupertino
> 
> 
> diff --git a/NEWS b/NEWS
> index c54af824e0..941808049a 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -21,6 +21,9 @@ Major new features:
> 
>  * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>.
> 
> +* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable
> +  Transparent Huge Pages (THP) in pthread_create stack allocation.
> +

Ok.

>  Deprecated and removed features, and other changes affecting compatibility:
> 
>  * In the Linux kernel for the hppa/parisc architecture some of the
> 
> 
> Adhemerval Zanella Netto writes:
> 
>> On 12/04/23 05:53, Cupertino Miranda wrote:
>>>
>>> Hi Adhemerval, everyone,
>>>
>>> Thanks the approval, detailed analysis and time spent on the topic.
>>>
>>> Best regards,
>>> Cupertino
>>>
>>
>> Cupertino, could you also add a NEWS entry? I recall that every tunable
>> is also mentioned on NEWS.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 13:06           ` Adhemerval Zanella Netto
@ 2023-04-14 14:33             ` Cupertino Miranda
  0 siblings, 0 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 14:33 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: libc-alpha, Florian Weimer, jose.marchesi, elena.zannoni


Sent !

Adhemerval Zanella Netto writes:

> On 14/04/23 09:27, Cupertino Miranda wrote:
>>
>> Hi Adhemerval,
>>
>> I came up with the following text in patch below.
>> Do you want me to do a new version or is this good enough ?
>
> Yes, so I can install by pulling from patchwork.
>
>>
>> Regards,
>> Cupertino
>>
>>
>> diff --git a/NEWS b/NEWS
>> index c54af824e0..941808049a 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -21,6 +21,9 @@ Major new features:
>>
>>  * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>.
>>
>> +* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable
>> +  Transparent Huge Pages (THP) in pthread_create stack allocation.
>> +
>
> Ok.
>
>>  Deprecated and removed features, and other changes affecting compatibility:
>>
>>  * In the Linux kernel for the hppa/parisc architecture some of the
>>
>>
>> Adhemerval Zanella Netto writes:
>>
>>> On 12/04/23 05:53, Cupertino Miranda wrote:
>>>>
>>>> Hi Adhemerval, everyone,
>>>>
>>>> Thanks the approval, detailed analysis and time spent on the topic.
>>>>
>>>> Best regards,
>>>> Cupertino
>>>>
>>>
>>> Cupertino, could you also add a NEWS entry? I recall that every tunable
>>> is also mentioned on NEWS.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 16:35               ` Cupertino Miranda
@ 2023-04-14 23:10                 ` Wilco Dijkstra
  0 siblings, 0 replies; 22+ messages in thread
From: Wilco Dijkstra @ 2023-04-14 23:10 UTC (permalink / raw)
  To: Cupertino Miranda; +Cc: Adhemerval Zanella Netto, 'GNU C Library'

Hi Cupertino,

> I had a test application demonstrating the problem.

So I don't see the problem discussed of huge pages being
split into small pages due to the guard page. If you ask for
N huge pages of stack, you actually get either N-1 or N huge
pages. That's irrespectively of the size of the guard page.

If you get N-1 huge pages the first/last part of the stack consist
of small pages. This depends on the alignment of the previous
mmap by accident since the stack allocation code does not
request huge pages (ie. you get transparent huge pages if they
happen to fit in the mmap range).

Even in your basic example you always get exactly 1 huge page
for all threads when the mmap is aligned. Interestingly in the
unaligned mmap case (ie. N-1 huge pages), the RSS size can
actually end up twice as large as the RSS size of aligned huge
pages (eg. if you use ~2MBytes of stack it will exactly fit in one
huge page if aligned, so 2MB RSS, but if not aligned might use
1.9MB of small pages with the rest falling into the next huge
page, making RSS ~4MB).

So the underlying problem is that the stack allocation code doesn't
really support huge pages. It's bad to get random huge pages in
stack allocations based on alignment from unrelated mmap calls -
it affects RSS size randomly as well as TLB misses etc. Overall it seems
best to disable huge pages in stacks until this is fixed.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 16:03             ` Wilco Dijkstra
@ 2023-04-14 16:35               ` Cupertino Miranda
  2023-04-14 23:10                 ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 16:35 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Adhemerval Zanella Netto, 'GNU C Library'

[-- Attachment #1: Type: text/plain, Size: 828 bytes --]


Hi Wilco,

I had a test application demonstrating the problem.
I include it in attach.
From my particular interest in the tunable this is the difference:

# GLIBC_TUNABLES=glibc.pthread.stack_hugetlb=0 ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 3482023 (/proc/3482023/smaps)
stack_size = 2097152, 0x200000
Creating 128 threads...
RSS: 448 pages (1835008 bytes = 1 MB)
Press enter to exit...

# GLIBC_TUNABLES=glibc.pthread.stack_hugetlb=1 ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 3482254 (/proc/3482254/smaps)
stack_size = 2097152, 0x200000
Creating 128 threads...
RSS: 65891 pages (269889536 bytes = 257 MB)
Press enter to exit...

Regards,
Cupertino


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tststackalloc.c --]
[-- Type: text/x-csrc, Size: 4666 bytes --]

// Compile & run:
//    gcc -Wall -g -o tststackalloc tststackalloc.c $< -lpthread
//    ./tststackalloc 1     # Attempt to use huge pages for stacks -> RSS bloat
//    ./tststackalloc 0     # Do not attempt to use huge pages -> No RSS bloat

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <fcntl.h>

// Number of threads to create
#define NOOF_THREADS (128)

// Size of a small page (hard-coded)
#define SMALL_PAGE_SIZE (4*1024)

// Size of a huge page (hard-coded)
#define HUGE_PAGE_SIZE (2*1024*1024)

// Total size of the thread stack, including the guard page(s)
#define STACK_SIZE_TOTAL (HUGE_PAGE_SIZE)

// Size of the guard page(s)
#define GUARD_SIZE (SMALL_PAGE_SIZE)

//#define PRINT_STACK_RANGES
//#define PRINT_PROC_SMAPS

// When enabled (set to non-zero), tries to align thread stacks on
// huge page boundaries, making them eligible for huge pages
static int huge_page_align_stacks;

static volatile int exit_thread = 0;

#if defined(PRINT_STACK_RANGES)
static void print_stack_range(void) {
  pthread_attr_t attr;
  void* bottom;
  size_t size;
  int err;

  err = pthread_getattr_np(pthread_self(), &attr);
  if (err != 0) {
    fprintf(stderr, "Error looking up attr\n");
    exit(1);
  }

  err = pthread_attr_getstack(&attr, &bottom, &size);
  if (err != 0) {
    fprintf(stderr, "Cannot locate current stack attributes!\n");
    exit(1);
  }

  pthread_attr_destroy(&attr);

  fprintf(stderr, "Stack: %p-%p (0x%zx/%zd)\n", bottom, bottom + size, size, size);
}
#endif

static void* start(void* arg) {
#if defined(PRINT_STACK_RANGES)
  print_stack_range();
#endif

  while(!exit_thread) {
    sleep(1);
  }
  exit(2);
}

#if defined(PRINT_PROC_SMAPS)
static void print_proc_file(const char* file) {
  char path[128];
  snprintf(path, sizeof(path), "/proc/self/%s", file);
  int smap = open(path, O_RDONLY);
  char buf[4096];
  int x = 0;
  while ((x = read(smap, buf, sizeof(buf))) > 0) {
    write(1, buf, x);
  }
  close(smap);
}
#endif

static size_t get_rss(void) {
  FILE* stat = fopen("/proc/self/statm", "r");
  long rss;
  fscanf(stat, "%*d %ld", &rss);
  return rss;
}

uintptr_t align_down(uintptr_t value, uintptr_t alignment) {
  return value & ~(alignment - 1);
}

// Do a series of small, single page mmap calls to attempt to set
// everything up so that the next mmap call (glibc allocating the
// stack) returns a 2MB aligned range. The kernel "expands" vmas from
// higher to lower addresses (subsequent calls return ranges starting
// at lower addresses), so this function keeps calling mmap until it a
// huge page aligned address is returned. The next range (the stack)
// will then end on that same address.
static void align_next_on(uintptr_t alignment) {
  uintptr_t p;
  do {
    p = (uintptr_t)mmap(NULL, SMALL_PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE, -1, 0);
  } while (p != align_down(p, HUGE_PAGE_SIZE));
}

int main(int argc, char* argv[]) {
  pthread_t t[NOOF_THREADS];
  pthread_attr_t attr;
  int i;

  if (argc != 2) {
    printf("Usage: %s <huge page stacks>\n", argv[0]);
    printf("    huge page stacks = 1 - attempt to use huge pages for stacks\n");
    exit(1);
  }
  huge_page_align_stacks = atoi(argv[1]);

  void* dummy = malloc(1024);
  free(dummy);

  fprintf(stderr, "Page size: %d kB, %d MB huge pages\n", SMALL_PAGE_SIZE / 1024, HUGE_PAGE_SIZE / (1024 * 1024));
  if (huge_page_align_stacks) {
    fprintf(stderr, "Will attempt to align allocations to make stacks eligible for huge pages\n");
  }
  pid_t pid = getpid();
  fprintf(stderr, "pid: %d (/proc/%d/smaps)\n", pid, pid);

  size_t guard_size = GUARD_SIZE;
  size_t stack_size = STACK_SIZE_TOTAL;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, stack_size);
  pthread_attr_setguardsize(&attr, guard_size);

  fprintf(stderr, "stack_size = %d, 0x%x\n", stack_size, stack_size);
  fprintf(stderr, "Creating %d threads...\n", NOOF_THREADS);
  for (i = 0; i < NOOF_THREADS; i++) {
    if (huge_page_align_stacks) {
      // align (next) allocation on huge page boundary
      align_next_on(HUGE_PAGE_SIZE);
    }
    pthread_create(&t[i], &attr, start, NULL);
  }
  sleep(1);

#if defined(PRINT_PROC_SMAPS)
  print_proc_file("smaps");
#endif

  size_t rss = get_rss();
  fprintf(stderr, "RSS: %zd pages (%zd bytes = %zd MB)\n", rss, rss * SMALL_PAGE_SIZE, rss * SMALL_PAGE_SIZE / 1024 / 1024);

  fprintf(stderr, "Press enter to exit...\n");
  getchar();

  exit_thread = 1;
  for (i = 0; i < NOOF_THREADS; i++) {
    pthread_join(t[i], NULL);
  }
  return 0;
}

[-- Attachment #3: Type: text/plain, Size: 460 bytes --]



Wilco Dijkstra writes:

> Hi,
>
>> The next question is whether the splitting of one huge page causes the whole
>> stack mmap to use small pages too.
>
> Btw if it wasn't obvious, you can trivially check this by creating threads that use
> 3MB of stack. Then the RSS per thread should be either 3MB (only small pages,
> no RSS loss!), 4MB (2 large pages, no RSS loss), 5 MB (3MB small pages, 2MB loss)
> or 6MB (2 large pages, 2MB loss).
>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 15:32           ` Wilco Dijkstra
  2023-04-14 16:03             ` Wilco Dijkstra
@ 2023-04-14 16:27             ` Cupertino Miranda
  1 sibling, 0 replies; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 16:27 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Adhemerval Zanella Netto, 'GNU C Library'


> Hi Cupertino,
>
>> To contextualize the tunable was proposed to reduce stack memory usage while
>> still having THP enabled in the kernel for all other allocation types.
>> We identified a scenario which THP will be initially used for stack
>> allocation, but once protection gets set for the guard region, it will
>> split the huge page into 4k pages, when the 2M page is already marked
>> dirty, not really making use of THP benefits, but bloating RSS as all
>> the small pages are assumed as touch.
>
> So after the guard region is reserved, we lose 2MB of real memory that will
> never be used (as only the top of the stack is used), right?

Right

> The next question is whether the splitting of one huge page causes the whole
> stack mmap to use small pages too. If so then it never makes sense to use huge
> pages for the stack if the guard page is smaller than a huge page (since you lose
> 2MB of memory and never use huge pages anyway).

If one wants to get guardsize of 2MB they can specify that in
pthread_create attributes. That is already controllable through pthread
api.

My tunable is really an optimization to reduce RSS, i.e. always avoid
THP to be used for stack allocation. Even if you want to allocate big
stack spaces, those would be using small pages to reduce RSS.

Adhemerval suggestion is more of an heuristic to detect when, although
you prefer THP to be used, the configuration of the stack would not
allow to assign a huge page anyway, and instead of having the kernel bloat
the RSS by 2MB at allocation time, one would advise the kernel not to
attempt THP here.
When glibc changes the protection mode for a small guard space that
fits in the same huge page as the rest of the stack, although the huge
page is split into 4k pages, the RSS would be kept at 2MB, since the
huge page was already written on and is marked as dirty.
This is what is attempted to be avoided in Adhemerval patch by
madvise-ing in advance.

> Alternatively, if the rest of the stack can still use huge pages, we actually lose 4MB
> of RSS (since the top of the stack is a large page and is obviously used). As I
> explained, huge pages for stack can still make sense for applications that need
> a large amount of stack. For these cases you could just increase the guard size
> to avoid wasting the extra 2MB of memory.
>
>> I think you are suggesting the oposite, i.e. it would increase RSS.
>
> Where did I suggest increasing RSS?
Wouldn't increasing the guard size to 2MB not increase RSS by 2MB ?

>
>> Also Adhemerval patch intention is not to detect when the THP for stack
>> allocation makes sense, but rather the oposite. It is to detect when THP
>> will not bring any benefit.
>
> It doesn't matter which way around you do it - it is absolutely trivial either way.

Regarding condition changes I made in Adhemerval patch, it is likely
that it is not complete. In any case, the approach is more about
detecting when THP are not going to be possible, rather then changing
allocation parameters to make it possible.

Cheers,
Cupertino

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 15:32           ` Wilco Dijkstra
@ 2023-04-14 16:03             ` Wilco Dijkstra
  2023-04-14 16:35               ` Cupertino Miranda
  2023-04-14 16:27             ` Cupertino Miranda
  1 sibling, 1 reply; 22+ messages in thread
From: Wilco Dijkstra @ 2023-04-14 16:03 UTC (permalink / raw)
  To: Cupertino Miranda; +Cc: Adhemerval Zanella Netto, 'GNU C Library'

Hi,

> The next question is whether the splitting of one huge page causes the whole
> stack mmap to use small pages too. 

Btw if it wasn't obvious, you can trivially check this by creating threads that use
3MB of stack. Then the RSS per thread should be either 3MB (only small pages,
no RSS loss!), 4MB (2 large pages, no RSS loss), 5 MB (3MB small pages, 2MB loss)
or 6MB (2 large pages, 2MB loss).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 14:49         ` Cupertino Miranda
@ 2023-04-14 15:32           ` Wilco Dijkstra
  2023-04-14 16:03             ` Wilco Dijkstra
  2023-04-14 16:27             ` Cupertino Miranda
  0 siblings, 2 replies; 22+ messages in thread
From: Wilco Dijkstra @ 2023-04-14 15:32 UTC (permalink / raw)
  To: Cupertino Miranda; +Cc: Adhemerval Zanella Netto, 'GNU C Library'

Hi Cupertino,

> To contextualize the tunable was proposed to reduce stack memory usage while
> still having THP enabled in the kernel for all other allocation types.
> We identified a scenario which THP will be initially used for stack
> allocation, but once protection gets set for the guard region, it will
> split the huge page into 4k pages, when the 2M page is already marked
> dirty, not really making use of THP benefits, but bloating RSS as all
> the small pages are assumed as touch.

So after the guard region is reserved, we lose 2MB of real memory that will
never be used (as only the top of the stack is used), right?

The next question is whether the splitting of one huge page causes the whole
stack mmap to use small pages too. If so then it never makes sense to use huge
pages for the stack if the guard page is smaller than a huge page (since you lose
2MB of memory and never use huge pages anyway).

Alternatively, if the rest of the stack can still use huge pages, we actually lose 4MB
of RSS (since the top of the stack is a large page and is obviously used). As I
explained, huge pages for stack can still make sense for applications that need
a large amount of stack. For these cases you could just increase the guard size
to avoid wasting the extra 2MB of memory.

> I think you are suggesting the oposite, i.e. it would increase RSS.

Where did I suggest increasing RSS?

> Also Adhemerval patch intention is not to detect when the THP for stack
> allocation makes sense, but rather the oposite. It is to detect when THP
> will not bring any benefit.

It doesn't matter which way around you do it - it is absolutely trivial either way.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 13:24       ` Wilco Dijkstra
@ 2023-04-14 14:49         ` Cupertino Miranda
  2023-04-14 15:32           ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 14:49 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Adhemerval Zanella Netto, 'GNU C Library'


Hi Wilco,

To contextualize the tunable was proposed to reduce stack memory usage while
still having THP enabled in the kernel for all other allocation
types.
We identified a scenario which THP will be initially used for stack
allocation, but once protection gets set for the guard region, it will
split the huge page into 4k pages, when the 2M page is already marked
dirty, not really making use of THP benefits, but bloating RSS as all
the small pages are assumed as touch.

I think you are suggesting the oposite, i.e. it would increase RSS.

Also Adhemerval patch intention is not to detect when the THP for stack
allocation makes sense, but rather the oposite. It is to detect when THP
will not bring any benefit.

Regards,
Cupertino


Wilco Dijkstra writes:

> Hi Cupertino,
>
>> If my memory does not betray me, I believe the reason why the hugepages
>> would not be relevant was because the mprotect to the guard region would
>> enforce different protection on part of the huge page.
>
> Yes so with a default stack size of 8MB and a 2MB huge page, this implies that
> it splits the bottom page of the stack and wastes 2 MB of memory that most
> programs will never use (since it is the bottom of the stack).
>
>> In that sense you could test if the guard region is a part of the
>> of the first page. If it is it would mean that the page would be broken
>> in smaller parts and it should advise for smaller pages.
>
> Yes that's the test I suggested.
>
>> While debugging I realized as well you were chacking for allignment of
>> "mem" however for the common case STACK_GROWS_DOWN, the relevant
>> allignment to check is the start of the stack (end of the memory allocated).
>
> This shouldn't matter given the size must also be a multiple of huge page
> size, and so start/end alignment are the same.
>
>> I have included your patch with some changes which I think make a
>> reasonable estimation.
>
> +  if (thpmode != malloc_thp_mode_always)
> +    return 0;
> +  if ((((uintptr_t) stack_start_addr) % thpsize) != 0
>
> I think checking mem and size is better.
>
> +      || (((uintptr_t) mem / thpsize) != ((uintptr_t) guardpos / thpsize)))
>
> I'm not sure what that is trying to test. The issue is not the position of the
> guard page, but the guard page *size* being smaller than a huge page.
> We could test that guardpos % thpsize == 0 as well since the guard may
> be in different places within the mmap, but testing guardsize % thpsize == 0
> is essential.
>
> +    return 0;
> +
> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
>
>> In any case, I would like to keep the tunable, as I think it is armless
>> and makes sense.
>
> Having a tunable to override the default behaviour is perfectly fine.
> However I believe the default behaviour should be to block huge pages
> unless a heuristic says it is safe.
>
> We could automatically increase guard size to be a huge page to avoid any
> splitting (a much larger guard doesn't cost anything). The heuristic could
> allow huge pages if the stacksize is much larger than the default size.
> This would help Fortran code built with -Ofast where it allocates huge arrays
> on the stack to avoid slow malloc/free.
>
> This should do the right thing for the majority of cases without needing
> to set a tunable.
>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-14 11:28     ` Cupertino Miranda
@ 2023-04-14 13:24       ` Wilco Dijkstra
  2023-04-14 14:49         ` Cupertino Miranda
  0 siblings, 1 reply; 22+ messages in thread
From: Wilco Dijkstra @ 2023-04-14 13:24 UTC (permalink / raw)
  To: Cupertino Miranda, Adhemerval Zanella Netto; +Cc: 'GNU C Library'

Hi Cupertino,

> If my memory does not betray me, I believe the reason why the hugepages
> would not be relevant was because the mprotect to the guard region would
> enforce different protection on part of the huge page.

Yes so with a default stack size of 8MB and a 2MB huge page, this implies that
it splits the bottom page of the stack and wastes 2 MB of memory that most
programs will never use (since it is the bottom of the stack).

> In that sense you could test if the guard region is a part of the
> of the first page. If it is it would mean that the page would be broken
> in smaller parts and it should advise for smaller pages.

Yes that's the test I suggested.

> While debugging I realized as well you were chacking for allignment of
> "mem" however for the common case STACK_GROWS_DOWN, the relevant
> allignment to check is the start of the stack (end of the memory allocated).

This shouldn't matter given the size must also be a multiple of huge page
size, and so start/end alignment are the same.

> I have included your patch with some changes which I think make a
> reasonable estimation.

+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+  if ((((uintptr_t) stack_start_addr) % thpsize) != 0

I think checking mem and size is better.

+      || (((uintptr_t) mem / thpsize) != ((uintptr_t) guardpos / thpsize)))

I'm not sure what that is trying to test. The issue is not the position of the
guard page, but the guard page *size* being smaller than a huge page.
We could test that guardpos % thpsize == 0 as well since the guard may
be in different places within the mmap, but testing guardsize % thpsize == 0
is essential.

+    return 0;
+
+  return __madvise (mem, size, MADV_NOHUGEPAGE);

> In any case, I would like to keep the tunable, as I think it is armless
> and makes sense.

Having a tunable to override the default behaviour is perfectly fine.
However I believe the default behaviour should be to block huge pages
unless a heuristic says it is safe.

We could automatically increase guard size to be a huge page to avoid any
splitting (a much larger guard doesn't cost anything). The heuristic could
allow huge pages if the stacksize is much larger than the default size.
This would help Fortran code built with -Ofast where it allocates huge arrays
on the stack to avoid slow malloc/free.

This should do the right thing for the majority of cases without needing
to set a tunable.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-13 17:48   ` Adhemerval Zanella Netto
@ 2023-04-14 11:28     ` Cupertino Miranda
  2023-04-14 13:24       ` Wilco Dijkstra
  0 siblings, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-14 11:28 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Wilco Dijkstra, 'GNU C Library'

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]


Hi Adhemerval,

If my memory does not betray me, I believe the reason why the hugepages
would not be relevant was because the mprotect to the guard region would
enforce different protection on part of the huge page.

In that sense you could test if the guard region is a part of the
of the first page. If it is it would mean that the page would be broken
in smaller parts and it should advise for smaller pages.

While debugging I realized as well you were chacking for allignment of
"mem" however for the common case STACK_GROWS_DOWN, the relevant
allignment to check is the start of the stack (end of the memory allocated).

I have included your patch with some changes which I think make a
reasonable estimation.
In any case, I would like to keep the tunable, as I think it is armless
and makes sense.

BTW, you mentioned that you would install the tunable patch in a
previous email. Maybe you forgot.
I am waiting for it to do some internal backporting. ;-)

Regards,
Cupertino


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: adhemerval_1.patch --]
[-- Type: text/x-diff, Size: 5579 bytes --]

diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index c7adbccd6f..ad718cfb4a 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -33,6 +33,7 @@
 #include <nptl-stack.h>
 #include <libc-lock.h>
 #include <tls-internal.h>
+#include <malloc-hugepages.h>
 
 /* Default alignment of stack.  */
 #ifndef STACK_ALIGN
@@ -206,6 +207,38 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
 #endif
 }
 
+/* If the Transparent Huge Page (THP) is set as 'always', the resulting
+   address and the stack size are multiple of THP size, kernel may use THP for
+   the thread stack.  However, if the guard page size is not multiple of THP,
+   once it is mprotect the allocate range could no longer be served with THP
+   and then kernel will revert back using default page sizes.
+
+   However, the kernel might also not keep track of the offsets within the THP
+   that has been touched and need to reside on the memory.  It will then keep
+   all the small pages, thus using much more memory than required.  In this
+   scenario, it is better to just madvise that not use huge pages and avoid
+   the memory bloat.  */
+static __always_inline int
+advise_thp (void *mem, size_t size, char *guardpos)
+{
+  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
+  unsigned long int thpsize = __malloc_default_thp_pagesize ();
+
+#if _STACK_GROWS_DOWN
+  char *stack_start_addr = mem + size;
+#elif _STACK_GROWS_UP
+  char *stack_start_addr = mem;
+#endif
+
+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+  if ((((uintptr_t) stack_start_addr) % thpsize) != 0
+      || (((uintptr_t) mem / thpsize) != ((uintptr_t) guardpos / thpsize)))
+    return 0;
+
+  return __madvise (mem, size, MADV_NOHUGEPAGE);
+}
+
 /* Returns a usable stack for a new thread either by allocating a
    new stack or reusing a cached stack of sufficient size.
    ATTR must be non-NULL and point to a valid pthread_attr.
@@ -385,6 +418,13 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 				   - TLS_PRE_TCB_SIZE);
 #endif
 
+	  char *guardpos = guard_position (mem, size, guardsize,
+			  				pd, pagesize_m1);
+	  int r = advise_thp (mem, size, guardpos);
+	  if (r != 0)
+	    return r;
+
+
 	  /* Now mprotect the required region excluding the guard area.  */
 	  if (__glibc_likely (guardsize > 0))
 	    {
diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h
index d68b85630c..21d4844bc4 100644
--- a/sysdeps/generic/malloc-hugepages.h
+++ b/sysdeps/generic/malloc-hugepages.h
@@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden;
 
 enum malloc_thp_mode_t
 {
+  malloc_thp_mode_unknown,
   malloc_thp_mode_always,
   malloc_thp_mode_madvise,
   malloc_thp_mode_never,
diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c
index 740027ebfb..15b862b0bf 100644
--- a/sysdeps/unix/sysv/linux/malloc-hugepages.c
+++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c
@@ -22,19 +22,33 @@
 #include <not-cancel.h>
 #include <sys/mman.h>
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static unsigned long int thp_pagesize = -1;
+
 unsigned long int
 __malloc_default_thp_pagesize (void)
 {
+  unsigned long int size = atomic_load_relaxed (&thp_pagesize);
+  if (size != -1)
+    return size;
+
   int fd = __open64_nocancel (
     "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY);
   if (fd == -1)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   char str[INT_BUFSIZE_BOUND (unsigned long int)];
   ssize_t s = __read_nocancel (fd, str, sizeof (str));
   __close_nocancel (fd);
   if (s < 0)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   unsigned long int r = 0;
   for (ssize_t i = 0; i < s; i++)
@@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void)
       r *= 10;
       r += str[i] - '0';
     }
+  atomic_store_relaxed (&thp_pagesize, r);
   return r;
 }
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown;
+
 enum malloc_thp_mode_t
 __malloc_thp_mode (void)
 {
+  enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode);
+  if (mode != malloc_thp_mode_unknown)
+    return mode;
+
   int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled",
 			      O_RDONLY);
   if (fd == -1)
-    return malloc_thp_mode_not_supported;
+    {
+      atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported);
+      return malloc_thp_mode_not_supported;
+    }
 
   static const char mode_always[]  = "[always] madvise never\n";
   static const char mode_madvise[] = "always [madvise] never\n";
@@ -67,13 +93,19 @@ __malloc_thp_mode (void)
   if (s == sizeof (mode_always) - 1)
     {
       if (strcmp (str, mode_always) == 0)
-	return malloc_thp_mode_always;
+	mode = malloc_thp_mode_always;
       else if (strcmp (str, mode_madvise) == 0)
-	return malloc_thp_mode_madvise;
+	mode = malloc_thp_mode_madvise;
       else if (strcmp (str, mode_never) == 0)
-	return malloc_thp_mode_never;
+	mode = malloc_thp_mode_never;
+      else
+	mode = malloc_thp_mode_not_supported;
     }
-  return malloc_thp_mode_not_supported;
+  else
+    mode = malloc_thp_mode_not_supported;
+
+  atomic_store_relaxed (&thp_mode, mode);
+  return mode;
 }
 
 static size_t

[-- Attachment #3: Type: text/plain, Size: 2696 bytes --]


Adhemerval Zanella Netto writes:

> On 13/04/23 13:23, Cupertino Miranda wrote:
>>
>> Hi Wilco,
>>
>> Exactly my remark on the patch. ;)
>>
>> I think the tunable is benefitial when we care to allocate hugepages for
>> malloc, etc. But still be able to force small pages for stack.
>>
>> Imagine a scenario were you create lots of threads. Most threads
>> barelly use any stack, however there is one that somehow requires a lot
>> of it to do some crazy recursion. :)
>>
>> Most likely the heuristic would detect that hugepages would be useful
>> based on the stack size requirement, but it would never predict that it
>> only brings any benefit to 1% of the threads created.
>
> The problem is not find when hugepages is beneficial, but rather when
> using will incur in falling back to default pages.  And re-reading the
> THP kernel docs and after some experiment, I am not sure it is really
> possible to come up with good heuristics to do so (not without poking
> in khugepaged stats).
>
> For instance, if guard size is 0 THP will still backup the thread stack.
> However, if you force stack alignment by issuing multiple mmaps; the
> khugepaged won't have available VMA and thus won't use THP (using your
> example to force the mmap alignment in thread creation).
>
> I think my proposal will end with very limited and complicated
> heuristic (specially because khugepaged have various tunable itself),
> I agree that the tunable is a better strategy.
>
>>
>> Regards,
>> Cupertino
>>
>>
>>
>> Wilco Dijkstra writes:
>>
>>> Hi Adhemerval,
>>>
>>> I agree doing this automatically sounds like a better solution.
>>> However:
>>>
>>> +static __always_inline int
>>> +advise_thp (void *mem, size_t size, size_t guardsize)
>>> +{
>>> +  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
>>> +  if (thpmode != malloc_thp_mode_always)
>>> +    return 0;
>>> +
>>> +  unsigned long int thpsize = __malloc_default_thp_pagesize ();
>>> +  if ((uintptr_t) mem % thpsize != 0
>>> +      || size % thpsize != 0
>>> +      || (size - guardsize) % thpsize != 0)
>>> +    return 0;
>>>
>>> Isn't the last part always true currently given the guard page size is based on
>>> the standard page size? IIRC the issue was the mmap succeeds but the guard
>>> page is taken from the original mmap which then causes the decomposition.
>>>
>>> So you'd need something like:
>>>
>>> || guardsize % thpsize == 0)
>>>
>>> Ie. we return without the madvise if the size and alignment is wrong for a huge
>>> page or it is correct and the guardsize is a multiple of a huge page (in which
>>> case it shouldn't decompose).
>>>
>>> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
>>> +}
>>>
>>> Cheers,
>>> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-13 16:23 ` Cupertino Miranda
@ 2023-04-13 17:48   ` Adhemerval Zanella Netto
  2023-04-14 11:28     ` Cupertino Miranda
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2023-04-13 17:48 UTC (permalink / raw)
  To: Cupertino Miranda, Wilco Dijkstra; +Cc: 'GNU C Library'



On 13/04/23 13:23, Cupertino Miranda wrote:
> 
> Hi Wilco,
> 
> Exactly my remark on the patch. ;)
> 
> I think the tunable is benefitial when we care to allocate hugepages for
> malloc, etc. But still be able to force small pages for stack.
> 
> Imagine a scenario were you create lots of threads. Most threads
> barelly use any stack, however there is one that somehow requires a lot
> of it to do some crazy recursion. :)
> 
> Most likely the heuristic would detect that hugepages would be useful
> based on the stack size requirement, but it would never predict that it
> only brings any benefit to 1% of the threads created.

The problem is not find when hugepages is beneficial, but rather when
using will incur in falling back to default pages.  And re-reading the
THP kernel docs and after some experiment, I am not sure it is really
possible to come up with good heuristics to do so (not without poking
in khugepaged stats).

For instance, if guard size is 0 THP will still backup the thread stack.
However, if you force stack alignment by issuing multiple mmaps; the
khugepaged won't have available VMA and thus won't use THP (using your
example to force the mmap alignment in thread creation).

I think my proposal will end with very limited and complicated 
heuristic (specially because khugepaged have various tunable itself),
I agree that the tunable is a better strategy.

> 
> Regards,
> Cupertino
> 
> 
> 
> Wilco Dijkstra writes:
> 
>> Hi Adhemerval,
>>
>> I agree doing this automatically sounds like a better solution.
>> However:
>>
>> +static __always_inline int
>> +advise_thp (void *mem, size_t size, size_t guardsize)
>> +{
>> +  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
>> +  if (thpmode != malloc_thp_mode_always)
>> +    return 0;
>> +
>> +  unsigned long int thpsize = __malloc_default_thp_pagesize ();
>> +  if ((uintptr_t) mem % thpsize != 0
>> +      || size % thpsize != 0
>> +      || (size - guardsize) % thpsize != 0)
>> +    return 0;
>>
>> Isn't the last part always true currently given the guard page size is based on
>> the standard page size? IIRC the issue was the mmap succeeds but the guard
>> page is taken from the original mmap which then causes the decomposition.
>>
>> So you'd need something like:
>>
>> || guardsize % thpsize == 0)
>>
>> Ie. we return without the madvise if the size and alignment is wrong for a huge
>> page or it is correct and the guardsize is a multiple of a huge page (in which
>> case it shouldn't decompose).
>>
>> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
>> +}
>>
>> Cheers,
>> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
  2023-04-13 15:43 [PATCH v5 1/1] " Wilco Dijkstra
@ 2023-04-13 16:23 ` Cupertino Miranda
  2023-04-13 17:48   ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 22+ messages in thread
From: Cupertino Miranda @ 2023-04-13 16:23 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Adhemerval Zanella, 'GNU C Library'


Hi Wilco,

Exactly my remark on the patch. ;)

I think the tunable is benefitial when we care to allocate hugepages for
malloc, etc. But still be able to force small pages for stack.

Imagine a scenario were you create lots of threads. Most threads
barelly use any stack, however there is one that somehow requires a lot
of it to do some crazy recursion. :)

Most likely the heuristic would detect that hugepages would be useful
based on the stack size requirement, but it would never predict that it
only brings any benefit to 1% of the threads created.

Regards,
Cupertino



Wilco Dijkstra writes:

> Hi Adhemerval,
>
> I agree doing this automatically sounds like a better solution.
> However:
>
> +static __always_inline int
> +advise_thp (void *mem, size_t size, size_t guardsize)
> +{
> +  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
> +  if (thpmode != malloc_thp_mode_always)
> +    return 0;
> +
> +  unsigned long int thpsize = __malloc_default_thp_pagesize ();
> +  if ((uintptr_t) mem % thpsize != 0
> +      || size % thpsize != 0
> +      || (size - guardsize) % thpsize != 0)
> +    return 0;
>
> Isn't the last part always true currently given the guard page size is based on
> the standard page size? IIRC the issue was the mmap succeeds but the guard
> page is taken from the original mmap which then causes the decomposition.
>
> So you'd need something like:
>
> || guardsize % thpsize == 0)
>
> Ie. we return without the madvise if the size and alignment is wrong for a huge
> page or it is correct and the guardsize is a multiple of a huge page (in which
> case it shouldn't decompose).
>
> +  return __madvise (mem, size, MADV_NOHUGEPAGE);
> +}
>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/1] Created tunable to force small pages on stack allocation.
@ 2023-04-13 15:43 Wilco Dijkstra
  2023-04-13 16:23 ` Cupertino Miranda
  0 siblings, 1 reply; 22+ messages in thread
From: Wilco Dijkstra @ 2023-04-13 15:43 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: 'GNU C Library', Cupertino Miranda

Hi Adhemerval,

I agree doing this automatically sounds like a better solution.
However:

+static __always_inline int
+advise_thp (void *mem, size_t size, size_t guardsize)
+{
+  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+
+  unsigned long int thpsize = __malloc_default_thp_pagesize ();
+  if ((uintptr_t) mem % thpsize != 0
+      || size % thpsize != 0
+      || (size - guardsize) % thpsize != 0)
+    return 0;

Isn't the last part always true currently given the guard page size is based on
the standard page size? IIRC the issue was the mmap succeeds but the guard
page is taken from the original mmap which then causes the decomposition.

So you'd need something like:

|| guardsize % thpsize == 0)

Ie. we return without the madvise if the size and alignment is wrong for a huge
page or it is correct and the guardsize is a multiple of a huge page (in which
case it shouldn't decompose). 

+  return __madvise (mem, size, MADV_NOHUGEPAGE);
+}

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-04-14 23:11 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28 15:22 [PATCH v5 0/1] *** Created tunable to force small pages on stack allocation Cupertino Miranda
2023-03-28 15:22 ` [PATCH v5 1/1] " Cupertino Miranda
2023-04-11 19:56   ` Adhemerval Zanella Netto
2023-04-12  8:53     ` Cupertino Miranda
2023-04-12 14:10       ` Adhemerval Zanella Netto
2023-04-13 16:13         ` Cupertino Miranda
2023-04-14 11:41       ` Adhemerval Zanella Netto
2023-04-14 12:27         ` Cupertino Miranda
2023-04-14 13:06           ` Adhemerval Zanella Netto
2023-04-14 14:33             ` Cupertino Miranda
2023-04-10  8:59 ` [PING] Re: [PATCH v5 0/1] *** " Cupertino Miranda
2023-04-13 15:43 [PATCH v5 1/1] " Wilco Dijkstra
2023-04-13 16:23 ` Cupertino Miranda
2023-04-13 17:48   ` Adhemerval Zanella Netto
2023-04-14 11:28     ` Cupertino Miranda
2023-04-14 13:24       ` Wilco Dijkstra
2023-04-14 14:49         ` Cupertino Miranda
2023-04-14 15:32           ` Wilco Dijkstra
2023-04-14 16:03             ` Wilco Dijkstra
2023-04-14 16:35               ` Cupertino Miranda
2023-04-14 23:10                 ` Wilco Dijkstra
2023-04-14 16:27             ` Cupertino Miranda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).