From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=e2b/=AD=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com [IPv6:2001:4860:4864:20::2a])
	by sourceware.org (Postfix) with ESMTPS id B67363858D28
	for <libc-alpha@sourceware.org>; Wed, 12 Apr 2023 14:10:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B67363858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-1842cddca49so14071618fac.1
        for <libc-alpha@sourceware.org>; Wed, 12 Apr 2023 07:10:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1681308659; x=1683900659;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=lmdURSzBZoXUIJasJBVexRzIAdeR8U1kbKmla1DNFlU=;
        b=xVr9rMvk23kVuzcf32uOJeHxtNJFMwe55H6z2XKnVgYWMLEk5u2car2x1vcG1LCCZL
         FnNSAqBF/w6pRj977N24Sj2DYVLLMlscUKnOS+qyMzhxT7mGBwEKZE0rGfBNgLdFnUC4
         O/Pd6jIyCW66/c97V3CJxmL/cvPNhP3wFb58DGQcambofRhce7VV5UZW1mG06A2pxm1E
         JciNxFh5WNBO9d13hFxj7BIshK0E0PB/gJsZw/2+id0SmbmIqrgxgDLDHCkA1drxGNf6
         XDQf+vvRDenFgeflYbGuZ9T0h4hB+99DA6R/if3bKQglw4P2fZ8KlAB0gEFb3WDAMKYf
         fEBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1681308659; x=1683900659;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=lmdURSzBZoXUIJasJBVexRzIAdeR8U1kbKmla1DNFlU=;
        b=IZI34zpXTq0UpmPwHN/0Pyl0sBTm6Jdcl+bOliSXqKl7YNWxWPujpT/jY441Yg+mBF
         lDAgTWWFA5aFIqOchuYvBh77GuYrdQugG9B/OXFWbtvCLMZfC851PyZ0f58D/tArQH0R
         cafCLPO/RMox68Hg4FJYkHw+odu4wqcFQSUhAJ4rjnhjuNMkSuyt6ALen5rsh2WABYKp
         fa7htcYRBGwk7fBdLUQPt6Z8ISUJPSkOvlXbmkg/v7xMR4Mj92tezVSsfmZsqV49CkQh
         fWA3QxTN4Oo/Z5shGoxX1RaaOS2ASUGTP4SVNjPYeReF2NLRFUot8wZHUXefyXWJMTYo
         Pv6Q==
X-Gm-Message-State: AAQBX9eRkBZqYnM9WQo3acEsmj5+MlKqG5ttIsXlB0TIqaP2IIlFAKM3
	m8dn/LlMW6t5JPNSE4rzhrdClw==
X-Google-Smtp-Source: AKy350aSxw4yG5sVkRqHFlSEiwD0MzjiF6AnwqCuuRKVU48IlMLDzR0MrQ+FuBy0OGl+XeL983yQ7w==
X-Received: by 2002:a05:6870:330c:b0:184:4117:4bc6 with SMTP id x12-20020a056870330c00b0018441174bc6mr6789788oae.30.1681308658742;
        Wed, 12 Apr 2023 07:10:58 -0700 (PDT)
Received: from ?IPV6:2804:1b3:a7c2:55a1:7428:425e:4ee7:30b6? ([2804:1b3:a7c2:55a1:7428:425e:4ee7:30b6])
        by smtp.gmail.com with ESMTPSA id l6-20020a056870218600b0016b0369f08fsm6207594oae.15.2023.04.12.07.10.56
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 12 Apr 2023 07:10:57 -0700 (PDT)
Message-ID: <645838ea-afc0-0289-233e-50fa22f126c1@linaro.org>
Date: Wed, 12 Apr 2023 11:10:54 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.9.1
Subject: Re: [PATCH v5 1/1] Created tunable to force small pages on stack
 allocation.
Content-Language: en-US
To: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: libc-alpha@sourceware.org, Florian Weimer <fweimer@redhat.com>,
 jose.marchesi@oracle.com, elena.zannoni@oracle.com
References: <20230328152258.54844-1-cupertino.miranda@oracle.com>
 <20230328152258.54844-2-cupertino.miranda@oracle.com>
 <8f313a5d-f16a-d682-1d78-f216c446099f@linaro.org> <873555cwh5.fsf@oracle.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <873555cwh5.fsf@oracle.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-13.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 12/04/23 05:53, Cupertino Miranda wrote:
>>
>> So this patch is LGTM, and I will install this shortly.
>>
>> I also discussed on the same call if it would be better to make the m
>> advise the *default* behavior if the pthread stack usage will always ended
>> up requiring the kernel to split up to use default pages, i.e:
>>
>>   1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to
>>      'always'.
>>
>>   2. The stack size is multiple of THP size
>>      (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size).
>>
>>   3. And if stack size minus guard pages is still multiple of THP
>>      size ((stack_size - guard_size) % thp_size == 0).
>>
>> It does not mean that the stack will automatically backup by THP, but
>> it also means that depending of the process VMA it might generate some
>> RSS waste once kernel decides to use THP for the stack.  And it should
>> also make the tunables not required.
>>
>> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings
>> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
>> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/

I implemented my idea above, which should cover the issue you brought without
the need of the extra tunable.  It seems that if the kernel can not keep track
of the touch 'subpages' once THP is used on the stack allocation, it should be
always an improvement to madvise (MADV_NOHUGEPAGE).

What do you think?

---

[PATCH] nptl: Disable THP on thread stack if it incurs in large RSS usage

If the Transparent Huge Page (THP) is set as 'always', the resulting
address and the stack size are multiple of THP size, kernel may use
THP for the thread stack.  However, if the guard page size is not
multiple of THP, once it is mprotect the allocate range could no
longer be served with THP and then kernel will revert back using
default page sizes.

However, the kernel might also not keep track of the offsets within
the THP that has been touched and need to reside on the memory.  It
will then keep all the small pages, thus using much more memory than
required.  In this scenario, it is better to just madvise that not
use huge pages and avoid the memory bloat.

The __malloc_default_thp_pagesize and __malloc_thp_mode now cache
the obtained value to avoid require read and parse the kernel
information on each thread creation (if system change its setting,
the process will not be able to adjust it).

Checked on x86_64-linux-gnu.
---
 nptl/allocatestack.c                       | 32 +++++++++++++++
 sysdeps/generic/malloc-hugepages.h         |  1 +
 sysdeps/unix/sysv/linux/malloc-hugepages.c | 46 ++++++++++++++++++----
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index c7adbccd6f..d197edf2e9 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -33,6 +33,7 @@
 #include <nptl-stack.h>
 #include <libc-lock.h>
 #include <tls-internal.h>
+#include <malloc-hugepages.h>
 
 /* Default alignment of stack.  */
 #ifndef STACK_ALIGN
@@ -206,6 +207,33 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize)
 #endif
 }
 
+/* If the Transparent Huge Page (THP) is set as 'always', the resulting
+   address and the stack size are multiple of THP size, kernel may use THP for
+   the thread stack.  However, if the guard page size is not multiple of THP,
+   once it is mprotect the allocate range could no longer be served with THP
+   and then kernel will revert back using default page sizes.
+
+   However, the kernel might also not keep track of the offsets within the THP
+   that has been touched and need to reside on the memory.  It will then keep
+   all the small pages, thus using much more memory than required.  In this
+   scenario, it is better to just madvise that not use huge pages and avoid
+   the memory bloat.  */
+static __always_inline int
+advise_thp (void *mem, size_t size, size_t guardsize)
+{
+  enum malloc_thp_mode_t thpmode = __malloc_thp_mode ();
+  if (thpmode != malloc_thp_mode_always)
+    return 0;
+
+  unsigned long int thpsize = __malloc_default_thp_pagesize ();
+  if ((uintptr_t) mem % thpsize != 0
+      || size % thpsize != 0
+      || (size - guardsize) % thpsize != 0)
+    return 0;
+
+  return __madvise (mem, size, MADV_NOHUGEPAGE);
+}
+
 /* Returns a usable stack for a new thread either by allocating a
    new stack or reusing a cached stack of sufficient size.
    ATTR must be non-NULL and point to a valid pthread_attr.
@@ -373,6 +401,10 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 	     So we can never get a null pointer back from mmap.  */
 	  assert (mem != NULL);
 
+	  int r = advise_thp (mem, size, guardsize);
+	  if (r != 0)
+	    return r;
+
 	  /* Place the thread descriptor at the end of the stack.  */
 #if TLS_TCB_AT_TP
 	  pd = (struct pthread *) ((((uintptr_t) mem + size)
diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h
index d68b85630c..21d4844bc4 100644
--- a/sysdeps/generic/malloc-hugepages.h
+++ b/sysdeps/generic/malloc-hugepages.h
@@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden;
 
 enum malloc_thp_mode_t
 {
+  malloc_thp_mode_unknown,
   malloc_thp_mode_always,
   malloc_thp_mode_madvise,
   malloc_thp_mode_never,
diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c
index 683d68c327..5954dd13f6 100644
--- a/sysdeps/unix/sysv/linux/malloc-hugepages.c
+++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c
@@ -22,19 +22,33 @@
 #include <not-cancel.h>
 #include <sys/mman.h>
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static unsigned long int thp_pagesize = -1;
+
 unsigned long int
 __malloc_default_thp_pagesize (void)
 {
+  unsigned long int size = atomic_load_relaxed (&thp_pagesize);
+  if (size != -1)
+    return size;
+
   int fd = __open64_nocancel (
     "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY);
   if (fd == -1)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   char str[INT_BUFSIZE_BOUND (unsigned long int)];
   ssize_t s = __read_nocancel (fd, str, sizeof (str));
   __close_nocancel (fd);
   if (s < 0)
-    return 0;
+    {
+      atomic_store_relaxed (&thp_pagesize, 0);
+      return 0;
+    }
 
   unsigned long int r = 0;
   for (ssize_t i = 0; i < s; i++)
@@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void)
       r *= 10;
       r += str[i] - '0';
     }
+  atomic_store_relaxed (&thp_pagesize, r);
   return r;
 }
 
+/* The __malloc_thp_mode is called only in single-thread mode, either in
+   malloc initialization or pthread creation.  */
+static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown;
+
 enum malloc_thp_mode_t
 __malloc_thp_mode (void)
 {
+  enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode);
+  if (mode != malloc_thp_mode_unknown)
+    return mode;
+
   int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled",
 			      O_RDONLY);
   if (fd == -1)
-    return malloc_thp_mode_not_supported;
+    {
+      atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported);
+      return malloc_thp_mode_not_supported;
+    }
 
   static const char mode_always[]  = "[always] madvise never\n";
   static const char mode_madvise[] = "always [madvise] never\n";
@@ -66,13 +92,19 @@ __malloc_thp_mode (void)
   if (s == sizeof (mode_always) - 1)
     {
       if (strcmp (str, mode_always) == 0)
-	return malloc_thp_mode_always;
+	mode = malloc_thp_mode_always;
       else if (strcmp (str, mode_madvise) == 0)
-	return malloc_thp_mode_madvise;
+	mode = malloc_thp_mode_madvise;
       else if (strcmp (str, mode_never) == 0)
-	return malloc_thp_mode_never;
+	mode = malloc_thp_mode_never;
+      else
+	mode = malloc_thp_mode_not_supported;
     }
-  return malloc_thp_mode_not_supported;
+  else
+    mode = malloc_thp_mode_not_supported;
+
+  atomic_store_relaxed (&thp_mode, mode);
+  return mode;
 }
 
 static size_t