Re: [RFC] Stack allocation, hugepages and RSS implications

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
To: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: libc-alpha@sourceware.org,
	"Jose E. Marchesi" <jose.marchesi@oracle.com>,
	Elena Zannoni <elena.zannoni@oracle.com>,
	Cupertino Miranda <cupertinomiranda@gmail.com>
Subject: Re: [RFC] Stack allocation, hugepages and RSS implications
Date: Thu, 9 Mar 2023 15:15:51 -0300	[thread overview]
Message-ID: <843f1062-562a-c455-c6b1-c767f2e4417f@linaro.org> (raw)
In-Reply-To: <87y1o53icm.fsf@oracle.com>



On 09/03/23 15:11, Cupertino Miranda wrote:
> 
> Adhemerval Zanella Netto writes:
> 
>> On 09/03/23 06:38, Cupertino Miranda wrote:
>>>
>>> Adhemerval Zanella Netto writes:
>>>
>>>> On 08/03/23 11:17, Cupertino Miranda via Libc-alpha wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> For performance purposes, one of ours in-house applications requires to enable
>>>>> TRANSPARENT_HUGEPAGES_ALWAYS option in linux kernel, actually making the
>>>>> kernel to force all of the big enough and alligned memory allocations to
>>>>> reside in hugepages.  I believe the reason behind this decision is to
>>>>> have more control on data location.
>>>>
>>>> He have, since 2.35, the glibc.malloc.hugetlb tunables, where setting to 1
>>>> enables MADV_HUGEPAGE madvise for mmap allocated pages if mode is set as
>>>> 'madvise' (/sys/kernel/mm/transparent_hugepage/enabled).  One option would
>>>> to use it instead of 'always' and use glibc.malloc.hugetlb=1.
>>>>
>>>> The main drawback of this strategy is this system wide setting, so it
>>>> might affect other user/programs as well.
>>>>
>>>>>
>>>>> For stack allocation, it seems that hugepages make resident set size
>>>>> (RSS) increase significantly, and without any apparent benefit, as the
>>>>> huge page will be split in small pages even before leaving glibc stack
>>>>> allocation code.
>>>>>
>>>>> As an example, this is what happens in case of a pthread_create with 2MB
>>>>> stack size:
>>>>>  1. mmap request for the 2MB allocation with PROT_NONE;
>>>>>       a huge page is "registered" by the kernel
>>>>>  2. the thread descriptor is writen in the end of the stack.
>>>>>       this will trigger a page exception in the kernel which will make the actual
>>>>>       memory allocation of the 2MB.
>>>>>  3. an mprotect changes protection on the guard (one of the small pages of the
>>>>>     allocated space):
>>>>>       at this point the kernel needs to break the 2MB page into many small pages
>>>>>       in order to change the protection on that memory region.
>>>>>       This will eliminate any benefit of having small pages for stack allocation,
>>>>>       but also makes RSS to be increaded by 2MB even though nothing was
>>>>>       written to most of the small pages.
>>>>>
>>>>> As an exercise I added __madvise(..., MADV_NOHUGEPAGE) right after
>>>>> the __mmap in nptl/allocatestack.c. As expected, RSS was significantly reduced for
>>>>> the application.
>>>>>
>>>>> At this point I am very much confident that there is a real benefit in our
>>>>> particular use case to enforce stacks not ever to use hugepages.
>>>>>
>>>>> This RFC is to understand if I have missed some option in glibc that would
>>>>> allow to better control stack allocation.
>>>>> If not, I am tempted to propose/submit a change, in the form of a tunable, to
>>>>> enforce NOHUGEPAGES for stacks.
>>>>>
>>>>> In any case, I wonder if there is an actual use case where an hugepage would
>>>>> survive glibc stack allocation and will bring an actual benefit.
>>>>>
>>>>> Looking forward for your comments.
>>>>
>>>> Maybe also a similar strategy on pthread stack allocation, where if transparent
>>>> hugepages is 'always' and glibc.malloc.hugetlb is 3 we set MADV_NOHUGEPAGE on
>>>> internal mmaps.  So value of '3' means disable THP, which might be confusing
>>>> but currently we have '0' as 'use system default'.  It can be also another
>>>> tunable, like glibc.hugetlb to decouple from malloc code.
>>>>
>>> The intent would not be to disable hugepages on all internal mmaps, as I
>>> think you said, but rather just do it for stack allocations.
>>> Although more work, I would say if we add this to a tunable then maybe
>>> we should move it from malloc namespace.
>>
>> I was thinking on mmap allocation where internal usage might trigger this
>> behavior.  If I understood what is happening, since the initial stack is
>> aligned to the hugepage size (assuming x86 2MB hugepage and 8MB default
>> stack size) and 'always' is set a the policy, the stack will be always
>> backed up by hugepages.  And then, when the guard page is set at
>> setup_stack_prot, it will force the kernel to split and move the stack
>> to default pages.
> Yes for the most part I think. Actually I think the kernel makes the
> split at the the first write.
> At the setup_stack_prot, it could sort of get to the conclusion that the
> pages would need to be split, but it does not do it. Only when the write
> and page exception occurs it realizes that it needs to split, and it
> materializes all of the pages as if the hugepage was already dirty.
> In my madvise experiments, only when I madvise after the write it gets
> RSS to bloat.

Yes, I expect that COW semantic will actually trigger the page migration.

> 
>> It seems to be a pthread specific problem, since I think alloc_new_heap
>> already mprotect if hugepage it is used.
>>
>> And I agree with Florian that backing up thread stack with hugepage it might
>> indeed reduce TLB misses.  However, if you want to optimize to RSS maybe you
>> can force the total thread stack size to not be multiple of hugepages:
> Considering the default 8MB stack size, there is nothing to think about,
> it definetely is a requirement.

The 8MB come in fact from ulimit -s, but I agree that my suggestion was more
like a hack.


>>
>> $ cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> $ grep -w STACK_SIZE_TOTAL tststackalloc.c
>> #define STACK_SIZE_TOTAL (3 * (HUGE_PAGE_SIZE)) / 4
>>   size_t stack_size = STACK_SIZE_TOTAL;
>> $ ./testrun.sh ./tststackalloc 1
>> Page size: 4 kB, 2 MB huge pages
>> Will attempt to align allocations to make stacks eligible for huge pages
>> pid: 342503 (/proc/342503/smaps)
>> Creating 128 threads...
>> RSS: 537 pages (2199552 bytes = 2 MB)
>> Press enter to exit...
>>
>> $ ./testrun.sh ./tststackalloc 0
>> Page size: 4 kB, 2 MB huge pages
>> pid: 342641 (/proc/342641/smaps)
>> Creating 128 threads...
>> RSS: 536 pages (2195456 bytes = 2 MB)
>> Press enter to exit...
>>
>> But I think a tunable to force it for all stack sizes might be useful indeed.
>>
>>> If moving it out of malloc is not Ok for backcompatibility reasons, then
>>> I would say create a new tunable specific for the purpose, like
>>> glibc.stack_nohugetlb ?
>>
>> We don't enforce tunable compatibility, but we have the glibc.pthread namespace
>> already.  Maybe we can use glibc.pthread.stack_hugetlb, with 0 to use the default
>> and 1 to avoid by call mprotect (we might change this semantic).
> Will work on the patch right away. I would swap the 0 and the 1,
> otherwise it looks in reverse logic. 0 to enable and 1 to disable.

It works as well.

next prev parent reply	other threads:[~2023-03-09 18:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <87pm9j4azf.fsf@oracle.com>
2023-03-08 14:17 ` Cupertino Miranda
2023-03-08 14:53   ` Cristian Rodríguez
2023-03-08 15:12     ` Cupertino Miranda
2023-03-08 17:19   ` Adhemerval Zanella Netto
2023-03-09  9:38     ` Cupertino Miranda
2023-03-09 17:11       ` Adhemerval Zanella Netto
2023-03-09 18:11         ` Cupertino Miranda
2023-03-09 18:15           ` Adhemerval Zanella Netto [this message]
2023-03-09 19:01             ` Cupertino Miranda
2023-03-09 19:11               ` Adhemerval Zanella Netto
2023-03-09 10:54   ` Florian Weimer
2023-03-09 14:29     ` Cupertino Miranda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=843f1062-562a-c455-c6b1-c767f2e4417f@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=cupertino.miranda@oracle.com \
    --cc=cupertinomiranda@gmail.com \
    --cc=elena.zannoni@oracle.com \
    --cc=jose.marchesi@oracle.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).