From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=0ApU=7B=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235])
	by sourceware.org (Postfix) with ESMTPS id 8AE313858D20
	for <libc-alpha@sourceware.org>; Thu,  9 Mar 2023 18:15:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8AE313858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-oi1-x235.google.com with SMTP id e21so2350511oie.1
        for <libc-alpha@sourceware.org>; Thu, 09 Mar 2023 10:15:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1678385755;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=SRnoAimj2w6H3E3fPg2LhHoSiuaPLTD5CfZPaDzkCCk=;
        b=v1YIpLSgB+0LefGjflFO4Mq0sPhZbnd+RlqD+Hzgz3p+BIJmIpUltthy92Gcab+ZQY
         y1wH3NQXg2HnooPy0/NKVmjwQHHTgxd7pe1nDol6g8O2MBVjo/JuvS8kdr901qQekNYh
         dlDQzWRK3cPX+4g+Eete/wb5KjgWcv3qi8aZUur/xgbmG6sP5G1VQQJB37j+l/OjQwxB
         zuJ28Xnf4cYb14F32cFGepG2A1WtOvH1XixtWEtdk6utdCDsvbsZtlC7RID0kbKhebWW
         IdMfi9vXWIZrFk3EKJZn/O/XsWwHCRDVvT/E71mozuPFzHcD13d6NYNnImK66rnHKym1
         l9Vg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1678385755;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=SRnoAimj2w6H3E3fPg2LhHoSiuaPLTD5CfZPaDzkCCk=;
        b=ovQxwb+BZWQdxfFKJmNdP6c1DhkxtxSWaMhk3fduzC4W9gt7bviDHvJ/m9OdWZwn8Q
         VfT3362A4OyPb1bMipZX8yLVp1NyneNp9uqEo33MhtNXz9TwfJyY4vtqpJXaboO1RwMW
         xlkUiRDv4myLtJKxTXmh03Yu7+67YoEK+UTiHGTOGnJMuELd6hnU9RpBKoLEsbTRo52F
         b0BJEx+RB0P9J2xIe077jbQ+N8azWAj9I6BNaXLqhCX11IQ8SQlGU7EAlMtNEC0nDBvJ
         QAdTJVgjnNg4NdJvXmetsYNvY80sJ+R74QSW6PyEgyYCEOp6PFXMsP3iBku044qJ5/wZ
         sSmA==
X-Gm-Message-State: AO0yUKU6R/LImgQSXsmi4yJDmuyalQWTka+eCAfeECUscC0H/UsZJPrQ
	BCMhxYbGTHPgfzKAYe2Hxtjv9g==
X-Google-Smtp-Source: AK7set8rvBsDdYy9v3EwPykqzvlj5tDOp/I4G7qTxtSNdl/N2xHNQ532/oWfhaD2gT7+bqF+Ta1pOA==
X-Received: by 2002:a05:6808:85:b0:383:4e8c:ba42 with SMTP id s5-20020a056808008500b003834e8cba42mr9184804oic.20.1678385754622;
        Thu, 09 Mar 2023 10:15:54 -0800 (PST)
Received: from ?IPV6:2804:1b3:a7c0:544b:655d:5559:758d:90f7? ([2804:1b3:a7c0:544b:655d:5559:758d:90f7])
        by smtp.gmail.com with ESMTPSA id v184-20020aca61c1000000b0037d8aec19e0sm7898428oib.36.2023.03.09.10.15.52
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 09 Mar 2023 10:15:53 -0800 (PST)
Message-ID: <843f1062-562a-c455-c6b1-c767f2e4417f@linaro.org>
Date: Thu, 9 Mar 2023 15:15:51 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.8.0
Subject: Re: [RFC] Stack allocation, hugepages and RSS implications
Content-Language: en-US
To: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: libc-alpha@sourceware.org, "Jose E. Marchesi" <jose.marchesi@oracle.com>,
 Elena Zannoni <elena.zannoni@oracle.com>,
 Cupertino Miranda <cupertinomiranda@gmail.com>
References: <87pm9j4azf.fsf@oracle.com> <87mt4n49ak.fsf@oracle.com>
 <06a84799-3a73-2bff-e157-281eed68febf@linaro.org> <87edpy464g.fsf@oracle.com>
 <8f22594a-145a-a358-7ae0-dbbe16d709e8@linaro.org> <87y1o53icm.fsf@oracle.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <87y1o53icm.fsf@oracle.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 09/03/23 15:11, Cupertino Miranda wrote:
> 
> Adhemerval Zanella Netto writes:
> 
>> On 09/03/23 06:38, Cupertino Miranda wrote:
>>>
>>> Adhemerval Zanella Netto writes:
>>>
>>>> On 08/03/23 11:17, Cupertino Miranda via Libc-alpha wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> For performance purposes, one of ours in-house applications requires to enable
>>>>> TRANSPARENT_HUGEPAGES_ALWAYS option in linux kernel, actually making the
>>>>> kernel to force all of the big enough and alligned memory allocations to
>>>>> reside in hugepages.  I believe the reason behind this decision is to
>>>>> have more control on data location.
>>>>
>>>> He have, since 2.35, the glibc.malloc.hugetlb tunables, where setting to 1
>>>> enables MADV_HUGEPAGE madvise for mmap allocated pages if mode is set as
>>>> 'madvise' (/sys/kernel/mm/transparent_hugepage/enabled).  One option would
>>>> to use it instead of 'always' and use glibc.malloc.hugetlb=1.
>>>>
>>>> The main drawback of this strategy is this system wide setting, so it
>>>> might affect other user/programs as well.
>>>>
>>>>>
>>>>> For stack allocation, it seems that hugepages make resident set size
>>>>> (RSS) increase significantly, and without any apparent benefit, as the
>>>>> huge page will be split in small pages even before leaving glibc stack
>>>>> allocation code.
>>>>>
>>>>> As an example, this is what happens in case of a pthread_create with 2MB
>>>>> stack size:
>>>>>  1. mmap request for the 2MB allocation with PROT_NONE;
>>>>>       a huge page is "registered" by the kernel
>>>>>  2. the thread descriptor is writen in the end of the stack.
>>>>>       this will trigger a page exception in the kernel which will make the actual
>>>>>       memory allocation of the 2MB.
>>>>>  3. an mprotect changes protection on the guard (one of the small pages of the
>>>>>     allocated space):
>>>>>       at this point the kernel needs to break the 2MB page into many small pages
>>>>>       in order to change the protection on that memory region.
>>>>>       This will eliminate any benefit of having small pages for stack allocation,
>>>>>       but also makes RSS to be increaded by 2MB even though nothing was
>>>>>       written to most of the small pages.
>>>>>
>>>>> As an exercise I added __madvise(..., MADV_NOHUGEPAGE) right after
>>>>> the __mmap in nptl/allocatestack.c. As expected, RSS was significantly reduced for
>>>>> the application.
>>>>>
>>>>> At this point I am very much confident that there is a real benefit in our
>>>>> particular use case to enforce stacks not ever to use hugepages.
>>>>>
>>>>> This RFC is to understand if I have missed some option in glibc that would
>>>>> allow to better control stack allocation.
>>>>> If not, I am tempted to propose/submit a change, in the form of a tunable, to
>>>>> enforce NOHUGEPAGES for stacks.
>>>>>
>>>>> In any case, I wonder if there is an actual use case where an hugepage would
>>>>> survive glibc stack allocation and will bring an actual benefit.
>>>>>
>>>>> Looking forward for your comments.
>>>>
>>>> Maybe also a similar strategy on pthread stack allocation, where if transparent
>>>> hugepages is 'always' and glibc.malloc.hugetlb is 3 we set MADV_NOHUGEPAGE on
>>>> internal mmaps.  So value of '3' means disable THP, which might be confusing
>>>> but currently we have '0' as 'use system default'.  It can be also another
>>>> tunable, like glibc.hugetlb to decouple from malloc code.
>>>>
>>> The intent would not be to disable hugepages on all internal mmaps, as I
>>> think you said, but rather just do it for stack allocations.
>>> Although more work, I would say if we add this to a tunable then maybe
>>> we should move it from malloc namespace.
>>
>> I was thinking on mmap allocation where internal usage might trigger this
>> behavior.  If I understood what is happening, since the initial stack is
>> aligned to the hugepage size (assuming x86 2MB hugepage and 8MB default
>> stack size) and 'always' is set a the policy, the stack will be always
>> backed up by hugepages.  And then, when the guard page is set at
>> setup_stack_prot, it will force the kernel to split and move the stack
>> to default pages.
> Yes for the most part I think. Actually I think the kernel makes the
> split at the the first write.
> At the setup_stack_prot, it could sort of get to the conclusion that the
> pages would need to be split, but it does not do it. Only when the write
> and page exception occurs it realizes that it needs to split, and it
> materializes all of the pages as if the hugepage was already dirty.
> In my madvise experiments, only when I madvise after the write it gets
> RSS to bloat.

Yes, I expect that COW semantic will actually trigger the page migration.

> 
>> It seems to be a pthread specific problem, since I think alloc_new_heap
>> already mprotect if hugepage it is used.
>>
>> And I agree with Florian that backing up thread stack with hugepage it might
>> indeed reduce TLB misses.  However, if you want to optimize to RSS maybe you
>> can force the total thread stack size to not be multiple of hugepages:
> Considering the default 8MB stack size, there is nothing to think about,
> it definetely is a requirement.

The 8MB come in fact from ulimit -s, but I agree that my suggestion was more
like a hack.


>>
>> $ cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> $ grep -w STACK_SIZE_TOTAL tststackalloc.c
>> #define STACK_SIZE_TOTAL (3 * (HUGE_PAGE_SIZE)) / 4
>>   size_t stack_size = STACK_SIZE_TOTAL;
>> $ ./testrun.sh ./tststackalloc 1
>> Page size: 4 kB, 2 MB huge pages
>> Will attempt to align allocations to make stacks eligible for huge pages
>> pid: 342503 (/proc/342503/smaps)
>> Creating 128 threads...
>> RSS: 537 pages (2199552 bytes = 2 MB)
>> Press enter to exit...
>>
>> $ ./testrun.sh ./tststackalloc 0
>> Page size: 4 kB, 2 MB huge pages
>> pid: 342641 (/proc/342641/smaps)
>> Creating 128 threads...
>> RSS: 536 pages (2195456 bytes = 2 MB)
>> Press enter to exit...
>>
>> But I think a tunable to force it for all stack sizes might be useful indeed.
>>
>>> If moving it out of malloc is not Ok for backcompatibility reasons, then
>>> I would say create a new tunable specific for the purpose, like
>>> glibc.stack_nohugetlb ?
>>
>> We don't enforce tunable compatibility, but we have the glibc.pthread namespace
>> already.  Maybe we can use glibc.pthread.stack_hugetlb, with 0 to use the default
>> and 1 to avoid by call mprotect (we might change this semantic).
> Will work on the patch right away. I would swap the 0 and the 1,
> otherwise it looks in reverse logic. 0 to enable and 1 to disable.

It works as well.