From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id 8AE313858D20 for ; Thu, 9 Mar 2023 18:15:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8AE313858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-oi1-x235.google.com with SMTP id e21so2350511oie.1 for ; Thu, 09 Mar 2023 10:15:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1678385755; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=SRnoAimj2w6H3E3fPg2LhHoSiuaPLTD5CfZPaDzkCCk=; b=v1YIpLSgB+0LefGjflFO4Mq0sPhZbnd+RlqD+Hzgz3p+BIJmIpUltthy92Gcab+ZQY y1wH3NQXg2HnooPy0/NKVmjwQHHTgxd7pe1nDol6g8O2MBVjo/JuvS8kdr901qQekNYh dlDQzWRK3cPX+4g+Eete/wb5KjgWcv3qi8aZUur/xgbmG6sP5G1VQQJB37j+l/OjQwxB zuJ28Xnf4cYb14F32cFGepG2A1WtOvH1XixtWEtdk6utdCDsvbsZtlC7RID0kbKhebWW IdMfi9vXWIZrFk3EKJZn/O/XsWwHCRDVvT/E71mozuPFzHcD13d6NYNnImK66rnHKym1 l9Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678385755; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=SRnoAimj2w6H3E3fPg2LhHoSiuaPLTD5CfZPaDzkCCk=; b=ovQxwb+BZWQdxfFKJmNdP6c1DhkxtxSWaMhk3fduzC4W9gt7bviDHvJ/m9OdWZwn8Q VfT3362A4OyPb1bMipZX8yLVp1NyneNp9uqEo33MhtNXz9TwfJyY4vtqpJXaboO1RwMW xlkUiRDv4myLtJKxTXmh03Yu7+67YoEK+UTiHGTOGnJMuELd6hnU9RpBKoLEsbTRo52F b0BJEx+RB0P9J2xIe077jbQ+N8azWAj9I6BNaXLqhCX11IQ8SQlGU7EAlMtNEC0nDBvJ QAdTJVgjnNg4NdJvXmetsYNvY80sJ+R74QSW6PyEgyYCEOp6PFXMsP3iBku044qJ5/wZ sSmA== X-Gm-Message-State: AO0yUKU6R/LImgQSXsmi4yJDmuyalQWTka+eCAfeECUscC0H/UsZJPrQ BCMhxYbGTHPgfzKAYe2Hxtjv9g== X-Google-Smtp-Source: AK7set8rvBsDdYy9v3EwPykqzvlj5tDOp/I4G7qTxtSNdl/N2xHNQ532/oWfhaD2gT7+bqF+Ta1pOA== X-Received: by 2002:a05:6808:85:b0:383:4e8c:ba42 with SMTP id s5-20020a056808008500b003834e8cba42mr9184804oic.20.1678385754622; Thu, 09 Mar 2023 10:15:54 -0800 (PST) Received: from ?IPV6:2804:1b3:a7c0:544b:655d:5559:758d:90f7? ([2804:1b3:a7c0:544b:655d:5559:758d:90f7]) by smtp.gmail.com with ESMTPSA id v184-20020aca61c1000000b0037d8aec19e0sm7898428oib.36.2023.03.09.10.15.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 09 Mar 2023 10:15:53 -0800 (PST) Message-ID: <843f1062-562a-c455-c6b1-c767f2e4417f@linaro.org> Date: Thu, 9 Mar 2023 15:15:51 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: [RFC] Stack allocation, hugepages and RSS implications Content-Language: en-US To: Cupertino Miranda Cc: libc-alpha@sourceware.org, "Jose E. Marchesi" , Elena Zannoni , Cupertino Miranda References: <87pm9j4azf.fsf@oracle.com> <87mt4n49ak.fsf@oracle.com> <06a84799-3a73-2bff-e157-281eed68febf@linaro.org> <87edpy464g.fsf@oracle.com> <8f22594a-145a-a358-7ae0-dbbe16d709e8@linaro.org> <87y1o53icm.fsf@oracle.com> From: Adhemerval Zanella Netto Organization: Linaro In-Reply-To: <87y1o53icm.fsf@oracle.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 09/03/23 15:11, Cupertino Miranda wrote: > > Adhemerval Zanella Netto writes: > >> On 09/03/23 06:38, Cupertino Miranda wrote: >>> >>> Adhemerval Zanella Netto writes: >>> >>>> On 08/03/23 11:17, Cupertino Miranda via Libc-alpha wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> For performance purposes, one of ours in-house applications requires to enable >>>>> TRANSPARENT_HUGEPAGES_ALWAYS option in linux kernel, actually making the >>>>> kernel to force all of the big enough and alligned memory allocations to >>>>> reside in hugepages. I believe the reason behind this decision is to >>>>> have more control on data location. >>>> >>>> He have, since 2.35, the glibc.malloc.hugetlb tunables, where setting to 1 >>>> enables MADV_HUGEPAGE madvise for mmap allocated pages if mode is set as >>>> 'madvise' (/sys/kernel/mm/transparent_hugepage/enabled). One option would >>>> to use it instead of 'always' and use glibc.malloc.hugetlb=1. >>>> >>>> The main drawback of this strategy is this system wide setting, so it >>>> might affect other user/programs as well. >>>> >>>>> >>>>> For stack allocation, it seems that hugepages make resident set size >>>>> (RSS) increase significantly, and without any apparent benefit, as the >>>>> huge page will be split in small pages even before leaving glibc stack >>>>> allocation code. >>>>> >>>>> As an example, this is what happens in case of a pthread_create with 2MB >>>>> stack size: >>>>> 1. mmap request for the 2MB allocation with PROT_NONE; >>>>> a huge page is "registered" by the kernel >>>>> 2. the thread descriptor is writen in the end of the stack. >>>>> this will trigger a page exception in the kernel which will make the actual >>>>> memory allocation of the 2MB. >>>>> 3. an mprotect changes protection on the guard (one of the small pages of the >>>>> allocated space): >>>>> at this point the kernel needs to break the 2MB page into many small pages >>>>> in order to change the protection on that memory region. >>>>> This will eliminate any benefit of having small pages for stack allocation, >>>>> but also makes RSS to be increaded by 2MB even though nothing was >>>>> written to most of the small pages. >>>>> >>>>> As an exercise I added __madvise(..., MADV_NOHUGEPAGE) right after >>>>> the __mmap in nptl/allocatestack.c. As expected, RSS was significantly reduced for >>>>> the application. >>>>> >>>>> At this point I am very much confident that there is a real benefit in our >>>>> particular use case to enforce stacks not ever to use hugepages. >>>>> >>>>> This RFC is to understand if I have missed some option in glibc that would >>>>> allow to better control stack allocation. >>>>> If not, I am tempted to propose/submit a change, in the form of a tunable, to >>>>> enforce NOHUGEPAGES for stacks. >>>>> >>>>> In any case, I wonder if there is an actual use case where an hugepage would >>>>> survive glibc stack allocation and will bring an actual benefit. >>>>> >>>>> Looking forward for your comments. >>>> >>>> Maybe also a similar strategy on pthread stack allocation, where if transparent >>>> hugepages is 'always' and glibc.malloc.hugetlb is 3 we set MADV_NOHUGEPAGE on >>>> internal mmaps. So value of '3' means disable THP, which might be confusing >>>> but currently we have '0' as 'use system default'. It can be also another >>>> tunable, like glibc.hugetlb to decouple from malloc code. >>>> >>> The intent would not be to disable hugepages on all internal mmaps, as I >>> think you said, but rather just do it for stack allocations. >>> Although more work, I would say if we add this to a tunable then maybe >>> we should move it from malloc namespace. >> >> I was thinking on mmap allocation where internal usage might trigger this >> behavior. If I understood what is happening, since the initial stack is >> aligned to the hugepage size (assuming x86 2MB hugepage and 8MB default >> stack size) and 'always' is set a the policy, the stack will be always >> backed up by hugepages. And then, when the guard page is set at >> setup_stack_prot, it will force the kernel to split and move the stack >> to default pages. > Yes for the most part I think. Actually I think the kernel makes the > split at the the first write. > At the setup_stack_prot, it could sort of get to the conclusion that the > pages would need to be split, but it does not do it. Only when the write > and page exception occurs it realizes that it needs to split, and it > materializes all of the pages as if the hugepage was already dirty. > In my madvise experiments, only when I madvise after the write it gets > RSS to bloat. Yes, I expect that COW semantic will actually trigger the page migration. > >> It seems to be a pthread specific problem, since I think alloc_new_heap >> already mprotect if hugepage it is used. >> >> And I agree with Florian that backing up thread stack with hugepage it might >> indeed reduce TLB misses. However, if you want to optimize to RSS maybe you >> can force the total thread stack size to not be multiple of hugepages: > Considering the default 8MB stack size, there is nothing to think about, > it definetely is a requirement. The 8MB come in fact from ulimit -s, but I agree that my suggestion was more like a hack. >> >> $ cat /sys/kernel/mm/transparent_hugepage/enabled >> [always] madvise never >> $ grep -w STACK_SIZE_TOTAL tststackalloc.c >> #define STACK_SIZE_TOTAL (3 * (HUGE_PAGE_SIZE)) / 4 >> size_t stack_size = STACK_SIZE_TOTAL; >> $ ./testrun.sh ./tststackalloc 1 >> Page size: 4 kB, 2 MB huge pages >> Will attempt to align allocations to make stacks eligible for huge pages >> pid: 342503 (/proc/342503/smaps) >> Creating 128 threads... >> RSS: 537 pages (2199552 bytes = 2 MB) >> Press enter to exit... >> >> $ ./testrun.sh ./tststackalloc 0 >> Page size: 4 kB, 2 MB huge pages >> pid: 342641 (/proc/342641/smaps) >> Creating 128 threads... >> RSS: 536 pages (2195456 bytes = 2 MB) >> Press enter to exit... >> >> But I think a tunable to force it for all stack sizes might be useful indeed. >> >>> If moving it out of malloc is not Ok for backcompatibility reasons, then >>> I would say create a new tunable specific for the purpose, like >>> glibc.stack_nohugetlb ? >> >> We don't enforce tunable compatibility, but we have the glibc.pthread namespace >> already. Maybe we can use glibc.pthread.stack_hugetlb, with 0 to use the default >> and 1 to avoid by call mprotect (we might change this semantic). > Will work on the patch right away. I would swap the 0 and the 1, > otherwise it looks in reverse logic. 0 to enable and 1 to disable. It works as well.