Hi Florian,

>> Hi everyone,
>>
>> For performance purposes, one of ours in-house applications requires to enable
>> TRANSPARENT_HUGEPAGES_ALWAYS option in linux kernel, actually making the
>> kernel to force all of the big enough and alligned memory allocations to
>> reside in hugepages.  I believe the reason behind this decision is to
>> have more control on data location.
>>
>> For stack allocation, it seems that hugepages make resident set size
>> (RSS) increase significantly, and without any apparent benefit, as the
>> huge page will be split in small pages even before leaving glibc stack
>> allocation code.
>>
>> As an example, this is what happens in case of a pthread_create with 2MB
>> stack size:
>>  1. mmap request for the 2MB allocation with PROT_NONE;
>>       a huge page is "registered" by the kernel
>>  2. the thread descriptor is writen in the end of the stack.
>>       this will trigger a page exception in the kernel which will make the actual
>>       memory allocation of the 2MB.
>>  3. an mprotect changes protection on the guard (one of the small pages of the
>>     allocated space):
>>       at this point the kernel needs to break the 2MB page into many small pages
>>       in order to change the protection on that memory region.
>>       This will eliminate any benefit of having small pages for stack allocation,
>>       but also makes RSS to be increaded by 2MB even though nothing was
>>       written to most of the small pages.
>>
>> As an exercise I added __madvise(..., MADV_NOHUGEPAGE) right after the
>> __mmap in nptl/allocatestack.c. As expected, RSS was significantly
>> reduced for the application.
>
> Interesting.  I did not expect to get hugepages right out of mmap.  I
> would have expected subsequent coalescing by khugepaged, taking actual
> stack usage into account.  But over-allocating memory might be
> beneficial, see below.
It is probably not getting the hugepages on mmap. Still the RSS is
growing as if it did.
>
> (Something must be happening between step 1 & 2 to make the writes
> possible.)
Totally right.
Could have explained it better. There is a call to setup_stack_prot that
I believe changes the protection for the stack-related values single
small page.

The write happens right after when you start writting to stack-related
values.
This is the critical point where it makes RSS grow by the hugepage size.

>
>> In any case, I wonder if there is an actual use case where an hugepage would
>> survive glibc stack allocation and will bring an actual benefit.
>
> It can reduce TLB misses.  The first-level TLB might only have 64
> entries for 4K pages, for example.  If the working set on the stack
> (including the TCB) needs more than a couple of pages, it might
> beneficial to use a 2M page and use just one TLB entry.
Indeed it might only not make sense if (guardsize > 0) as it is the case
of the example.
I think that in this case you can never get a hugepage since the guard
TLB entries will be write protected and would have different protection
from the remaining of the stack pages.
At least if you don't plan to allocate more than 2 hugepages.

I believe allocating 2M+4k was considered but it made it hard to control
data location.

> In your case, if your stacks are quite small, maybe you can just
> allocate slightly less than 2 MiB?
>
> The other question is whether the reported RSS is real, or if the kernel
> will recover zero stack pages on memory pressure.
Its a good point. I have no idea if the kernel is capable to recover the
zero stack pages in this particular case. Is there any way to trigger a recover?

In our example (in attach), there is a significant difference in
reported RSS, when we madvise the kernel.
Reported RSS is collected from /proc/self/statm.

# LD_LIBRARY_PATH=${HOME}/glibc_example/lib ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 2458323 (/proc/2458323/smaps)
Creating 128 threads...
RSS: 65888 pages (269877248 bytes = 257 MB)

After the madvise is added right before the writes to stack related
values (patch below):

# LD_LIBRARY_PATH=${HOME}/glibc_example/lib ./tststackalloc 1
Page size: 4 kB, 2 MB huge pages
Will attempt to align allocations to make stacks eligible for huge pages
pid: 2463199 (/proc/2463199/smaps)
Creating 128 threads...
RSS: 448 pages (1835008 bytes = 1 MB)

Thanks,
Cupertino

>
> Thanks,
> Florian

@@ -397,6 +397,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
                }
            }

+         __madvise(mem, size, MADV_NOHUGEPAGE);
          /* Remember the stack-related values.  */
          pd->stackblock = mem;
          pd->stackblock_size = size;