malloc: Optimize the number of arenas for better application performance

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* malloc: Optimize the number of arenas for better application performance
@ 2022-06-28  9:40 Yang Yanchao
  2022-06-28 11:18 ` Florian Weimer
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Yang Yanchao @ 2022-06-28  9:40 UTC (permalink / raw)
  To: libc-alpha
  Cc: adhemerval.zanella, glebfm, ldv, Carlos O'Donell, dj,
	siddhesh, linfeilong, liqingqing3

At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% 
between glibc-2.36 and glibc2.28.
Comparing the code, I find that the two commits causes performance 
degradation.
11a02b035b46 (misc: Add __get_nprocs_sched)
97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with 
garbage [BZ #28850])

These two patches modify the default behavior.
However, my machine is 96 cores and I have 91 cores bound.
It means that perhaps the current way of computing arenas is not optimal.
So I roll back some of the code submitted by 11a02b035(misc: Add 
__get_nprocs_sched).

---
  malloc/arena.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/malloc/arena.c b/malloc/arena.c
index 0a684a720d..a1ee7928d3 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
              narenas_limit = mp_.arena_max;
            else if (narenas > mp_.arena_test)
              {
-              int n = __get_nprocs_sched ();
+              int n = __get_nprocs ();

                if (n >= 1)
                  narenas_limit = NARENAS_FROM_NCORES (n);
-- 
2.33.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: malloc: Optimize the number of arenas for better application performance
  2022-06-28  9:40 malloc: Optimize the number of arenas for better application performance Yang Yanchao
@ 2022-06-28 11:18 ` Florian Weimer
  2022-06-28 12:38   ` Siddhesh Poyarekar
  2022-06-28 13:35 ` Adhemerval Zanella
  2022-06-28 18:56 ` DJ Delorie
  2 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2022-06-28 11:18 UTC (permalink / raw)
  To: Yang Yanchao via Libc-alpha; +Cc: Yang Yanchao, ldv, linfeilong, siddhesh

* Yang Yanchao via Libc-alpha:

> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% 
> between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance 
> degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with 
> garbage [BZ #28850])
>
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add 
> __get_nprocs_sched).
>
> ---
>   malloc/arena.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>               narenas_limit = mp_.arena_max;
>             else if (narenas > mp_.arena_test)
>               {
> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();
>
>                 if (n >= 1)
>                   narenas_limit = NARENAS_FROM_NCORES (n);

How many threads does tpcc-mysql create?

I wonder if all threads get their own arena with the larger count, and
there is some arena sharing with the smaller count.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: malloc: Optimize the number of arenas for better application performance
  2022-06-28 11:18 ` Florian Weimer
@ 2022-06-28 12:38   ` Siddhesh Poyarekar
  0 siblings, 0 replies; 9+ messages in thread
From: Siddhesh Poyarekar @ 2022-06-28 12:38 UTC (permalink / raw)
  To: Florian Weimer, Yang Yanchao via Libc-alpha; +Cc: Yang Yanchao, ldv, linfeilong

On 28/06/2022 16:48, Florian Weimer wrote:
> * Yang Yanchao via Libc-alpha:
> 
>> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2%
>> between glibc-2.36 and glibc2.28.
>> Comparing the code, I find that the two commits causes performance
>> degradation.
>> 11a02b035b46 (misc: Add __get_nprocs_sched)
>> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with
>> garbage [BZ #28850])
>>
>> These two patches modify the default behavior.
>> However, my machine is 96 cores and I have 91 cores bound.
>> It means that perhaps the current way of computing arenas is not optimal.
>> So I roll back some of the code submitted by 11a02b035(misc: Add
>> __get_nprocs_sched).
>>
>> ---
>>    malloc/arena.c | 2 +-
>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/malloc/arena.c b/malloc/arena.c
>> index 0a684a720d..a1ee7928d3 100644
>> --- a/malloc/arena.c
>> +++ b/malloc/arena.c
>> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>>                narenas_limit = mp_.arena_max;
>>              else if (narenas > mp_.arena_test)
>>                {
>> -              int n = __get_nprocs_sched ();
>> +              int n = __get_nprocs ();
>>
>>                  if (n >= 1)
>>                    narenas_limit = NARENAS_FROM_NCORES (n);
> 
> How many threads does tpcc-mysql create?
> 
> I wonder if all threads get their own arena with the larger count, and
> there is some arena sharing with the smaller count.

A simple test to determine this could be to repeat the test with 
different values of arena_max to see if there's a trend.

Siddhesh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: malloc: Optimize the number of arenas for better application performance
  2022-06-28  9:40 malloc: Optimize the number of arenas for better application performance Yang Yanchao
  2022-06-28 11:18 ` Florian Weimer
@ 2022-06-28 13:35 ` Adhemerval Zanella
  2022-06-28 18:56 ` DJ Delorie
  2 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2022-06-28 13:35 UTC (permalink / raw)
  To: Yang Yanchao
  Cc: libc-alpha, glebfm, ldv, Carlos O'Donell, dj, siddhesh,
	linfeilong, liqingqing3



> On 28 Jun 2022, at 06:40, Yang Yanchao <yangyanchao6@huawei.com> wrote:
> 
> At Kunpeng920 platform, tpcc-mysql scores decreased by about 11.2% between glibc-2.36 and glibc2.28.
> Comparing the code, I find that the two commits causes performance degradation.
> 11a02b035b46 (misc: Add __get_nprocs_sched)
> 97ba273b5057 (linux: __get_nprocs_sched: do not feed CPU_COUNT_S with garbage [BZ #28850])
> 
> These two patches modify the default behavior.
> However, my machine is 96 cores and I have 91 cores bound.
> It means that perhaps the current way of computing arenas is not optimal.
> So I roll back some of the code submitted by 11a02b035(misc: Add __get_nprocs_sched).
> 
> ---
> malloc/arena.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 0a684a720d..a1ee7928d3 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -937,7 +937,7 @@ arena_get2 (size_t size, mstate avoid_arena)
>             narenas_limit = mp_.arena_max;
>           else if (narenas > mp_.arena_test)
>             {
> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();
> 
>               if (n >= 1)
>                 narenas_limit = NARENAS_FROM_NCORES (n);
> -- 
> 2.33.0

In fact 11a02b035b46 only changed __get_nprocs_sched to call __get_nproc, 
33099d72e41c was the one that actually changed __get_nproc to use 
sched_getaffinity.

I think it makes sense to get back to old behavior, since this change
was motivated mainly to avoid the malloc call in arena initialization.
The __get_nprocs does not call malloc through opendir anymore, so it should
be safe. 
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: malloc: Optimize the number of arenas for better application performance
  2022-06-28  9:40 malloc: Optimize the number of arenas for better application performance Yang Yanchao
  2022-06-28 11:18 ` Florian Weimer
  2022-06-28 13:35 ` Adhemerval Zanella
@ 2022-06-28 18:56 ` DJ Delorie
  2022-06-28 19:17   ` Adhemerval Zanella
  2 siblings, 1 reply; 9+ messages in thread
From: DJ Delorie @ 2022-06-28 18:56 UTC (permalink / raw)
  To: Yang Yanchao
  Cc: libc-alpha, adhemerval.zanella, glebfm, ldv, carlos, siddhesh,
	linfeilong, liqingqing3

Yang Yanchao <yangyanchao6@huawei.com> writes:
> However, my machine is 96 cores and I have 91 cores bound.

One benchmark on one uncommon configuration is not sufficient reason to
change a core tunable.  What about other platforms?  Other benchmarks?
Other percentages of cores scheduled?

I would reject this patch based solely on the lack of data backing up
your claims.

> -              int n = __get_nprocs_sched ();
> +              int n = __get_nprocs ();

I've heard complaints about how our code leads to hundreds of arenas on
processes scheduled on only two CPUs.  I think using the number of
*schedulable* cores makes more sense than using the number of *unusable*
cores.

I think this change warrants more research.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: malloc: Optimize the number of arenas for better application performance
  2022-06-28 18:56 ` DJ Delorie
@ 2022-06-28 19:17   ` Adhemerval Zanella
       [not found]     ` <1a8f10e034e7489c8e9f090e9c90b396@huawei.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Adhemerval Zanella @ 2022-06-28 19:17 UTC (permalink / raw)
  To: DJ Delorie
  Cc: Yang Yanchao, libc-alpha, glebfm, ldv, carlos, siddhesh,
	linfeilong, liqingqing3



> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
> 
> Yang Yanchao <yangyanchao6@huawei.com> writes:
>> However, my machine is 96 cores and I have 91 cores bound.
> 
> One benchmark on one uncommon configuration is not sufficient reason to
> change a core tunable.  What about other platforms?  Other benchmarks?
> Other percentages of cores scheduled?
> 
> I would reject this patch based solely on the lack of data backing up
> your claims.
> 
>> -              int n = __get_nprocs_sched ();
>> +              int n = __get_nprocs ();
> 
> I've heard complaints about how our code leads to hundreds of arenas on
> processes scheduled on only two CPUs.  I think using the number of
> *schedulable* cores makes more sense than using the number of *unusable*
> cores.
> 
> I think this change warrants more research.

I think this patch make sense mainly because we changed to use the 
schedulable cores without much though either.  Maybe we can revert
to previous semantic and investigate that using the schedulable 
number makes more sense.

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <1a8f10e034e7489c8e9f090e9c90b396@huawei.com>]

* Re: 转发: malloc: Optimize the number of arenas for better application performance
       [not found]     ` <1a8f10e034e7489c8e9f090e9c90b396@huawei.com>
@ 2022-06-29  2:37       ` Qingqing Li
  2022-06-29  5:25         ` Siddhesh Poyarekar
  0 siblings, 1 reply; 9+ messages in thread
From: Qingqing Li @ 2022-06-29  2:37 UTC (permalink / raw)
  To: Adhemerval Zanella, DJ Delorie, Yang Yanchao
  Cc: carlos, libc-alpha, glebfm, ldv, linfeilong, Qingqing Li

>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>
>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>> However, my machine is 96 cores and I have 91 cores bound.
>>
>> One benchmark on one uncommon configuration is not sufficient reason to
>> change a core tunable.  What about other platforms?  Other benchmarks?
>> Other percentages of cores scheduled?
>>
>> I would reject this patch based solely on the lack of data backing up
>> your claims.
>>
>>> -              int n = __get_nprocs_sched ();
>>> +              int n = __get_nprocs ();
>>
>> I've heard complaints about how our code leads to hundreds of arenas on
>> processes scheduled on only two CPUs.  I think using the number of
>> *schedulable* cores makes more sense than using the number of *unusable*
>> cores.
>>
>> I think this change warrants more research.
> 
> I think this patch make sense mainly because we changed to use the 
> schedulable cores without much though either.  Maybe we can revert
> to previous semantic and investigate that using the schedulable 
> number makes more sense.
> 
Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
__get_nprocs_sched is not a good choice. my opinion first use  __get_nprocs as a static valule(the old default behavior),
and user use glibc.TUNABLE to adjust arana number.
Also, we can do more research and test to optimize the default arena number.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 转发: malloc: Optimize the number of arenas for better application performance
  2022-06-29  2:37       ` 转发: " Qingqing Li
@ 2022-06-29  5:25         ` Siddhesh Poyarekar
  2022-06-29  8:05           ` [PATCH] malloc: Optimize the number of arenas for better application performance [BZ# 29296] Yang Yanchao
  0 siblings, 1 reply; 9+ messages in thread
From: Siddhesh Poyarekar @ 2022-06-29  5:25 UTC (permalink / raw)
  To: Qingqing Li, Adhemerval Zanella, DJ Delorie, Yang Yanchao
  Cc: libc-alpha, ldv, linfeilong

On 29/06/2022 08:07, Qingqing Li via Libc-alpha wrote:
>>> On 28 Jun 2022, at 15:56, DJ Delorie <dj@redhat.com> wrote:
>>>
>>> Yang Yanchao <yangyanchao6@huawei.com> writes:
>>>> However, my machine is 96 cores and I have 91 cores bound.
>>>
>>> One benchmark on one uncommon configuration is not sufficient reason to
>>> change a core tunable.  What about other platforms?  Other benchmarks?
>>> Other percentages of cores scheduled?
>>>
>>> I would reject this patch based solely on the lack of data backing up
>>> your claims.
>>>
>>>> -              int n = __get_nprocs_sched ();
>>>> +              int n = __get_nprocs ();
>>>
>>> I've heard complaints about how our code leads to hundreds of arenas on
>>> processes scheduled on only two CPUs.  I think using the number of
>>> *schedulable* cores makes more sense than using the number of *unusable*
>>> cores.
>>>
>>> I think this change warrants more research.

Agreed, but the Huawei system use case appears to suggest that at least 
that system prefers the superset, which is a data point against this 
change.  The original change[1] doesn't appear to provide any extensive 
research (as Adhemerval admitted below too), so that seems sufficient 
grounds to revert a change that has regressed performance.

Of course, if you or anyone else has data points that strongly suggest 
*better* overall performance with __get_nprocs_sched (I am aware of 
complaints about hundreds of arenas and address space usage and on many 
occasions in the past I've been able to successfully resolve them by 
showing that their code is nearly 20-30% faster because of that) then 
maybe there's scope to explore the possibility of having 
architecture-specific defaults.

>> I think this patch make sense mainly because we changed to use the
>> schedulable cores without much though either.  Maybe we can revert
>> to previous semantic and investigate that using the schedulable
>> number makes more sense.
>>
> Agreed, the variable narenas_limit just Initialize once, if there has a scenario that we dynamic adjust the cpu affnity,
> __get_nprocs_sched is not a good choice. my opinion first use  __get_nprocs as a static valule(the old default behavior),
> and user use glibc.TUNABLE to adjust arana number.
> Also, we can do more research and test to optimize the default arena number.

I'm not strongly opposed to reverting (see above for more nuance) but 
would like us to use this opportunity to at least track a project to 
improve the arenas heuristic.  Could you or Adhemerval please add a bug 
report on sourceware to study the performance impact of arena counts and 
their relationship to core counts on different architectures/systems?

Thanks,
Siddhesh

[1] 
https://patchwork.sourceware.org/project/glibc/patch/20210907122259.79800-2-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: Optimize the number of arenas for better application performance [BZ# 29296]
  2022-06-29  5:25         ` Siddhesh Poyarekar
@ 2022-06-29  8:05           ` Yang Yanchao
  0 siblings, 0 replies; 9+ messages in thread
From: Yang Yanchao @ 2022-06-29  8:05 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Qingqing Li, Adhemerval Zanella, DJ Delorie
  Cc: libc-alpha, ldv, linfeilong

On 2022/6/29 13:25, Siddhesh Poyarekar wrote:
> I'm not strongly opposed to reverting (see above for more nuance) but 
> would like us to use this opportunity to at least track a project to 
> improve the arenas heuristic.  Could you or Adhemerval please add a bug 
> report on sourceware to study the performance impact of arena counts and 
> their relationship to core counts on different architectures/systems?

Thanks to the attention paid to this issue, I created a bug report on 
sourceware(https://sourceware.org/bugzilla/show_bug.cgi?id=29296).

Based on the openEuler 2203 LTS, we tested in different application 
scenarios(ceph, Hbase, Hibench, spark sql, hive sql). Only tpcc-mysql 
has this performance degradation because only tpcc-mysql has core 
binding configured.

I'm using different arena counts and their relationship to core counts, 
based on x86_64 and aarch64. tpcc-mysql tests are slow to run and will 
take several days.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-29  8:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28  9:40 malloc: Optimize the number of arenas for better application performance Yang Yanchao
2022-06-28 11:18 ` Florian Weimer
2022-06-28 12:38   ` Siddhesh Poyarekar
2022-06-28 13:35 ` Adhemerval Zanella
2022-06-28 18:56 ` DJ Delorie
2022-06-28 19:17   ` Adhemerval Zanella
     [not found]     ` <1a8f10e034e7489c8e9f090e9c90b396@huawei.com>
2022-06-29  2:37       ` 转发: " Qingqing Li
2022-06-29  5:25         ` Siddhesh Poyarekar
2022-06-29  8:05           ` [PATCH] malloc: Optimize the number of arenas for better application performance [BZ# 29296] Yang Yanchao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).