From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <thiago.macieira@intel.com>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by sourceware.org (Postfix) with ESMTPS id 50268386EC49
 for <libstdc++@gcc.gnu.org>; Mon,  1 Mar 2021 19:12:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 50268386EC49
IronPort-SDR: NDMj9/G8k+NJuVXMguI9vgN3lxwWeKxAKvgbiIf6SRsFx0e4UAi8PcS8oRjJ+irQWAiKIiBP+8
 0BCBA1GwsFHw==
X-IronPort-AV: E=McAfee;i="6000,8403,9910"; a="185895451"
X-IronPort-AV: E=Sophos;i="5.81,215,1610438400"; d="scan'208";a="185895451"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 01 Mar 2021 11:12:02 -0800
IronPort-SDR: 0cJPHpKUv5m0o0N0lgW9uoTaAFiaayI/fw9+kCVa5rm78XNQL3K0o7y0J0isl3Spxw2mZY3q7f
 zwYQeoSOrUzw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.81,215,1610438400"; d="scan'208";a="397884011"
Received: from irsmsx605.ger.corp.intel.com ([163.33.146.138])
 by fmsmga008.fm.intel.com with ESMTP; 01 Mar 2021 11:12:01 -0800
Received: from tjmaciei-mobl1.localnet (10.255.230.217) by
 IRSMSX605.ger.corp.intel.com (163.33.146.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2106.2; Mon, 1 Mar 2021 19:12:00 +0000
From: Thiago Macieira <thiago.macieira@intel.com>
To: Thomas Rodgers <trodgers@redhat.com>
CC: Thomas Rodgers <rodgert@appliantology.com>, <jwakely@redhat.com>,
 <libstdc++@gcc.gnu.org>
Subject: Re: C++2a synchronisation inefficient in GCC 11
Date: Mon, 1 Mar 2021 11:11:56 -0800
Message-ID: <3732407.PpFURekcsd@tjmaciei-mobl1>
Organization: Intel Corporation
In-Reply-To: <18527452.32617548.1614623662482.JavaMail.zimbra@redhat.com>
References: <1968544.UC5HiB4uFJ@tjmaciei-mobl1>
 <7309459.cE9rBlv1QQ@tjmaciei-mobl1>
 <18527452.32617548.1614623662482.JavaMail.zimbra@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Originating-IP: [10.255.230.217]
X-ClientProxiedBy: orsmsx603.amr.corp.intel.com (10.22.229.16) To
 IRSMSX605.ger.corp.intel.com (163.33.146.138)
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 19:12:06 -0000

On Monday, 1 March 2021 10:34:22 PST Thomas Rodgers wrote:
> > I'm worried that it is benchmarking the wrong thing. Can we benchmark
> > latch
> > and counting_semaphore instead? Those already track contention on the
> > waitable
> > atomic by themselves.
> 
> So is your concern here that e.g. latch::count_down() will do an extra
> atomic load?

I'm concerned we're violating "Don't Pay for What You Don't Need". Waiting 
algorithms should be written to track the contention by themselves. If they 
say they want to wake or wait, the runtime shouldn't second-guess them.

Take a look at how glibc's sem_post and sem_wait are implemented. On systems 
with 64-bit atomics, they store the number of waiters in the high 32-bit; on 
systems without, they store a single bit indicating whether there's anyone 
waiting. The 64-bit code summarises to:

void acquire()
{
    auto &low = low_half_by_ref();	// handle big-endian, reinterpret_cast,
    uint64_t d = _M_counter.fetch_add(1ULL << 32);
    for (;;) {
        if ((d & SEM_VALUE_MASK) == 0) {
            low.wait(d);
            d = _M_counter.load(std::memory_order_relaxed);
        } else {
            if (_M_counter.compare_exchange_strong(d, d - 1, ...))
                break;
        }
    _M_counter.fetch_sub(1ULL << 32);
}

void release(int n = 1)
{
    auto &low = low_half_by_ref();	// handle big-endian, reinterpret_cast,
    uint64_d = _M_counter.fetch_add(n);
    if (d >> SEM_NWAITERS_SHIFT)
        if (n == 1)
            low.notify_one();
        else
            low.notify_all();  // ought to be "notify_many(n)"
}

It's even simpler in the latch, where any value different from "the last to 
arrive" is "contended". 

    _GLIBCXX_ALWAYS_INLINE void
    count_down(ptrdiff_t __update = 1)
    {
      auto const __old = __atomic_impl::fetch_sub(&_M_a,
                                    __update, memory_order::release);
      if (__old == __update)
        __atomic_impl::notify_all(&_M_a);
    }

There's no need to track contention again here. The only case where it would 
be useful is if there's only one thread operating on this latch, but if that's 
the case then why are you using a latch in the first place? And again, if 
that's an issue, then we can use the LSB or MSB to indicate that there are 
waiters waiting (we don't need a count of waiters).
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering