From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qvbv=BR=nexgo.de=stefan.kanthak@sourceware.org>
Received: from eggs.gnu.org (eggs.gnu.org [IPv6:2001:470:142:3::10])
	by sourceware.org (Postfix) with ESMTPS id DD4273857718
	for <gcc@gcc.gnu.org>; Sun, 28 May 2023 07:51:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DD4273857718
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nexgo.de
Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=nexgo.de
Received: from mr6.vodafonemail.de ([145.253.228.166])
	by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.90_1)
	(envelope-from <stefan.kanthak@nexgo.de>)
	id 1q3BBx-0002Ge-V4
	for gcc@gnu.org; Sun, 28 May 2023 03:51:56 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nexgo.de;
	s=vfde-smtpout-mb-15sep; t=1685260310;
	bh=Q0vNbeYx/1xhnLb8yGut5fsAoxZS0CxSSpm9WBiawTI=;
	h=Message-ID:From:To:References:In-Reply-To:Subject:Date:
	 Content-Type:X-Mailer:From;
	b=SQUPKTqXzfzUbPeXWvU3htI8LQ7+BJ3P/ddeqpXOr087K61d/3Iw5yimYE+oQD47h
	 pkMhLtiA3KVA7qQUZg0HId2Cz5kzkqHGwP+d/DHxL0vdLd5HBdQ2xNd4e3BALtK5Rt
	 IcQbOmeZGzxSykOSxkGeh+io4J+aelDbyU2kyXzU=
Received: from smtp.vodafone.de (unknown [10.0.0.2])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by mr6.vodafonemail.de (Postfix) with ESMTPS id 4QTW864hrqz1y2r;
	Sun, 28 May 2023 07:51:50 +0000 (UTC)
Received: from H270 (p5b38f631.dip0.t-ipconnect.de [91.56.246.49])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.vodafone.de (Postfix) with ESMTPSA id 4QTW7z2XGjz9s9C;
	Sun, 28 May 2023 07:51:40 +0000 (UTC)
Message-ID: <866992E82E8B4D1E9B627478131B78C5@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: "Andrew Pinski" <pinskia@gmail.com>
Cc: <gcc@gnu.org>
References: <23A490318B7149D88618A7CDA2CEDB14@H270> <CA+=Sn1nrDBd-LN-rY0E=a4Q_kRu5E0Gh4n-THjxhgvn3n2m0pQ@mail.gmail.com> <8DB226CF451A4430A8D7D5CBFE6B3972@H270> <CA+=Sn1nPyYf8s9Z8QtL3n_mZyH3f+xg3o5icHKMkUn_VRezkbQ@mail.gmail.com>
In-Reply-To: <CA+=Sn1nPyYf8s9Z8QtL3n_mZyH3f+xg3o5icHKMkUn_VRezkbQ@mail.gmail.com>
Subject: Re: Who cares about performance (or Intel's CPU errata)?
Date: Sun, 28 May 2023 09:47:31 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-purgate-type: clean
X-purgate: clean
X-purgate-size: 3728
X-purgate-ID: 155817::1685260306-3B7FD404-B975CCD8/0/0
Received-SPF: pass client-ip=145.253.228.166; envelope-from=stefan.kanthak@nexgo.de; helo=mr6.vodafonemail.de
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,DKIM_VALID_EF=-0.1,RCVD_IN_DNSWL_LOW=-0.7,SPF_HELO_NONE=0.001,SPF_PASS=-0.001,T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,LIKELY_SPAM_BODY,SPF_FAIL,SPF_HELO_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

"Andrew Pinski" <pinskia@gmail.com> wrote:

> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:

[...]

>> Nevertheless GCC fails to optimise code properly:
>>
>> --- .c ---
>> int ispowerof2(unsigned long long argument) {
>>     return __builtin_popcountll(argument) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3    gcc -m32 -mpopcnt -O3
>>
>> https://godbolt.org/z/fT7a7jP4e
>> ispowerof2(unsigned long long):
>>         xor     eax, eax
>>         xor     edx, edx
>>         popcnt  eax, [esp+4]
>>         popcnt  edx, [esp+8]
>>         add     eax, edx                 # eax is less than 64!
>>         cmp     eax, 1    ->    dec eax  # 2 bytes shorter
>
> dec eax is done for -Os already. -O2 means performance, it does not
> mean decrease size. dec can be slower as it can create a false
> dependency and it requires eax register to be not alive at the end of
> the statement. and IIRC for x86 decode, it could cause 2 (not 1)
> micro-ops.

It CAN, it COULD, but is does NOT NEED to: it all depends on the target
processor. Shall I add an example with -march=<not affected processor>?

>>         sete    al

Depending on the target processor the partial register can also harm
the performance.
Did you forget to mention that too?

>>         movzx   eax, al                  # superfluous
>
> No it is not superfluous, well ok it is because of the context of eax
> (besides the lower 8 bits) are already zero'd

Correct.
The same holds for example for PMOVMSKB when the high(er) lane(s) of
the source [XYZ]MM register are (known to be) 0, for example after MOVQ;
that's what GCC also fails to track.

> but keeping that track is a hard problem and is turning problem really.

Aren't such problems just there to be solved?

> And I suspect it would cause another false dependency later on too.

All these quirks can be avoided with the following 6-byte code sequence
(same size as SETcc plus MOVZX) I used in one of my previous posts to
fold any non-zero value to 1:

        neg    eax
        sbb    eax, eax
        neg    eax

No partial register writes, no false dependencies, no INC/DEC subleties.

JFTR: AMD documents that SBB with same destination and source is handled
      in the register renamer; I suspect Intel processors do it too,
      albeit not documented.

> For -Os -march=skylake (and -Oz instead of -Os) we get:
>        popcnt  rdi, rdi
>        popcnt  rsi, rsi
>        add     esi, edi
>        xor     eax, eax
>        dec     esi
>        sete    al
>
> Which is exactly what you want right?

Yes.
For -m32 -Os/-Oz, AND if CDQ breaks the dependency, it should be

         xor     eax, eax
         xor     edx, edx  ->    cdq      # 1 byte shorter
         popcnt  eax, [esp+4]
         popcnt  edx, [esp+8]
         add     eax, edx                 # eax is less than 64!
         cmp     eax, 1    ->    dec eax  # 2 bytes shorter

On AMD64 DEC <r32> is a 2-byte instruction; the following alternative code
avoids its potential false dependency as well as other possible quirks,
and also suits -Ot, -O2 and -O3 on processors where the register renamer
handles the XOR:

        popcnt  rdi, rdi
        popcnt  rsi, rsi
        xor     eax, eax
        not     edi        # edi = -(edi + 1)
        sub     edi, esi   # edi = -(edi + 1 + esi)
        setz    al

For processors where the register renamer doesn't "execute" XOR, but MOV,
the following code is an alternative for -Ot, -O2 and -O3:

        popcnt  rdi, rdi
        popcnt  rsi, rsi
        mov     eax, edi
        add     eax, esi
        cmp     eax, 1
        setz    al

Stefan