From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from eggs.gnu.org (eggs.gnu.org [IPv6:2001:470:142:3::10]) by sourceware.org (Postfix) with ESMTPS id 916263858D3C for ; Sat, 27 May 2023 23:31:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 916263858D3C Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=gmail.com Received: from mail-pf1-x42d.google.com ([2607:f8b0:4864:20::42d]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1q33NJ-0000z5-54 for gcc@gnu.org; Sat, 27 May 2023 19:31:09 -0400 Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-64d41763796so1588862b3a.2 for ; Sat, 27 May 2023 16:31:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1685230263; x=1687822263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n+s/RBn0hdI7ZJxl1OEBvCknZDagmYKaFnEqTa8t+BI=; b=JiX4RWCmRmydv6k6qH4XHRspl75SjfeaqzQ9ME0jyu+SY+n5n+BMli/U/BBG3+gemP ggUlqOFTjydCa94qqs7yszMKFslQPu9hO8DwNv+4bDkCX2g2+WWNiRFukOh1UM9+Q60Q 6NLJYxZRqSaYdPaGSlULtfz0W5GAA1CPiGCDQaMTgQS+3J/nfjlXDKUc0PqK5T8qE0Lz EfZy0pGogjQVmjMpdhls+AMfToSdvxxajKFJANtAd2moLYcnj+oTMJ7ZRuBUbzEfsj2w 1L3HXMtzDGBicjywmardSsiugWY8zThX9sBddyXVm4Ui/xOXo1m/IMSeEbtJpY7NxA6y yRFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685230263; x=1687822263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n+s/RBn0hdI7ZJxl1OEBvCknZDagmYKaFnEqTa8t+BI=; b=dVSSh+bLQV9OUZWEJPPQWAX1vyBRFglM09VFI6CfnECchpPhsg5JVCQEayuc5frrok DhTDq4IkU807yjOBZOyhiJBduuKqR6yk6dZW7ymWPm30pBsRSsV1wpgdx0m/AlP56HZE Q0dolU/snQNU68xM/wwFetxx/uIaepoDSysn1FeGj4GIVF4ltYfrvlkrn0/f+SA3REDM UACS9H+npQsyCl8JVpxri4q7svTr4hpZWvy9PN3r2gT+yTHxKqJXvfJGgW96wM8BtzRe dn6/2A3mLp/UdH9mHEOmWqynuRwppD6/EAN4UCKn+ZdwUdyoRQhNpjOECRUYKHeOzHQn nugg== X-Gm-Message-State: AC+VfDxoF97he3F0CoaJ26uZ8S01hvA2uxB77/8nAQrVXa4knNsBxtUG x/lEu/t8LrgYszQNnb55W1cOKVXq1nln57ZPSW4= X-Google-Smtp-Source: ACHHUZ4pZvCddn1WJ2gC+PHg+OLcTjIFPDcBzVqYWVOjVX3c5nEYa+QKTV3R33lK998WVG3ywdATsqapGO08jvuObpM= X-Received: by 2002:a05:6a00:15c5:b0:647:e45f:1a4c with SMTP id o5-20020a056a0015c500b00647e45f1a4cmr8678051pfu.11.1685230263059; Sat, 27 May 2023 16:31:03 -0700 (PDT) MIME-Version: 1.0 References: <23A490318B7149D88618A7CDA2CEDB14@H270> <8DB226CF451A4430A8D7D5CBFE6B3972@H270> In-Reply-To: <8DB226CF451A4430A8D7D5CBFE6B3972@H270> From: Andrew Pinski Date: Sat, 27 May 2023 16:30:50 -0700 Message-ID: Subject: Re: Who cares about performance (or Intel's CPU errata)? To: Stefan Kanthak Cc: gcc@gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2607:f8b0:4864:20::42d; envelope-from=pinskia@gmail.com; helo=mail-pf1-x42d.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9,DKIM_SIGNED=0.1,DKIM_VALID=-0.1,DKIM_VALID_AU=-0.1,DKIM_VALID_EF=-0.1,FREEMAIL_FROM=0.001,RCVD_IN_DNSWL_NONE=-0.0001,SPF_HELO_NONE=0.001,SPF_PASS=-0.001,T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,LIKELY_SPAM_BODY,SPF_HELO_PASS,SPF_SOFTFAIL,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sat, May 27, 2023 at 3:54=E2=80=AFPM Stefan Kanthak wrote: > > "Andrew Pinski" wrote: > > > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak wrote: > >> > >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's develop= ers are: > >> > >> --- dontcare.c --- > >> int ispowerof2(unsigned __int128 argument) { > >> return __builtin_popcountll(argument) + __builtin_popcountll(argum= ent >> 64) =3D=3D 1; > >> } > >> --- EOF --- > >> > >> GCC 13.3 gcc -march=3Dhaswell -O3 > >> > >> https://gcc.godbolt.org/z/PPzYsPzMc > >> ispowerof2(unsigned __int128): > >> popcnt rdi, rdi > >> popcnt rsi, rsi > >> add esi, edi > >> xor eax, eax > >> cmp esi, 1 > >> sete al > >> ret > >> > >> OOPS: what about Intel's CPU errata regarding the false dependency on = POPCNTs output? > > > > Because the popcount is going to the same register, there is no false > > dependency .... > > The false dependency errata only applies if the result of the popcnt > > is going to a different register, the processor thinks it depends on > > the result in that register from a previous instruction but it does > > not (which is why it is called a false dependency). In this case it > > actually does depend on the previous result since the input is the > > same as the input. > > OUCH, my fault; sorry for the confusion and the wrong accusation. > > Nevertheless GCC fails to optimise code properly: > > --- .c --- > int ispowerof2(unsigned long long argument) { > return __builtin_popcountll(argument) =3D=3D 1; > } > --- EOF --- > > GCC 13.3 gcc -m32 -mpopcnt -O3 > > https://godbolt.org/z/fT7a7jP4e > ispowerof2(unsigned long long): > xor eax, eax > xor edx, edx > popcnt eax, [esp+4] > popcnt edx, [esp+8] > add eax, edx # eax is less than 64! > cmp eax, 1 -> dec eax # 2 bytes shorter dec eax is done for -Os already. -O2 means performance, it does not mean decrease size. dec can be slower as it can create a false dependency and it requires eax register to be not alive at the end of the statement. and IIRC for x86 decode, it could cause 2 (not 1) micro-ops. > sete al > movzx eax, al # superfluous No it is not superfluous, well ok it is because of the context of eax (besides the lower 8 bits) are already zero'd but keeping that track is a hard problem and is turning problem really. And I suspect it would cause another false dependency later on too. For -Os -march=3Dskylake (and -Oz instead of -Os) we get: popcnt rdi, rdi popcnt rsi, rsi add esi, edi xor eax, eax dec esi sete al Which is exactly what you want right? Thanks, Andrew > ret > > 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to > kilo- or even megabytes, and they can extend code to cross a cache line > or a 16-byte alignment boundary. > > JFTR: same for "__builtin_popcount(argument) =3D=3D 1;" and 32-bit argume= nt > > JFTR: GCC is notorious for generating superfluous MOVZX instructions > where its optimiser SHOULD be able see that the value is already > less than 256! > > Stefan