From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Sizt=BP=gmail.com=nvinson234@sourceware.org>
Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a])
	by sourceware.org (Postfix) with ESMTPS id 33E6C3858D39
	for <gcc@gcc.gnu.org>; Fri, 26 May 2023 11:34:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 33E6C3858D39
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-qk1-x72a.google.com with SMTP id af79cd13be357-75affb4d0f9so41422685a.2
        for <gcc@gcc.gnu.org>; Fri, 26 May 2023 04:34:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1685100868; x=1687692868;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=yvqMXD+W0UeTnDzPnRAfXmvVdPftIoiGaw5devHK2KQ=;
        b=Jl6cl6pEEOh+bNIUHTHmYvTra4eeWl2cg//qNOorcNt7xoOFvGxSRoKAP/Fw2vf1oh
         yClVvpleR0I1KCZX+44RmIAX9M//cBFV+VJDy2bNcjFjQrXk/BvPX81kcghJ+u0K9m5M
         ZpC/hFN6O7vGz+HxkOWQJ0fdEMyj52+1lLEOXvDB9OkVnv3iV3dBWUmJUCreWjYrEspY
         +N6KUUfKgVYLRP28eEvswu3/Fe9Q/RrilGdyvyGuC6P+Alj75mJGM3JGyOYrdpJOvyt5
         hFbxGg8IV1iA270LCG8t+qup2mkLSqzsz09W4zF0gETdXQgqpErkvnPDCkzZ3BORxp4f
         ZqAQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685100868; x=1687692868;
        h=content-transfer-encoding:in-reply-to:from:references:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=yvqMXD+W0UeTnDzPnRAfXmvVdPftIoiGaw5devHK2KQ=;
        b=SjUYgfn1ObdHyTLBNyIwWpvlDXBpbUbuGK/6eGb843AIfeqSQqgzIt5bRu4+q3kWOQ
         i9F7ApbsKEpFviHZlqAvR9WhXiGNHZRj17J1eMg9xqxei0u2U+BzY66G65cnSFdKnH6A
         EbTUYrnt2r1glx1fok4yiYLowS4gHTt7uEGKaB8EsK8UfsFbuzaszuh8UZjLxjognzCo
         mSTHb6xm/zzi59K23rOrnG3p9W5qN3hsLCF3xniHMKY/sWpV6+N5M3LcHtoHV3ZxdAu0
         0vzRPgZnSh3UR1QhvSEUhF6LE17jINxE9i6idU2NqVvIZcaloe7Q6QmX9tuJxZGrUgZR
         pP1w==
X-Gm-Message-State: AC+VfDzuB484UAAX2pY9bRZ0ltRjv45WJFxrE+91Aqx3+X7h2oh3B0DL
	XzdAHG2S8OVXvy9+rB1QSF8jmxdS6qA=
X-Google-Smtp-Source: ACHHUZ4+c3MHYjKGTZaOiO/ugLCMG84zms0EzITwmsN8fqG84Y15JgYU5e+HTm0OwkqMfrlMEUFcdA==
X-Received: by 2002:a05:620a:2157:b0:75b:23a0:e7ad with SMTP id m23-20020a05620a215700b0075b23a0e7admr1287650qkm.14.1685100867912;
        Fri, 26 May 2023 04:34:27 -0700 (PDT)
Received: from ?IPV6:2602:47:d92c:4400:ef17:5b26:13c3:7c6f? ([2602:47:d92c:4400:ef17:5b26:13c3:7c6f])
        by smtp.gmail.com with ESMTPSA id b8-20020a05620a126800b0074df2ac52f8sm1100318qkl.21.2023.05.26.04.34.27
        for <gcc@gcc.gnu.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 26 May 2023 04:34:27 -0700 (PDT)
Message-ID: <c3537686-0832-ce3a-e54a-2eb936ff75df@gmail.com>
Date: Fri, 26 May 2023 07:34:26 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Subject: Re: Will GCC eventually support SSE2 or SSE4.1?
Content-Language: en-US
To: gcc@gcc.gnu.org
References: <51071A92918346ABBC6B5703179F5174@H270>
From: Nicholas Vinson <nvinson234@gmail.com>
In-Reply-To: <51071A92918346ABBC6B5703179F5174@H270>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On 5/26/23 02:46, Stefan Kanthak wrote:

> Hi,
>
> compile the following function on a system with Core2 processor
> (released January 2008) for the 32-bit execution environment:
>
> --- demo.c ---
> int ispowerof2(unsigned long long argument)
> {
>      return (argument & argument - 1) == 0;
> }
> --- EOF ---
>
> GCC 13.3: gcc -m32 -O3 demo.c
>
> NOTE: -mtune=native is the default!
>
> # https://godbolt.org/z/b43cjGdY9
> ispowerof2(unsigned long long):
>          movq    xmm1, [esp+4]
>          pcmpeqd xmm0, xmm0
>          paddq   xmm0, xmm1
>          pand    xmm0, xmm1
>          movd    edx, xmm0      #    pxor    xmm1, xmm1
>          psrlq   xmm0, 32       #    pcmpeqb xmm0, xmm1
>          movd    eax, xmm0      #    pmovmskb eax, xmm0
>          or      edx, eax       #    cmp     al, 255
>          sete    al             #    sete    al
>          movzx   eax, al        #
>          ret
>
> 11 instructions in 40 bytes # 10 instructions in 36 bytes 

You cannot delete the 'movzx eax, al' instruction. The line "(argument & 
argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required 
to ensure that the upper 24-bits of the eax register are properly zeroed.


> OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
>        here instead of the native SSE4.1 alias "Penryn New Instruction Set"
>        of the Core2 (and all later processors)?
>
> OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
> right side?
After correcting for the above error, your solution is is the same size 
as the solution gcc generated. Therefore, the only remaining question 
would be "Is your solution faster than the code gcc produced?"

If you claim it is, I'd like to see evidence supporting that claim.
> Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
> alias "Penryn New Instruction Set" of the Core2 processor:
>
> GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c
>
> # https://godbolt.org/z/svhEoYT11
> ispowerof2(unsigned long long):
>                                 #    xor      eax, eax
>          movq    xmm1, [esp+4]  #    movq     xmm1, [esp+4]
>          pcmpeqd xmm0, xmm0     #    pcmpeqq  xmm0, xmm0
>          paddq   xmm0, xmm1     #    paddq    xmm0, xmm1
>          pand    xmm0, xmm1     #    ptest    xmm0, xmm1
>          movd    edx, xmm0      #
>          psrlq   xmm0, 32       #
>          movd    eax, xmm0      #
>          or      edx, eax       #
>          sete    al             #    sete     al
>          movzx   eax, al        #
>          ret                    #    ret
>
> 11 instructions in 40 bytes    # 7 instructions in 26 bytes
>
> OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
>        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As pointed out elsewhere in this thread, you used the wrong flags. With 
the proper flags, I get

% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c  && objdump -d 
ispowerof2.o


ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
    0:   f3 0f 7e 4c 24 04       movq   0x4(%esp),%xmm1
    6:   66 0f 76 c0             pcmpeqd %xmm0,%xmm0
    a:   31 c0                   xor    %eax,%eax
    c:   66 0f d4 c1             paddq  %xmm1,%xmm0
   10:   66 0f db c1             pand   %xmm1,%xmm0
   14:   66 0f 6c c0             punpcklqdq %xmm0,%xmm0
   18:   66 0f 38 17 c0          ptest  %xmm0,%xmm0
   1d:   0f 94 c0                sete   %al
   20:   c3                      ret

so with just the SSE-4.1 instruction set the output is 31 bytes long.

> Last compile with -mtune=i386 for the i386 processor:
>
> GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c
>
> # https://godbolt.org/z/e76W6dsMj
> ispowerof2(unsigned long long):
>          push    ebx            #
>          mov     ecx, [esp+8]   #    mov    eax, [esp+4]
>          mov     ebx, [esp+12]  #    mov    edx, [esp+8]
>          mov     eax, ecx       #
>          mov     edx, ebx       #
>          add     eax, -1        #    add    eax, -1
>          adc     edx, -1        #    adc    edx, -1
>          and     eax, ecx       #    and    eax, [esp+4]
>          and     edx, ebx       #    and    edx, [esp+8]
>          or      eax, edx       #    or     eax, edx
>          sete    al             #    neg    eax
>          movzx   eax, al        #    sbb    eax, eax
>          pop     ebx            #    inc    eax
>          ret                    #    ret
>
> 14 instructions in 33 bytes    # 11 instructions in 32 bytes
>
> OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
>        memory write?

At -O1 gcc produces:

% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c  && objdump 
-Mintel -d ispowerof2.o

ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
    0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
    4:   8b 54 24 08             mov    edx,DWORD PTR [esp+0x8]
    8:   83 c0 ff                add    eax,0xffffffff
    b:   83 d2 ff                adc    edx,0xffffffff
    e:   23 44 24 04             and    eax,DWORD PTR [esp+0x4]
   12:   23 54 24 08             and    edx,DWORD PTR [esp+0x8]
   16:   09 d0                   or     eax,edx
   18:   0f 94 c0                sete   al
   1b:   0f b6 c0                movzx  eax,al
   1e:   c3                      ret

which is 1 instruction and 1 byte shorter than your proposed solution.

However, at -O2 or -O3 it produces the code you mention above. The 
reason for that is simple. It's faster to read from registers than it is 
to read from cache or RAM, and gcc is taking advantage of that fact when 
optimizing at -O2 or higher.

>
> Stefan Kanthak