Re: Will GCC eventually support SSE2 or SSE4.1?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: Nicholas Vinson <nvinson234@gmail.com>
To: gcc@gcc.gnu.org
Subject: Re: Will GCC eventually support SSE2 or SSE4.1?
Date: Fri, 26 May 2023 07:34:26 -0400	[thread overview]
Message-ID: <c3537686-0832-ce3a-e54a-2eb936ff75df@gmail.com> (raw)
In-Reply-To: <51071A92918346ABBC6B5703179F5174@H270>

On 5/26/23 02:46, Stefan Kanthak wrote:

> Hi,
>
> compile the following function on a system with Core2 processor
> (released January 2008) for the 32-bit execution environment:
>
> --- demo.c ---
> int ispowerof2(unsigned long long argument)
> {
>      return (argument & argument - 1) == 0;
> }
> --- EOF ---
>
> GCC 13.3: gcc -m32 -O3 demo.c
>
> NOTE: -mtune=native is the default!
>
> # https://godbolt.org/z/b43cjGdY9
> ispowerof2(unsigned long long):
>          movq    xmm1, [esp+4]
>          pcmpeqd xmm0, xmm0
>          paddq   xmm0, xmm1
>          pand    xmm0, xmm1
>          movd    edx, xmm0      #    pxor    xmm1, xmm1
>          psrlq   xmm0, 32       #    pcmpeqb xmm0, xmm1
>          movd    eax, xmm0      #    pmovmskb eax, xmm0
>          or      edx, eax       #    cmp     al, 255
>          sete    al             #    sete    al
>          movzx   eax, al        #
>          ret
>
> 11 instructions in 40 bytes # 10 instructions in 36 bytes 

You cannot delete the 'movzx eax, al' instruction. The line "(argument & 
argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required 
to ensure that the upper 24-bits of the eax register are properly zeroed.


> OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
>        here instead of the native SSE4.1 alias "Penryn New Instruction Set"
>        of the Core2 (and all later processors)?
>
> OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
> right side?
After correcting for the above error, your solution is is the same size 
as the solution gcc generated. Therefore, the only remaining question 
would be "Is your solution faster than the code gcc produced?"

If you claim it is, I'd like to see evidence supporting that claim.
> Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
> alias "Penryn New Instruction Set" of the Core2 processor:
>
> GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c
>
> # https://godbolt.org/z/svhEoYT11
> ispowerof2(unsigned long long):
>                                 #    xor      eax, eax
>          movq    xmm1, [esp+4]  #    movq     xmm1, [esp+4]
>          pcmpeqd xmm0, xmm0     #    pcmpeqq  xmm0, xmm0
>          paddq   xmm0, xmm1     #    paddq    xmm0, xmm1
>          pand    xmm0, xmm1     #    ptest    xmm0, xmm1
>          movd    edx, xmm0      #
>          psrlq   xmm0, 32       #
>          movd    eax, xmm0      #
>          or      edx, eax       #
>          sete    al             #    sete     al
>          movzx   eax, al        #
>          ret                    #    ret
>
> 11 instructions in 40 bytes    # 7 instructions in 26 bytes
>
> OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
>        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As pointed out elsewhere in this thread, you used the wrong flags. With 
the proper flags, I get

% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c  && objdump -d 
ispowerof2.o


ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
    0:   f3 0f 7e 4c 24 04       movq   0x4(%esp),%xmm1
    6:   66 0f 76 c0             pcmpeqd %xmm0,%xmm0
    a:   31 c0                   xor    %eax,%eax
    c:   66 0f d4 c1             paddq  %xmm1,%xmm0
   10:   66 0f db c1             pand   %xmm1,%xmm0
   14:   66 0f 6c c0             punpcklqdq %xmm0,%xmm0
   18:   66 0f 38 17 c0          ptest  %xmm0,%xmm0
   1d:   0f 94 c0                sete   %al
   20:   c3                      ret

so with just the SSE-4.1 instruction set the output is 31 bytes long.

> Last compile with -mtune=i386 for the i386 processor:
>
> GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c
>
> # https://godbolt.org/z/e76W6dsMj
> ispowerof2(unsigned long long):
>          push    ebx            #
>          mov     ecx, [esp+8]   #    mov    eax, [esp+4]
>          mov     ebx, [esp+12]  #    mov    edx, [esp+8]
>          mov     eax, ecx       #
>          mov     edx, ebx       #
>          add     eax, -1        #    add    eax, -1
>          adc     edx, -1        #    adc    edx, -1
>          and     eax, ecx       #    and    eax, [esp+4]
>          and     edx, ebx       #    and    edx, [esp+8]
>          or      eax, edx       #    or     eax, edx
>          sete    al             #    neg    eax
>          movzx   eax, al        #    sbb    eax, eax
>          pop     ebx            #    inc    eax
>          ret                    #    ret
>
> 14 instructions in 33 bytes    # 11 instructions in 32 bytes
>
> OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
>        memory write?

At -O1 gcc produces:

% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c  && objdump 
-Mintel -d ispowerof2.o

ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
    0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
    4:   8b 54 24 08             mov    edx,DWORD PTR [esp+0x8]
    8:   83 c0 ff                add    eax,0xffffffff
    b:   83 d2 ff                adc    edx,0xffffffff
    e:   23 44 24 04             and    eax,DWORD PTR [esp+0x4]
   12:   23 54 24 08             and    edx,DWORD PTR [esp+0x8]
   16:   09 d0                   or     eax,edx
   18:   0f 94 c0                sete   al
   1b:   0f b6 c0                movzx  eax,al
   1e:   c3                      ret

which is 1 instruction and 1 byte shorter than your proposed solution.

However, at -O2 or -O3 it produces the code you mention above. The 
reason for that is simple. It's faster to read from registers than it is 
to read from cache or RAM, and gcc is taking advantage of that fact when 
optimizing at -O2 or higher.

>
> Stefan Kanthak

next prev parent reply	other threads:[~2023-05-26 11:34 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-26  6:46 Stefan Kanthak
2023-05-26  7:00 ` Andrew Pinski
2023-05-26  7:30   ` Jonathan Wakely
2023-05-26  7:58     ` Stefan Kanthak
2023-05-26  8:16       ` Sam James
2023-05-26  8:28       ` Jonathan Wakely
2023-05-26  8:59         ` Stefan Kanthak
2023-05-26  9:22           ` Jakub Jelinek
2023-05-26 11:28             ` Stefan Kanthak
2023-05-26 11:42               ` Jonathan Wakely
2023-05-26 12:03                 ` Stefan Kanthak
2023-05-26 12:16                   ` Jonathan Wakely
2023-05-26 12:22                     ` Stefan Kanthak
2023-05-26 13:00                       ` Mark Wielaard
2023-05-26 12:23                   ` Jonathan Wakely
2023-05-26 11:36             ` Stefan Kanthak
2023-05-26 11:45               ` Jonathan Wakely
2023-05-26 12:19                 ` Stefan Kanthak
2023-05-26 12:30                   ` Jonathan Wakely
2023-05-26 12:42                     ` Stefan Kanthak
2023-05-26 13:33                       ` Nicholas Vinson
2023-05-26 12:37                   ` Jakub Jelinek
2023-05-26 13:49                     ` Stefan Kanthak
2023-05-26 14:07                       ` Jonathan Wakely
2023-05-26 14:18                         ` Jakub Jelinek
2023-05-26 14:41                           ` Stefan Kanthak
2023-05-26 14:55                             ` Jonathan Wakely
2023-05-26 15:07                               ` Stefan Kanthak
2023-05-26 14:26                         ` Stefan Kanthak
2023-05-26 14:58                           ` Jonathan Wakely
2023-05-26 15:49                             ` Stefan Kanthak
2023-05-26 16:44                               ` David Brown
2023-05-27 18:16                                 ` Will GCC eventually support correct code compilation? Dave Blanchard
2023-05-27 18:59                                   ` Jason Merrill
2023-05-28 11:50                                   ` David Brown
2023-05-26  9:22           ` Will GCC eventually support SSE2 or SSE4.1? Jonathan Wakely
2023-05-26  8:12     ` Hagen Paul Pfeifer
2023-05-26  9:51       ` Jonathan Wakely
2023-05-26 11:34 ` Nicholas Vinson [this message]
2023-05-26 15:10 ` LIU Hao
2023-05-26 15:40   ` Stefan Kanthak
2023-05-27 18:20     ` LIU Hao
2023-05-27 18:49       ` Stefan Kanthak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c3537686-0832-ce3a-e54a-2eb936ff75df@gmail.com \
    --to=nvinson234@gmail.com \
    --cc=gcc@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).