From: Nicholas Vinson <nvinson234@gmail.com>
To: gcc@gcc.gnu.org
Subject: Re: Will GCC eventually support SSE2 or SSE4.1?
Date: Fri, 26 May 2023 07:34:26 -0400 [thread overview]
Message-ID: <c3537686-0832-ce3a-e54a-2eb936ff75df@gmail.com> (raw)
In-Reply-To: <51071A92918346ABBC6B5703179F5174@H270>
On 5/26/23 02:46, Stefan Kanthak wrote:
> Hi,
>
> compile the following function on a system with Core2 processor
> (released January 2008) for the 32-bit execution environment:
>
> --- demo.c ---
> int ispowerof2(unsigned long long argument)
> {
> return (argument & argument - 1) == 0;
> }
> --- EOF ---
>
> GCC 13.3: gcc -m32 -O3 demo.c
>
> NOTE: -mtune=native is the default!
>
> # https://godbolt.org/z/b43cjGdY9
> ispowerof2(unsigned long long):
> movq xmm1, [esp+4]
> pcmpeqd xmm0, xmm0
> paddq xmm0, xmm1
> pand xmm0, xmm1
> movd edx, xmm0 # pxor xmm1, xmm1
> psrlq xmm0, 32 # pcmpeqb xmm0, xmm1
> movd eax, xmm0 # pmovmskb eax, xmm0
> or edx, eax # cmp al, 255
> sete al # sete al
> movzx eax, al #
> ret
>
> 11 instructions in 40 bytes # 10 instructions in 36 bytes
You cannot delete the 'movzx eax, al' instruction. The line "(argument &
argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required
to ensure that the upper 24-bits of the eax register are properly zeroed.
> OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
> here instead of the native SSE4.1 alias "Penryn New Instruction Set"
> of the Core2 (and all later processors)?
>
> OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
> right side?
After correcting for the above error, your solution is is the same size
as the solution gcc generated. Therefore, the only remaining question
would be "Is your solution faster than the code gcc produced?"
If you claim it is, I'd like to see evidence supporting that claim.
> Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
> alias "Penryn New Instruction Set" of the Core2 processor:
>
> GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c
>
> # https://godbolt.org/z/svhEoYT11
> ispowerof2(unsigned long long):
> # xor eax, eax
> movq xmm1, [esp+4] # movq xmm1, [esp+4]
> pcmpeqd xmm0, xmm0 # pcmpeqq xmm0, xmm0
> paddq xmm0, xmm1 # paddq xmm0, xmm1
> pand xmm0, xmm1 # ptest xmm0, xmm1
> movd edx, xmm0 #
> psrlq xmm0, 32 #
> movd eax, xmm0 #
> or edx, eax #
> sete al # sete al
> movzx eax, al #
> ret # ret
>
> 11 instructions in 40 bytes # 7 instructions in 26 bytes
>
> OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As pointed out elsewhere in this thread, you used the wrong flags. With
the proper flags, I get
% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c && objdump -d
ispowerof2.o
ispowerof2.o: file format elf32-i386
Disassembly of section .text:
00000000 <ispowerof2>:
0: f3 0f 7e 4c 24 04 movq 0x4(%esp),%xmm1
6: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0
a: 31 c0 xor %eax,%eax
c: 66 0f d4 c1 paddq %xmm1,%xmm0
10: 66 0f db c1 pand %xmm1,%xmm0
14: 66 0f 6c c0 punpcklqdq %xmm0,%xmm0
18: 66 0f 38 17 c0 ptest %xmm0,%xmm0
1d: 0f 94 c0 sete %al
20: c3 ret
so with just the SSE-4.1 instruction set the output is 31 bytes long.
> Last compile with -mtune=i386 for the i386 processor:
>
> GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c
>
> # https://godbolt.org/z/e76W6dsMj
> ispowerof2(unsigned long long):
> push ebx #
> mov ecx, [esp+8] # mov eax, [esp+4]
> mov ebx, [esp+12] # mov edx, [esp+8]
> mov eax, ecx #
> mov edx, ebx #
> add eax, -1 # add eax, -1
> adc edx, -1 # adc edx, -1
> and eax, ecx # and eax, [esp+4]
> and edx, ebx # and edx, [esp+8]
> or eax, edx # or eax, edx
> sete al # neg eax
> movzx eax, al # sbb eax, eax
> pop ebx # inc eax
> ret # ret
>
> 14 instructions in 33 bytes # 11 instructions in 32 bytes
>
> OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
> memory write?
At -O1 gcc produces:
% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c && objdump
-Mintel -d ispowerof2.o
ispowerof2.o: file format elf32-i386
Disassembly of section .text:
00000000 <ispowerof2>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 8b 54 24 08 mov edx,DWORD PTR [esp+0x8]
8: 83 c0 ff add eax,0xffffffff
b: 83 d2 ff adc edx,0xffffffff
e: 23 44 24 04 and eax,DWORD PTR [esp+0x4]
12: 23 54 24 08 and edx,DWORD PTR [esp+0x8]
16: 09 d0 or eax,edx
18: 0f 94 c0 sete al
1b: 0f b6 c0 movzx eax,al
1e: c3 ret
which is 1 instruction and 1 byte shorter than your proposed solution.
However, at -O2 or -O3 it produces the code you mention above. The
reason for that is simple. It's faster to read from registers than it is
to read from cache or RAM, and gcc is taking advantage of that fact when
optimizing at -O2 or higher.
>
> Stefan Kanthak
next prev parent reply other threads:[~2023-05-26 11:34 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-26 6:46 Stefan Kanthak
2023-05-26 7:00 ` Andrew Pinski
2023-05-26 7:30 ` Jonathan Wakely
2023-05-26 7:58 ` Stefan Kanthak
2023-05-26 8:16 ` Sam James
2023-05-26 8:28 ` Jonathan Wakely
2023-05-26 8:59 ` Stefan Kanthak
2023-05-26 9:22 ` Jakub Jelinek
2023-05-26 11:28 ` Stefan Kanthak
2023-05-26 11:42 ` Jonathan Wakely
2023-05-26 12:03 ` Stefan Kanthak
2023-05-26 12:16 ` Jonathan Wakely
2023-05-26 12:22 ` Stefan Kanthak
2023-05-26 13:00 ` Mark Wielaard
2023-05-26 12:23 ` Jonathan Wakely
2023-05-26 11:36 ` Stefan Kanthak
2023-05-26 11:45 ` Jonathan Wakely
2023-05-26 12:19 ` Stefan Kanthak
2023-05-26 12:30 ` Jonathan Wakely
2023-05-26 12:42 ` Stefan Kanthak
2023-05-26 13:33 ` Nicholas Vinson
2023-05-26 12:37 ` Jakub Jelinek
2023-05-26 13:49 ` Stefan Kanthak
2023-05-26 14:07 ` Jonathan Wakely
2023-05-26 14:18 ` Jakub Jelinek
2023-05-26 14:41 ` Stefan Kanthak
2023-05-26 14:55 ` Jonathan Wakely
2023-05-26 15:07 ` Stefan Kanthak
2023-05-26 14:26 ` Stefan Kanthak
2023-05-26 14:58 ` Jonathan Wakely
2023-05-26 15:49 ` Stefan Kanthak
2023-05-26 16:44 ` David Brown
2023-05-27 18:16 ` Will GCC eventually support correct code compilation? Dave Blanchard
2023-05-27 18:59 ` Jason Merrill
2023-05-28 11:50 ` David Brown
2023-05-26 9:22 ` Will GCC eventually support SSE2 or SSE4.1? Jonathan Wakely
2023-05-26 8:12 ` Hagen Paul Pfeifer
2023-05-26 9:51 ` Jonathan Wakely
2023-05-26 11:34 ` Nicholas Vinson [this message]
2023-05-26 15:10 ` LIU Hao
2023-05-26 15:40 ` Stefan Kanthak
2023-05-27 18:20 ` LIU Hao
2023-05-27 18:49 ` Stefan Kanthak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c3537686-0832-ce3a-e54a-2eb936ff75df@gmail.com \
--to=nvinson234@gmail.com \
--cc=gcc@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).