From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by sourceware.org (Postfix) with ESMTPS id 33E6C3858D39 for ; Fri, 26 May 2023 11:34:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 33E6C3858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-qk1-x72a.google.com with SMTP id af79cd13be357-75affb4d0f9so41422685a.2 for ; Fri, 26 May 2023 04:34:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1685100868; x=1687692868; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=yvqMXD+W0UeTnDzPnRAfXmvVdPftIoiGaw5devHK2KQ=; b=Jl6cl6pEEOh+bNIUHTHmYvTra4eeWl2cg//qNOorcNt7xoOFvGxSRoKAP/Fw2vf1oh yClVvpleR0I1KCZX+44RmIAX9M//cBFV+VJDy2bNcjFjQrXk/BvPX81kcghJ+u0K9m5M ZpC/hFN6O7vGz+HxkOWQJ0fdEMyj52+1lLEOXvDB9OkVnv3iV3dBWUmJUCreWjYrEspY +N6KUUfKgVYLRP28eEvswu3/Fe9Q/RrilGdyvyGuC6P+Alj75mJGM3JGyOYrdpJOvyt5 hFbxGg8IV1iA270LCG8t+qup2mkLSqzsz09W4zF0gETdXQgqpErkvnPDCkzZ3BORxp4f ZqAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685100868; x=1687692868; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=yvqMXD+W0UeTnDzPnRAfXmvVdPftIoiGaw5devHK2KQ=; b=SjUYgfn1ObdHyTLBNyIwWpvlDXBpbUbuGK/6eGb843AIfeqSQqgzIt5bRu4+q3kWOQ i9F7ApbsKEpFviHZlqAvR9WhXiGNHZRj17J1eMg9xqxei0u2U+BzY66G65cnSFdKnH6A EbTUYrnt2r1glx1fok4yiYLowS4gHTt7uEGKaB8EsK8UfsFbuzaszuh8UZjLxjognzCo mSTHb6xm/zzi59K23rOrnG3p9W5qN3hsLCF3xniHMKY/sWpV6+N5M3LcHtoHV3ZxdAu0 0vzRPgZnSh3UR1QhvSEUhF6LE17jINxE9i6idU2NqVvIZcaloe7Q6QmX9tuJxZGrUgZR pP1w== X-Gm-Message-State: AC+VfDzuB484UAAX2pY9bRZ0ltRjv45WJFxrE+91Aqx3+X7h2oh3B0DL XzdAHG2S8OVXvy9+rB1QSF8jmxdS6qA= X-Google-Smtp-Source: ACHHUZ4+c3MHYjKGTZaOiO/ugLCMG84zms0EzITwmsN8fqG84Y15JgYU5e+HTm0OwkqMfrlMEUFcdA== X-Received: by 2002:a05:620a:2157:b0:75b:23a0:e7ad with SMTP id m23-20020a05620a215700b0075b23a0e7admr1287650qkm.14.1685100867912; Fri, 26 May 2023 04:34:27 -0700 (PDT) Received: from ?IPV6:2602:47:d92c:4400:ef17:5b26:13c3:7c6f? ([2602:47:d92c:4400:ef17:5b26:13c3:7c6f]) by smtp.gmail.com with ESMTPSA id b8-20020a05620a126800b0074df2ac52f8sm1100318qkl.21.2023.05.26.04.34.27 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 26 May 2023 04:34:27 -0700 (PDT) Message-ID: Date: Fri, 26 May 2023 07:34:26 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: Will GCC eventually support SSE2 or SSE4.1? Content-Language: en-US To: gcc@gcc.gnu.org References: <51071A92918346ABBC6B5703179F5174@H270> From: Nicholas Vinson In-Reply-To: <51071A92918346ABBC6B5703179F5174@H270> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 5/26/23 02:46, Stefan Kanthak wrote: > Hi, > > compile the following function on a system with Core2 processor > (released January 2008) for the 32-bit execution environment: > > --- demo.c --- > int ispowerof2(unsigned long long argument) > { > return (argument & argument - 1) == 0; > } > --- EOF --- > > GCC 13.3: gcc -m32 -O3 demo.c > > NOTE: -mtune=native is the default! > > # https://godbolt.org/z/b43cjGdY9 > ispowerof2(unsigned long long): > movq xmm1, [esp+4] > pcmpeqd xmm0, xmm0 > paddq xmm0, xmm1 > pand xmm0, xmm1 > movd edx, xmm0 # pxor xmm1, xmm1 > psrlq xmm0, 32 # pcmpeqb xmm0, xmm1 > movd eax, xmm0 # pmovmskb eax, xmm0 > or edx, eax # cmp al, 255 > sete al # sete al > movzx eax, al # > ret > > 11 instructions in 40 bytes # 10 instructions in 36 bytes You cannot delete the 'movzx eax, al' instruction. The line "(argument & argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required to ensure that the upper 24-bits of the eax register are properly zeroed. > OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set" > here instead of the native SSE4.1 alias "Penryn New Instruction Set" > of the Core2 (and all later processors)? > > OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the > right side? After correcting for the above error, your solution is is the same size as the solution gcc generated. Therefore, the only remaining question would be "Is your solution faster than the code gcc produced?" If you claim it is, I'd like to see evidence supporting that claim. > Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1 > alias "Penryn New Instruction Set" of the Core2 processor: > > GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c > > # https://godbolt.org/z/svhEoYT11 > ispowerof2(unsigned long long): > # xor eax, eax > movq xmm1, [esp+4] # movq xmm1, [esp+4] > pcmpeqd xmm0, xmm0 # pcmpeqq xmm0, xmm0 > paddq xmm0, xmm1 # paddq xmm0, xmm1 > pand xmm0, xmm1 # ptest xmm0, xmm1 > movd edx, xmm0 # > psrlq xmm0, 32 # > movd eax, xmm0 # > or edx, eax # > sete al # sete al > movzx eax, al # > ret # ret > > 11 instructions in 40 bytes # 7 instructions in 26 bytes > > OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As pointed out elsewhere in this thread, you used the wrong flags. With the proper flags, I get % gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c  && objdump -d ispowerof2.o ispowerof2.o:     file format elf32-i386 Disassembly of section .text: 00000000 :    0:   f3 0f 7e 4c 24 04       movq   0x4(%esp),%xmm1    6:   66 0f 76 c0             pcmpeqd %xmm0,%xmm0    a:   31 c0                   xor    %eax,%eax    c:   66 0f d4 c1             paddq  %xmm1,%xmm0   10:   66 0f db c1             pand   %xmm1,%xmm0   14:   66 0f 6c c0             punpcklqdq %xmm0,%xmm0   18:   66 0f 38 17 c0          ptest  %xmm0,%xmm0   1d:   0f 94 c0                sete   %al   20:   c3                      ret so with just the SSE-4.1 instruction set the output is 31 bytes long. > Last compile with -mtune=i386 for the i386 processor: > > GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c > > # https://godbolt.org/z/e76W6dsMj > ispowerof2(unsigned long long): > push ebx # > mov ecx, [esp+8] # mov eax, [esp+4] > mov ebx, [esp+12] # mov edx, [esp+8] > mov eax, ecx # > mov edx, ebx # > add eax, -1 # add eax, -1 > adc edx, -1 # adc edx, -1 > and eax, ecx # and eax, [esp+4] > and edx, ebx # and edx, [esp+8] > or eax, edx # or eax, edx > sete al # neg eax > movzx eax, al # sbb eax, eax > pop ebx # inc eax > ret # ret > > 14 instructions in 33 bytes # 11 instructions in 32 bytes > > OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous > memory write? At -O1 gcc produces: % gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c  && objdump -Mintel -d ispowerof2.o ispowerof2.o:     file format elf32-i386 Disassembly of section .text: 00000000 :    0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]    4:   8b 54 24 08             mov    edx,DWORD PTR [esp+0x8]    8:   83 c0 ff                add    eax,0xffffffff    b:   83 d2 ff                adc    edx,0xffffffff    e:   23 44 24 04             and    eax,DWORD PTR [esp+0x4]   12:   23 54 24 08             and    edx,DWORD PTR [esp+0x8]   16:   09 d0                   or     eax,edx   18:   0f 94 c0                sete   al   1b:   0f b6 c0                movzx  eax,al   1e:   c3                      ret which is 1 instruction and 1 byte shorter than your proposed solution. However, at -O2 or -O3 it produces the code you mention above. The reason for that is simple. It's faster to read from registers than it is to read from cache or RAM, and gcc is taking advantage of that fact when optimizing at -O2 or higher. > > Stefan Kanthak