From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D6604385841E; Mon, 28 Aug 2023 12:37:22 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D6604385841E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1693226242;
	bh=vjiWzLgs16Z6U0BHRN63OHfrnkA06MWP9eD0WhC90h4=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Hgn3KStw4E7uLPGSeTsXeWWJgJ/UYHc4qEUY9+tPtSY4sOB1q5tQWh3gqp9JneDIP
	 wFm8CZIZzSafI76b5u8GowmO1iVs5s6Z2FDiPnEaJoXBiMs9DyYr6GUH2/dcYqmPxZ
	 DtIp6suQ0C/9fEmrNzkqt3Kt8zXU7w9FpGHW+UJU=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/111166] gcc unnecessarily creates vector operations for
 packing 32 bit integers into struct (x86_64)
Date: Mon, 28 Aug 2023 12:37:22 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.2.1
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: dependson
Message-ID: <bug-111166-4-MLHBEuDckM@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-111166-4@http.gcc.gnu.org/bugzilla/>
References: <bug-111166-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D111166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |101926
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Your benchmark confirms the vectorized variant is slower, on a 7900X it's
both the memory roundtrip and the gpr->xmm move causing it.  perf shows

       |    turn_into_struct():
     1 |      movd       %edi,%xmm1
     3 |      movd       %esi,%xmm4
     4 |      movd       %edx,%xmm0
    95 |      movd       %ecx,%xmm3
     6 |      punpckldq  %xmm4,%xmm1
     2 |      punpckldq  %xmm3,%xmm0
     1 |      movdqa     %xmm1,%xmm2
       |      punpcklqdq %xmm0,%xmm2
     5 |      movaps     %xmm2,-0x18(%rsp)
    63 |      mov        -0x18(%rsp),%rdi
    70 |      mov        -0x10(%rsp),%rsi
    47 |      jmp        400630 <do_smth_with_4_u32>

note the situation is difficult to rectify - ideally the vectorizer
would see that we require two 64bit register pieces but it doesn't - it sees
we store into memory.

I'll note the non-vectorized code is also far from optimal.  clang
produces the following which is faster by more of the delta that
the vectorized version is slower compared to the scalar GCC variant.

turn_into_struct:                       # @turn_into_struct
        .cfi_startproc
# %bb.0:
                                        # kill: def $ecx killed $ecx def $r=
cx
                                        # kill: def $esi killed $esi def $r=
si
        shlq    $32, %rsi
        movl    %edi, %edi
        orq     %rsi, %rdi
        shlq    $32, %rcx
        movl    %edx, %esi
        orq     %rcx, %rsi
        jmp     do_smth_with_4_u32      # TAILCALL


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D101926
[Bug 101926] [meta-bug] struct/complex/other argument passing and return sh=
ould
be improved=