From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D6604385841E; Mon, 28 Aug 2023 12:37:22 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D6604385841E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1693226242; bh=vjiWzLgs16Z6U0BHRN63OHfrnkA06MWP9eD0WhC90h4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Hgn3KStw4E7uLPGSeTsXeWWJgJ/UYHc4qEUY9+tPtSY4sOB1q5tQWh3gqp9JneDIP wFm8CZIZzSafI76b5u8GowmO1iVs5s6Z2FDiPnEaJoXBiMs9DyYr6GUH2/dcYqmPxZ DtIp6suQ0C/9fEmrNzkqt3Kt8zXU7w9FpGHW+UJU= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) Date: Mon, 28 Aug 2023 12:37:22 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.2.1 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: dependson Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D111166 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |101926 --- Comment #4 from Richard Biener --- Your benchmark confirms the vectorized variant is slower, on a 7900X it's both the memory roundtrip and the gpr->xmm move causing it. perf shows | turn_into_struct(): 1 | movd %edi,%xmm1 3 | movd %esi,%xmm4 4 | movd %edx,%xmm0 95 | movd %ecx,%xmm3 6 | punpckldq %xmm4,%xmm1 2 | punpckldq %xmm3,%xmm0 1 | movdqa %xmm1,%xmm2 | punpcklqdq %xmm0,%xmm2 5 | movaps %xmm2,-0x18(%rsp) 63 | mov -0x18(%rsp),%rdi 70 | mov -0x10(%rsp),%rsi 47 | jmp 400630 note the situation is difficult to rectify - ideally the vectorizer would see that we require two 64bit register pieces but it doesn't - it sees we store into memory. I'll note the non-vectorized code is also far from optimal. clang produces the following which is faster by more of the delta that the vectorized version is slower compared to the scalar GCC variant. turn_into_struct: # @turn_into_struct .cfi_startproc # %bb.0: # kill: def $ecx killed $ecx def $r= cx # kill: def $esi killed $esi def $r= si shlq $32, %rsi movl %edi, %edi orq %rsi, %rdi shlq $32, %rcx movl %edx, %esi orq %rcx, %rsi jmp do_smth_with_4_u32 # TAILCALL Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D101926 [Bug 101926] [meta-bug] struct/complex/other argument passing and return sh= ould be improved=