From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id A673A3858D28; Mon, 28 Aug 2023 11:52:32 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A673A3858D28
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1693223552;
	bh=RipwzbU5bEFT2oaGjyScKKKlH3hNguShMlskP510Q6M=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=nauF9nibw91ApdLwvwluplogmZD/2WKceG/KWZdrkDSIe/iLhAs5Slvpgzjvi7qoT
	 Y82rhHeQXQ5rpP3e8+eEzhhQ/yAzWJpoEyWHBsXTXxdv9ngv8eNf0TAjafcwHm0u7g
	 5es4gr8PoCQ1bq31ANOH5Ii2JCKUVhNW7CM/jPgY=
From: "gnu_bugzilla_gcc at catelyn dot tech" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/111166] gcc unnecessarily creates vector operations for
 packing 32 bit integers into struct (x86_64)
Date: Mon, 28 Aug 2023 11:52:32 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.2.1
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: gnu_bugzilla_gcc at catelyn dot tech
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-111166-4-EUQg4gI2hM@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-111166-4@http.gcc.gnu.org/bugzilla/>
References: <bug-111166-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D111166

--- Comment #3 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #1)
> Unless you can come up with an actual benchmark showing the vector code is
> slower I'd say it's not.  Given it's smaller it should win on the icache
> side if not executed frequently as well.

I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I
compiled the same (attached preprocessed) file with -O2, -O3, and -Os into =
an
object file, and then compiled a benchmarking file into an object as well (=
to
avoid variance caused by the benchmarking file being compiled with different
optimization levels), I added a very simple implementation for
`do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop,
with varying (pre-generated) input data and storing the result in an array,=
 I
timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and
end, then added up the calculated results to ensure all three programs get =
the
same result

on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and
-O3 versions take ~.43 and ~.42 seconds

I tried both -O2 and -O3 to get a slightly better view of the typical varia=
nce
between program runs, and their times are very similar, but the -Os version=
 is
a decent amount faster (around 16%, which I'd assume is significant)

I've added the preprocessed benchmark file as well, which I then compiled w=
ith
-mtune=3Dgeneric and -march=3Dx86-64 to match the system-under-test=