From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 47A513982076; Wed, 14 Apr 2021 07:08:22 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 47A513982076 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3 Date: Wed, 14 Apr 2021 07:08:22 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cf_gcctarget cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2021 07:08:22 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100076 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-*-* CC| |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- See also PR90579. I wonder if there's a way to tell the CPU to not forward a load - does emitting a lfence inbetween the scalar store and the vector load fix the issue? ISTR that the "bad" effect is not so much the delay between flushing the store buffers to L1 and then loading from L1 but when the CPU speculates there's no conflicting [not forwardable] store in the store buffer and thus fetches a wrong value from L1 and thus we have to flush and restart the pipeline after we discover the conflict late? Otherwise it's really hard to address these kind of issues - for doubles and SSE vectorization we might simply vectorize all loads using scalars but that doesn't scale for larger VFs. It might eventually be enough to force peel a single iteration of all loops at the cost of code size (and performance if there's no STLF issue). That said, CPU design folks should try to address this by making the penalty smaller ;) Can you share a runtime testcase?=