From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id A8B543858C3A; Wed, 13 Mar 2024 12:52:52 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A8B543858C3A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1710334372; bh=RZMczQUdLRq9Gl47Jvjfu4MVdEKAk2rZeahmWBFhp2g=; h=From:To:Subject:Date:From; b=S470jyD43ZHj6FtXGqxQ8Vw1Q6J6UhT18tOzrMGMV3ObCn0HciQCrl0fziElvW3FS KCZhVJzFVwaxg9A8x1gQPcOxrk69tjVv9hcQoCZEpZq1OaVdj7JeE4B9eLhurQuFjb PjZYrKQc3nIl0UvSmB8GvgcnpXre/JmmUx68mYiY= From: "mjr19 at cam dot ac.uk" To: gcc-bugs@gcc.gnu.org Subject: [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 Date: Wed, 13 Mar 2024 12:52:51 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: fortran X-Bugzilla-Version: 13.1.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: mjr19 at cam dot ac.uk X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114324 Bug ID: 114324 Summary: AVX2 vectorisation performance regression with gfortran 13/14 Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: mjr19 at cam dot ac.uk Target Milestone: --- Created attachment 57685 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D57685&action=3Dedit Test case of loop showing performance regression The attached loop, when compiled with "-Ofast -mavx2" runs over 20% slower = on gfortran 13 or (pre-release) 14 than it does on 12.x. Precise versions test= ed 12.3.0, 13.1.0 and GCC 14 downloaded on 11th March. Precise slowdown depends on CPU. Tested on Haswell and Kaby Lake desktops. Adding "-fopenmp" changes the code produced, but 12.3 still beats later compilers. The analysis below is without -fopenmp. It appears (to me) that 12.x is using the full width of the ymm registers, = and has a loop of 17 vector instructions, and some scalar loop control, which performs two iterations of the original Fortran loop. 13.x manages more aggressive unrolling, performing four iterations per pass, but uses about 54 vector instructions, rather than the 34 one might naively expect. More instructions does not necessarily mean slower, but here it doe= s. I attach the test case to which I refer. I would be happy to add the trivial timing program to show how I have been timing it. The full code is an FFT, = but the test case has been reduced to functional nonsense. (I note that in other areas there are pleasing performance gains in gfortran 13.x. It is a pity that this partially cancels them.)=