From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8C7343849AEE; Thu, 18 Apr 2024 12:13:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8C7343849AEE DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1713442394; bh=UpQzvVUQsLSsSV/D+zSu3ftQhxniOsUQ1v9k8hib92E=; h=From:To:Subject:Date:From; b=CgMvI09YYjfvme5GbHrEUnc8X+yEfgfc8yVafxXVeWwJAnZbDW1qSmDAWjh4gFAH3 cRzpyZJTka9VJnFhjmWCPnjW2yoB7E9HK5vw0Tz/OjJJAQ9TkTXzXfj5h0q8zlwdb1 omzNVxs6sJSPdq9mCIegNH2C94GHwS4S6ROHwiko= From: "mjr19 at cam dot ac.uk" To: gcc-bugs@gcc.gnu.org Subject: [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal Date: Thu, 18 Apr 2024 12:13:13 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: fortran X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: mjr19 at cam dot ac.uk X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114767 Bug ID: 114767 Summary: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: mjr19 at cam dot ac.uk Target Milestone: --- Gfortran 14 shows considerable improvement over 13.1 on x86_64 AVX2 on the = test case subroutine scale_i(c,n) integer :: i,n complex(kind(1d0)) :: c(*) do i=3D1,n c(i)=3Dc(i)*(0d0,1d0) enddo end subroutine scale_i Both vectorise well, and use an xor for the multiplication by -1 -- good. But both progress by forming one vector containing the real parts of four consecutive complex elements, and one of the imaginary parts. The imaginary parts are then all xor'ed to swap their signs, and further permuting and shuffling occurs to reassemble things into the correct interleaved order. Gfortran-14 has reduced the amount of permuting and shuffling to achieve the same result. I think that it should be possible to do this with the vector registers hol= ding the complex data in their natural order. A single xor could switch the sign= s of alternate elements, leaving the real parts untouched, and a single vpermilpd (or the more general vpermpd) could then swap pairs of elements. This should not only be faster, but use fewer registers too. (This could be generalised to the case where the constant is a multiple of (0d0,1d0), say (0d0,q), either by a final multiplication with a vector containing q repeated, or by replacing the xor by a multiplication by a vec= tor containing q repeated but with alternating signs.) Compilation tests with -mavx2 -O3 -ffast-math.=