From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E7AFE3858408; Tue, 28 Sep 2021 13:55:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E7AFE3858408 From: "dwwork at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug fortran/102510] Function call has unnecessary stride check Date: Tue, 28 Sep 2021 13:55:47 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: fortran X-Bugzilla-Version: 11.2.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: dwwork at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2021 13:55:48 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102510 --- Comment #2 from Dalon Work --- Thanks for the information. Based on your comments, I've created 2 new subroutines that call the "bad" function. The first places the result in a contiguous array, while the second places the result in a strided array. (https://godbolt.org/z/bTnWr3bMn) The first: subroutine add2vecs3(a,b,c) real(r32), dimension(8), intent(in) :: a,b real(r32), dimension(8), intent(out) :: c c =3D add2vecs2(a,b) end subroutine With "-O3 -mavx", this subroutine becomes fully vectorized: __blah_MOD_add2vecs3: vmovups ymm0, YMMWORD PTR [rdi] vaddps ymm0, ymm0, YMMWORD PTR [rsi] vmovups YMMWORD PTR [rdx], ymm0 vzeroupper ret The second: subroutine add2vecs4(a,b,c) real(r32), dimension(8), intent(in) :: a,b real(r32), dimension(16), intent(out) :: c c(1:16:2) =3D add2vecs2(a,b) end subroutine In this case we get the non-vectorized version: __blah_MOD_add2vecs4: vmovups ymm0, YMMWORD PTR [rsi] vaddps ymm0, ymm0, YMMWORD PTR [rdi] vmovss DWORD PTR [rdx], xmm0 vextractps DWORD PTR [rdx+8], xmm0, 1 vextractps DWORD PTR [rdx+16], xmm0, 2 vextractps DWORD PTR [rdx+24], xmm0, 3 vextractf128 xmm0, ymm0, 0x1 vmovss DWORD PTR [rdx+32], xmm0 vextractps DWORD PTR [rdx+40], xmm0, 1 vextractps DWORD PTR [rdx+48], xmm0, 2 vextractps DWORD PTR [rdx+56], xmm0, 3 vzeroupper ret >>From this, it seems you are correct. The result gets passed in as a descrip= tor to a block of memory and from that the function figures out the best way to fill in the data. Perhaps other compilers handle this differently, but ther= e we have it. Changing this behavior might be difficult or impossible, as this would be an ABI change, would it not? It's arguable whether it's even worth changing. Perhaps other compilers do it differently. I guess what I assumed is that t= he compiler would have a contigous block of memory available for the return result. Any necessary striding would happen external to the function.=