From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id B16193858D33; Wed, 7 Feb 2024 03:39:23 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B16193858D33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1707277163; bh=wfeZNEO9sQw3ahrP2vADw/96nzINnHKRa/slKxzGtZM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=WZhc/NEPjgfQ0aETjRGyR5I1XUU0Y1Uewymlhm47Qm8Ni4WhKQBuC0wl4IRap0aW2 hU8NLvTk2c4RMLJWWa+w31edBtp4XkAi4fiN+yuaRd3D2lTp3jNzUhBYatmThrMGPl Gk1ZfLCEGO6NQ90mAge1mUTgpUcjOb6TUkdaWXCU= From: "juzhe.zhong at rivai dot ai" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized. Date: Wed, 07 Feb 2024 03:39:22 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: juzhe.zhong at rivai dot ai X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 --- Comment #12 from JuzheZhong --- Ok. I found it even without vectorization: GCC is worse than Clang: https://godbolt.org/z/addr54Gc6 GCC (14 instructions inside the loop): fld fa3,0(a0) fld fa5,8(a0) fld fa1,16(a0) fsub.d fa4,ft2,fa3 addi a0,a0,160 fadd.d fa5,fa5,fa1 addi a1,a1,160 addi a5,a5,160 fmadd.d fa4,fa4,fa2,fa3 fnmsub.d fa5,fa5,ft1,ft0 fsd fa4,-160(a1) fld fa4,-152(a0) fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 fsd fa5,-160(a5) Clang (12 instructions inside the loop): fld fa1, -8(a0) fld fa0, 0(a0) fld ft0, 8(a0) fmadd.d fa1, fa1, fa4, fa5 fsd fa1, 0(a1) fld fa1, 0(a0) fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 add a4, a1, a3 fsd fa1, -376(a4) addi a1, a1, 160 addi a0, a0, 160 The critical things is that: GCC has=20 fsub.d fa4,ft2,fa3 fadd.d fa5,fa5,fa1 fmadd.d fa4,fa4,fa2,fa3 fnmsub.d fa5,fa5,ft1,ft0 fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 6 floating-point operations. Clang has: fmadd.d fa1, fa1, fa4, fa5 fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 Clang has 4. 2 more floating-point operations are very critical to the performance I thi= nk since double floating-point operations are usually costly in real hardware.=