From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id AFA31385DC00; Tue, 31 Mar 2020 23:12:32 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AFA31385DC00 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1585696352; bh=PlxGwcvGknEfGAsxzF2M3wKTiUruTEtfZI2GCncpoO4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=peELpx0NLC231JQPC0+66T0leF3mEN8e33s9gB4HL6lrFk4Ao3nzuajmgaB/1hLI8 +zF1isCf6s01+7QP21dqfJKPO1/DB/nSSuQcYKucgpsYe78TFzG9N4mnpUvyFse0O9 h/sJ3LoBqr0l/vEpy8TCDbHDigCSI36KTmSMduiI= From: "jamborm at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 Date: Tue, 31 Mar 2020 23:12:32 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: jamborm at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Mar 2020 23:12:32 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D94427 --- Comment #1 from Martin Jambor --- OK, so it turns out the identified commit only allows us to shoot ourselves in the foot - and there one too few branches, not too many. The hottest loop, consuming most of the time is: Percent Instructions ------------------------------------------------ 0.03 =E2=94=82 fb0:=E2=94=8C=E2=94=80+add -0x8(%r9,%rcx,4),%eax 5.03 =E2=94=82 =E2=94=82 mov %eax,-0x4(%r13,%rcx,4) 2.48 =E2=94=82 =E2=94=82 mov -0x8(%r8,%rcx,4),%esi 0.02 =E2=94=82 =E2=94=82 add -0x8(%rdx,%rcx,4),%esi 0.06 =E2=94=82 =E2=94=82 cmp %eax,%esi 4.49 =E2=94=82 =E2=94=82 cmovge %esi,%eax 17.17 =E2=94=82 =E2=94=82 mov %ecx,%esi 0.03 =E2=94=82 =E2=94=82 cmp $0xc521974f,%eax 3.50 =E2=94=82 =E2=94=82 cmovl %ebx,%eax <----------- this used = to be a branch 21.84 =E2=94=82 =E2=94=82 mov %eax,-0x4(%r13,%rcx,4) 3.88 =E2=94=82 =E2=94=82 add $0x1,%rcx 0.00 =E2=94=82 =E2=94=82 cmp %rdi,%rcx 0.04 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jne fb0 where the marked conditional move was a branch one revision before, because, after fwprop3 the IL looked like: [local count: 955630217]: # cstore_281 =3D PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] cstore_249(15)> [fast_algorithms.c:142:49] MEM [(void *)_72] =3D cstore_281; [fast_algorithms.c:143:13] _78 =3D [fast_algorithms.c:143:13] *_72; [fast_algorithms.c:143:10] if (_78 < -987654321) goto ; [50.00%] else goto ; [50.00%] [local count: 477815109]: [local count: 955630217]: # cstore_250 =3D PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM [(void *)_72] =3D cstore_250; The aforementioned revision turned this into more optimized code: [local count: 955630217]: # cstore_281 =3D PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] _73(15)> [fast_algorithms.c:143:10] if (cstore_281 < -987654321) goto ; [50.00%] else goto ; [50.00%] [local count: 477815109]: [local count: 955630217]: # cstore_250 =3D PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM [(void *)_72] =3D cstore_250; Which then phiopt3 changed to: cstore_248 =3D MAX_EXPR ; [fast_algorithms.c:143:29] MEM [(void *)_72] =3D cstore_248; and expander apparently always expands MAX_EXPR into a conditional move if it can(?). When I hacked phiopt not to do the transformation for - ehm - any GIMPLE_COND statement originating from source line 143, I recovered the original run-time of the benchmark. On both AMD and Intel.=