From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 73F8B385773C; Tue, 18 Jul 2023 14:49:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 73F8B385773C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1689691797; bh=t5SMBgXyzbFDaMvuj6TakfUP96Mao8WUe02XXldSa7k=; h=From:To:Subject:Date:In-Reply-To:References:From; b=xqT7uPxLkAAomBP2KOaA/fuDSUM7nDHVQV/MuvLqWyQ1gfFUnKqKEKwRG0pmZOau+ 56a37jgNkkjP562r20Zei/636sP3kH0yfuICdntHXSr3+6KI96LQABJFhMB+KEsGSc sj3WbaaBvqBjRPyyyx7hexjU7syst3KBoLt5p0Qo= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23) Date: Tue, 18 Jul 2023 14:49:56 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization, needs-bisection X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110649 --- Comment #14 from Jan Hubicka --- Chasing profile update bugs out of the hottest two functions did not solve = the regression. Moreover the weekly testers confirm it was not noise on zens either. Before the change we get: 34.58% sphinx_livepret [.] mgau_eval =E2= =97=86 26.61% sphinx_livepret [.] vector_gautbl_eval_logs3 =E2= =96=92 8.94% sphinx_livepret [.] subvq_mgau_shortlist =E2= =96=92 7.36% sphinx_livepret [.] logs3_add =E2= =96=92 5.66% sphinx_livepret [.] approx_cont_mgau_frame_eval =E2= =96=92 4.68% sphinx_livepret [.] mdef_sseq2sen_active =E2= =96=92 3.38% sphinx_livepret [.] dict2pid_comsenscr =E2= =96=92 1.66% sphinx_livepret [.] hmm_vit_eval_3st =E2= =96=92 0.90% sphinx_livepret [.] lextree_hmm_eval =E2= =96=92 0.73% sphinx_livepret [.] lextree_hmm_propagate =E2= =96=92 0.71% sphinx_livepret [.] lextree_enter =E2= =96=92 0.68% sphinx_livepret [.] fe_fft =E2= =96=92 0.49% sphinx_livepret [.] dict2pid_comsseq2sen_active =E2= =96=92 0.35% sphinx_livepret [.] lextree_ssid_active =E2= =96=92 0.20% sphinx_livepret [.] vithist_rescore =E2= =96=92 So difference seems to be mgau_eval. Both version of mgau_eval has almost same code layout. Main difference is registr allocation. In old version we do more spill around call: 0.01 =E2=94=82 and $0xffffffffffffffe0,%rsp = =E2=96=92 0.14 =E2=94=82 mov %rcx,%rbx = =E2=96=92 0.00 =E2=94=82 sub $0xa0,%rsp = =E2=96=92 0.04 =E2=94=82 mov 0x10(%rdi),%rax = =E2=96=92 0.13 =E2=94=82 mov 0x8(%rdi),%r15d = =E2=96=92 0.01 =E2=94=82 vmovaps %xmm3,0x80(%rsp) = =E2=96=92 0.22 =E2=94=82 vmovaps %xmm2,0x90(%rsp) = =E2=96=92 0.03 =E2=94=82 mov %rdi,0x70(%rsp) = =E2=96=92 0.05 =E2=94=82 lea (%rax,%rdx,8),%r14 = =E2=96=92 0.01 =E2=94=82 call log_to_logs3_factor = =E2=96=92 1.00 =E2=94=82 test %r13,%r13 = =E2=96=92 0.00 =E2=94=82 vxorps %xmm4,%xmm4,%xmm4 = =E2=96=92 0.02 =E2=94=82 vmovsd %xmm0,0x78(%rsp) = =E2=96=92 0.00 =E2=94=82 je 433 = =E2=96=92 0.01 =E2=94=82 movslq 0x0(%r13),%rax = =E2=96=92 0.02 =E2=94=82 mov $0xc8000000,%edi = =E2=96=92 0.01 =E2=94=82 vmovaps 0x90(%rsp),%xmm2 = =E2=96=92 0.23 =E2=94=82 vmovaps 0x80(%rsp),%xmm3 = =E2=96=92 0.09 =E2=94=82 test %eax,%eax = =E2=96=92 0.00 =E2=94=82 js 3f9 = =E2=96=92 new verison is missing the spill of xmm2/3 0.02 =E2=94=82 and $0xffffffffffffffe0,%rsp = =E2=96=92 0.03 =E2=94=82 mov %rcx,%rbx = =E2=96=92 0.01 =E2=94=82 add $0xffffffffffffff80,%rsp = =E2=96=92 0.03 =E2=94=82 mov 0x10(%rdi),%rax = =E2=96=92 0.16 =E2=94=82 mov 0x8(%rdi),%r15d = =E2=96=92 0.06 =E2=94=82 mov %rdi,0x50(%rsp) = =E2=96=92 0.12 =E2=94=82 lea (%rax,%rdx,8),%r14 = =E2=96=92 0.01 =E2=94=82 call log_to_logs3_factor = =E2=96=92 0.75 =E2=94=82 test %r12,%r12 = =E2=96=92 0.00 =E2=94=82 vxorps %xmm3,%xmm3,%xmm3 = =E2=96=92 0.01 =E2=94=82 vmovsd %xmm0,0x58(%rsp) = =E2=96=92 0.01 =E2=94=82 je 3f2 = =E2=96=92 0.01 =E2=94=82 movslq (%r12),%rcx = =E2=96=92 0.00 =E2=94=82 mov $0xc8000000,%edi = =E2=96=92 =E2=94=82 test %ecx,%ecx = =E2=96=92 0.14 =E2=94=82 js 3b8 = =E2=96=92 Which looks better. log_to_logs3_factor just returns constant: Percent=E2=94=82 vmovsd invlogB,%xmm0=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20 =E2=94=82 ret=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20 I wonder why we no longer need to spill. log_to_logs3_factor is from other translation unit and this is non-LTO build. Maybe there are undefined variables. New version does: 0.29 =E2=94=82 vmovhps %xmm4,0x70(%rsp) = =E2=96=92 0.11 =E2=94=82 vmovaps 0x70(%rsp),%xmm7 = =E2=96=92 and this looks odd.=