From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 0569C3858C66; Thu, 12 Jan 2023 10:34:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0569C3858C66 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1673519659; bh=wVPqeQrLjJ1h50mrJ1cPlBEXN5Lim63DlxlADsmPzB4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=VtRVy/j0fVhe68/aviM5Jo2wD50GkGJ4a8fzK1NhkvJ+GvCVoR9ejq3ov75Fl1EzR XeD3b0CB8eQCk/1ZmStKUJeEd6BmMVV81iqiJhIw3JTwTYI1wnANooG2dzjPqzAXm9 VgjHQO566mRVDr/mqJQrs9cSWE53jtKKay8k5UmY= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than gcc at zen4 Date: Thu, 12 Jan 2023 10:34:17 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cf_reconfirmed_on bug_status everconfirmed Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108376 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2023-01-12 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- As far as I can see a[] is all zeros. AOCC basically preserves the loop control flow when if (a[i] < 0.) for all elements processed in the iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all down to combined masking of the guarded code. I think the testcase as-is is too artificial to be relevant. GCC has code to do such thing to convert masked stores, but in this case we are not using masked stores or masked loads: .L3: vmovaps a(%rax), %ymm3 vmovaps b(%rax), %ymm4 vmovaps c(%rax), %ymm7 addq $32, %rax vmovaps c-32(%rax), %ymm0 vmovaps e-32(%rax), %ymm5 vcmpps $1, %ymm1, %ymm3, %k1 vcmpps $14, %ymm3, %ymm4, %k1{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0{%k1} vfmadd231ps d-32(%rax), %ymm5, %ymm0 vblendmps %ymm0, %ymm7, %ymm0{%k1} vmovaps %ymm0, c-32(%rax) cmpq $128000, %rax jne .L3 I suspect if you do a less optimal initialization of a/b then the AOCC code will be slower. Note GCC applies unroll-and-jam to the loop (the outer iteration is visibly redundant, so we are eventually doing half of the work as AOCC ;)) Confirmed for us not vectorizing control flow but if-converting.=