From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 0569C3858C66; Thu, 12 Jan 2023 10:34:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0569C3858C66
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1673519659;
	bh=wVPqeQrLjJ1h50mrJ1cPlBEXN5Lim63DlxlADsmPzB4=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=VtRVy/j0fVhe68/aviM5Jo2wD50GkGJ4a8fzK1NhkvJ+GvCVoR9ejq3ov75Fl1EzR
	 XeD3b0CB8eQCk/1ZmStKUJeEd6BmMVV81iqiJhIw3JTwTYI1wnANooG2dzjPqzAXm9
	 VgjHQO566mRVDr/mqJQrs9cSWE53jtKKay8k5UmY=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/108376] TSVC s1279 runs 40% faster with aocc than
 gcc at zen4
Date: Thu, 12 Jan 2023 10:34:17 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cf_reconfirmed_on bug_status everconfirmed
Message-ID: <bug-108376-4-bnk2BmjeaM@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108376-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108376-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108376

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-01-12
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
As far as I can see a[] is all zeros.  AOCC basically preserves the
loop control flow when if (a[i] < 0.) for all elements processed in the
iteration, likewise for if (b[i] > a[i]) but GCC if-converts this all
down to combined masking of the guarded code.

I think the testcase as-is is too artificial to be relevant.  GCC
has code to do such thing to convert masked stores, but in this case
we are not using masked stores or masked loads:

.L3:
        vmovaps a(%rax), %ymm3
        vmovaps b(%rax), %ymm4
        vmovaps c(%rax), %ymm7
        addq    $32, %rax
        vmovaps c-32(%rax), %ymm0
        vmovaps e-32(%rax), %ymm5
        vcmpps  $1, %ymm1, %ymm3, %k1
        vcmpps  $14, %ymm3, %ymm4, %k1{%k1}
        vfmadd231ps     d-32(%rax), %ymm5, %ymm0{%k1}
        vfmadd231ps     d-32(%rax), %ymm5, %ymm0
        vblendmps       %ymm0, %ymm7, %ymm0{%k1}
        vmovaps %ymm0, c-32(%rax)
        cmpq    $128000, %rax
        jne     .L3

I suspect if you do a less optimal initialization of a/b then the AOCC
code will be slower.

Note GCC applies unroll-and-jam to the loop (the outer iteration is
visibly redundant, so we are eventually doing half of the work as AOCC ;))

Confirmed for us not vectorizing control flow but if-converting.=