From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id BB7B73882111; Tue, 18 Jun 2024 06:44:16 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BB7B73882111 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1718693056; bh=3WfaHbzxmgxgJiTW/jQDCel+HU9wOnnPxHwcl+6tCtc=; h=From:To:Subject:Date:In-Reply-To:References:From; b=l8f7fDUBk7NlQ8Fm50qD8ELi7CYLv5xUhRQgJlN0FTvLmWwPK0L4UcAKD3FbY0k+o iwFWZ5KwEaUPzOS5k8nbBE64GwMyZdPt6LOFAuGo2JNMoO+QkRla6DAQ2lEoeoaIJZ an63lpkcI0NY55LLWZ/9J0eZF8/zu+JY7FPb2yxk= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops Date: Tue, 18 Jun 2024 06:44:16 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 15.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115531 --- Comment #4 from Richard Biener --- AVX512 produces .L3: vmovdqu8 (%rsi), %zmm9{%k1} kshiftrq $32, %k1, %k5 kshiftrq $48, %k1, %k4 movl %r9d, %eax vmovdqu32 128(%rcx), %zmm7{%k5} subl %esi, %eax movl $64, %edi vmovdqu32 128(%rdx), %zmm3{%k5} kshiftrq $16, %k1, %k6 addl %r10d, %eax vmovdqu32 192(%rcx), %zmm8{%k4} cmpl %edi, %eax vmovdqu32 192(%rdx), %zmm4{%k4} cmova %edi, %eax addq $64, %rsi addq $256, %rcx vmovdqu32 -256(%rcx), %zmm5{%k1} vmovdqu32 (%rdx), %zmm1{%k1} vmovdqu32 -192(%rcx), %zmm6{%k6} vmovdqu32 64(%rdx), %zmm2{%k6} vpcmpb $4, %zmm14, %zmm9, %k2 kshiftrq $32, %k2, %k3 vpblendmd %zmm7, %zmm3, %zmm10{%k3} kshiftrd $16, %k3, %k3 vpblendmd %zmm8, %zmm4, %zmm0{%k3} vpblendmd %zmm5, %zmm1, %zmm12{%k2} vmovdqu32 %zmm10, 128(%rdx){%k5} kshiftrd $16, %k2, %k2 vmovdqu32 %zmm0, 192(%rdx){%k4} vpblendmd %zmm6, %zmm2, %zmm11{%k2} vpbroadcastb %eax, %zmm0 movl %r9d, %eax vmovdqu32 %zmm12, (%rdx){%k1} subl %esi, %eax addl %r8d, %eax vmovdqu32 %zmm11, 64(%rdx){%k6} addq $256, %rdx vpcmpub $6, %zmm13, %zmm0, %k1 cmpl $64, %eax ja .L3 The vectorizer sees [local count: 955630225]: # i_26 =3D PHI _1 =3D (long unsigned int) i_26; _2 =3D _1 * 4; _3 =3D c_17(D) + _2; res_18 =3D *_3; _4 =3D stride_14(D) + i_26; _5 =3D (long unsigned int) _4; _6 =3D _5 * 4; _7 =3D b_19(D) + _6; t_20 =3D *_7; _8 =3D a_21(D) + _1; _9 =3D *_8; _34 =3D _9 !=3D 0; res_11 =3D _34 ? t_20 : res_18; *_3 =3D res_11; i_23 =3D i_26 + 1; if (n_16(D) > i_23) I believe that to get proper vectorizer costing we want to have an optimization phase that can take into account whether we use a masked loop or not. Note that your intended transform relies on identifying the open-coded "conditional store" int res =3D c[i]; if (a[i] !=3D 0) res =3D t; c[i] =3D res; As Andrew says when that's a .MASK_STORE it's going to be easier to identify the opportunity. So yeah, if-conversion could recognize this pattern and produce a .MASK_STORE from it as a first step.=