[Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops
@ 2024-06-18  3:14 tnfchris at gcc dot gnu.org
  2024-06-18  3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-18  3:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

            Bug ID: 115531
           Summary: vectorizer generates inefficient code for masked
                    conditional update loops
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following code:

void __attribute__((noipa))
foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
  if (stride <= 1)
    return;

  for (int i = 0; i < n; i++)
    {
      int res = c[i];
      int t = b[i+stride];
      if (a[i] != 0)
        res = t;
      c[i] = res;
    }
}

generates at -O3 -g0 -mcpu=generic+sve:

.L3:
        ld1b    z29.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x2, x5, lsl 2]
        ld1w    z30.s, p7/z, [x1, x5, lsl 2]
        cmpne   p15.b, p6/z, z29.b, #0
        sel     z30.s, p15, z30.s, z31.s
        st1w    z30.s, p7, [x2, x5, lsl 2]
        add     x5, x5, x4
        whilelo p7.s, w5, w3
        b.any   .L3
.L1:

and makes vectorization unprofitable until very high iterations of n.
This is because the vector code has more instructions than needed.

Since it's a masked store, whenever a value is being conditionally set we don't
need the intermediate VEC_COND_EXPR.  This loop can be vectorized as:

.L3:
        ld1b    z29.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x2, x5, lsl 2]
        cmpne   p4.b, p6/z, z29.b, #0
        st1w    z31.s, p4, [x2, x5, lsl 2]
        add     x5, x5, x4
        whilelo p7.s, w5, w3
        b.any   .L3
.L1:

I currently prototyped a load-to-store forward optimization in forwprop but
looking to move it into the vectorizer to cost it properly, however I'm not
entirely sure what the best way to do so is.

I can certainly fix it up during codegen but to cost it I need to do so during
analysis. I could detect it during vectorizable_condition but then the dead
load is still costed. Or I could maybe use a pattern, but unsure how to
represent the mask into the load.

Is it valid to produce a pattern with .IFN_MASK_STORE?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
@ 2024-06-18  3:18 ` pinskia at gcc dot gnu.org
  2024-06-18  3:18 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18  3:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-06-18
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=20999
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect PR 20999 would fix this ...
but we have to be careful since without masked stores, you could still
vectorize this unlike the transformed version.

Maybe ifcvt can produce a masked store version if this pattern ...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
  2024-06-18  3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
@ 2024-06-18  3:18 ` pinskia at gcc dot gnu.org
  2024-06-18  3:22 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18  3:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
  2024-06-18  3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
  2024-06-18  3:18 ` pinskia at gcc dot gnu.org
@ 2024-06-18  3:22 ` pinskia at gcc dot gnu.org
  2024-06-18  3:30 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18  3:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> Maybe ifcvt can produce a masked store version if this pattern ...

Maybe add another argument to .MASK_STORE to say it was originally
unconditional store? Or something like that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-06-18  3:22 ` pinskia at gcc dot gnu.org
@ 2024-06-18  3:30 ` tnfchris at gcc dot gnu.org
  2024-06-18  6:44 ` rguenth at gcc dot gnu.org
  2024-06-25  5:29 ` tnfchris at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-18  3:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> I suspect PR 20999 would fix this ...
> but we have to be careful since without masked stores, you could still
> vectorize this unlike the transformed version.
> 
> Maybe ifcvt can produce a masked store version if this pattern ...

doing so during ifcvt forces you to commit to a masked operation. So you loose
the ability to not vectorize for non-fully masked architectures.

So it's too early.  A vector pattern doesn't have this problem. This question
was mostly to what degree the vectorizer has support for MASK_STORE as an
input. vect_get_vector_types_for_stmt seems to have support for it so it looks
like it may work.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-06-18  3:30 ` tnfchris at gcc dot gnu.org
@ 2024-06-18  6:44 ` rguenth at gcc dot gnu.org
  2024-06-25  5:29 ` tnfchris at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-18  6:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
AVX512 produces

.L3:
        vmovdqu8        (%rsi), %zmm9{%k1}
        kshiftrq        $32, %k1, %k5
        kshiftrq        $48, %k1, %k4
        movl    %r9d, %eax
        vmovdqu32       128(%rcx), %zmm7{%k5}
        subl    %esi, %eax
        movl    $64, %edi
        vmovdqu32       128(%rdx), %zmm3{%k5}
        kshiftrq        $16, %k1, %k6
        addl    %r10d, %eax
        vmovdqu32       192(%rcx), %zmm8{%k4}
        cmpl    %edi, %eax
        vmovdqu32       192(%rdx), %zmm4{%k4}
        cmova   %edi, %eax
        addq    $64, %rsi
        addq    $256, %rcx
        vmovdqu32       -256(%rcx), %zmm5{%k1}
        vmovdqu32       (%rdx), %zmm1{%k1}
        vmovdqu32       -192(%rcx), %zmm6{%k6}
        vmovdqu32       64(%rdx), %zmm2{%k6}
        vpcmpb  $4, %zmm14, %zmm9, %k2
        kshiftrq        $32, %k2, %k3
        vpblendmd       %zmm7, %zmm3, %zmm10{%k3}
        kshiftrd        $16, %k3, %k3
        vpblendmd       %zmm8, %zmm4, %zmm0{%k3}
        vpblendmd       %zmm5, %zmm1, %zmm12{%k2}
        vmovdqu32       %zmm10, 128(%rdx){%k5}
        kshiftrd        $16, %k2, %k2
        vmovdqu32       %zmm0, 192(%rdx){%k4}
        vpblendmd       %zmm6, %zmm2, %zmm11{%k2}
        vpbroadcastb    %eax, %zmm0
        movl    %r9d, %eax
        vmovdqu32       %zmm12, (%rdx){%k1}
        subl    %esi, %eax
        addl    %r8d, %eax
        vmovdqu32       %zmm11, 64(%rdx){%k6}
        addq    $256, %rdx
        vpcmpub $6, %zmm13, %zmm0, %k1
        cmpl    $64, %eax
        ja      .L3


The vectorizer sees

  <bb 4> [local count: 955630225]:
  # i_26 = PHI <i_23(9), 0(16)>
  _1 = (long unsigned int) i_26;
  _2 = _1 * 4;
  _3 = c_17(D) + _2;
  res_18 = *_3;
  _4 = stride_14(D) + i_26;
  _5 = (long unsigned int) _4;
  _6 = _5 * 4;
  _7 = b_19(D) + _6;
  t_20 = *_7;
  _8 = a_21(D) + _1;
  _9 = *_8;
  _34 = _9 != 0;
  res_11 = _34 ? t_20 : res_18;
  *_3 = res_11;
  i_23 = i_26 + 1;
  if (n_16(D) > i_23)

I believe that to get proper vectorizer costing we want to have an
optimization phase that can take into account whether we use a masked
loop or not.  Note that your intended transform relies on identifying
the open-coded "conditional store"

      int res = c[i];
      if (a[i] != 0)
        res = t;
      c[i] = res;

As Andrew says when that's a .MASK_STORE it's going to be easier to
identify the opportunity.  So yeah, if-conversion could recognize
this pattern and produce a .MASK_STORE from it as a first step.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
  2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-06-18  6:44 ` rguenth at gcc dot gnu.org
@ 2024-06-25  5:29 ` tnfchris at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-25  5:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-25  5:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-18  3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
2024-06-18  3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
2024-06-18  3:18 ` pinskia at gcc dot gnu.org
2024-06-18  3:22 ` pinskia at gcc dot gnu.org
2024-06-18  3:30 ` tnfchris at gcc dot gnu.org
2024-06-18  6:44 ` rguenth at gcc dot gnu.org
2024-06-25  5:29 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).