[Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
@ 2022-09-30 11:41 rguenth at gcc dot gnu.org
  2022-09-30 12:30 ` [Bug tree-optimization/107096] " rguenth at gcc dot gnu.org
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30 11:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

            Bug ID: 107096
           Summary: Fully masking vectorization with AVX512 ICEs
                    gcc.dg/vect/vect-over-widen-*.c
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

With like

/home/rguenther/src/trunk/gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c:11:1:
error: conversion of register to a different size in 'view_convert_expr'
VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_11);

_164 = VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_11);
/home/rguenther/src/trunk/gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c:11:1:
error: conversion of register to a different size in 'view_convert_expr'
VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_8);

_167 = VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_8);
/home/rguenther/src/trunk/gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c:11:1:
error: conversion of register to a different size in 'view_convert_expr'
VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_118);

_170 = VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_118);
/home/rguenther/src/trunk/gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c:11:1:
error: conversion of register to a different size in 'view_convert_expr'
VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_119);

_173 = VIEW_CONVERT_EXPR<vector(8) <signed-boolean:1>>(loop_mask_119);
during GIMPLE pass: vect
dump file: ./vect-over-widen-2.c.172t.vect
/home/rguenther/src/trunk/gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c:11:1:
internal compiler error: verify_gimple failed


The issue is that in vect_get_loop_mask we do

  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
                TYPE_VECTOR_SUBPARTS (vectype)))
    {
      /* A loop mask for data type X can be reused for data type Y
         if X has N times more elements than Y and if Y's elements
         are N times bigger than X's.  In this case each sequence
         of N elements in the loop mask will be all-zero or all-one.
         We can then view-convert the mask so that each sequence of
         N elements is replaced by a single element.  */
      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
                              TYPE_VECTOR_SUBPARTS (vectype)));
      gimple_seq seq = NULL;
      mask_type = truth_type_for (vectype);
      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);

but that ends up VIEW_CONVERTing vector(16) <signed-boolean:1> to
vector(8) <signed-boolean:1>, in this case a HImode mask to a QImode mask.

If I understand the comment correctly then it wants to re-use a mask
for V16QI for V8HI indicating that the mask bits should be set in pairs
for the V16QImask.  But then I don't understand how VIEW_CONVERTing
should perform the desired conversion.  It might be OK if the mask is
a byte mask (like for classic AVX2) but for AVX512 it would require
unpacking even/odd bits.

The actual error is likely in the setup of the loop masks though, thinking
we could do this re-use.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
@ 2022-09-30 12:30 ` rguenth at gcc dot gnu.org
  2022-09-30 13:07 ` rsandifo at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30 12:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Take the simplified

void foo (int * __restrict dst, short *src, int n)
{
  for (int i = 0; i < n; ++i)
    dst[i] = src[2*i] + src[2*i+1];
}

here we get vect_record_loop_mask twice with nvectors == 2 and V16HImode
for the load which populates masks[2].  Then we once get V8SImode but
also with nvectors == 2 which leaves the data unadjusted since it
looks at masks[2] as well but if it were to come first we'd have recorded
a different mask vector type.

The masks seem to be constructed in a way to produce two bits per lane
(but we still apply it naively?!) and the V_C_E does actually look wrong
to me.  Huh.

With SVE I seem to get (besides a .LOAD_LANE version) permutes of the
mask vector:

  loop_mask_116 = VEC_PERM_EXPR <loop_mask_131, loop_mask_131, { 0,
POLY_INT_CST [4, 4], 1, POLY_INT_CST [5, 4], 2, POLY_INT_CST [6, 4], ... }>;
  loop_mask_117 = VEC_PERM_EXPR <loop_mask_131, loop_mask_131, { POLY_INT_CST
[2, 2], POLY_INT_CST [6, 6], POLY_INT_CST [3, 2], POLY_INT_CST [7, 6],
POLY_INT_CST [4, 2], POLY_INT_CST [8, 6], ... }>;
...
  vect__69.20_118 = .MASK_LOAD (vectp_src.18_114, 16B, loop_mask_116);
  vectp_src.18_119 = vectp_src.18_114 + POLY_INT_CST [8, 8];
  vect__69.21_120 = .MASK_LOAD (vectp_src.18_119, 16B, loop_mask_117);
...
  .MASK_STORE (vectp_dst.25_129, 32B, loop_mask_131, vect__79.24_125);

so the original mask provider here is the larger element vector type
and the smaller element masks are produced from that.  That's something
I'd expect for AVX512 as well, not sure where it goes "wrong".

See PR107093 for a patch implementing WHILE_ULT for AVX512.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
  2022-09-30 12:30 ` [Bug tree-optimization/107096] " rguenth at gcc dot gnu.org
@ 2022-09-30 13:07 ` rsandifo at gcc dot gnu.org
  2022-10-10  9:49 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-09-30 13:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
See the comment above rgroup_controls in tree-vectorizer.h for the
current assumptions around loop predication.  If AVX512 wants something
different then some extensions will be needed :-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
  2022-09-30 12:30 ` [Bug tree-optimization/107096] " rguenth at gcc dot gnu.org
  2022-09-30 13:07 ` rsandifo at gcc dot gnu.org
@ 2022-10-10  9:49 ` rguenth at gcc dot gnu.org
  2022-10-10 11:01 ` ams at gcc dot gnu.org
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-10  9:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ams at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #2)
> See the comment above rgroup_controls in tree-vectorizer.h for the
> current assumptions around loop predication.  If AVX512 wants something
> different then some extensions will be needed :-)

Coming back to this now.  Crucially

   If only the first three lanes are active, the masks we need are:

     f rgroup: 1 1 | 1 1 | 1 1 | 0 0
     d rgroup:  1  |  1  |  1  |  0

   Here we can use a mask calculated for f's rgroup for d's, but not
   vice versa.

that seems to assume that the space in the mask vector for the "bools"
in the d rgroup is twice as large as in that for the f rgroup.

For AVX512 there's a single bit for each lane, independently on the
width of the actual data.  So instead of 

   Thus for each value of nV, it is enough to provide nV masks, with the
   mask being calculated based on the highest nL (or, equivalently, based
   on the highest nS) required by any rgroup with that nV.  We therefore
   represent the entire collection of masks as a two-level table, with the
   first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
   the second being indexed by the mask index 0 <= i < nV.

we need a set of nV masks for each value of nS, and we can pick the
smallest nV for each nS and generate the corresponding larger nV
masks by a series of shifts.  In fact we can re-use the first vector
(excess bits are OK).  So for the example in tree-vectorizer.h

     float *f;
     double *d;
     for (int i = 0; i < n; ++i)
       {
         f[i * 2 + 0] += 1.0f;
         f[i * 2 + 1] += 2.0f;
         d[i] += 3.0;
       }

we'd need to perform two WHILE_ULT.  For

     float *f;
     double *d;
     for (int i = 0; i < n; ++i)
       {
         f[i] += 1.0f;
         d[i] += 3.0;
       }

we'd compute the mask for the f rgroup with a WHILE_ULT and we'll
have nV_d = 2 * nV_f and the first mask vector from f can be reused
for d (but not the other way around).  The second mask vector for
d can be obtained by kshiftr.

There's no other way to do N bit to two N/2 bit hi/lo (un)packing
(there's a 2x N/2 bit -> N bit operation, for whatever reason).
There's also no way to transform the d rgroup mask into the
f rgroup mask for the first example aka duplicate bits in place,
{ b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
nor the reverse.

So in reality it seems we need a set of mask vectors for the full
set of nS * nV combinations with AVX512.  Doing fully masking with
AVX2 style vectors would work with the existing rgroup control scheme.

Currently the "key" to the AVX512 behavior is the use of scalar modes
for the mask vectors but then that's also what GCN uses.  Do GCN
mask bits really map to bytes to allow the currently used rgroup scheme?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-10-10  9:49 ` rguenth at gcc dot gnu.org
@ 2022-10-10 11:01 ` ams at gcc dot gnu.org
  2022-10-10 11:33 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: ams at gcc dot gnu.org @ 2022-10-10 11:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #4 from Andrew Stubbs <ams at gcc dot gnu.org> ---
I don't understand rgroups, but I can say that GCN masks are very simply
one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI
mode has fewer bytes and bits than V64DImode (when written to memory).

This is different to most other architectures where the bit-size remains the
same and number of lanes varies with the inner type, and has caused us some
issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends
in vector registers" is not true for GCN).

However, I think it's the same as you're describing for AVX512, at least in
this respect.

Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the
GCN backend (in lieu of implementing full mask support everywhere in the
middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
issues are getting worse. I have a bunch of vec_extract patterns that fix up
some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
really just 64-lane vectors with the mask preset, so the mask has to remain
DImode or register allocation becomes tricky.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-10-10 11:01 ` ams at gcc dot gnu.org
@ 2022-10-10 11:33 ` rguenth at gcc dot gnu.org
  2022-10-10 12:38 ` rsandifo at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-10 11:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Stubbs from comment #4)
> I don't understand rgroups, but I can say that GCN masks are very simply
> one-bit-one-lane. There are always 64-lanes, regardless of the type, so
> V64QI mode has fewer bytes and bits than V64DImode (when written to memory).
> 
> This is different to most other architectures where the bit-size remains the
> same and number of lanes varies with the inner type, and has caused us some
> issues with invalid assumptions in GCC (e.g. "there's no need for
> sign-extends in vector registers" is not true for GCN).
> 
> However, I think it's the same as you're describing for AVX512, at least in
> this respect.
> 
> Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in
> the GCN backend (in lieu of implementing full mask support everywhere in the
> middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
> issues are getting worse. I have a bunch of vec_extract patterns that fix up
> some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
> really just 64-lane vectors with the mask preset, so the mask has to remain
> DImode or register allocation becomes tricky.

For the documentation test case GCN seems to skirt the issue by using
different "sized" vectors so it manages to get two different nV here, one
for the float and one for the double rgroup.  With AVX512 I get the
same nV and wrong code, re-using the mask of the floats:

  _51 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_40);

(but the verifier not ICEing because it only checks modes).

GCN gets nV == 1 for both for

void foo (float *f, double * __restrict d, int n)
{
  for (int i = 0; i < n; ++i)
    {
      f[i] += 1.0f;
      d[i] += 3.0;
    }
}    

and here sharing the mask is OK.

So it looks like the sharing logic depends on how we get to assign
vector modes - GCN insists on handing out only 64 lane vectors.  If
you'd change that and allow mixing I guess you'll run into similar
issues as AVX512.

Only handing out 8-lane vectors would limit AVX512 quite a bit so
that doesn't sound like a viable option to us.  RVV might be in the
same situation as GCN here.

For GCN with DImode mask vectors at the point where V_C_Es would be
emitted we could assert that the number of lanes are the same (we
probably should, otherwise we'd have wrong-code).

So eventually special-casing integer mode masks might work out.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-10-10 11:33 ` rguenth at gcc dot gnu.org
@ 2022-10-10 12:38 ` rsandifo at gcc dot gnu.org
  2022-10-10 12:55 ` rguenther at suse dot de
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-10-10 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #6 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
I don't think we should key this off whether the masks are integers
or vectors.  In principle there's no reason why a vector-based
couldn't work the AVX512 way, or an integer approach the SVE way.
(build_truth_vector_type_for_mode is currently hard-coded to 1 bit
per element for integers, but that could change.)

At the moment:

   We make the simplifying assumption that if a sequence of nV masks is
   suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by
   VIEW_CONVERTing it.  This holds for all current targets that support
   fully-masked loops.

The PR means supporting targets where this assumption doesn't hold.
But I think we should test for it explicitly somehow.  For now we
can probably assume that the assumption holds when the two mask
modes have equal size.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-10-10 12:38 ` rsandifo at gcc dot gnu.org
@ 2022-10-10 12:55 ` rguenther at suse dot de
  2022-10-10 14:20 ` rsandifo at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenther at suse dot de @ 2022-10-10 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 10 Oct 2022, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> 
> --- Comment #6 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> The PR means supporting targets where this assumption doesn't hold.
> But I think we should test for it explicitly somehow.  For now we
> can probably assume that the assumption holds when the two mask
> modes have equal size.

more like precision but x86 uses QImode for two-element, four-element
and eight-element masks (rather than two partial integer modes with
two and four bits precision).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-10-10 12:55 ` rguenther at suse dot de
@ 2022-10-10 14:20 ` rsandifo at gcc dot gnu.org
  2022-10-11 12:43 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-10-10 14:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #8 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #7)
> more like precision but x86 uses QImode for two-element, four-element
> and eight-element masks (rather than two partial integer modes with
> two and four bits precision).
Ah, OK.  So yeah, maybe the precision of the vector boolean element *
the number of elements.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-10-10 14:20 ` rsandifo at gcc dot gnu.org
@ 2022-10-11 12:43 ` rguenth at gcc dot gnu.org
  2023-02-15  7:08 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-11 12:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #8)
> (In reply to rguenther@suse.de from comment #7)
> > more like precision but x86 uses QImode for two-element, four-element
> > and eight-element masks (rather than two partial integer modes with
> > two and four bits precision).
> Ah, OK.  So yeah, maybe the precision of the vector boolean element *
> the number of elements.

For SVE the following holds:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 1996ecfee7a..9b24b481867 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10097,6 +10097,12 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi,
vec_loop_masks *masks,
                              TYPE_VECTOR_SUBPARTS (vectype)));
       gimple_seq seq = NULL;
       mask_type = truth_type_for (vectype);
+      /* Assert that both mask types have the same total number of value
+        bits.  */
+      gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
+                           * TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
+                           TYPE_PRECISION (TREE_TYPE (mask_type))
+                           * TYPE_VECTOR_SUBPARTS (mask_type)));
       mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
       if (seq)
        gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);

for AVX the TYPE_PRECISION is always 1 so for unequal subparts we cannot
directly share masks.

I'm going to change LOOP_VINFO_MASKS from an array indexed by nV to
a two-dimensional indexed by nV and bit-precision * subparts.  Well,
probably using a hash_map instead since this will be quite sparse.
Or maybe not, but at least dynamically growing as we do now is difficult
and subparts can be non-constant.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-10-11 12:43 ` rguenth at gcc dot gnu.org
@ 2023-02-15  7:08 ` crazylht at gmail dot com
  2023-02-15  7:09 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: crazylht at gmail dot com @ 2023-02-15  7:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> There's also no way to transform the d rgroup mask into the
> f rgroup mask for the first example aka duplicate bits in place,
> { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> nor the reverse.
> 

Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
.i.e
we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2023-02-15  7:08 ` crazylht at gmail dot com
@ 2023-02-15  7:09 ` crazylht at gmail dot com
  2023-02-15  7:26 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: crazylht at gmail dot com @ 2023-02-15  7:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> There's also no way to transform the d rgroup mask into the
> f rgroup mask for the first example aka duplicate bits in place,
> { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> nor the reverse.
> 

Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
.i.e
we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-02-15  7:09 ` crazylht at gmail dot com
@ 2023-02-15  7:26 ` rguenther at suse dot de
  2023-02-16  2:09 ` crazylht at gmail dot com
  2023-06-14 12:55 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenther at suse dot de @ 2023-02-15  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 15 Feb 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> 
> --- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > 
> > There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> > (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> > There's also no way to transform the d rgroup mask into the
> > f rgroup mask for the first example aka duplicate bits in place,
> > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> > nor the reverse.
> > 
> 
> Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
> .i.e
> we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
> mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
> as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
> b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

Hmm, yes, if we arrange for the larger mask to be available that would
work for loads and stores I guess.  It wouldn't work for arithmetic
cond_* IFNs though.  It's also going to be a bit tricky within the
masking framework - I'm going to see whether that works though, it might
be a nice way to avoid an excessive number of masks for integer code
at least.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2023-02-15  7:26 ` rguenther at suse dot de
@ 2023-02-16  2:09 ` crazylht at gmail dot com
  2023-06-14 12:55 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: crazylht at gmail dot com @ 2023-02-16  2:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to rguenther@suse.de from comment #12)
> On Wed, 15 Feb 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> > 
> > --- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
> > 
> > > 
> > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> > > (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> > > There's also no way to transform the d rgroup mask into the
> > > f rgroup mask for the first example aka duplicate bits in place,
> > > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> > > nor the reverse.
> > > 
> > 
> > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
> > .i.e
> > we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
> > mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
> > as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
> > b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).
> 
> Hmm, yes, if we arrange for the larger mask to be available that would
> work for loads and stores I guess.  It wouldn't work for arithmetic
> cond_* IFNs though.  It's also going to be a bit tricky within the
> masking framework - I'm going to see whether that works though, it might
> be a nice way to avoid an excessive number of masks for integer code
> at least.

There could be some limitation for nV(it should be power of 2 for VCE?)
.i.e.
There's no suitable vectype for VCE of src1 vectype to resure loop mask.
void
foo (int* __restrict dest, int* src1, int* src2)
{
    for (int i = 0; i != 10000; i++)
      dest[i] = src1[3*i] + src1[3*i + 1] + src1[3*i + 2];
}


Maybe AVX512 could use gather instruction for .MASK_LOAD_LANES to use
LOOP_MASK?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
  2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2023-02-16  2:09 ` crazylht at gmail dot com
@ 2023-06-14 12:55 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-14 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
         Resolution|---                         |INVALID

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Not a bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-06-14 12:55 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 11:41 [Bug tree-optimization/107096] New: Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c rguenth at gcc dot gnu.org
2022-09-30 12:30 ` [Bug tree-optimization/107096] " rguenth at gcc dot gnu.org
2022-09-30 13:07 ` rsandifo at gcc dot gnu.org
2022-10-10  9:49 ` rguenth at gcc dot gnu.org
2022-10-10 11:01 ` ams at gcc dot gnu.org
2022-10-10 11:33 ` rguenth at gcc dot gnu.org
2022-10-10 12:38 ` rsandifo at gcc dot gnu.org
2022-10-10 12:55 ` rguenther at suse dot de
2022-10-10 14:20 ` rsandifo at gcc dot gnu.org
2022-10-11 12:43 ` rguenth at gcc dot gnu.org
2023-02-15  7:08 ` crazylht at gmail dot com
2023-02-15  7:09 ` crazylht at gmail dot com
2023-02-15  7:26 ` rguenther at suse dot de
2023-02-16  2:09 ` crazylht at gmail dot com
2023-06-14 12:55 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).