public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [SVE] PR86753
@ 2019-08-14 15:53 Prathamesh Kulkarni
  2019-08-14 16:59 ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-14 15:53 UTC (permalink / raw)
  To: gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 2715 bytes --]

Hi,
The attached patch tries to fix PR86753.

For following test:
void
f1 (int *restrict x, int *restrict y, int *restrict z)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] ? z[i] : 10;
}

vect dump shows:
  vect_cst__42 = { 0, ... };
  vect_cst__48 = { 0, ... };

  vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
  _4 = *_3;
  _5 = z_12(D) + _2;
  mask__35.8_43 = vect__4.7_41 != vect_cst__42;
  _35 = _4 != 0;
  vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
  vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
  iftmp.0_13 = 0;
  vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
vect_iftmp.11_47, vect_cst__49>;

and following code-gen:
L2:
        ld1w    z0.s, p2/z, [x1, x3, lsl 2]
        cmpne   p1.s, p3/z, z0.s, #0
        cmpne   p0.s, p2/z, z0.s, #0
        ld1w    z0.s, p0/z, [x2, x3, lsl 2]
        sel     z0.s, p1, z0.s, z1.s

We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.

I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
which is conditional on C, then we could reuse the mask used in load,
in vec_cond_expr ?

The patch maintains a hash_map cond_to_vec_mask
from <cond, loop_mask -> vec_mask (with loop predicate applied).
In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
and in vectorizable_condition, we check if <cond, loop_mask> exists in
cond_to_vec_mask
and if found, the corresponding vec_mask is used as 1st operand of
vec_cond_expr.

<cond, loop_mask> is represented with cond_vmask_key, and the patch
adds tree_cond_ops to represent condition operator and operands coming
either from cond_expr
or a gimple comparison stmt. If the stmt is not comparison, it returns
<ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.

With patch, the redundant p1 is eliminated and sel uses p0 for above test.

For following test:
void
f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] ? z[i] : fallback;
}

input to vectorizer has operands swapped in cond_expr:
  _36 = _4 != 0;
  iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
  iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;

So we need to check for inverted condition in cond_to_vec_mask,
and swap the operands.
Does the patch look OK so far ?

One major issue remaining with the patch is value  numbering.
Currently, it does value numbering for entire function using sccvn
during start of vect pass, which is too expensive since we only need
block based VN. I am looking into that.

Thanks,
Prathamesh

[-- Attachment #2: pr86753-5.diff --]
[-- Type: application/x-patch, Size: 11197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-14 15:53 [SVE] PR86753 Prathamesh Kulkarni
@ 2019-08-14 16:59 ` Richard Biener
  2019-08-14 17:01   ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-14 16:59 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: gcc Patches, Richard Sandiford

On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi,
> The attached patch tries to fix PR86753.
>
> For following test:
> void
> f1 (int *restrict x, int *restrict y, int *restrict z)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] ? z[i] : 10;
> }
>
> vect dump shows:
>   vect_cst__42 = { 0, ... };
>   vect_cst__48 = { 0, ... };
>
>   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>   _4 = *_3;
>   _5 = z_12(D) + _2;
>   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>   _35 = _4 != 0;
>   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>   iftmp.0_13 = 0;
>   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> vect_iftmp.11_47, vect_cst__49>;
>
> and following code-gen:
> L2:
>         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>         cmpne   p1.s, p3/z, z0.s, #0
>         cmpne   p0.s, p2/z, z0.s, #0
>         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>         sel     z0.s, p1, z0.s, z1.s
>
> We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>
> I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> which is conditional on C, then we could reuse the mask used in load,
> in vec_cond_expr ?
>
> The patch maintains a hash_map cond_to_vec_mask
> from <cond, loop_mask -> vec_mask (with loop predicate applied).
> In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> and in vectorizable_condition, we check if <cond, loop_mask> exists in
> cond_to_vec_mask
> and if found, the corresponding vec_mask is used as 1st operand of
> vec_cond_expr.
>
> <cond, loop_mask> is represented with cond_vmask_key, and the patch
> adds tree_cond_ops to represent condition operator and operands coming
> either from cond_expr
> or a gimple comparison stmt. If the stmt is not comparison, it returns
> <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>
> With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>
> For following test:
> void
> f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] ? z[i] : fallback;
> }
>
> input to vectorizer has operands swapped in cond_expr:
>   _36 = _4 != 0;
>   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>
> So we need to check for inverted condition in cond_to_vec_mask,
> and swap the operands.
> Does the patch look OK so far ?
>
> One major issue remaining with the patch is value  numbering.
> Currently, it does value numbering for entire function using sccvn
> during start of vect pass, which is too expensive since we only need
> block based VN. I am looking into that.

Why do you need it at all?  We run VN on the if-converted loop bodies btw.

Richard.

>
> Thanks,
> Prathamesh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-14 16:59 ` Richard Biener
@ 2019-08-14 17:01   ` Richard Biener
  2019-08-14 21:22     ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-14 17:01 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: gcc Patches, Richard Sandiford

On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > Hi,
> > The attached patch tries to fix PR86753.
> >
> > For following test:
> > void
> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > {
> >   for (int i = 0; i < 100; ++i)
> >     x[i] = y[i] ? z[i] : 10;
> > }
> >
> > vect dump shows:
> >   vect_cst__42 = { 0, ... };
> >   vect_cst__48 = { 0, ... };
> >
> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >   _4 = *_3;
> >   _5 = z_12(D) + _2;
> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >   _35 = _4 != 0;
> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >   iftmp.0_13 = 0;
> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > vect_iftmp.11_47, vect_cst__49>;
> >
> > and following code-gen:
> > L2:
> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >         cmpne   p1.s, p3/z, z0.s, #0
> >         cmpne   p0.s, p2/z, z0.s, #0
> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >         sel     z0.s, p1, z0.s, z1.s
> >
> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >
> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > which is conditional on C, then we could reuse the mask used in load,
> > in vec_cond_expr ?
> >
> > The patch maintains a hash_map cond_to_vec_mask
> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > cond_to_vec_mask
> > and if found, the corresponding vec_mask is used as 1st operand of
> > vec_cond_expr.
> >
> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > adds tree_cond_ops to represent condition operator and operands coming
> > either from cond_expr
> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >
> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >
> > For following test:
> > void
> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > {
> >   for (int i = 0; i < 100; ++i)
> >     x[i] = y[i] ? z[i] : fallback;
> > }
> >
> > input to vectorizer has operands swapped in cond_expr:
> >   _36 = _4 != 0;
> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >
> > So we need to check for inverted condition in cond_to_vec_mask,
> > and swap the operands.
> > Does the patch look OK so far ?
> >
> > One major issue remaining with the patch is value  numbering.
> > Currently, it does value numbering for entire function using sccvn
> > during start of vect pass, which is too expensive since we only need
> > block based VN. I am looking into that.
>
> Why do you need it at all?  We run VN on the if-converted loop bodies btw.

Also I can't trivially see the equality of the masks and probably so can't VN.
Is it that we just don't bother to apply loop_mask to VEC_COND but there's
no harm if we do?

Richard.

> Richard.
>
> >
> > Thanks,
> > Prathamesh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-14 17:01   ` Richard Biener
@ 2019-08-14 21:22     ` Richard Sandiford
  2019-08-21 20:10       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-14 21:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: Prathamesh Kulkarni, gcc Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>> >
>> > Hi,
>> > The attached patch tries to fix PR86753.
>> >
>> > For following test:
>> > void
>> > f1 (int *restrict x, int *restrict y, int *restrict z)
>> > {
>> >   for (int i = 0; i < 100; ++i)
>> >     x[i] = y[i] ? z[i] : 10;
>> > }
>> >
>> > vect dump shows:
>> >   vect_cst__42 = { 0, ... };
>> >   vect_cst__48 = { 0, ... };
>> >
>> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>> >   _4 = *_3;
>> >   _5 = z_12(D) + _2;
>> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>> >   _35 = _4 != 0;
>> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>> >   iftmp.0_13 = 0;
>> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
>> > vect_iftmp.11_47, vect_cst__49>;
>> >
>> > and following code-gen:
>> > L2:
>> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>> >         cmpne   p1.s, p3/z, z0.s, #0
>> >         cmpne   p0.s, p2/z, z0.s, #0
>> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>> >         sel     z0.s, p1, z0.s, z1.s
>> >
>> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
>> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
>> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>> >
>> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
>> > which is conditional on C, then we could reuse the mask used in load,
>> > in vec_cond_expr ?
>> >
>> > The patch maintains a hash_map cond_to_vec_mask
>> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
>> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
>> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
>> > cond_to_vec_mask
>> > and if found, the corresponding vec_mask is used as 1st operand of
>> > vec_cond_expr.
>> >
>> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
>> > adds tree_cond_ops to represent condition operator and operands coming
>> > either from cond_expr
>> > or a gimple comparison stmt. If the stmt is not comparison, it returns
>> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>> >
>> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>> >
>> > For following test:
>> > void
>> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
>> > {
>> >   for (int i = 0; i < 100; ++i)
>> >     x[i] = y[i] ? z[i] : fallback;
>> > }
>> >
>> > input to vectorizer has operands swapped in cond_expr:
>> >   _36 = _4 != 0;
>> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>> >
>> > So we need to check for inverted condition in cond_to_vec_mask,
>> > and swap the operands.
>> > Does the patch look OK so far ?
>> >
>> > One major issue remaining with the patch is value  numbering.
>> > Currently, it does value numbering for entire function using sccvn
>> > during start of vect pass, which is too expensive since we only need
>> > block based VN. I am looking into that.
>>
>> Why do you need it at all?  We run VN on the if-converted loop bodies btw.

This was my suggestion, but with the idea being to do the numbering
per-statement as we vectorise.  We'll then see pattern statements too.

That's important because we use pattern statements to set the right
vector boolean type (e.g. vect_recog_mask_conversion_pattern).
So some of the masks we care about don't exist after if converison.

> Also I can't trivially see the equality of the masks and probably so
> can't VN.  Is it that we just don't bother to apply loop_mask to
> VEC_COND but there's no harm if we do?

Yeah.  The idea of the optimisation is to decide when it's more profitable
to apply the loop mask, even though doing so isn't necessary.  It would
be hard to do after vectorisation because the masks aren't equivalent.
We're relying on knowledge of how the vectoriser uses the result.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-14 21:22     ` Richard Sandiford
@ 2019-08-21 20:10       ` Prathamesh Kulkarni
  2019-08-22 12:05         ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-21 20:10 UTC (permalink / raw)
  To: Richard Biener, Prathamesh Kulkarni, gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 5091 bytes --]

On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >>
> >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> >> <prathamesh.kulkarni@linaro.org> wrote:
> >> >
> >> > Hi,
> >> > The attached patch tries to fix PR86753.
> >> >
> >> > For following test:
> >> > void
> >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> >> > {
> >> >   for (int i = 0; i < 100; ++i)
> >> >     x[i] = y[i] ? z[i] : 10;
> >> > }
> >> >
> >> > vect dump shows:
> >> >   vect_cst__42 = { 0, ... };
> >> >   vect_cst__48 = { 0, ... };
> >> >
> >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >> >   _4 = *_3;
> >> >   _5 = z_12(D) + _2;
> >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >> >   _35 = _4 != 0;
> >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >> >   iftmp.0_13 = 0;
> >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> >> > vect_iftmp.11_47, vect_cst__49>;
> >> >
> >> > and following code-gen:
> >> > L2:
> >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >> >         cmpne   p1.s, p3/z, z0.s, #0
> >> >         cmpne   p0.s, p2/z, z0.s, #0
> >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >> >         sel     z0.s, p1, z0.s, z1.s
> >> >
> >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >> >
> >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> >> > which is conditional on C, then we could reuse the mask used in load,
> >> > in vec_cond_expr ?
> >> >
> >> > The patch maintains a hash_map cond_to_vec_mask
> >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> >> > cond_to_vec_mask
> >> > and if found, the corresponding vec_mask is used as 1st operand of
> >> > vec_cond_expr.
> >> >
> >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> >> > adds tree_cond_ops to represent condition operator and operands coming
> >> > either from cond_expr
> >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >> >
> >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >> >
> >> > For following test:
> >> > void
> >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> >> > {
> >> >   for (int i = 0; i < 100; ++i)
> >> >     x[i] = y[i] ? z[i] : fallback;
> >> > }
> >> >
> >> > input to vectorizer has operands swapped in cond_expr:
> >> >   _36 = _4 != 0;
> >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >> >
> >> > So we need to check for inverted condition in cond_to_vec_mask,
> >> > and swap the operands.
> >> > Does the patch look OK so far ?
> >> >
> >> > One major issue remaining with the patch is value  numbering.
> >> > Currently, it does value numbering for entire function using sccvn
> >> > during start of vect pass, which is too expensive since we only need
> >> > block based VN. I am looking into that.
> >>
> >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
>
> This was my suggestion, but with the idea being to do the numbering
> per-statement as we vectorise.  We'll then see pattern statements too.
>
> That's important because we use pattern statements to set the right
> vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> So some of the masks we care about don't exist after if converison.
>
> > Also I can't trivially see the equality of the masks and probably so
> > can't VN.  Is it that we just don't bother to apply loop_mask to
> > VEC_COND but there's no harm if we do?
>
> Yeah.  The idea of the optimisation is to decide when it's more profitable
> to apply the loop mask, even though doing so isn't necessary.  It would
> be hard to do after vectorisation because the masks aren't equivalent.
> We're relying on knowledge of how the vectoriser uses the result.
Hi,
Sorry for late response. This is an updated patch, that integrates
block-based VN into vect pass.
The patch
(a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
initialize VN state, and vn_bb_free to free it.
(b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
stmts. We're only interested in obtaining
value numbers, not eliminating redundancies.
Does it look in the right direction ?
I am not sure if the initialization in vn_bb_init is entirely correct.

PS: The patch seems to regress fmla_2.c. I am looking into it.

Thanks,
Prathamesh
>
> Thanks,
> Richard

[-- Attachment #2: pr86753-6.diff --]
[-- Type: application/x-patch, Size: 14604 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-21 20:10       ` Prathamesh Kulkarni
@ 2019-08-22 12:05         ` Richard Biener
  2019-08-23 12:46           ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-22 12:05 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: gcc Patches, Richard Sandiford

On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> writes:
> > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >>
> > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> > >> <prathamesh.kulkarni@linaro.org> wrote:
> > >> >
> > >> > Hi,
> > >> > The attached patch tries to fix PR86753.
> > >> >
> > >> > For following test:
> > >> > void
> > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > >> > {
> > >> >   for (int i = 0; i < 100; ++i)
> > >> >     x[i] = y[i] ? z[i] : 10;
> > >> > }
> > >> >
> > >> > vect dump shows:
> > >> >   vect_cst__42 = { 0, ... };
> > >> >   vect_cst__48 = { 0, ... };
> > >> >
> > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> > >> >   _4 = *_3;
> > >> >   _5 = z_12(D) + _2;
> > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> > >> >   _35 = _4 != 0;
> > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> > >> >   iftmp.0_13 = 0;
> > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > >> > vect_iftmp.11_47, vect_cst__49>;
> > >> >
> > >> > and following code-gen:
> > >> > L2:
> > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> > >> >         cmpne   p1.s, p3/z, z0.s, #0
> > >> >         cmpne   p0.s, p2/z, z0.s, #0
> > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> > >> >         sel     z0.s, p1, z0.s, z1.s
> > >> >
> > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> > >> >
> > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > >> > which is conditional on C, then we could reuse the mask used in load,
> > >> > in vec_cond_expr ?
> > >> >
> > >> > The patch maintains a hash_map cond_to_vec_mask
> > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > >> > cond_to_vec_mask
> > >> > and if found, the corresponding vec_mask is used as 1st operand of
> > >> > vec_cond_expr.
> > >> >
> > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > >> > adds tree_cond_ops to represent condition operator and operands coming
> > >> > either from cond_expr
> > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> > >> >
> > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> > >> >
> > >> > For following test:
> > >> > void
> > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > >> > {
> > >> >   for (int i = 0; i < 100; ++i)
> > >> >     x[i] = y[i] ? z[i] : fallback;
> > >> > }
> > >> >
> > >> > input to vectorizer has operands swapped in cond_expr:
> > >> >   _36 = _4 != 0;
> > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> > >> >
> > >> > So we need to check for inverted condition in cond_to_vec_mask,
> > >> > and swap the operands.
> > >> > Does the patch look OK so far ?
> > >> >
> > >> > One major issue remaining with the patch is value  numbering.
> > >> > Currently, it does value numbering for entire function using sccvn
> > >> > during start of vect pass, which is too expensive since we only need
> > >> > block based VN. I am looking into that.
> > >>
> > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> >
> > This was my suggestion, but with the idea being to do the numbering
> > per-statement as we vectorise.  We'll then see pattern statements too.
> >
> > That's important because we use pattern statements to set the right
> > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> > So some of the masks we care about don't exist after if converison.
> >
> > > Also I can't trivially see the equality of the masks and probably so
> > > can't VN.  Is it that we just don't bother to apply loop_mask to
> > > VEC_COND but there's no harm if we do?
> >
> > Yeah.  The idea of the optimisation is to decide when it's more profitable
> > to apply the loop mask, even though doing so isn't necessary.  It would
> > be hard to do after vectorisation because the masks aren't equivalent.
> > We're relying on knowledge of how the vectoriser uses the result.
> Hi,
> Sorry for late response. This is an updated patch, that integrates
> block-based VN into vect pass.
> The patch
> (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> initialize VN state, and vn_bb_free to free it.
> (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> stmts. We're only interested in obtaining
> value numbers, not eliminating redundancies.
> Does it look in the right direction ?

It looks a bit odd to me.  I'd have expected it to work by generating
the stmts as before in the vectorizer and then on the stmts we care
invoke vn_visit_stmt that does both value-numbering and elimination.
Alternatively you could ask the VN state to generate the stmt for
you via vn_nary_build_or_lookup () (certainly that needs a bit more
work).  One complication might be availability if you don't value-number
all stmts in the block, but well.  I'm not sure constraining to a single
block is necessary - I've thought of having a "CSE"ing gimple_build
for some time (add & CSE new stmts onto a sequence), so one
should keep this mode in mind when designing the one working on
an existing BB.  Note as you write it it depends on visiting the
stmts in proper order - is that guaranteed when for example
vectorizing SLP?

> I am not sure if the initialization in vn_bb_init is entirely correct.
>
> PS: The patch seems to regress fmla_2.c. I am looking into it.
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-22 12:05         ` Richard Biener
@ 2019-08-23 12:46           ` Prathamesh Kulkarni
  2019-08-23 13:47             ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-23 12:46 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc Patches, Richard Sandiford

On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Biener <richard.guenther@gmail.com> writes:
> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > >>
> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> > > >> <prathamesh.kulkarni@linaro.org> wrote:
> > > >> >
> > > >> > Hi,
> > > >> > The attached patch tries to fix PR86753.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : 10;
> > > >> > }
> > > >> >
> > > >> > vect dump shows:
> > > >> >   vect_cst__42 = { 0, ... };
> > > >> >   vect_cst__48 = { 0, ... };
> > > >> >
> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> > > >> >   _4 = *_3;
> > > >> >   _5 = z_12(D) + _2;
> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> > > >> >   _35 = _4 != 0;
> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> > > >> >   iftmp.0_13 = 0;
> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > > >> > vect_iftmp.11_47, vect_cst__49>;
> > > >> >
> > > >> > and following code-gen:
> > > >> > L2:
> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> > > >> >         sel     z0.s, p1, z0.s, z1.s
> > > >> >
> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> > > >> >
> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > > >> > which is conditional on C, then we could reuse the mask used in load,
> > > >> > in vec_cond_expr ?
> > > >> >
> > > >> > The patch maintains a hash_map cond_to_vec_mask
> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > > >> > cond_to_vec_mask
> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
> > > >> > vec_cond_expr.
> > > >> >
> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > > >> > adds tree_cond_ops to represent condition operator and operands coming
> > > >> > either from cond_expr
> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> > > >> >
> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : fallback;
> > > >> > }
> > > >> >
> > > >> > input to vectorizer has operands swapped in cond_expr:
> > > >> >   _36 = _4 != 0;
> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> > > >> >
> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
> > > >> > and swap the operands.
> > > >> > Does the patch look OK so far ?
> > > >> >
> > > >> > One major issue remaining with the patch is value  numbering.
> > > >> > Currently, it does value numbering for entire function using sccvn
> > > >> > during start of vect pass, which is too expensive since we only need
> > > >> > block based VN. I am looking into that.
> > > >>
> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> > >
> > > This was my suggestion, but with the idea being to do the numbering
> > > per-statement as we vectorise.  We'll then see pattern statements too.
> > >
> > > That's important because we use pattern statements to set the right
> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> > > So some of the masks we care about don't exist after if converison.
> > >
> > > > Also I can't trivially see the equality of the masks and probably so
> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
> > > > VEC_COND but there's no harm if we do?
> > >
> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
> > > to apply the loop mask, even though doing so isn't necessary.  It would
> > > be hard to do after vectorisation because the masks aren't equivalent.
> > > We're relying on knowledge of how the vectoriser uses the result.
> > Hi,
> > Sorry for late response. This is an updated patch, that integrates
> > block-based VN into vect pass.
> > The patch
> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> > initialize VN state, and vn_bb_free to free it.
> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> > stmts. We're only interested in obtaining
> > value numbers, not eliminating redundancies.
> > Does it look in the right direction ?
>
> It looks a bit odd to me.  I'd have expected it to work by generating
> the stmts as before in the vectorizer and then on the stmts we care
> invoke vn_visit_stmt that does both value-numbering and elimination.
> Alternatively you could ask the VN state to generate the stmt for
> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> work).  One complication might be availability if you don't value-number
> all stmts in the block, but well.  I'm not sure constraining to a single
> block is necessary - I've thought of having a "CSE"ing gimple_build
> for some time (add & CSE new stmts onto a sequence), so one
> should keep this mode in mind when designing the one working on
> an existing BB.  Note as you write it it depends on visiting the
> stmts in proper order - is that guaranteed when for example
> vectorizing SLP?
Hi,
Indeed, I wrote the function with assumption that, stmts would be
visited in proper order.
This doesn't affect SLP currently, because call to vn_visit_stmt in
vect_transform_stmt is
conditional on cond_to_vec_mask, which is only allocated inside
vect_transform_loop.
But I agree we could make it more general.
AFAIU, the idea of constraining VN to single block was to avoid using defs from
non-dominating scalar stmts during outer-loop vectorization.

* fmla_2.c regression with patch:
This happens because with patch, forwprop4 is able to convert all 3
vec_cond_expr's
to .cond_fma(), which results in 3 calls to fmla, regressing the
test-case. If matching with
inverted condition is disabled in patch in vectorizable_condition,
then the old behavior gets preserved.

Thanks,
Prathamesh
>
> > I am not sure if the initialization in vn_bb_init is entirely correct.
> >
> > PS: The patch seems to regress fmla_2.c. I am looking into it.
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-23 12:46           ` Prathamesh Kulkarni
@ 2019-08-23 13:47             ` Richard Sandiford
  2019-08-23 14:30               ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-23 13:47 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>> >
>> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> > >
>> > > Richard Biener <richard.guenther@gmail.com> writes:
>> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
>> > > > <richard.guenther@gmail.com> wrote:
>> > > >>
>> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
>> > > >> <prathamesh.kulkarni@linaro.org> wrote:
>> > > >> >
>> > > >> > Hi,
>> > > >> > The attached patch tries to fix PR86753.
>> > > >> >
>> > > >> > For following test:
>> > > >> > void
>> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
>> > > >> > {
>> > > >> >   for (int i = 0; i < 100; ++i)
>> > > >> >     x[i] = y[i] ? z[i] : 10;
>> > > >> > }
>> > > >> >
>> > > >> > vect dump shows:
>> > > >> >   vect_cst__42 = { 0, ... };
>> > > >> >   vect_cst__48 = { 0, ... };
>> > > >> >
>> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>> > > >> >   _4 = *_3;
>> > > >> >   _5 = z_12(D) + _2;
>> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>> > > >> >   _35 = _4 != 0;
>> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>> > > >> >   iftmp.0_13 = 0;
>> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
>> > > >> > vect_iftmp.11_47, vect_cst__49>;
>> > > >> >
>> > > >> > and following code-gen:
>> > > >> > L2:
>> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
>> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
>> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>> > > >> >         sel     z0.s, p1, z0.s, z1.s
>> > > >> >
>> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
>> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
>> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>> > > >> >
>> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
>> > > >> > which is conditional on C, then we could reuse the mask used in load,
>> > > >> > in vec_cond_expr ?
>> > > >> >
>> > > >> > The patch maintains a hash_map cond_to_vec_mask
>> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
>> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
>> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
>> > > >> > cond_to_vec_mask
>> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
>> > > >> > vec_cond_expr.
>> > > >> >
>> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
>> > > >> > adds tree_cond_ops to represent condition operator and operands coming
>> > > >> > either from cond_expr
>> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
>> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>> > > >> >
>> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>> > > >> >
>> > > >> > For following test:
>> > > >> > void
>> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
>> > > >> > {
>> > > >> >   for (int i = 0; i < 100; ++i)
>> > > >> >     x[i] = y[i] ? z[i] : fallback;
>> > > >> > }
>> > > >> >
>> > > >> > input to vectorizer has operands swapped in cond_expr:
>> > > >> >   _36 = _4 != 0;
>> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>> > > >> >
>> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
>> > > >> > and swap the operands.
>> > > >> > Does the patch look OK so far ?
>> > > >> >
>> > > >> > One major issue remaining with the patch is value  numbering.
>> > > >> > Currently, it does value numbering for entire function using sccvn
>> > > >> > during start of vect pass, which is too expensive since we only need
>> > > >> > block based VN. I am looking into that.
>> > > >>
>> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
>> > >
>> > > This was my suggestion, but with the idea being to do the numbering
>> > > per-statement as we vectorise.  We'll then see pattern statements too.
>> > >
>> > > That's important because we use pattern statements to set the right
>> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
>> > > So some of the masks we care about don't exist after if converison.
>> > >
>> > > > Also I can't trivially see the equality of the masks and probably so
>> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
>> > > > VEC_COND but there's no harm if we do?
>> > >
>> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
>> > > to apply the loop mask, even though doing so isn't necessary.  It would
>> > > be hard to do after vectorisation because the masks aren't equivalent.
>> > > We're relying on knowledge of how the vectoriser uses the result.
>> > Hi,
>> > Sorry for late response. This is an updated patch, that integrates
>> > block-based VN into vect pass.
>> > The patch
>> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
>> > initialize VN state, and vn_bb_free to free it.
>> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
>> > stmts. We're only interested in obtaining
>> > value numbers, not eliminating redundancies.
>> > Does it look in the right direction ?
>>
>> It looks a bit odd to me.  I'd have expected it to work by generating
>> the stmts as before in the vectorizer and then on the stmts we care
>> invoke vn_visit_stmt that does both value-numbering and elimination.
>> Alternatively you could ask the VN state to generate the stmt for
>> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> work).  One complication might be availability if you don't value-number
>> all stmts in the block, but well.  I'm not sure constraining to a single
>> block is necessary - I've thought of having a "CSE"ing gimple_build
>> for some time (add & CSE new stmts onto a sequence), so one
>> should keep this mode in mind when designing the one working on
>> an existing BB.  Note as you write it it depends on visiting the
>> stmts in proper order - is that guaranteed when for example
>> vectorizing SLP?
> Hi,
> Indeed, I wrote the function with assumption that, stmts would be
> visited in proper order.
> This doesn't affect SLP currently, because call to vn_visit_stmt in
> vect_transform_stmt is
> conditional on cond_to_vec_mask, which is only allocated inside
> vect_transform_loop.
> But I agree we could make it more general.
> AFAIU, the idea of constraining VN to single block was to avoid using defs from
> non-dominating scalar stmts during outer-loop vectorization.

Maybe we could do the numbering in a separate walk immediately before
the transform phase instead.

> * fmla_2.c regression with patch:
> This happens because with patch, forwprop4 is able to convert all 3
> vec_cond_expr's
> to .cond_fma(), which results in 3 calls to fmla, regressing the
> test-case. If matching with
> inverted condition is disabled in patch in vectorizable_condition,
> then the old behavior gets preserved.

Ugh, yeah.  This all stems from the fact that we don't get rid of
the first redundant store (before or after the patch).  I think it's
just luck that we managed to CSE the FMLAs feeding the two stores.
(That wasn't what the test was added for; it was added for a
suboptimal .md choice.)

I think it'd be OK to XFAIL the current scan-assembler-times and add a
new one for 3 times instead of 2 (with a comment of course).  I've filed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91532 for the redundant
stores.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-23 13:47             ` Richard Sandiford
@ 2019-08-23 14:30               ` Prathamesh Kulkarni
  2019-08-23 14:34                 ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-23 14:30 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >>
> >> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
> >> <prathamesh.kulkarni@linaro.org> wrote:
> >> >
> >> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> > >
> >> > > Richard Biener <richard.guenther@gmail.com> writes:
> >> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> >> > > > <richard.guenther@gmail.com> wrote:
> >> > > >>
> >> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> >> > > >> <prathamesh.kulkarni@linaro.org> wrote:
> >> > > >> >
> >> > > >> > Hi,
> >> > > >> > The attached patch tries to fix PR86753.
> >> > > >> >
> >> > > >> > For following test:
> >> > > >> > void
> >> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> >> > > >> > {
> >> > > >> >   for (int i = 0; i < 100; ++i)
> >> > > >> >     x[i] = y[i] ? z[i] : 10;
> >> > > >> > }
> >> > > >> >
> >> > > >> > vect dump shows:
> >> > > >> >   vect_cst__42 = { 0, ... };
> >> > > >> >   vect_cst__48 = { 0, ... };
> >> > > >> >
> >> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >> > > >> >   _4 = *_3;
> >> > > >> >   _5 = z_12(D) + _2;
> >> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >> > > >> >   _35 = _4 != 0;
> >> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >> > > >> >   iftmp.0_13 = 0;
> >> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> >> > > >> > vect_iftmp.11_47, vect_cst__49>;
> >> > > >> >
> >> > > >> > and following code-gen:
> >> > > >> > L2:
> >> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
> >> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
> >> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >> > > >> >         sel     z0.s, p1, z0.s, z1.s
> >> > > >> >
> >> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> >> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> >> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >> > > >> >
> >> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> >> > > >> > which is conditional on C, then we could reuse the mask used in load,
> >> > > >> > in vec_cond_expr ?
> >> > > >> >
> >> > > >> > The patch maintains a hash_map cond_to_vec_mask
> >> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> >> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> >> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> >> > > >> > cond_to_vec_mask
> >> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
> >> > > >> > vec_cond_expr.
> >> > > >> >
> >> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> >> > > >> > adds tree_cond_ops to represent condition operator and operands coming
> >> > > >> > either from cond_expr
> >> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> >> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >> > > >> >
> >> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >> > > >> >
> >> > > >> > For following test:
> >> > > >> > void
> >> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> >> > > >> > {
> >> > > >> >   for (int i = 0; i < 100; ++i)
> >> > > >> >     x[i] = y[i] ? z[i] : fallback;
> >> > > >> > }
> >> > > >> >
> >> > > >> > input to vectorizer has operands swapped in cond_expr:
> >> > > >> >   _36 = _4 != 0;
> >> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >> > > >> >
> >> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
> >> > > >> > and swap the operands.
> >> > > >> > Does the patch look OK so far ?
> >> > > >> >
> >> > > >> > One major issue remaining with the patch is value  numbering.
> >> > > >> > Currently, it does value numbering for entire function using sccvn
> >> > > >> > during start of vect pass, which is too expensive since we only need
> >> > > >> > block based VN. I am looking into that.
> >> > > >>
> >> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> >> > >
> >> > > This was my suggestion, but with the idea being to do the numbering
> >> > > per-statement as we vectorise.  We'll then see pattern statements too.
> >> > >
> >> > > That's important because we use pattern statements to set the right
> >> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> >> > > So some of the masks we care about don't exist after if converison.
> >> > >
> >> > > > Also I can't trivially see the equality of the masks and probably so
> >> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
> >> > > > VEC_COND but there's no harm if we do?
> >> > >
> >> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
> >> > > to apply the loop mask, even though doing so isn't necessary.  It would
> >> > > be hard to do after vectorisation because the masks aren't equivalent.
> >> > > We're relying on knowledge of how the vectoriser uses the result.
> >> > Hi,
> >> > Sorry for late response. This is an updated patch, that integrates
> >> > block-based VN into vect pass.
> >> > The patch
> >> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> >> > initialize VN state, and vn_bb_free to free it.
> >> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> >> > stmts. We're only interested in obtaining
> >> > value numbers, not eliminating redundancies.
> >> > Does it look in the right direction ?
> >>
> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> the stmts as before in the vectorizer and then on the stmts we care
> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> Alternatively you could ask the VN state to generate the stmt for
> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> work).  One complication might be availability if you don't value-number
> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> for some time (add & CSE new stmts onto a sequence), so one
> >> should keep this mode in mind when designing the one working on
> >> an existing BB.  Note as you write it it depends on visiting the
> >> stmts in proper order - is that guaranteed when for example
> >> vectorizing SLP?
> > Hi,
> > Indeed, I wrote the function with assumption that, stmts would be
> > visited in proper order.
> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > vect_transform_stmt is
> > conditional on cond_to_vec_mask, which is only allocated inside
> > vect_transform_loop.
> > But I agree we could make it more general.
> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > non-dominating scalar stmts during outer-loop vectorization.
>
> Maybe we could do the numbering in a separate walk immediately before
> the transform phase instead.
Um sorry, I didn't understand. Do you mean we should do dom based VN
just before transform phase
or run full VN ?
>
> > * fmla_2.c regression with patch:
> > This happens because with patch, forwprop4 is able to convert all 3
> > vec_cond_expr's
> > to .cond_fma(), which results in 3 calls to fmla, regressing the
> > test-case. If matching with
> > inverted condition is disabled in patch in vectorizable_condition,
> > then the old behavior gets preserved.
>
> Ugh, yeah.  This all stems from the fact that we don't get rid of
> the first redundant store (before or after the patch).  I think it's
> just luck that we managed to CSE the FMLAs feeding the two stores.
> (That wasn't what the test was added for; it was added for a
> suboptimal .md choice.)
>
> I think it'd be OK to XFAIL the current scan-assembler-times and add a
> new one for 3 times instead of 2 (with a comment of course).  I've filed
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91532 for the redundant
> stores.
Thanks for the suggestions, I will adjust the test-case in follow up patch.

Thanks,
Prathamesh
>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-23 14:30               ` Prathamesh Kulkarni
@ 2019-08-23 14:34                 ` Richard Sandiford
  2019-08-26  5:59                   ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-23 14:34 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>> >> It looks a bit odd to me.  I'd have expected it to work by generating
>> >> the stmts as before in the vectorizer and then on the stmts we care
>> >> invoke vn_visit_stmt that does both value-numbering and elimination.
>> >> Alternatively you could ask the VN state to generate the stmt for
>> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> >> work).  One complication might be availability if you don't value-number
>> >> all stmts in the block, but well.  I'm not sure constraining to a single
>> >> block is necessary - I've thought of having a "CSE"ing gimple_build
>> >> for some time (add & CSE new stmts onto a sequence), so one
>> >> should keep this mode in mind when designing the one working on
>> >> an existing BB.  Note as you write it it depends on visiting the
>> >> stmts in proper order - is that guaranteed when for example
>> >> vectorizing SLP?
>> > Hi,
>> > Indeed, I wrote the function with assumption that, stmts would be
>> > visited in proper order.
>> > This doesn't affect SLP currently, because call to vn_visit_stmt in
>> > vect_transform_stmt is
>> > conditional on cond_to_vec_mask, which is only allocated inside
>> > vect_transform_loop.
>> > But I agree we could make it more general.
>> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
>> > non-dominating scalar stmts during outer-loop vectorization.
>>
>> Maybe we could do the numbering in a separate walk immediately before
>> the transform phase instead.
> Um sorry, I didn't understand. Do you mean we should do dom based VN
> just before transform phase
> or run full VN ?

No, I just meant that we could do a separate walk of the contents
of the basic block:

> @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>      {
>        basic_block bb = bbs[i];
>        stmt_vec_info stmt_info;
> +      vn_bb_init (bb);
> +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> 

...here, rather than doing it on the fly during vect_transform_stmt
itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
others don't have to pay the compile-time penalty.  (Same for
cond_to_vec_mask itself really.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-23 14:34                 ` Richard Sandiford
@ 2019-08-26  5:59                   ` Prathamesh Kulkarni
  2019-08-26 11:46                     ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-26  5:59 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3100 bytes --]

On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> >> the stmts as before in the vectorizer and then on the stmts we care
> >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> >> Alternatively you could ask the VN state to generate the stmt for
> >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> >> work).  One complication might be availability if you don't value-number
> >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> >> for some time (add & CSE new stmts onto a sequence), so one
> >> >> should keep this mode in mind when designing the one working on
> >> >> an existing BB.  Note as you write it it depends on visiting the
> >> >> stmts in proper order - is that guaranteed when for example
> >> >> vectorizing SLP?
> >> > Hi,
> >> > Indeed, I wrote the function with assumption that, stmts would be
> >> > visited in proper order.
> >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> >> > vect_transform_stmt is
> >> > conditional on cond_to_vec_mask, which is only allocated inside
> >> > vect_transform_loop.
> >> > But I agree we could make it more general.
> >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> >> > non-dominating scalar stmts during outer-loop vectorization.
> >>
> >> Maybe we could do the numbering in a separate walk immediately before
> >> the transform phase instead.
> > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > just before transform phase
> > or run full VN ?
>
> No, I just meant that we could do a separate walk of the contents
> of the basic block:
>
> > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >      {
> >        basic_block bb = bbs[i];
> >        stmt_vec_info stmt_info;
> > +      vn_bb_init (bb);
> > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> >
>
> ...here, rather than doing it on the fly during vect_transform_stmt
> itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> others don't have to pay the compile-time penalty.  (Same for
> cond_to_vec_mask itself really.)
Hi,
Does the attached patch look OK ?
In patch, I put call to vn_visit stmt in bb loop in
vect_transform_loop to avoid replicating logic for processing phi and
stmts.
AFAIU, vect_transform_loop_stmt is only called from bb loop, so
compile time penalty for checking cond_to_vec_mask
should be pretty small ?
If this is not OK, I will walk bb immediately before the bb loop.

Thanks,
Prathamesh
>
> Thanks,
> Richard

[-- Attachment #2: pr86753-7.diff --]
[-- Type: application/x-patch, Size: 15695 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-26  5:59                   ` Prathamesh Kulkarni
@ 2019-08-26 11:46                     ` Richard Biener
  2019-08-26 13:39                       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-26 11:46 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: gcc Patches, Richard Sandiford

On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> > >> >> the stmts as before in the vectorizer and then on the stmts we care
> > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> > >> >> Alternatively you could ask the VN state to generate the stmt for
> > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> > >> >> work).  One complication might be availability if you don't value-number
> > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> > >> >> for some time (add & CSE new stmts onto a sequence), so one
> > >> >> should keep this mode in mind when designing the one working on
> > >> >> an existing BB.  Note as you write it it depends on visiting the
> > >> >> stmts in proper order - is that guaranteed when for example
> > >> >> vectorizing SLP?
> > >> > Hi,
> > >> > Indeed, I wrote the function with assumption that, stmts would be
> > >> > visited in proper order.
> > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > >> > vect_transform_stmt is
> > >> > conditional on cond_to_vec_mask, which is only allocated inside
> > >> > vect_transform_loop.
> > >> > But I agree we could make it more general.
> > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > >> > non-dominating scalar stmts during outer-loop vectorization.
> > >>
> > >> Maybe we could do the numbering in a separate walk immediately before
> > >> the transform phase instead.
> > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > > just before transform phase
> > > or run full VN ?
> >
> > No, I just meant that we could do a separate walk of the contents
> > of the basic block:
> >
> > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> > >      {
> > >        basic_block bb = bbs[i];
> > >        stmt_vec_info stmt_info;
> > > +      vn_bb_init (bb);
> > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> > >
> >
> > ...here, rather than doing it on the fly during vect_transform_stmt
> > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> > others don't have to pay the compile-time penalty.  (Same for
> > cond_to_vec_mask itself really.)
> Hi,
> Does the attached patch look OK ?
> In patch, I put call to vn_visit stmt in bb loop in
> vect_transform_loop to avoid replicating logic for processing phi and
> stmts.
> AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> compile time penalty for checking cond_to_vec_mask
> should be pretty small ?
> If this is not OK, I will walk bb immediately before the bb loop.

So if I understand correctly you never have vectorizable COND_EXPRs
in SLP mode?  Because we vectorize all SLP chains before entering
the loop in vect_transform_loop where you VN existing scalar(!) stmts.

Then all this hew hash-table stuff should not be needed since this
is what VN should provide you with.  You of course need to visit
generated condition stmts.  And condition support is weak
in VN due to it possibly having two operations in a single stmt.
Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
why you even need it given you are doing your own hashing?

Richard.

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-26 11:46                     ` Richard Biener
@ 2019-08-26 13:39                       ` Prathamesh Kulkarni
  2019-08-27 10:41                         ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-26 13:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 5654 bytes --]

On Mon, 26 Aug 2019 at 14:48, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > >>
> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> > > >> >> Alternatively you could ask the VN state to generate the stmt for
> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> > > >> >> work).  One complication might be availability if you don't value-number
> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
> > > >> >> should keep this mode in mind when designing the one working on
> > > >> >> an existing BB.  Note as you write it it depends on visiting the
> > > >> >> stmts in proper order - is that guaranteed when for example
> > > >> >> vectorizing SLP?
> > > >> > Hi,
> > > >> > Indeed, I wrote the function with assumption that, stmts would be
> > > >> > visited in proper order.
> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > > >> > vect_transform_stmt is
> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
> > > >> > vect_transform_loop.
> > > >> > But I agree we could make it more general.
> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > > >> > non-dominating scalar stmts during outer-loop vectorization.
> > > >>
> > > >> Maybe we could do the numbering in a separate walk immediately before
> > > >> the transform phase instead.
> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > > > just before transform phase
> > > > or run full VN ?
> > >
> > > No, I just meant that we could do a separate walk of the contents
> > > of the basic block:
> > >
> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> > > >      {
> > > >        basic_block bb = bbs[i];
> > > >        stmt_vec_info stmt_info;
> > > > +      vn_bb_init (bb);
> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> > > >
> > >
> > > ...here, rather than doing it on the fly during vect_transform_stmt
> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> > > others don't have to pay the compile-time penalty.  (Same for
> > > cond_to_vec_mask itself really.)
> > Hi,
> > Does the attached patch look OK ?
> > In patch, I put call to vn_visit stmt in bb loop in
> > vect_transform_loop to avoid replicating logic for processing phi and
> > stmts.
> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> > compile time penalty for checking cond_to_vec_mask
> > should be pretty small ?
> > If this is not OK, I will walk bb immediately before the bb loop.
>
> So if I understand correctly you never have vectorizable COND_EXPRs
> in SLP mode?  Because we vectorize all SLP chains before entering
> the loop in vect_transform_loop where you VN existing scalar(!) stmts.
>
> Then all this hew hash-table stuff should not be needed since this
> is what VN should provide you with.  You of course need to visit
> generated condition stmts.  And condition support is weak
> in VN due to it possibly having two operations in a single stmt.
> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
> why you even need it given you are doing your own hashing?
Well, we thought of using VN for comparing operands for cases
operand_equal_p would not
work. Actually, VN seems not to be required for test-cases in PR
because both conditions
are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
with operand_equal_p.
Input to vectorizer is:

 <bb 3> [local count: 1063004407]:
  # i_20 = PHI <i_16(7), 0(15)>
  # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
  _1 = (long unsigned int) i_20;
  _2 = _1 * 4;
  _3 = y_11(D) + _2;
  _4 = *_3;
  _5 = z_12(D) + _2;
  _35 = _4 != 0;
  iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
  iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;

In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
cond_to_vec_mask,
and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
not require VN
since operands are same.

Initially, I was trying to change the generated vectorized code:

  mask__35.8_43 = vect__4.7_41 != vect_cst__42;
  vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
vect_iftmp.11_47, vect_cst__49>;

where both conditions are equivalent because vect_cst__42 and
vect_cst__48 are zero vectors but operand_equal_p failed to catch
those.

Sorry, I mixed up later between scalar and vector stmts.
I wonder if we then need VN ? Perhaps there might be other cases where
operands of scalar
conditions may be equivalent but not match with operand_equal_p ?
In the attached patch, changing operator==, to compare using
operand_equal_p works for the tests.

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard

[-- Attachment #2: pr86753-8.diff --]
[-- Type: application/x-patch, Size: 16031 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-26 13:39                       ` Prathamesh Kulkarni
@ 2019-08-27 10:41                         ` Richard Sandiford
  2019-08-27 11:31                           ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-27 10:41 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 26 Aug 2019 at 14:48, Richard Biener <richard.guenther@gmail.com> wrote:
>>
>> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>> >
>> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> > >
>> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
>> > > > <richard.sandiford@arm.com> wrote:
>> > > >>
>> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
>> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
>> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
>> > > >> >> Alternatively you could ask the VN state to generate the stmt for
>> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> > > >> >> work).  One complication might be availability if you don't value-number
>> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
>> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
>> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
>> > > >> >> should keep this mode in mind when designing the one working on
>> > > >> >> an existing BB.  Note as you write it it depends on visiting the
>> > > >> >> stmts in proper order - is that guaranteed when for example
>> > > >> >> vectorizing SLP?
>> > > >> > Hi,
>> > > >> > Indeed, I wrote the function with assumption that, stmts would be
>> > > >> > visited in proper order.
>> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
>> > > >> > vect_transform_stmt is
>> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
>> > > >> > vect_transform_loop.
>> > > >> > But I agree we could make it more general.
>> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
>> > > >> > non-dominating scalar stmts during outer-loop vectorization.
>> > > >>
>> > > >> Maybe we could do the numbering in a separate walk immediately before
>> > > >> the transform phase instead.
>> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
>> > > > just before transform phase
>> > > > or run full VN ?
>> > >
>> > > No, I just meant that we could do a separate walk of the contents
>> > > of the basic block:
>> > >
>> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>> > > >      {
>> > > >        basic_block bb = bbs[i];
>> > > >        stmt_vec_info stmt_info;
>> > > > +      vn_bb_init (bb);
>> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
>> > > >
>> > >
>> > > ...here, rather than doing it on the fly during vect_transform_stmt
>> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
>> > > others don't have to pay the compile-time penalty.  (Same for
>> > > cond_to_vec_mask itself really.)
>> > Hi,
>> > Does the attached patch look OK ?
>> > In patch, I put call to vn_visit stmt in bb loop in
>> > vect_transform_loop to avoid replicating logic for processing phi and
>> > stmts.
>> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
>> > compile time penalty for checking cond_to_vec_mask
>> > should be pretty small ?
>> > If this is not OK, I will walk bb immediately before the bb loop.
>>
>> So if I understand correctly you never have vectorizable COND_EXPRs
>> in SLP mode?  Because we vectorize all SLP chains before entering
>> the loop in vect_transform_loop where you VN existing scalar(!) stmts.

On the "!": the idea behind the patch is to find cases in which a
scalar condition is used in both a statement that needs to be masked
for correctness reasons and a statement that we can choose to mask
if we want to.  It also tries (opportunisticly) to match the ?: order
with other conditions.

That's why it's operating on scalar values rather than vector values.
In principle it could be done as a subpass before vectorisation rather
than on the fly, when there aren't any vector stmts around.

>> Then all this hew hash-table stuff should not be needed since this
>> is what VN should provide you with.  You of course need to visit
>> generated condition stmts.  And condition support is weak
>> in VN due to it possibly having two operations in a single stmt.
>> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
>> why you even need it given you are doing your own hashing?
> Well, we thought of using VN for comparing operands for cases
> operand_equal_p would not
> work. Actually, VN seems not to be required for test-cases in PR
> because both conditions
> are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
> with operand_equal_p.

Right, that's why I was suggesting in the earlier thread that we
treat value numbering as a follow-on.  But...

> Input to vectorizer is:
>
>  <bb 3> [local count: 1063004407]:
>   # i_20 = PHI <i_16(7), 0(15)>
>   # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 4;
>   _3 = y_11(D) + _2;
>   _4 = *_3;
>   _5 = z_12(D) + _2;
>   _35 = _4 != 0;
>   iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
>   iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;
>
> In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
> cond_to_vec_mask,
> and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
> not require VN
> since operands are same.
>
> Initially, I was trying to change the generated vectorized code:
>
>   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> vect_iftmp.11_47, vect_cst__49>;
>
> where both conditions are equivalent because vect_cst__42 and
> vect_cst__48 are zero vectors but operand_equal_p failed to catch
> those.
>
> Sorry, I mixed up later between scalar and vector stmts.
> I wonder if we then need VN ? Perhaps there might be other cases where
> operands of scalar
> conditions may be equivalent but not match with operand_equal_p ?
> In the attached patch, changing operator==, to compare using
> operand_equal_p works for the tests.

...those are only very simple cases :-) The problem is that ifcvt
doesn't leave the code free of redundancies, and pattern stmt generation
can create redundancies too.  Of course, we could fix those instead.

E.g. for:

void
f (short *restrict x, short *restrict a, short *restrict b, short *restrict c)
{
  for (int i = 0; i < 100; ++i)
    if (a[i] >= 1)
      x[i] = b[i] >= 1 ? a[i] : c[i];
}

ifcvt produces:

  <bb 3> [local count: 1063004407]:
  # i_34 = PHI <i_30(10), 0(21)>
  # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
  _1 = (long unsigned int) i_34;
  _2 = _1 * 2;
  _3 = a_23(D) + _2;
  _4 = *_3;
  _7 = b_24(D) + _2;
  _49 = _4 > 0;
  _8 = .MASK_LOAD (_7, 16B, _49);
  _12 = _4 > 0;
  _13 = _8 > 0;
  _9 = _12 & _13;
  _10 = _4 > 0;
  _11 = _8 > 0;
  _27 = ~_11;
  _15 = _10 & _27;
  _14 = c_25(D) + _2;
  iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
  iftmp.0_19 = _9 ? _4 : iftmp.0_26;
  _17 = x_28(D) + _2;
  _50 = _4 > 0;
  .MASK_STORE (_17, 16B, _50, iftmp.0_19);
  i_30 = i_34 + 1;
  ivtmp_6 = ivtmp_5 - 1;
  if (ivtmp_6 != 0)
    goto <bb 10>; [98.99%]
  else
    goto <bb 9>; [1.01%]

  <bb 10> [local count: 1052266994]:
  goto <bb 3>; [100.00%]

which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
Looking through the definition of an SSA name means that we can cope
with these redundancies for single comparisons, but not for comparisons
combined through & and |.  This hurts most when trying to decide whether
to invert comparisons.

But maybe value numbering won't cope with that either :-)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 10:41                         ` Richard Sandiford
@ 2019-08-27 11:31                           ` Richard Biener
  2019-08-27 12:52                             ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-27 11:31 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 26 Aug 2019 at 14:48, Richard Biener <richard.guenther@gmail.com> wrote:
> >>
> >> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
> >> <prathamesh.kulkarni@linaro.org> wrote:
> >> >
> >> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> > >
> >> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> >> > > > <richard.sandiford@arm.com> wrote:
> >> > > >>
> >> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
> >> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> > > >> >> Alternatively you could ask the VN state to generate the stmt for
> >> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> > > >> >> work).  One complication might be availability if you don't value-number
> >> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
> >> > > >> >> should keep this mode in mind when designing the one working on
> >> > > >> >> an existing BB.  Note as you write it it depends on visiting the
> >> > > >> >> stmts in proper order - is that guaranteed when for example
> >> > > >> >> vectorizing SLP?
> >> > > >> > Hi,
> >> > > >> > Indeed, I wrote the function with assumption that, stmts would be
> >> > > >> > visited in proper order.
> >> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> >> > > >> > vect_transform_stmt is
> >> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
> >> > > >> > vect_transform_loop.
> >> > > >> > But I agree we could make it more general.
> >> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> >> > > >> > non-dominating scalar stmts during outer-loop vectorization.
> >> > > >>
> >> > > >> Maybe we could do the numbering in a separate walk immediately before
> >> > > >> the transform phase instead.
> >> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> >> > > > just before transform phase
> >> > > > or run full VN ?
> >> > >
> >> > > No, I just meant that we could do a separate walk of the contents
> >> > > of the basic block:
> >> > >
> >> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >> > > >      {
> >> > > >        basic_block bb = bbs[i];
> >> > > >        stmt_vec_info stmt_info;
> >> > > > +      vn_bb_init (bb);
> >> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> >> > > >
> >> > >
> >> > > ...here, rather than doing it on the fly during vect_transform_stmt
> >> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> >> > > others don't have to pay the compile-time penalty.  (Same for
> >> > > cond_to_vec_mask itself really.)
> >> > Hi,
> >> > Does the attached patch look OK ?
> >> > In patch, I put call to vn_visit stmt in bb loop in
> >> > vect_transform_loop to avoid replicating logic for processing phi and
> >> > stmts.
> >> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> >> > compile time penalty for checking cond_to_vec_mask
> >> > should be pretty small ?
> >> > If this is not OK, I will walk bb immediately before the bb loop.
> >>
> >> So if I understand correctly you never have vectorizable COND_EXPRs
> >> in SLP mode?  Because we vectorize all SLP chains before entering
> >> the loop in vect_transform_loop where you VN existing scalar(!) stmts.
>
> On the "!": the idea behind the patch is to find cases in which a
> scalar condition is used in both a statement that needs to be masked
> for correctness reasons and a statement that we can choose to mask
> if we want to.  It also tries (opportunisticly) to match the ?: order
> with other conditions.
>
> That's why it's operating on scalar values rather than vector values.
> In principle it could be done as a subpass before vectorisation rather
> than on the fly, when there aren't any vector stmts around.
>
> >> Then all this hew hash-table stuff should not be needed since this
> >> is what VN should provide you with.  You of course need to visit
> >> generated condition stmts.  And condition support is weak
> >> in VN due to it possibly having two operations in a single stmt.
> >> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
> >> why you even need it given you are doing your own hashing?
> > Well, we thought of using VN for comparing operands for cases
> > operand_equal_p would not
> > work. Actually, VN seems not to be required for test-cases in PR
> > because both conditions
> > are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
> > with operand_equal_p.
>
> Right, that's why I was suggesting in the earlier thread that we
> treat value numbering as a follow-on.  But...
>
> > Input to vectorizer is:
> >
> >  <bb 3> [local count: 1063004407]:
> >   # i_20 = PHI <i_16(7), 0(15)>
> >   # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
> >   _1 = (long unsigned int) i_20;
> >   _2 = _1 * 4;
> >   _3 = y_11(D) + _2;
> >   _4 = *_3;
> >   _5 = z_12(D) + _2;
> >   _35 = _4 != 0;
> >   iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
> >   iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;
> >
> > In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
> > cond_to_vec_mask,
> > and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
> > not require VN
> > since operands are same.
> >
> > Initially, I was trying to change the generated vectorized code:
> >
> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > vect_iftmp.11_47, vect_cst__49>;
> >
> > where both conditions are equivalent because vect_cst__42 and
> > vect_cst__48 are zero vectors but operand_equal_p failed to catch
> > those.
> >
> > Sorry, I mixed up later between scalar and vector stmts.
> > I wonder if we then need VN ? Perhaps there might be other cases where
> > operands of scalar
> > conditions may be equivalent but not match with operand_equal_p ?
> > In the attached patch, changing operator==, to compare using
> > operand_equal_p works for the tests.
>
> ...those are only very simple cases :-) The problem is that ifcvt
> doesn't leave the code free of redundancies, and pattern stmt generation
> can create redundancies too.  Of course, we could fix those instead.
>
> E.g. for:
>
> void
> f (short *restrict x, short *restrict a, short *restrict b, short *restrict c)
> {
>   for (int i = 0; i < 100; ++i)
>     if (a[i] >= 1)
>       x[i] = b[i] >= 1 ? a[i] : c[i];
> }
>
> ifcvt produces:
>
>   <bb 3> [local count: 1063004407]:
>   # i_34 = PHI <i_30(10), 0(21)>
>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
>   _1 = (long unsigned int) i_34;
>   _2 = _1 * 2;
>   _3 = a_23(D) + _2;
>   _4 = *_3;
>   _7 = b_24(D) + _2;
>   _49 = _4 > 0;
>   _8 = .MASK_LOAD (_7, 16B, _49);
>   _12 = _4 > 0;
>   _13 = _8 > 0;
>   _9 = _12 & _13;
>   _10 = _4 > 0;
>   _11 = _8 > 0;
>   _27 = ~_11;
>   _15 = _10 & _27;
>   _14 = c_25(D) + _2;
>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
>   _17 = x_28(D) + _2;
>   _50 = _4 > 0;
>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
>   i_30 = i_34 + 1;
>   ivtmp_6 = ivtmp_5 - 1;
>   if (ivtmp_6 != 0)
>     goto <bb 10>; [98.99%]
>   else
>     goto <bb 9>; [1.01%]
>
>   <bb 10> [local count: 1052266994]:
>   goto <bb 3>; [100.00%]
>
> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).

Huh.  if-conversion does

  /* Now all statements are if-convertible.  Combine all the basic
     blocks into one huge basic block doing the if-conversion
     on-the-fly.  */
  combine_blocks (loop);

  /* Delete dead predicate computations.  */
  ifcvt_local_dce (loop->header);

  /* Perform local CSE, this esp. helps the vectorizer analysis if loads
     and stores are involved.  CSE only the loop body, not the entry
     PHIs, those are to be kept in sync with the non-if-converted copy.
     ???  We'll still keep dead stores though.  */
  exit_bbs = BITMAP_ALLOC (NULL);
  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
  bitmap_set_bit (exit_bbs, loop->latch->index);
  todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);

which should remove those redundant _4 > 0 checks.  In fact when I
run this on x86_64 with -mavx512bw I see

  <bb 3> [local count: 1063004407]:
  # i_25 = PHI <i_20(9), 0(21)>
  # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
  _1 = (long unsigned int) i_25;
  _2 = _1 * 2;
  _3 = a_14(D) + _2;
  _4 = *_3;
  _5 = b_15(D) + _2;
  _49 = _4 > 0;
  _6 = .MASK_LOAD (_5, 16B, _49);
  _22 = _6 > 0;
  _28 = ~_22;
  _29 = _28 & _49;
  _7 = c_16(D) + _2;
  iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
  iftmp.0_10 = _29 ? iftmp.0_17 : _4;
  _8 = x_18(D) + _2;
  .MASK_STORE (_8, 16B, _49, iftmp.0_10);
  i_20 = i_25 + 1;
  ivtmp_12 = ivtmp_24 - 1;
  if (ivtmp_12 != 0)

after if-conversion (that should be the case already on the GCC 9 branch).

> Looking through the definition of an SSA name means that we can cope
> with these redundancies for single comparisons, but not for comparisons
> combined through & and |.  This hurts most when trying to decide whether
> to invert comparisons.
>
> But maybe value numbering won't cope with that either :-)

Not sure, it definitely doesn't do arbitrary re-association
(but commutative ops are handled).  So it cannot prove
a & (b & c) == (c & a) & b but it can prove a & b == b & a

Richard.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 11:31                           ` Richard Biener
@ 2019-08-27 12:52                             ` Richard Sandiford
  2019-08-27 15:55                               ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-27 12:52 UTC (permalink / raw)
  To: Richard Biener; +Cc: Prathamesh Kulkarni, gcc Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> ifcvt produces:
>>
>>   <bb 3> [local count: 1063004407]:
>>   # i_34 = PHI <i_30(10), 0(21)>
>>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
>>   _1 = (long unsigned int) i_34;
>>   _2 = _1 * 2;
>>   _3 = a_23(D) + _2;
>>   _4 = *_3;
>>   _7 = b_24(D) + _2;
>>   _49 = _4 > 0;
>>   _8 = .MASK_LOAD (_7, 16B, _49);
>>   _12 = _4 > 0;
>>   _13 = _8 > 0;
>>   _9 = _12 & _13;
>>   _10 = _4 > 0;
>>   _11 = _8 > 0;
>>   _27 = ~_11;
>>   _15 = _10 & _27;
>>   _14 = c_25(D) + _2;
>>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
>>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
>>   _17 = x_28(D) + _2;
>>   _50 = _4 > 0;
>>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
>>   i_30 = i_34 + 1;
>>   ivtmp_6 = ivtmp_5 - 1;
>>   if (ivtmp_6 != 0)
>>     goto <bb 10>; [98.99%]
>>   else
>>     goto <bb 9>; [1.01%]
>>
>>   <bb 10> [local count: 1052266994]:
>>   goto <bb 3>; [100.00%]
>>
>> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
>
> Huh.  if-conversion does
>
>   /* Now all statements are if-convertible.  Combine all the basic
>      blocks into one huge basic block doing the if-conversion
>      on-the-fly.  */
>   combine_blocks (loop);
>
>   /* Delete dead predicate computations.  */
>   ifcvt_local_dce (loop->header);
>
>   /* Perform local CSE, this esp. helps the vectorizer analysis if loads
>      and stores are involved.  CSE only the loop body, not the entry
>      PHIs, those are to be kept in sync with the non-if-converted copy.
>      ???  We'll still keep dead stores though.  */
>   exit_bbs = BITMAP_ALLOC (NULL);
>   bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
>   bitmap_set_bit (exit_bbs, loop->latch->index);
>   todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
>
> which should remove those redundant _4 > 0 checks.  In fact when I
> run this on x86_64 with -mavx512bw I see
>
>   <bb 3> [local count: 1063004407]:
>   # i_25 = PHI <i_20(9), 0(21)>
>   # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
>   _1 = (long unsigned int) i_25;
>   _2 = _1 * 2;
>   _3 = a_14(D) + _2;
>   _4 = *_3;
>   _5 = b_15(D) + _2;
>   _49 = _4 > 0;
>   _6 = .MASK_LOAD (_5, 16B, _49);
>   _22 = _6 > 0;
>   _28 = ~_22;
>   _29 = _28 & _49;
>   _7 = c_16(D) + _2;
>   iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
>   iftmp.0_10 = _29 ? iftmp.0_17 : _4;
>   _8 = x_18(D) + _2;
>   .MASK_STORE (_8, 16B, _49, iftmp.0_10);
>   i_20 = i_25 + 1;
>   ivtmp_12 = ivtmp_24 - 1;
>   if (ivtmp_12 != 0)
>
> after if-conversion (that should be the case already on the GCC 9 branch).

Gah, sorry for the noise.  Turns out I still had a local change that was
trying to poke the patch into doing something wrong.  Will try to check
my facts more carefully next time.

The redundant pattern statements I was thinking of come from
vect_recog_mask_conversion_pattern, but I guess that isn't so
interesting here.

So yeah, let's drop this whole vn thing for now...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 12:52                             ` Richard Sandiford
@ 2019-08-27 15:55                               ` Prathamesh Kulkarni
  2019-08-27 17:39                                 ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-27 15:55 UTC (permalink / raw)
  To: Richard Biener, Prathamesh Kulkarni, gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3437 bytes --]

On Tue, 27 Aug 2019 at 17:29, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> ifcvt produces:
> >>
> >>   <bb 3> [local count: 1063004407]:
> >>   # i_34 = PHI <i_30(10), 0(21)>
> >>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
> >>   _1 = (long unsigned int) i_34;
> >>   _2 = _1 * 2;
> >>   _3 = a_23(D) + _2;
> >>   _4 = *_3;
> >>   _7 = b_24(D) + _2;
> >>   _49 = _4 > 0;
> >>   _8 = .MASK_LOAD (_7, 16B, _49);
> >>   _12 = _4 > 0;
> >>   _13 = _8 > 0;
> >>   _9 = _12 & _13;
> >>   _10 = _4 > 0;
> >>   _11 = _8 > 0;
> >>   _27 = ~_11;
> >>   _15 = _10 & _27;
> >>   _14 = c_25(D) + _2;
> >>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
> >>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
> >>   _17 = x_28(D) + _2;
> >>   _50 = _4 > 0;
> >>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
> >>   i_30 = i_34 + 1;
> >>   ivtmp_6 = ivtmp_5 - 1;
> >>   if (ivtmp_6 != 0)
> >>     goto <bb 10>; [98.99%]
> >>   else
> >>     goto <bb 9>; [1.01%]
> >>
> >>   <bb 10> [local count: 1052266994]:
> >>   goto <bb 3>; [100.00%]
> >>
> >> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
> >
> > Huh.  if-conversion does
> >
> >   /* Now all statements are if-convertible.  Combine all the basic
> >      blocks into one huge basic block doing the if-conversion
> >      on-the-fly.  */
> >   combine_blocks (loop);
> >
> >   /* Delete dead predicate computations.  */
> >   ifcvt_local_dce (loop->header);
> >
> >   /* Perform local CSE, this esp. helps the vectorizer analysis if loads
> >      and stores are involved.  CSE only the loop body, not the entry
> >      PHIs, those are to be kept in sync with the non-if-converted copy.
> >      ???  We'll still keep dead stores though.  */
> >   exit_bbs = BITMAP_ALLOC (NULL);
> >   bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> >   bitmap_set_bit (exit_bbs, loop->latch->index);
> >   todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> >
> > which should remove those redundant _4 > 0 checks.  In fact when I
> > run this on x86_64 with -mavx512bw I see
> >
> >   <bb 3> [local count: 1063004407]:
> >   # i_25 = PHI <i_20(9), 0(21)>
> >   # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
> >   _1 = (long unsigned int) i_25;
> >   _2 = _1 * 2;
> >   _3 = a_14(D) + _2;
> >   _4 = *_3;
> >   _5 = b_15(D) + _2;
> >   _49 = _4 > 0;
> >   _6 = .MASK_LOAD (_5, 16B, _49);
> >   _22 = _6 > 0;
> >   _28 = ~_22;
> >   _29 = _28 & _49;
> >   _7 = c_16(D) + _2;
> >   iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
> >   iftmp.0_10 = _29 ? iftmp.0_17 : _4;
> >   _8 = x_18(D) + _2;
> >   .MASK_STORE (_8, 16B, _49, iftmp.0_10);
> >   i_20 = i_25 + 1;
> >   ivtmp_12 = ivtmp_24 - 1;
> >   if (ivtmp_12 != 0)
> >
> > after if-conversion (that should be the case already on the GCC 9 branch).
>
> Gah, sorry for the noise.  Turns out I still had a local change that was
> trying to poke the patch into doing something wrong.  Will try to check
> my facts more carefully next time.
>
> The redundant pattern statements I was thinking of come from
> vect_recog_mask_conversion_pattern, but I guess that isn't so
> interesting here.
>
> So yeah, let's drop this whole vn thing for now...
The attached version drops VN, and uses operand_equal_p for comparison.
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard

[-- Attachment #2: pr86753-9.diff --]
[-- Type: application/x-patch, Size: 11290 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 15:55                               ` Prathamesh Kulkarni
@ 2019-08-27 17:39                                 ` Richard Sandiford
  2019-08-27 20:10                                   ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-27 17:39 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Richard should have the final say, but some comments...

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 1e2dfe5d22d..862206b3256 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>  
>  static tree
>  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> -			 gimple_stmt_iterator *gsi)
> +			 gimple_stmt_iterator *gsi, tree mask,
> +			 cond_vmask_map_type *cond_to_vec_mask)

"scalar_mask" might be a better name.  But maybe we should key off the
vector mask after all, now that we're relying on the code having no
redundancies.

Passing the vinfo would be better than passing the cond_vmask_map_type
directly.

>  {
>    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
>    if (!loop_mask)
>      return vec_mask;
>  
>    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> +
> +  tree *slot = 0;
> +  if (cond_to_vec_mask)

The pointer should never be null in this context.

> +    {
> +      cond_vmask_key cond (mask, loop_mask);
> +      slot = &cond_to_vec_mask->get_or_insert (cond);
> +      if (*slot)
> +	return *slot;
> +    }
> +
>    tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
>    gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
>  					  vec_mask, loop_mask);
>    gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
> +
> +  if (slot)
> +    *slot = and_res;
>    return and_res;
>  }
> [...]
> @@ -9975,6 +9997,38 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>    /* Handle cond expr.  */
>    for (j = 0; j < ncopies; j++)
>      {
> +      tree vec_mask = NULL_TREE;
> +
> +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)

Nit: one condition per line when the whole thing doesn't fit on a single line.

> +	  && TREE_CODE_CLASS (TREE_CODE (cond_expr)) == tcc_comparison

Why restrict this to embedded comparisons?  It should work for separate
comparisons too.

> +	  && loop_vinfo->cond_to_vec_mask)

This should always be nonnull given the above.

> +	{
> +	  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +	  if (masks)

This is never null.

> +	    {
> +	      tree loop_mask = vect_get_loop_mask (gsi, masks,
> +						   ncopies, vectype, j);
> +
> +	      cond_vmask_key cond (cond_expr, loop_mask);
> +	      tree *slot = loop_vinfo->cond_to_vec_mask->get (cond);
> +	      if (slot && *slot)
> +		vec_mask = *slot;
> +	      else
> +		{
> +		  cond.cond_ops.code
> +		    = invert_tree_comparison (cond.cond_ops.code, true);
> +		  slot = loop_vinfo->cond_to_vec_mask->get (cond);
> +		  if (slot && *slot)
> +		    {
> +		      vec_mask = *slot;
> +		      tree tmp = then_clause;
> +		      then_clause = else_clause;
> +		      else_clause = tmp;

Can use std::swap.

> +		    }
> +		}
> +	    }
> +	}
> +
>        stmt_vec_info new_stmt_info = NULL;
>        if (j == 0)
>  	{
> @@ -10054,6 +10108,8 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  
>  	  if (masked)
>  	    vec_compare = vec_cond_lhs;
> +	  else if (vec_mask)
> +	    vec_compare = vec_mask;

If we do drop the comparison check above, this should win over "masked".

> @@ -193,6 +194,81 @@ public:
>    poly_uint64 min_value;
>  };
>  
> +struct cond_vmask_key

I'm no good at naming things, but since "vmask" doesn't occur elsewhere
in target-independent code, how about "vec_masked_cond_key"?

> +{
> +  cond_vmask_key (tree t, tree loop_mask_)
> +    : cond_ops (t), loop_mask (loop_mask_)
> +  {}
> +
> +  hashval_t hash () const
> +  {
> +    inchash::hash h;
> +    h.add_int (cond_ops.code);
> +    h.add_int (TREE_HASH (cond_ops.op0));
> +    h.add_int (TREE_HASH (cond_ops.op1));

These two need to use inchash::add_expr, since you're hashing for
operand_equal_p.

> +    h.add_int (TREE_HASH (loop_mask));
> +    return h.end ();
> +  }
> +
> +  void mark_empty ()
> +  {
> +    loop_mask = NULL_TREE;
> +  }
> +
> +  bool is_empty ()
> +  {
> +    return loop_mask == NULL_TREE;
> +  }
> +
> +  tree_cond_ops cond_ops;
> +  tree loop_mask;
> +};
> +
> +inline bool operator== (const cond_vmask_key& c1, const cond_vmask_key& c2)
> +{
> +  return c1.loop_mask == c2.loop_mask
> +	 && c1.cond_ops == c2.cond_ops;

Multi-line expressions should be in brackets (or just put this one on
a single line).

> +}
> +
> +struct cond_vmask_key_traits

Might as well make this:

template<>
struct default_hash_traits<cond_vmask_key>

and then you can drop the third template parameter from hash_map.

> +{
> +  typedef cond_vmask_key value_type;
> +  typedef cond_vmask_key compare_type;
> +
> +  static inline hashval_t hash (value_type v)
> +  {
> +    return v.hash ();
> +  }
> +
> +  static inline bool equal (value_type existing, value_type candidate)
> +  {
> +    return existing == candidate;
> +  }
> +
> +  static inline void mark_empty (value_type& v)
> +  {
> +    v.mark_empty ();
> +  }
> +
> +  static inline bool is_empty (value_type v)
> +  {
> +    return v.is_empty ();
> +  }

Making hash (), mask_empty () and is_empty () forward to cond_vmask_key
functions of the same name seems unnecessary.  I think we should just put
the implementation here and not define the functions in cond_vmask_key
itself.

> +
> +  static void mark_deleted (value_type&) {}
> +
> +  static inline bool is_deleted (value_type)
> +  {
> +    return false;
> +  }
> +
> +  static inline void remove (value_type &) {}
> +};
> +
> +typedef hash_map<cond_vmask_key, tree,
> +		 simple_hashmap_traits <cond_vmask_key_traits, tree> >
> +  cond_vmask_map_type;
> +
>  /* Vectorizer state shared between different analyses like vector sizes
>     of the same CFG region.  */
>  class vec_info_shared {
> @@ -255,6 +331,8 @@ public:
>    /* Cost data used by the target cost model.  */
>    void *target_cost_data;
>  
> +  cond_vmask_map_type *cond_to_vec_mask;
> +
>  private:
>    stmt_vec_info new_stmt_vec_info (gimple *stmt);
>    void set_vinfo_for_stmt (gimple *, stmt_vec_info);
> diff --git a/gcc/tree.c b/gcc/tree.c
> index 8f80012c6e8..32a8fcf1eb8 100644
> --- a/gcc/tree.c
> +++ b/gcc/tree.c
> @@ -15204,6 +15204,44 @@ max_object_size (void)
>    return TYPE_MAX_VALUE (ptrdiff_type_node);
>  }
>  
> +/* If code(T) is comparison op or def of comparison stmt,
> +   extract it's operands.
> +   Else return <NE_EXPR, T, 0>.  */
> +
> +tree_cond_ops::tree_cond_ops (tree t)
> +{
> +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> +    {
> +      this->code = TREE_CODE (t);
> +      this->op0 = TREE_OPERAND (t, 0);
> +      this->op1 = TREE_OPERAND (t, 1);
> +      return;
> +    }
> +
> +  else if (TREE_CODE (t) == SSA_NAME)

Better as just an "if", given the early return above.

> +    {
> +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> +      if (stmt)
> +	{
> +	  tree_code code = gimple_assign_rhs_code (stmt);
> +	  if (TREE_CODE_CLASS (code) == tcc_comparison)
> +	    {
> +	      this->code = code;
> +	      this->op0 = gimple_assign_rhs1 (stmt);
> +	      this->op1 = gimple_assign_rhs2 (stmt);
> +	      return;
> +	    }
> +	}
> +
> +      this->code = NE_EXPR;
> +      this->op0 = t;
> +      this->op1 = build_zero_cst (TREE_TYPE (t));

Think we should use this as the default for non-SSA_NAMEs too,
rather than assert below.

> +    }
> +
> +  else
> +    gcc_unreachable ();
> +}
> +
>  #if CHECKING_P
>  
>  namespace selftest {
> diff --git a/gcc/tree.h b/gcc/tree.h
> index 94dbb95a78a..e6d6e9541c3 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -6141,4 +6141,25 @@ public:
>    operator location_t () const { return m_combined_loc; }
>  };
>  
> +struct tree_cond_ops
> +{
> +  tree_code code;
> +  tree op0;
> +  tree op1;
> +
> +  tree_cond_ops (tree);
> +};
> +
> +/* ??? Not sure if it's good idea to include fold-const.h
> +       only for operand_equal_p ? */

Maybe put the new stuff in fold-const.h?

> +extern bool operand_equal_p (const_tree, const_tree, unsigned int);
> +
> +inline bool
> +operator== (const tree_cond_ops& o1, const tree_cond_ops &o2)
> +{
> +  return o1.code == o2.code
> +	 && operand_equal_p (o1.op0, o2.op0, 0)
> +	 && operand_equal_p (o1.op1, o2.op1, 0);

Multi-line expression should be enclosed in brackets.

Thanks,
Richard

> +}
> +
>  #endif  /* GCC_TREE_H  */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 17:39                                 ` Richard Sandiford
@ 2019-08-27 20:10                                   ` Prathamesh Kulkarni
  2019-08-28  9:42                                     ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-27 20:10 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 10079 bytes --]

On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard should have the final say, but some comments...
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index 1e2dfe5d22d..862206b3256 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> >
> >  static tree
> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> > -                      gimple_stmt_iterator *gsi)
> > +                      gimple_stmt_iterator *gsi, tree mask,
> > +                      cond_vmask_map_type *cond_to_vec_mask)
>
> "scalar_mask" might be a better name.  But maybe we should key off the
> vector mask after all, now that we're relying on the code having no
> redundancies.
>
> Passing the vinfo would be better than passing the cond_vmask_map_type
> directly.
>
> >  {
> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
> >    if (!loop_mask)
> >      return vec_mask;
> >
> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> > +
> > +  tree *slot = 0;
> > +  if (cond_to_vec_mask)
>
> The pointer should never be null in this context.
Disabling check for NULL results in segfault with cond_arith_4.c because we
reach prepare_load_store_mask via vect_schedule_slp, called from
here in vect_transform_loop:
 /* Schedule the SLP instances first, then handle loop vectorization
     below.  */
  if (!loop_vinfo->slp_instances.is_empty ())
    {
      DUMP_VECT_SCOPE ("scheduling SLP instances");
      vect_schedule_slp (loop_vinfo);
    }

which is before bb processing loop.
>
> > +    {
> > +      cond_vmask_key cond (mask, loop_mask);
> > +      slot = &cond_to_vec_mask->get_or_insert (cond);
> > +      if (*slot)
> > +     return *slot;
> > +    }
> > +
> >    tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
> >    gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
> >                                         vec_mask, loop_mask);
> >    gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
> > +
> > +  if (slot)
> > +    *slot = and_res;
> >    return and_res;
> >  }
> > [...]
> > @@ -9975,6 +9997,38 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >    /* Handle cond expr.  */
> >    for (j = 0; j < ncopies; j++)
> >      {
> > +      tree vec_mask = NULL_TREE;
> > +
> > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
>
> Nit: one condition per line when the whole thing doesn't fit on a single line.
>
> > +       && TREE_CODE_CLASS (TREE_CODE (cond_expr)) == tcc_comparison
>
> Why restrict this to embedded comparisons?  It should work for separate
> comparisons too.
>
> > +       && loop_vinfo->cond_to_vec_mask)
>
> This should always be nonnull given the above.
>
> > +     {
> > +       vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +       if (masks)
>
> This is never null.
>
> > +         {
> > +           tree loop_mask = vect_get_loop_mask (gsi, masks,
> > +                                                ncopies, vectype, j);
> > +
> > +           cond_vmask_key cond (cond_expr, loop_mask);
> > +           tree *slot = loop_vinfo->cond_to_vec_mask->get (cond);
> > +           if (slot && *slot)
> > +             vec_mask = *slot;
> > +           else
> > +             {
> > +               cond.cond_ops.code
> > +                 = invert_tree_comparison (cond.cond_ops.code, true);
> > +               slot = loop_vinfo->cond_to_vec_mask->get (cond);
> > +               if (slot && *slot)
> > +                 {
> > +                   vec_mask = *slot;
> > +                   tree tmp = then_clause;
> > +                   then_clause = else_clause;
> > +                   else_clause = tmp;
>
> Can use std::swap.
>
> > +                 }
> > +             }
> > +         }
> > +     }
> > +
> >        stmt_vec_info new_stmt_info = NULL;
> >        if (j == 0)
> >       {
> > @@ -10054,6 +10108,8 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >
> >         if (masked)
> >           vec_compare = vec_cond_lhs;
> > +       else if (vec_mask)
> > +         vec_compare = vec_mask;
>
> If we do drop the comparison check above, this should win over "masked".
>
> > @@ -193,6 +194,81 @@ public:
> >    poly_uint64 min_value;
> >  };
> >
> > +struct cond_vmask_key
>
> I'm no good at naming things, but since "vmask" doesn't occur elsewhere
> in target-independent code, how about "vec_masked_cond_key"?
>
> > +{
> > +  cond_vmask_key (tree t, tree loop_mask_)
> > +    : cond_ops (t), loop_mask (loop_mask_)
> > +  {}
> > +
> > +  hashval_t hash () const
> > +  {
> > +    inchash::hash h;
> > +    h.add_int (cond_ops.code);
> > +    h.add_int (TREE_HASH (cond_ops.op0));
> > +    h.add_int (TREE_HASH (cond_ops.op1));
>
> These two need to use inchash::add_expr, since you're hashing for
> operand_equal_p.
>
> > +    h.add_int (TREE_HASH (loop_mask));
> > +    return h.end ();
> > +  }
> > +
> > +  void mark_empty ()
> > +  {
> > +    loop_mask = NULL_TREE;
> > +  }
> > +
> > +  bool is_empty ()
> > +  {
> > +    return loop_mask == NULL_TREE;
> > +  }
> > +
> > +  tree_cond_ops cond_ops;
> > +  tree loop_mask;
> > +};
> > +
> > +inline bool operator== (const cond_vmask_key& c1, const cond_vmask_key& c2)
> > +{
> > +  return c1.loop_mask == c2.loop_mask
> > +      && c1.cond_ops == c2.cond_ops;
>
> Multi-line expressions should be in brackets (or just put this one on
> a single line).
>
> > +}
> > +
> > +struct cond_vmask_key_traits
>
> Might as well make this:
>
> template<>
> struct default_hash_traits<cond_vmask_key>
>
> and then you can drop the third template parameter from hash_map.
>
> > +{
> > +  typedef cond_vmask_key value_type;
> > +  typedef cond_vmask_key compare_type;
> > +
> > +  static inline hashval_t hash (value_type v)
> > +  {
> > +    return v.hash ();
> > +  }
> > +
> > +  static inline bool equal (value_type existing, value_type candidate)
> > +  {
> > +    return existing == candidate;
> > +  }
> > +
> > +  static inline void mark_empty (value_type& v)
> > +  {
> > +    v.mark_empty ();
> > +  }
> > +
> > +  static inline bool is_empty (value_type v)
> > +  {
> > +    return v.is_empty ();
> > +  }
>
> Making hash (), mask_empty () and is_empty () forward to cond_vmask_key
> functions of the same name seems unnecessary.  I think we should just put
> the implementation here and not define the functions in cond_vmask_key
> itself.
>
> > +
> > +  static void mark_deleted (value_type&) {}
> > +
> > +  static inline bool is_deleted (value_type)
> > +  {
> > +    return false;
> > +  }
> > +
> > +  static inline void remove (value_type &) {}
> > +};
> > +
> > +typedef hash_map<cond_vmask_key, tree,
> > +              simple_hashmap_traits <cond_vmask_key_traits, tree> >
> > +  cond_vmask_map_type;
> > +
> >  /* Vectorizer state shared between different analyses like vector sizes
> >     of the same CFG region.  */
> >  class vec_info_shared {
> > @@ -255,6 +331,8 @@ public:
> >    /* Cost data used by the target cost model.  */
> >    void *target_cost_data;
> >
> > +  cond_vmask_map_type *cond_to_vec_mask;
> > +
> >  private:
> >    stmt_vec_info new_stmt_vec_info (gimple *stmt);
> >    void set_vinfo_for_stmt (gimple *, stmt_vec_info);
> > diff --git a/gcc/tree.c b/gcc/tree.c
> > index 8f80012c6e8..32a8fcf1eb8 100644
> > --- a/gcc/tree.c
> > +++ b/gcc/tree.c
> > @@ -15204,6 +15204,44 @@ max_object_size (void)
> >    return TYPE_MAX_VALUE (ptrdiff_type_node);
> >  }
> >
> > +/* If code(T) is comparison op or def of comparison stmt,
> > +   extract it's operands.
> > +   Else return <NE_EXPR, T, 0>.  */
> > +
> > +tree_cond_ops::tree_cond_ops (tree t)
> > +{
> > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > +    {
> > +      this->code = TREE_CODE (t);
> > +      this->op0 = TREE_OPERAND (t, 0);
> > +      this->op1 = TREE_OPERAND (t, 1);
> > +      return;
> > +    }
> > +
> > +  else if (TREE_CODE (t) == SSA_NAME)
>
> Better as just an "if", given the early return above.
>
> > +    {
> > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > +      if (stmt)
> > +     {
> > +       tree_code code = gimple_assign_rhs_code (stmt);
> > +       if (TREE_CODE_CLASS (code) == tcc_comparison)
> > +         {
> > +           this->code = code;
> > +           this->op0 = gimple_assign_rhs1 (stmt);
> > +           this->op1 = gimple_assign_rhs2 (stmt);
> > +           return;
> > +         }
> > +     }
> > +
> > +      this->code = NE_EXPR;
> > +      this->op0 = t;
> > +      this->op1 = build_zero_cst (TREE_TYPE (t));
>
> Think we should use this as the default for non-SSA_NAMEs too,
> rather than assert below.
>
> > +    }
> > +
> > +  else
> > +    gcc_unreachable ();
> > +}
> > +
> >  #if CHECKING_P
> >
> >  namespace selftest {
> > diff --git a/gcc/tree.h b/gcc/tree.h
> > index 94dbb95a78a..e6d6e9541c3 100644
> > --- a/gcc/tree.h
> > +++ b/gcc/tree.h
> > @@ -6141,4 +6141,25 @@ public:
> >    operator location_t () const { return m_combined_loc; }
> >  };
> >
> > +struct tree_cond_ops
> > +{
> > +  tree_code code;
> > +  tree op0;
> > +  tree op1;
> > +
> > +  tree_cond_ops (tree);
> > +};
> > +
> > +/* ??? Not sure if it's good idea to include fold-const.h
> > +       only for operand_equal_p ? */
>
> Maybe put the new stuff in fold-const.h?
>
> > +extern bool operand_equal_p (const_tree, const_tree, unsigned int);
> > +
> > +inline bool
> > +operator== (const tree_cond_ops& o1, const tree_cond_ops &o2)
> > +{
> > +  return o1.code == o2.code
> > +      && operand_equal_p (o1.op0, o2.op0, 0)
> > +      && operand_equal_p (o1.op1, o2.op1, 0);
>
> Multi-line expression should be enclosed in brackets.
Thanks for the suggestions, I tried addressing these in attached patch.
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> > +}
> > +
> >  #endif  /* GCC_TREE_H  */
>

[-- Attachment #2: pr86753-10.diff --]
[-- Type: application/x-patch, Size: 10782 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-27 20:10                                   ` Prathamesh Kulkarni
@ 2019-08-28  9:42                                     ` Richard Sandiford
  2019-08-30 12:09                                       ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-08-28  9:42 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard should have the final say, but some comments...
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> > index 1e2dfe5d22d..862206b3256 100644
>> > --- a/gcc/tree-vect-stmts.c
>> > +++ b/gcc/tree-vect-stmts.c
>> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>> >
>> >  static tree
>> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
>> > -                      gimple_stmt_iterator *gsi)
>> > +                      gimple_stmt_iterator *gsi, tree mask,
>> > +                      cond_vmask_map_type *cond_to_vec_mask)
>>
>> "scalar_mask" might be a better name.  But maybe we should key off the
>> vector mask after all, now that we're relying on the code having no
>> redundancies.
>>
>> Passing the vinfo would be better than passing the cond_vmask_map_type
>> directly.
>>
>> >  {
>> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
>> >    if (!loop_mask)
>> >      return vec_mask;
>> >
>> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
>> > +
>> > +  tree *slot = 0;
>> > +  if (cond_to_vec_mask)
>>
>> The pointer should never be null in this context.
> Disabling check for NULL results in segfault with cond_arith_4.c because we
> reach prepare_load_store_mask via vect_schedule_slp, called from
> here in vect_transform_loop:
>  /* Schedule the SLP instances first, then handle loop vectorization
>      below.  */
>   if (!loop_vinfo->slp_instances.is_empty ())
>     {
>       DUMP_VECT_SCOPE ("scheduling SLP instances");
>       vect_schedule_slp (loop_vinfo);
>     }
>
> which is before bb processing loop.

We want this optimisation to be applied to SLP too though.  Especially
since non-SLP will be going away at some point.

But as Richard says, the problem with SLP is that the statements aren't
traversed in block order, so I guess we can't do the on-the-fly
redundancy elimination there...

Maybe an alternative would be to record during the analysis phase which
scalar conditions need which loop masks.  Statements that need a loop
mask currently do:

      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);

If we also pass the scalar condition, we can maintain a hash_set of
<condition, ncopies> pairs, representing the conditions that have
loop masks applied at some point in the vectorised code.  The COND_EXPR
code can use that set to decide whether to apply the loop mask or not.

Trying to avoid duplicate ANDs with the loop mask would then become a
separate follow-on change.  Not sure whether it's worth it on its own.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-28  9:42                                     ` Richard Sandiford
@ 2019-08-30 12:09                                       ` Richard Biener
  2019-08-31 16:56                                         ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-08-30 12:09 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

On Wed, Aug 28, 2019 at 11:02 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard should have the final say, but some comments...
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> > index 1e2dfe5d22d..862206b3256 100644
> >> > --- a/gcc/tree-vect-stmts.c
> >> > +++ b/gcc/tree-vect-stmts.c
> >> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> >> >
> >> >  static tree
> >> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> >> > -                      gimple_stmt_iterator *gsi)
> >> > +                      gimple_stmt_iterator *gsi, tree mask,
> >> > +                      cond_vmask_map_type *cond_to_vec_mask)
> >>
> >> "scalar_mask" might be a better name.  But maybe we should key off the
> >> vector mask after all, now that we're relying on the code having no
> >> redundancies.
> >>
> >> Passing the vinfo would be better than passing the cond_vmask_map_type
> >> directly.
> >>
> >> >  {
> >> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
> >> >    if (!loop_mask)
> >> >      return vec_mask;
> >> >
> >> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> >> > +
> >> > +  tree *slot = 0;
> >> > +  if (cond_to_vec_mask)
> >>
> >> The pointer should never be null in this context.
> > Disabling check for NULL results in segfault with cond_arith_4.c because we
> > reach prepare_load_store_mask via vect_schedule_slp, called from
> > here in vect_transform_loop:
> >  /* Schedule the SLP instances first, then handle loop vectorization
> >      below.  */
> >   if (!loop_vinfo->slp_instances.is_empty ())
> >     {
> >       DUMP_VECT_SCOPE ("scheduling SLP instances");
> >       vect_schedule_slp (loop_vinfo);
> >     }
> >
> > which is before bb processing loop.
>
> We want this optimisation to be applied to SLP too though.  Especially
> since non-SLP will be going away at some point.
>
> But as Richard says, the problem with SLP is that the statements aren't
> traversed in block order, so I guess we can't do the on-the-fly
> redundancy elimination there...

And the current patch AFAICS can generate wrong SSA for this reason.

> Maybe an alternative would be to record during the analysis phase which
> scalar conditions need which loop masks.  Statements that need a loop
> mask currently do:
>
>       vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
>
> If we also pass the scalar condition, we can maintain a hash_set of
> <condition, ncopies> pairs, representing the conditions that have
> loop masks applied at some point in the vectorised code.  The COND_EXPR
> code can use that set to decide whether to apply the loop mask or not.

Yeah, that sounds better.

Note that I don't like the extra "helpers" in fold-const.c/h, they do not look
useful in general so put them into vectorizer private code.  The decomposing
also doesn't look too nice, instead prepare_load_store_mask could get
such decomposed representation - possibly quite natural with the suggestion
from Richard above.

Richard.

> Trying to avoid duplicate ANDs with the loop mask would then become a
> separate follow-on change.  Not sure whether it's worth it on its own.
>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-30 12:09                                       ` Richard Biener
@ 2019-08-31 16:56                                         ` Prathamesh Kulkarni
  2019-09-05  9:00                                           ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-08-31 16:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 7027 bytes --]

On Fri, 30 Aug 2019 at 16:15, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Aug 28, 2019 at 11:02 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Richard should have the final say, but some comments...
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > >> > index 1e2dfe5d22d..862206b3256 100644
> > >> > --- a/gcc/tree-vect-stmts.c
> > >> > +++ b/gcc/tree-vect-stmts.c
> > >> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> > >> >
> > >> >  static tree
> > >> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> > >> > -                      gimple_stmt_iterator *gsi)
> > >> > +                      gimple_stmt_iterator *gsi, tree mask,
> > >> > +                      cond_vmask_map_type *cond_to_vec_mask)
> > >>
> > >> "scalar_mask" might be a better name.  But maybe we should key off the
> > >> vector mask after all, now that we're relying on the code having no
> > >> redundancies.
> > >>
> > >> Passing the vinfo would be better than passing the cond_vmask_map_type
> > >> directly.
> > >>
> > >> >  {
> > >> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
> > >> >    if (!loop_mask)
> > >> >      return vec_mask;
> > >> >
> > >> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> > >> > +
> > >> > +  tree *slot = 0;
> > >> > +  if (cond_to_vec_mask)
> > >>
> > >> The pointer should never be null in this context.
> > > Disabling check for NULL results in segfault with cond_arith_4.c because we
> > > reach prepare_load_store_mask via vect_schedule_slp, called from
> > > here in vect_transform_loop:
> > >  /* Schedule the SLP instances first, then handle loop vectorization
> > >      below.  */
> > >   if (!loop_vinfo->slp_instances.is_empty ())
> > >     {
> > >       DUMP_VECT_SCOPE ("scheduling SLP instances");
> > >       vect_schedule_slp (loop_vinfo);
> > >     }
> > >
> > > which is before bb processing loop.
> >
> > We want this optimisation to be applied to SLP too though.  Especially
> > since non-SLP will be going away at some point.
> >
> > But as Richard says, the problem with SLP is that the statements aren't
> > traversed in block order, so I guess we can't do the on-the-fly
> > redundancy elimination there...
>
> And the current patch AFAICS can generate wrong SSA for this reason.
>
> > Maybe an alternative would be to record during the analysis phase which
> > scalar conditions need which loop masks.  Statements that need a loop
> > mask currently do:
> >
> >       vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
> >
> > If we also pass the scalar condition, we can maintain a hash_set of
> > <condition, ncopies> pairs, representing the conditions that have
> > loop masks applied at some point in the vectorised code.  The COND_EXPR
> > code can use that set to decide whether to apply the loop mask or not.
>
> Yeah, that sounds better.
>
> Note that I don't like the extra "helpers" in fold-const.c/h, they do not look
> useful in general so put them into vectorizer private code.  The decomposing
> also doesn't look too nice, instead prepare_load_store_mask could get
> such decomposed representation - possibly quite natural with the suggestion
> from Richard above.
Hi,
Thanks for the suggestions, I have an attached updated patch, that
tries to address above suggestions.
With patch, we manage to use same predicate for both tests in PR, and
the redundant AND ops are eliminated
by fre4.

I have a few doubts:
1] I moved tree_cond_ops into tree-vectorizer.[ch], I will get rid of
it in follow up patch.
I am not sure what to pass as def of scalar condition (scalar_mask) to
vect_record_loop_mask
from vectorizable_store, vectorizable_reduction and
vectorizable_live_operation ? In the patch,
I just passed NULL.

2] Do changes to vectorizable_condition and
vectorizable_condition_apply_loop_mask look OK ?

3] The patch additionally regresses following tests (apart from fmla_2.c):
FAIL: gcc.target/aarch64/sve/cond_convert_1.c -march=armv8.2-a+sve
scan-assembler-not \\tsel\\t
FAIL: gcc.target/aarch64/sve/cond_convert_4.c -march=armv8.2-a+sve
scan-assembler-not \\tsel\\t
FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
scan-assembler-not \\tsel\\t
FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
scan-assembler-times \\tmovprfx\\t

The issue with cond_convert_1.c, can be reproduced with following test-case:

void __attribute__((noipa))
test_int16_t(_Float16 *__restrict r, int16_t *__restrict a,
             _Float16 *__restrict b, int16_t *__restrict pred, int n) {
  for (int i = 0; i < n; ++i)
    r[i] = pred[i] ? (_Float16)a[i] : b[i];
}

Before patch, vect dump shows:
  mask__41.15_56 = vect__4.9_47 == vect_cst__55;
  _41 = _4 == 0;
  vec_mask_and_59 = mask__41.15_56 & loop_mask_46;
  vect_iftmp.18_60 = .MASK_LOAD (vectp_b.16_57, 2B, vec_mask_and_59);
  iftmp.0_16 = 0.0;
  vect_iftmp.19_62 = VEC_COND_EXPR <vect__4.9_47 == vect_cst__61,
vect_iftmp.18_60, vect_iftmp.14_54>;
  iftmp.0_10 = _4 == 0 ? iftmp.0_16 : iftmp.0_18;

fre4, then seems to interchange operands of vec_cond_expr with inverted code:
  mask__41.15_56 = vect__4.9_47 == { 0, ... };
  vec_mask_and_59 = mask__41.15_56 & loop_mask_46;
  _1 = &MEM[base: b_15(D), index: ivtmp_66, step: 2, offset: 0B];
  vect_iftmp.18_60 = .MASK_LOAD (_1, 2B, vec_mask_and_59);
  vect_iftmp.19_62 = VEC_COND_EXPR <vect__4.9_47 != { 0, ... },
vect_iftmp.14_54, vect_iftmp.18_60>;

After patch, vect dump shows:
  mask__41.15_56 = vect__4.9_47 == vect_cst__55;
  _41 = _4 == 0;
  vec_mask_and_59 = mask__41.15_56 & loop_mask_46;
  vect_iftmp.18_60 = .MASK_LOAD (vectp_b.16_57, 2B, vec_mask_and_59);
  iftmp.0_16 = 0.0;
  _62 = vect__4.9_47 == vect_cst__61;
  _63 = _62 & loop_mask_46;
  vect_iftmp.19_64 = VEC_COND_EXPR <_63, vect_iftmp.18_60, vect_iftmp.14_54>;
  iftmp.0_10 = _4 == 0 ? iftmp.0_16 : iftmp.0_18;

which is then cleaned up by fre4:
  mask__41.15_56 = vect__4.9_47 == { 0, ... };
  vec_mask_and_59 = mask__41.15_56 & loop_mask_46;
  _1 = &MEM[base: b_15(D), index: ivtmp_68, step: 2, offset: 0B];
  vect_iftmp.18_60 = .MASK_LOAD (_1, 2B, vec_mask_and_59);
  vect_iftmp.19_64 = VEC_COND_EXPR <vec_mask_and_59, vect_iftmp.18_60,
vect_iftmp.14_54>;

In this case, fre4 does not interchange the operands, and reuses
vec_mask_and_59 in vec_cond_expr,
which perhaps results in different code-gen ?
I didn't investigate other tests so far, because they look quite
similar to cond_convert_1.c, and possibly
have the same issue.

Thanks,
Prathamesh

>
> Richard.
>
> > Trying to avoid duplicate ANDs with the loop mask would then become a
> > separate follow-on change.  Not sure whether it's worth it on its own.
> >
> > Thanks,
> > Richard

[-- Attachment #2: pr86753-v2-1.diff --]
[-- Type: application/x-patch, Size: 12570 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-08-31 16:56                                         ` Prathamesh Kulkarni
@ 2019-09-05  9:00                                           ` Richard Sandiford
  2019-09-05 12:51                                             ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-09-05  9:00 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Sorry for the slow reply.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 30 Aug 2019 at 16:15, Richard Biener <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Aug 28, 2019 at 11:02 AM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
>> > > <richard.sandiford@arm.com> wrote:
>> > >>
>> > >> Richard should have the final say, but some comments...
>> > >>
>> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> > >> > index 1e2dfe5d22d..862206b3256 100644
>> > >> > --- a/gcc/tree-vect-stmts.c
>> > >> > +++ b/gcc/tree-vect-stmts.c
>> > >> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>> > >> >
>> > >> >  static tree
>> > >> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
>> > >> > -                      gimple_stmt_iterator *gsi)
>> > >> > +                      gimple_stmt_iterator *gsi, tree mask,
>> > >> > +                      cond_vmask_map_type *cond_to_vec_mask)
>> > >>
>> > >> "scalar_mask" might be a better name.  But maybe we should key off the
>> > >> vector mask after all, now that we're relying on the code having no
>> > >> redundancies.
>> > >>
>> > >> Passing the vinfo would be better than passing the cond_vmask_map_type
>> > >> directly.
>> > >>
>> > >> >  {
>> > >> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
>> > >> >    if (!loop_mask)
>> > >> >      return vec_mask;
>> > >> >
>> > >> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
>> > >> > +
>> > >> > +  tree *slot = 0;
>> > >> > +  if (cond_to_vec_mask)
>> > >>
>> > >> The pointer should never be null in this context.
>> > > Disabling check for NULL results in segfault with cond_arith_4.c because we
>> > > reach prepare_load_store_mask via vect_schedule_slp, called from
>> > > here in vect_transform_loop:
>> > >  /* Schedule the SLP instances first, then handle loop vectorization
>> > >      below.  */
>> > >   if (!loop_vinfo->slp_instances.is_empty ())
>> > >     {
>> > >       DUMP_VECT_SCOPE ("scheduling SLP instances");
>> > >       vect_schedule_slp (loop_vinfo);
>> > >     }
>> > >
>> > > which is before bb processing loop.
>> >
>> > We want this optimisation to be applied to SLP too though.  Especially
>> > since non-SLP will be going away at some point.
>> >
>> > But as Richard says, the problem with SLP is that the statements aren't
>> > traversed in block order, so I guess we can't do the on-the-fly
>> > redundancy elimination there...
>>
>> And the current patch AFAICS can generate wrong SSA for this reason.
>>
>> > Maybe an alternative would be to record during the analysis phase which
>> > scalar conditions need which loop masks.  Statements that need a loop
>> > mask currently do:
>> >
>> >       vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
>> >
>> > If we also pass the scalar condition, we can maintain a hash_set of
>> > <condition, ncopies> pairs, representing the conditions that have
>> > loop masks applied at some point in the vectorised code.  The COND_EXPR
>> > code can use that set to decide whether to apply the loop mask or not.
>>
>> Yeah, that sounds better.
>>
>> Note that I don't like the extra "helpers" in fold-const.c/h, they do not look
>> useful in general so put them into vectorizer private code.  The decomposing
>> also doesn't look too nice, instead prepare_load_store_mask could get
>> such decomposed representation - possibly quite natural with the suggestion
>> from Richard above.
> Hi,
> Thanks for the suggestions, I have an attached updated patch, that
> tries to address above suggestions.
> With patch, we manage to use same predicate for both tests in PR, and
> the redundant AND ops are eliminated
> by fre4.
>
> I have a few doubts:
> 1] I moved tree_cond_ops into tree-vectorizer.[ch], I will get rid of
> it in follow up patch.
> I am not sure what to pass as def of scalar condition (scalar_mask) to
> vect_record_loop_mask
> from vectorizable_store, vectorizable_reduction and
> vectorizable_live_operation ? In the patch,
> I just passed NULL.

For vectorizable_store this is just "mask", like for vectorizable_load.
Passing NULL looks right for the other two.  (Nit, GCC style is to use
NULL rather than 0.)

> 2] Do changes to vectorizable_condition and
> vectorizable_condition_apply_loop_mask look OK ?

Some comments below.

> 3] The patch additionally regresses following tests (apart from fmla_2.c):
> FAIL: gcc.target/aarch64/sve/cond_convert_1.c -march=armv8.2-a+sve
> scan-assembler-not \\tsel\\t
> FAIL: gcc.target/aarch64/sve/cond_convert_4.c -march=armv8.2-a+sve
> scan-assembler-not \\tsel\\t
> FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> scan-assembler-not \\tsel\\t
> FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> scan-assembler-times \\tmovprfx\\t
> [...]

For cond_convert_1.c, I think it would be OK to change the test to:

    for (int i = 0; i < n; ++i)					\
      {								\
	FLOAT_TYPE bi = b[i];					\
	r[i] = pred[i] ? (FLOAT_TYPE) a[i] : bi;		\
      }								\

so that only the a[i] load is conditional.  Same for the other two.

I think originally I had to write it this way precisely because
we didn't have the optimisation you're adding, so this is actually
a good sign :-)

> @@ -8313,7 +8313,7 @@ vect_double_mask_nunits (tree type)
>  
>  void
>  vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> -		       unsigned int nvectors, tree vectype)
> +		       unsigned int nvectors, tree vectype, tree scalar_mask)
>  {
>    gcc_assert (nvectors != 0);
>    if (masks->length () < nvectors)

New parameter needs documentation.

> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index dd9d45a9547..49ea86a0680 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1888,7 +1888,7 @@ static void
>  check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>  			  vec_load_store_type vls_type, int group_size,
>  			  vect_memory_access_type memory_access_type,
> -			  gather_scatter_info *gs_info)
> +			  gather_scatter_info *gs_info, tree scalar_mask)
>  {
>    /* Invariant loads need no special support.  */
>    if (memory_access_type == VMAT_INVARIANT)

Same here.

> @@ -9763,6 +9765,29 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
>    return true;
>  }
>  
> +static void
> +vectorizable_condition_apply_loop_mask (tree &vec_compare,
> +					gimple_stmt_iterator *&gsi,
> +					stmt_vec_info &stmt_info,
> +					tree loop_mask,
> +					tree vec_cmp_type)

Function needs a comment.

I think it'd be better to return the new mask and not make vec_compare
a reference.  stmt_info shouldn't need to be a reference either (it's
just a pointer type).

> +{
> +  if (COMPARISON_CLASS_P (vec_compare))
> +    {
> +      tree tmp = make_ssa_name (vec_cmp_type);
> +      gassign *g = gimple_build_assign (tmp, TREE_CODE (vec_compare),
> +					TREE_OPERAND (vec_compare, 0),
> +					TREE_OPERAND (vec_compare, 1));
> +      vect_finish_stmt_generation (stmt_info, g, gsi);
> +      vec_compare = tmp;
> +    }
> +
> +  tree tmp2 = make_ssa_name (vec_cmp_type);
> +  gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> +  vect_finish_stmt_generation (stmt_info, g, gsi);
> +  vec_compare = tmp2;
> +}
> +
>  /* vectorizable_condition.
>  
>     Check if STMT_INFO is conditional modify expression that can be vectorized.
> @@ -9975,6 +10000,36 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>    /* Handle cond expr.  */
>    for (j = 0; j < ncopies; j++)
>      {
> +      tree loop_mask = NULL_TREE;
> +
> +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  scalar_cond_masked_key cond (cond_expr, ncopies);
> +          if (loop_vinfo->scalar_cond_masked_set->contains (cond))

Nit: untabified line.

> +	    {
> +	      scalar_cond_masked_key cond (cond_expr, ncopies);
> +	      if (loop_vinfo->scalar_cond_masked_set->contains (cond))

This "if" looks redundant -- isn't the condition the same as above?

> +		{
> +		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> +		}
> +	    }
> +	  else
> +	    {
> +	      cond.cond_ops.code
> +		= invert_tree_comparison (cond.cond_ops.code, true);

Would be better to pass an HONOR_NANS value instead of "true".

> +	      if (loop_vinfo->scalar_cond_masked_set->contains (cond))
> +		{
> +		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> +		  std::swap (then_clause, else_clause);
> +		  cond_code = cond.cond_ops.code;
> +		  cond_expr = build2 (cond_code, TREE_TYPE (cond_expr),
> +				      then_clause, else_clause);

Rather than do the swap here and build a new tree, I think it would be
better to set a boolean that indicates that the then and else are swapped.
Then we can conditionally swap them after:

          vec_then_clause = vec_oprnds2[i];
          vec_else_clause = vec_oprnds3[i];

> [...]
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index dc181524744..794e65f0007 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -464,6 +464,7 @@ vec_info::vec_info (vec_info::vec_kind kind_in, void *target_cost_data_in,
>      target_cost_data (target_cost_data_in)
>  {
>    stmt_vec_infos.create (50);
> +  scalar_cond_masked_set = new scalar_cond_masked_set_type ();
>  }
>  
>  vec_info::~vec_info ()
> @@ -476,6 +477,8 @@ vec_info::~vec_info ()
>  
>    destroy_cost_data (target_cost_data);
>    free_stmt_vec_infos ();
> +  delete scalar_cond_masked_set;
> +  scalar_cond_masked_set = 0;
>  }
>  
>  vec_info_shared::vec_info_shared ()

No need to assign null here, since we're at the end of the destructor.
But maybe scalar_cond_masked_set should be "scalar_cond_masked_set_type"
rather than "scalar_cond_masked_set_type *", if the object is going to
have the same lifetime as the vec_info anyway.

Looks good otherwise.  I skipped over the tree_cond_ops bit given
your comment above that this was temporary.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-05  9:00                                           ` Richard Sandiford
@ 2019-09-05 12:51                                             ` Prathamesh Kulkarni
  2019-09-09 11:15                                               ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-05 12:51 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener, gcc Patches, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 12049 bytes --]

On Thu, 5 Sep 2019 at 14:29, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Sorry for the slow reply.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 30 Aug 2019 at 16:15, Richard Biener <richard.guenther@gmail.com> wrote:
> >>
> >> On Wed, Aug 28, 2019 at 11:02 AM Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >>
> >> > >> Richard should have the final say, but some comments...
> >> > >>
> >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> > >> > index 1e2dfe5d22d..862206b3256 100644
> >> > >> > --- a/gcc/tree-vect-stmts.c
> >> > >> > +++ b/gcc/tree-vect-stmts.c
> >> > >> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> >> > >> >
> >> > >> >  static tree
> >> > >> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> >> > >> > -                      gimple_stmt_iterator *gsi)
> >> > >> > +                      gimple_stmt_iterator *gsi, tree mask,
> >> > >> > +                      cond_vmask_map_type *cond_to_vec_mask)
> >> > >>
> >> > >> "scalar_mask" might be a better name.  But maybe we should key off the
> >> > >> vector mask after all, now that we're relying on the code having no
> >> > >> redundancies.
> >> > >>
> >> > >> Passing the vinfo would be better than passing the cond_vmask_map_type
> >> > >> directly.
> >> > >>
> >> > >> >  {
> >> > >> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
> >> > >> >    if (!loop_mask)
> >> > >> >      return vec_mask;
> >> > >> >
> >> > >> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> >> > >> > +
> >> > >> > +  tree *slot = 0;
> >> > >> > +  if (cond_to_vec_mask)
> >> > >>
> >> > >> The pointer should never be null in this context.
> >> > > Disabling check for NULL results in segfault with cond_arith_4.c because we
> >> > > reach prepare_load_store_mask via vect_schedule_slp, called from
> >> > > here in vect_transform_loop:
> >> > >  /* Schedule the SLP instances first, then handle loop vectorization
> >> > >      below.  */
> >> > >   if (!loop_vinfo->slp_instances.is_empty ())
> >> > >     {
> >> > >       DUMP_VECT_SCOPE ("scheduling SLP instances");
> >> > >       vect_schedule_slp (loop_vinfo);
> >> > >     }
> >> > >
> >> > > which is before bb processing loop.
> >> >
> >> > We want this optimisation to be applied to SLP too though.  Especially
> >> > since non-SLP will be going away at some point.
> >> >
> >> > But as Richard says, the problem with SLP is that the statements aren't
> >> > traversed in block order, so I guess we can't do the on-the-fly
> >> > redundancy elimination there...
> >>
> >> And the current patch AFAICS can generate wrong SSA for this reason.
> >>
> >> > Maybe an alternative would be to record during the analysis phase which
> >> > scalar conditions need which loop masks.  Statements that need a loop
> >> > mask currently do:
> >> >
> >> >       vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
> >> >
> >> > If we also pass the scalar condition, we can maintain a hash_set of
> >> > <condition, ncopies> pairs, representing the conditions that have
> >> > loop masks applied at some point in the vectorised code.  The COND_EXPR
> >> > code can use that set to decide whether to apply the loop mask or not.
> >>
> >> Yeah, that sounds better.
> >>
> >> Note that I don't like the extra "helpers" in fold-const.c/h, they do not look
> >> useful in general so put them into vectorizer private code.  The decomposing
> >> also doesn't look too nice, instead prepare_load_store_mask could get
> >> such decomposed representation - possibly quite natural with the suggestion
> >> from Richard above.
> > Hi,
> > Thanks for the suggestions, I have an attached updated patch, that
> > tries to address above suggestions.
> > With patch, we manage to use same predicate for both tests in PR, and
> > the redundant AND ops are eliminated
> > by fre4.
> >
> > I have a few doubts:
> > 1] I moved tree_cond_ops into tree-vectorizer.[ch], I will get rid of
> > it in follow up patch.
> > I am not sure what to pass as def of scalar condition (scalar_mask) to
> > vect_record_loop_mask
> > from vectorizable_store, vectorizable_reduction and
> > vectorizable_live_operation ? In the patch,
> > I just passed NULL.
>
> For vectorizable_store this is just "mask", like for vectorizable_load.
> Passing NULL looks right for the other two.  (Nit, GCC style is to use
> NULL rather than 0.)
>
> > 2] Do changes to vectorizable_condition and
> > vectorizable_condition_apply_loop_mask look OK ?
>
> Some comments below.
>
> > 3] The patch additionally regresses following tests (apart from fmla_2.c):
> > FAIL: gcc.target/aarch64/sve/cond_convert_1.c -march=armv8.2-a+sve
> > scan-assembler-not \\tsel\\t
> > FAIL: gcc.target/aarch64/sve/cond_convert_4.c -march=armv8.2-a+sve
> > scan-assembler-not \\tsel\\t
> > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > scan-assembler-not \\tsel\\t
> > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > scan-assembler-times \\tmovprfx\\t
> > [...]
>
> For cond_convert_1.c, I think it would be OK to change the test to:
>
>     for (int i = 0; i < n; ++i)                                 \
>       {                                                         \
>         FLOAT_TYPE bi = b[i];                                   \
>         r[i] = pred[i] ? (FLOAT_TYPE) a[i] : bi;                \
>       }                                                         \
>
> so that only the a[i] load is conditional.  Same for the other two.
>
> I think originally I had to write it this way precisely because
> we didn't have the optimisation you're adding, so this is actually
> a good sign :-)
>
> > @@ -8313,7 +8313,7 @@ vect_double_mask_nunits (tree type)
> >
> >  void
> >  vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > -                    unsigned int nvectors, tree vectype)
> > +                    unsigned int nvectors, tree vectype, tree scalar_mask)
> >  {
> >    gcc_assert (nvectors != 0);
> >    if (masks->length () < nvectors)
>
> New parameter needs documentation.
>
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index dd9d45a9547..49ea86a0680 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -1888,7 +1888,7 @@ static void
> >  check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> >                         vec_load_store_type vls_type, int group_size,
> >                         vect_memory_access_type memory_access_type,
> > -                       gather_scatter_info *gs_info)
> > +                       gather_scatter_info *gs_info, tree scalar_mask)
> >  {
> >    /* Invariant loads need no special support.  */
> >    if (memory_access_type == VMAT_INVARIANT)
>
> Same here.
>
> > @@ -9763,6 +9765,29 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
> >    return true;
> >  }
> >
> > +static void
> > +vectorizable_condition_apply_loop_mask (tree &vec_compare,
> > +                                     gimple_stmt_iterator *&gsi,
> > +                                     stmt_vec_info &stmt_info,
> > +                                     tree loop_mask,
> > +                                     tree vec_cmp_type)
>
> Function needs a comment.
>
> I think it'd be better to return the new mask and not make vec_compare
> a reference.  stmt_info shouldn't need to be a reference either (it's
> just a pointer type).
>
> > +{
> > +  if (COMPARISON_CLASS_P (vec_compare))
> > +    {
> > +      tree tmp = make_ssa_name (vec_cmp_type);
> > +      gassign *g = gimple_build_assign (tmp, TREE_CODE (vec_compare),
> > +                                     TREE_OPERAND (vec_compare, 0),
> > +                                     TREE_OPERAND (vec_compare, 1));
> > +      vect_finish_stmt_generation (stmt_info, g, gsi);
> > +      vec_compare = tmp;
> > +    }
> > +
> > +  tree tmp2 = make_ssa_name (vec_cmp_type);
> > +  gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> > +  vect_finish_stmt_generation (stmt_info, g, gsi);
> > +  vec_compare = tmp2;
> > +}
> > +
> >  /* vectorizable_condition.
> >
> >     Check if STMT_INFO is conditional modify expression that can be vectorized.
> > @@ -9975,6 +10000,36 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >    /* Handle cond expr.  */
> >    for (j = 0; j < ncopies; j++)
> >      {
> > +      tree loop_mask = NULL_TREE;
> > +
> > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > +     {
> > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > +          if (loop_vinfo->scalar_cond_masked_set->contains (cond))
>
> Nit: untabified line.
>
> > +         {
> > +           scalar_cond_masked_key cond (cond_expr, ncopies);
> > +           if (loop_vinfo->scalar_cond_masked_set->contains (cond))
>
> This "if" looks redundant -- isn't the condition the same as above?
Oops sorry, probably a copy-paste typo -;)
>
> > +             {
> > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > +             }
> > +         }
> > +       else
> > +         {
> > +           cond.cond_ops.code
> > +             = invert_tree_comparison (cond.cond_ops.code, true);
>
> Would be better to pass an HONOR_NANS value instead of "true".
>
> > +           if (loop_vinfo->scalar_cond_masked_set->contains (cond))
> > +             {
> > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > +               std::swap (then_clause, else_clause);
> > +               cond_code = cond.cond_ops.code;
> > +               cond_expr = build2 (cond_code, TREE_TYPE (cond_expr),
> > +                                   then_clause, else_clause);
>
> Rather than do the swap here and build a new tree, I think it would be
> better to set a boolean that indicates that the then and else are swapped.
> Then we can conditionally swap them after:
>
>           vec_then_clause = vec_oprnds2[i];
>           vec_else_clause = vec_oprnds3[i];
>
> > [...]
> > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > index dc181524744..794e65f0007 100644
> > --- a/gcc/tree-vectorizer.c
> > +++ b/gcc/tree-vectorizer.c
> > @@ -464,6 +464,7 @@ vec_info::vec_info (vec_info::vec_kind kind_in, void *target_cost_data_in,
> >      target_cost_data (target_cost_data_in)
> >  {
> >    stmt_vec_infos.create (50);
> > +  scalar_cond_masked_set = new scalar_cond_masked_set_type ();
> >  }
> >
> >  vec_info::~vec_info ()
> > @@ -476,6 +477,8 @@ vec_info::~vec_info ()
> >
> >    destroy_cost_data (target_cost_data);
> >    free_stmt_vec_infos ();
> > +  delete scalar_cond_masked_set;
> > +  scalar_cond_masked_set = 0;
> >  }
> >
> >  vec_info_shared::vec_info_shared ()
>
> No need to assign null here, since we're at the end of the destructor.
> But maybe scalar_cond_masked_set should be "scalar_cond_masked_set_type"
> rather than "scalar_cond_masked_set_type *", if the object is going to
> have the same lifetime as the vec_info anyway.
>
> Looks good otherwise.  I skipped over the tree_cond_ops bit given
> your comment above that this was temporary.
Thanks for the suggestions, I tried addressing them in attached patch.
Does it look OK ?

With patch, the only following FAIL remains for aarch64-sve.exp:
FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
scan-assembler-times \\tmovprfx\\t 6
which now contains 14.
Should I adjust the test, assuming the change isn't a regression ?

Thanks,
Prathamesh
>
> Thanks,
> Richard

[-- Attachment #2: pr86753-v2-2.diff --]
[-- Type: application/x-patch, Size: 14217 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-05 12:51                                             ` Prathamesh Kulkarni
@ 2019-09-09 11:15                                               ` Richard Sandiford
  2019-09-09 16:37                                                 ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-09-09 11:15 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> With patch, the only following FAIL remains for aarch64-sve.exp:
> FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> scan-assembler-times \\tmovprfx\\t 6
> which now contains 14.
> Should I adjust the test, assuming the change isn't a regression ?

Well, it is kind-of a regression, but it really just means that the
integer code is now consistent with the floating-point code in having
an unnecessary MOVPRFX.  So I think adjusting the count is fine.
Presumably any future fix for the existing redundant MOVPRFXs will
apply to the new ones as well.

The patch looks good to me, just some very minor nits:

> @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
>  
>  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
>     contain a sequence of NVECTORS masks that each control a vector of type
> -   VECTYPE.  */
> +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> +   load/store stmt.  */

Should be two spaces between sentences.  Maybe:

   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
   these vector masks with the vector version of SCALAR_MASK.  */

since the mask isn't necessarily for a load or store statement.

> [...]
> @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
>     says how the load or store is going to be implemented and GROUP_SIZE
>     is the number of load or store statements in the containing group.
>     If the access is a gather load or scatter store, GS_INFO describes
> -   its arguments.
> +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> +   load or store stmt.

Maybe:

   its arguments.  If the load or store is conditional, SCALAR_MASK is the
   condition under which it occurs.

since SCALAR_MASK can be null here too.

> [...]
> @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>    /* Handle cond expr.  */
>    for (j = 0; j < ncopies; j++)
>      {
> +      tree loop_mask = NULL_TREE;
> +      bool swap_cond_operands = false;
> +
> +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  scalar_cond_masked_key cond (cond_expr, ncopies);
> +	  if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> +	    {
> +	      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +	      loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> +	    }
> +	  else
> +	    {
> +	      cond.code = invert_tree_comparison (cond.code,
> +						  HONOR_NANS (TREE_TYPE (cond.op0)));

Long line.  Maybe just split it out into a separate assignment:

	      bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
	      cond.code = invert_tree_comparison (cond.code, honor_nans);

> +	      if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		{
> +		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);

Long line here too.

> [...]
> @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  		    }
>  		}
>  	    }
> +
> +	  if (loop_mask)
> +	    {
> +	      if (COMPARISON_CLASS_P (vec_compare))
> +		{
> +		  tree tmp = make_ssa_name (vec_cmp_type);
> +		  gassign *g = gimple_build_assign (tmp,
> +						    TREE_CODE (vec_compare),
> +						    TREE_OPERAND (vec_compare, 0),
d> +						    TREE_OPERAND (vec_compare, 1));

Two long lines.

> +		  vect_finish_stmt_generation (stmt_info, g, gsi);
> +		  vec_compare = tmp;
> +		}
> +
> +	      tree tmp2 = make_ssa_name (vec_cmp_type);
> +	      gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);

Long line here too.

> [...]
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index dc181524744..c4b2d8e8647 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
>  {
>    return new pass_ipa_increase_alignment (ctxt);
>  }
> +
> +/* If code(T) is comparison op or def of comparison stmt,
> +   extract it's operands.
> +   Else return <NE_EXPR, T, 0>.  */
> +
> +void
> +scalar_cond_masked_key::get_cond_ops_from_tree (tree t) 
> +{
> +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> +    {
> +      this->code = TREE_CODE (t);
> +      this->op0 = TREE_OPERAND (t, 0);
> +      this->op1 = TREE_OPERAND (t, 1);
> +      return;
> +    }
> +
> +  if (TREE_CODE (t) == SSA_NAME)
> +    {
> +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> +      if (stmt)
> +        {

Might as well do this as:

  if (TREE_CODE (t) == SSA_NAME)
    if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
      {

The patch (as hoped) introduces some XPASSes:

XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30

Could you remove the associated xfails (and comments above them where
appropriate)?

OK with those changes from my POV, but please give Richi a day or so
to object.

Thanks for doing this.

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-09 11:15                                               ` Richard Sandiford
@ 2019-09-09 16:37                                                 ` Prathamesh Kulkarni
  2019-09-09 20:56                                                   ` Prathamesh Kulkarni
  2019-09-16 15:54                                                   ` Prathamesh Kulkarni
  0 siblings, 2 replies; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-09 16:37 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

[-- Attachment #1: Type: text/plain, Size: 9303 bytes --]

On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > With patch, the only following FAIL remains for aarch64-sve.exp:
> > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > scan-assembler-times \\tmovprfx\\t 6
> > which now contains 14.
> > Should I adjust the test, assuming the change isn't a regression ?
>
> Well, it is kind-of a regression, but it really just means that the
> integer code is now consistent with the floating-point code in having
> an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> Presumably any future fix for the existing redundant MOVPRFXs will
> apply to the new ones as well.
>
> The patch looks good to me, just some very minor nits:
>
> > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> >
> >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> >     contain a sequence of NVECTORS masks that each control a vector of type
> > -   VECTYPE.  */
> > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > +   load/store stmt.  */
>
> Should be two spaces between sentences.  Maybe:
>
>    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
>    these vector masks with the vector version of SCALAR_MASK.  */
>
> since the mask isn't necessarily for a load or store statement.
>
> > [...]
> > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> >     says how the load or store is going to be implemented and GROUP_SIZE
> >     is the number of load or store statements in the containing group.
> >     If the access is a gather load or scatter store, GS_INFO describes
> > -   its arguments.
> > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > +   load or store stmt.
>
> Maybe:
>
>    its arguments.  If the load or store is conditional, SCALAR_MASK is the
>    condition under which it occurs.
>
> since SCALAR_MASK can be null here too.
>
> > [...]
> > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >    /* Handle cond expr.  */
> >    for (j = 0; j < ncopies; j++)
> >      {
> > +      tree loop_mask = NULL_TREE;
> > +      bool swap_cond_operands = false;
> > +
> > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > +     {
> > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > +         {
> > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > +         }
> > +       else
> > +         {
> > +           cond.code = invert_tree_comparison (cond.code,
> > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
>
> Long line.  Maybe just split it out into a separate assignment:
>
>               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>               cond.code = invert_tree_comparison (cond.code, honor_nans);
>
> > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > +             {
> > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
>
> Long line here too.
>
> > [...]
> > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >                   }
> >               }
> >           }
> > +
> > +       if (loop_mask)
> > +         {
> > +           if (COMPARISON_CLASS_P (vec_compare))
> > +             {
> > +               tree tmp = make_ssa_name (vec_cmp_type);
> > +               gassign *g = gimple_build_assign (tmp,
> > +                                                 TREE_CODE (vec_compare),
> > +                                                 TREE_OPERAND (vec_compare, 0),
> d> +                                                TREE_OPERAND (vec_compare, 1));
>
> Two long lines.
>
> > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > +               vec_compare = tmp;
> > +             }
> > +
> > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
>
> Long line here too.
>
> > [...]
> > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > index dc181524744..c4b2d8e8647 100644
> > --- a/gcc/tree-vectorizer.c
> > +++ b/gcc/tree-vectorizer.c
> > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> >  {
> >    return new pass_ipa_increase_alignment (ctxt);
> >  }
> > +
> > +/* If code(T) is comparison op or def of comparison stmt,
> > +   extract it's operands.
> > +   Else return <NE_EXPR, T, 0>.  */
> > +
> > +void
> > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > +{
> > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > +    {
> > +      this->code = TREE_CODE (t);
> > +      this->op0 = TREE_OPERAND (t, 0);
> > +      this->op1 = TREE_OPERAND (t, 1);
> > +      return;
> > +    }
> > +
> > +  if (TREE_CODE (t) == SSA_NAME)
> > +    {
> > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > +      if (stmt)
> > +        {
>
> Might as well do this as:
>
>   if (TREE_CODE (t) == SSA_NAME)
>     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
>       {
>
> The patch (as hoped) introduces some XPASSes:
>
> XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
>
> Could you remove the associated xfails (and comments above them where
> appropriate)?
>
> OK with those changes from my POV, but please give Richi a day or so
> to object.
>
> Thanks for doing this.
Thanks for the suggestions, I have updated the patch accordingly.
Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
Richi, does the patch look OK to you ?

Thanks,
Prathamesh
>
> Richard

[-- Attachment #2: pr86753-v2-3.diff --]
[-- Type: application/x-patch, Size: 24113 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-09 16:37                                                 ` Prathamesh Kulkarni
@ 2019-09-09 20:56                                                   ` Prathamesh Kulkarni
  2019-09-10 12:20                                                     ` Richard Sandiford
  2019-09-10 13:35                                                     ` Matthew Malcomson
  2019-09-16 15:54                                                   ` Prathamesh Kulkarni
  1 sibling, 2 replies; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-09 20:56 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

On Mon, 9 Sep 2019 at 22:06, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > scan-assembler-times \\tmovprfx\\t 6
> > > which now contains 14.
> > > Should I adjust the test, assuming the change isn't a regression ?
> >
> > Well, it is kind-of a regression, but it really just means that the
> > integer code is now consistent with the floating-point code in having
> > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > Presumably any future fix for the existing redundant MOVPRFXs will
> > apply to the new ones as well.
> >
> > The patch looks good to me, just some very minor nits:
> >
> > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > >
> > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > -   VECTYPE.  */
> > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > +   load/store stmt.  */
> >
> > Should be two spaces between sentences.  Maybe:
> >
> >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> >    these vector masks with the vector version of SCALAR_MASK.  */
> >
> > since the mask isn't necessarily for a load or store statement.
> >
> > > [...]
> > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > >     says how the load or store is going to be implemented and GROUP_SIZE
> > >     is the number of load or store statements in the containing group.
> > >     If the access is a gather load or scatter store, GS_INFO describes
> > > -   its arguments.
> > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > +   load or store stmt.
> >
> > Maybe:
> >
> >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> >    condition under which it occurs.
> >
> > since SCALAR_MASK can be null here too.
> >
> > > [...]
> > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >    /* Handle cond expr.  */
> > >    for (j = 0; j < ncopies; j++)
> > >      {
> > > +      tree loop_mask = NULL_TREE;
> > > +      bool swap_cond_operands = false;
> > > +
> > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > +     {
> > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +         {
> > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > +         }
> > > +       else
> > > +         {
> > > +           cond.code = invert_tree_comparison (cond.code,
> > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> >
> > Long line.  Maybe just split it out into a separate assignment:
> >
> >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> >
> > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +             {
> > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> >
> > Long line here too.
> >
> > > [...]
> > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >                   }
> > >               }
> > >           }
> > > +
> > > +       if (loop_mask)
> > > +         {
> > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > +             {
> > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > +               gassign *g = gimple_build_assign (tmp,
> > > +                                                 TREE_CODE (vec_compare),
> > > +                                                 TREE_OPERAND (vec_compare, 0),
> > d> +                                                TREE_OPERAND (vec_compare, 1));
> >
> > Two long lines.
> >
> > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > +               vec_compare = tmp;
> > > +             }
> > > +
> > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> >
> > Long line here too.
> >
> > > [...]
> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > index dc181524744..c4b2d8e8647 100644
> > > --- a/gcc/tree-vectorizer.c
> > > +++ b/gcc/tree-vectorizer.c
> > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > >  {
> > >    return new pass_ipa_increase_alignment (ctxt);
> > >  }
> > > +
> > > +/* If code(T) is comparison op or def of comparison stmt,
> > > +   extract it's operands.
> > > +   Else return <NE_EXPR, T, 0>.  */
> > > +
> > > +void
> > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > +{
> > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > +    {
> > > +      this->code = TREE_CODE (t);
> > > +      this->op0 = TREE_OPERAND (t, 0);
> > > +      this->op1 = TREE_OPERAND (t, 1);
> > > +      return;
> > > +    }
> > > +
> > > +  if (TREE_CODE (t) == SSA_NAME)
> > > +    {
> > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > +      if (stmt)
> > > +        {
> >
> > Might as well do this as:
> >
> >   if (TREE_CODE (t) == SSA_NAME)
> >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> >       {
> >
> > The patch (as hoped) introduces some XPASSes:
> >
> > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> >
> > Could you remove the associated xfails (and comments above them where
> > appropriate)?
> >
> > OK with those changes from my POV, but please give Richi a day or so
> > to object.
> >
> > Thanks for doing this.
> Thanks for the suggestions, I have updated the patch accordingly.
> Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> Richi, does the patch look OK to you ?
Hi,
Bootstrap+test passes for x86_64-unknown-linux-gnu and aarch64-linux-gnu.
On x86_64, there's a "strange" failure of c-c++-common/builtins.c, log shows:

/home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
FAIL: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
Excess errors:
/home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test

Which shouldn't really happen since the test doesn't seem relevant to patch,
and only passes -O2 which shouldn't enable the vectorizer ? Manually
testing it results in PASS with:
make check-gcc RUNTESTFLAGS="dg.exp=builtins.c"
Would it be OK to ignore the FAIL during reg-test ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-09 20:56                                                   ` Prathamesh Kulkarni
@ 2019-09-10 12:20                                                     ` Richard Sandiford
  2019-09-10 13:35                                                     ` Matthew Malcomson
  1 sibling, 0 replies; 41+ messages in thread
From: Richard Sandiford @ 2019-09-10 12:20 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 9 Sep 2019 at 22:06, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > With patch, the only following FAIL remains for aarch64-sve.exp:
>> > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
>> > > scan-assembler-times \\tmovprfx\\t 6
>> > > which now contains 14.
>> > > Should I adjust the test, assuming the change isn't a regression ?
>> >
>> > Well, it is kind-of a regression, but it really just means that the
>> > integer code is now consistent with the floating-point code in having
>> > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
>> > Presumably any future fix for the existing redundant MOVPRFXs will
>> > apply to the new ones as well.
>> >
>> > The patch looks good to me, just some very minor nits:
>> >
>> > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
>> > >
>> > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
>> > >     contain a sequence of NVECTORS masks that each control a vector of type
>> > > -   VECTYPE.  */
>> > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
>> > > +   load/store stmt.  */
>> >
>> > Should be two spaces between sentences.  Maybe:
>> >
>> >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
>> >    these vector masks with the vector version of SCALAR_MASK.  */
>> >
>> > since the mask isn't necessarily for a load or store statement.
>> >
>> > > [...]
>> > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
>> > >     says how the load or store is going to be implemented and GROUP_SIZE
>> > >     is the number of load or store statements in the containing group.
>> > >     If the access is a gather load or scatter store, GS_INFO describes
>> > > -   its arguments.
>> > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
>> > > +   load or store stmt.
>> >
>> > Maybe:
>> >
>> >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
>> >    condition under which it occurs.
>> >
>> > since SCALAR_MASK can be null here too.
>> >
>> > > [...]
>> > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>> > >    /* Handle cond expr.  */
>> > >    for (j = 0; j < ncopies; j++)
>> > >      {
>> > > +      tree loop_mask = NULL_TREE;
>> > > +      bool swap_cond_operands = false;
>> > > +
>> > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> > > +     {
>> > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
>> > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
>> > > +         {
>> > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
>> > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
>> > > +         }
>> > > +       else
>> > > +         {
>> > > +           cond.code = invert_tree_comparison (cond.code,
>> > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
>> >
>> > Long line.  Maybe just split it out into a separate assignment:
>> >
>> >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>> >               cond.code = invert_tree_comparison (cond.code, honor_nans);
>> >
>> > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
>> > > +             {
>> > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
>> > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
>> >
>> > Long line here too.
>> >
>> > > [...]
>> > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>> > >                   }
>> > >               }
>> > >           }
>> > > +
>> > > +       if (loop_mask)
>> > > +         {
>> > > +           if (COMPARISON_CLASS_P (vec_compare))
>> > > +             {
>> > > +               tree tmp = make_ssa_name (vec_cmp_type);
>> > > +               gassign *g = gimple_build_assign (tmp,
>> > > +                                                 TREE_CODE (vec_compare),
>> > > +                                                 TREE_OPERAND (vec_compare, 0),
>> > d> +                                                TREE_OPERAND (vec_compare, 1));
>> >
>> > Two long lines.
>> >
>> > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
>> > > +               vec_compare = tmp;
>> > > +             }
>> > > +
>> > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
>> > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
>> >
>> > Long line here too.
>> >
>> > > [...]
>> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> > > index dc181524744..c4b2d8e8647 100644
>> > > --- a/gcc/tree-vectorizer.c
>> > > +++ b/gcc/tree-vectorizer.c
>> > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
>> > >  {
>> > >    return new pass_ipa_increase_alignment (ctxt);
>> > >  }
>> > > +
>> > > +/* If code(T) is comparison op or def of comparison stmt,
>> > > +   extract it's operands.
>> > > +   Else return <NE_EXPR, T, 0>.  */
>> > > +
>> > > +void
>> > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
>> > > +{
>> > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
>> > > +    {
>> > > +      this->code = TREE_CODE (t);
>> > > +      this->op0 = TREE_OPERAND (t, 0);
>> > > +      this->op1 = TREE_OPERAND (t, 1);
>> > > +      return;
>> > > +    }
>> > > +
>> > > +  if (TREE_CODE (t) == SSA_NAME)
>> > > +    {
>> > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
>> > > +      if (stmt)
>> > > +        {
>> >
>> > Might as well do this as:
>> >
>> >   if (TREE_CODE (t) == SSA_NAME)
>> >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
>> >       {
>> >
>> > The patch (as hoped) introduces some XPASSes:
>> >
>> > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
>> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
>> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
>> >
>> > Could you remove the associated xfails (and comments above them where
>> > appropriate)?
>> >
>> > OK with those changes from my POV, but please give Richi a day or so
>> > to object.
>> >
>> > Thanks for doing this.
>> Thanks for the suggestions, I have updated the patch accordingly.
>> Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
>> Richi, does the patch look OK to you ?
> Hi,
> Bootstrap+test passes for x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> On x86_64, there's a "strange" failure of c-c++-common/builtins.c, log shows:
>
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
> FAIL: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
> Excess errors:
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
>
> Which shouldn't really happen since the test doesn't seem relevant to patch,
> and only passes -O2 which shouldn't enable the vectorizer ? Manually
> testing it results in PASS with:
> make check-gcc RUNTESTFLAGS="dg.exp=builtins.c"
> Would it be OK to ignore the FAIL during reg-test ?

Looks like the lines got truncated by the cut-&-paste.  What was
the excess error?

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-09 20:56                                                   ` Prathamesh Kulkarni
  2019-09-10 12:20                                                     ` Richard Sandiford
@ 2019-09-10 13:35                                                     ` Matthew Malcomson
  2019-09-10 21:36                                                       ` Prathamesh Kulkarni
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Malcomson @ 2019-09-10 13:35 UTC (permalink / raw)
  To: Prathamesh Kulkarni, gcc-patches; +Cc: Richard Sandiford, Richard Biener, nd

Resending because I forgot to avoid the disclaimer and hence my email 
didn't go to the gcc-patches list.



On 09/09/19 21:55, Prathamesh Kulkarni wrote:
> On Mon, 9 Sep 2019 at 22:06, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>>
>>> Thanks for doing this.
>> Thanks for the suggestions, I have updated the patch accordingly.
>> Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
>> Richi, does the patch look OK to you ?
> Hi,
> Bootstrap+test passes for x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> On x86_64, there's a "strange" failure of c-c++-common/builtins.c, log shows:
> 
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
> FAIL: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
> Excess errors:
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
> 

Just FYI I have seen this error come from a restriction in DejaGNU itself.
https://gcc.gnu.org/ml/gcc/2019-05/msg00066.html

The reply to that email mentions that this restriction was removed in 
later DejaGNU versions.
https://gcc.gnu.org/ml/gcc/2019-05/msg00070.html

If you see the snippet mentioned in the first email (don't continue if 
you've already read greater than 512000 bytes of output) in your DejaGNU 
install (remote.exp file), and the error messages from the 
"-Wc++-compat" test are greater than 512000 bytes then it's likely the 
problem is because of DejaGNU rather than your code.

If that is the case, then a test is to remove the `if` mentioned in the 
first email and re-trying the regression test.

(i.e. replace

         if { [string length $output] < 512000 } {
         exp_continue -continue_timer
         }

with

             exp_continue -continue_timer

in the "local_exec" procedure from $DEJAGNU_INSTALL/remote.exp)


> Which shouldn't really happen since the test doesn't seem relevant to patch,
> and only passes -O2 which shouldn't enable the vectorizer ? Manually
> testing it results in PASS with:
> make check-gcc RUNTESTFLAGS="dg.exp=builtins.c"
> Would it be OK to ignore the FAIL during reg-test ?
> 

This also matches the symptoms of this DejaGNU restriction -- it only 
comes up when the OS read returned not all the output from the test, and 
that happens a lot more when there are many parallel tests running.

> Thanks,
> Prathamesh
>>
>> Thanks,
>> Prathamesh
>>>
>>> Richard


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-10 13:35                                                     ` Matthew Malcomson
@ 2019-09-10 21:36                                                       ` Prathamesh Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-10 21:36 UTC (permalink / raw)
  To: Matthew Malcomson; +Cc: gcc-patches, Richard Sandiford, Richard Biener, nd

On Tue, 10 Sep 2019 at 19:05, Matthew Malcomson
<Matthew.Malcomson@arm.com> wrote:
>
> Resending because I forgot to avoid the disclaimer and hence my email
> didn't go to the gcc-patches list.
>
>
>
> On 09/09/19 21:55, Prathamesh Kulkarni wrote:
> > On Mon, 9 Sep 2019 at 22:06, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >>>
> >>>
> >>> Thanks for doing this.
> >> Thanks for the suggestions, I have updated the patch accordingly.
> >> Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> >> Richi, does the patch look OK to you ?
> > Hi,
> > Bootstrap+test passes for x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> > On x86_64, there's a "strange" failure of c-c++-common/builtins.c, log shows:
> >
> > /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
> > FAIL: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
> > Excess errors:
> > /home/prathamesh.kulkarni/gnu-toolchain/gcc/pr86753-v2-3/gcc/gcc/test
> >
>
> Just FYI I have seen this error come from a restriction in DejaGNU itself.
> https://gcc.gnu.org/ml/gcc/2019-05/msg00066.html
>
> The reply to that email mentions that this restriction was removed in
> later DejaGNU versions.
> https://gcc.gnu.org/ml/gcc/2019-05/msg00070.html
>
> If you see the snippet mentioned in the first email (don't continue if
> you've already read greater than 512000 bytes of output) in your DejaGNU
> install (remote.exp file), and the error messages from the
> "-Wc++-compat" test are greater than 512000 bytes then it's likely the
> problem is because of DejaGNU rather than your code.
>
> If that is the case, then a test is to remove the `if` mentioned in the
> first email and re-trying the regression test.
>
> (i.e. replace
>
>          if { [string length $output] < 512000 } {
>          exp_continue -continue_timer
>          }
>
> with
>
>              exp_continue -continue_timer
>
> in the "local_exec" procedure from $DEJAGNU_INSTALL/remote.exp)
>
>
> > Which shouldn't really happen since the test doesn't seem relevant to patch,
> > and only passes -O2 which shouldn't enable the vectorizer ? Manually
> > testing it results in PASS with:
> > make check-gcc RUNTESTFLAGS="dg.exp=builtins.c"
> > Would it be OK to ignore the FAIL during reg-test ?
> >
>
> This also matches the symptoms of this DejaGNU restriction -- it only
> comes up when the OS read returned not all the output from the test, and
> that happens a lot more when there are many parallel tests running.
Hi Matthew,
Thanks for the clarification! I had started another bootstrap+regtest
before reading your
mail, and this time there were no FAIL's, so I assume the FAIL in
previous regtest, was due the
dejaGNU issue you mentioned.

Thanks,
Prathamesh
>
> > Thanks,
> > Prathamesh
> >>
> >> Thanks,
> >> Prathamesh
> >>>
> >>> Richard
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-09 16:37                                                 ` Prathamesh Kulkarni
  2019-09-09 20:56                                                   ` Prathamesh Kulkarni
@ 2019-09-16 15:54                                                   ` Prathamesh Kulkarni
  2019-09-25 16:18                                                     ` Prathamesh Kulkarni
  1 sibling, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-16 15:54 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

On Mon, 9 Sep 2019 at 09:36, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > scan-assembler-times \\tmovprfx\\t 6
> > > which now contains 14.
> > > Should I adjust the test, assuming the change isn't a regression ?
> >
> > Well, it is kind-of a regression, but it really just means that the
> > integer code is now consistent with the floating-point code in having
> > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > Presumably any future fix for the existing redundant MOVPRFXs will
> > apply to the new ones as well.
> >
> > The patch looks good to me, just some very minor nits:
> >
> > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > >
> > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > -   VECTYPE.  */
> > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > +   load/store stmt.  */
> >
> > Should be two spaces between sentences.  Maybe:
> >
> >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> >    these vector masks with the vector version of SCALAR_MASK.  */
> >
> > since the mask isn't necessarily for a load or store statement.
> >
> > > [...]
> > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > >     says how the load or store is going to be implemented and GROUP_SIZE
> > >     is the number of load or store statements in the containing group.
> > >     If the access is a gather load or scatter store, GS_INFO describes
> > > -   its arguments.
> > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > +   load or store stmt.
> >
> > Maybe:
> >
> >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> >    condition under which it occurs.
> >
> > since SCALAR_MASK can be null here too.
> >
> > > [...]
> > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >    /* Handle cond expr.  */
> > >    for (j = 0; j < ncopies; j++)
> > >      {
> > > +      tree loop_mask = NULL_TREE;
> > > +      bool swap_cond_operands = false;
> > > +
> > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > +     {
> > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +         {
> > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > +         }
> > > +       else
> > > +         {
> > > +           cond.code = invert_tree_comparison (cond.code,
> > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> >
> > Long line.  Maybe just split it out into a separate assignment:
> >
> >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> >
> > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +             {
> > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> >
> > Long line here too.
> >
> > > [...]
> > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >                   }
> > >               }
> > >           }
> > > +
> > > +       if (loop_mask)
> > > +         {
> > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > +             {
> > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > +               gassign *g = gimple_build_assign (tmp,
> > > +                                                 TREE_CODE (vec_compare),
> > > +                                                 TREE_OPERAND (vec_compare, 0),
> > d> +                                                TREE_OPERAND (vec_compare, 1));
> >
> > Two long lines.
> >
> > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > +               vec_compare = tmp;
> > > +             }
> > > +
> > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> >
> > Long line here too.
> >
> > > [...]
> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > index dc181524744..c4b2d8e8647 100644
> > > --- a/gcc/tree-vectorizer.c
> > > +++ b/gcc/tree-vectorizer.c
> > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > >  {
> > >    return new pass_ipa_increase_alignment (ctxt);
> > >  }
> > > +
> > > +/* If code(T) is comparison op or def of comparison stmt,
> > > +   extract it's operands.
> > > +   Else return <NE_EXPR, T, 0>.  */
> > > +
> > > +void
> > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > +{
> > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > +    {
> > > +      this->code = TREE_CODE (t);
> > > +      this->op0 = TREE_OPERAND (t, 0);
> > > +      this->op1 = TREE_OPERAND (t, 1);
> > > +      return;
> > > +    }
> > > +
> > > +  if (TREE_CODE (t) == SSA_NAME)
> > > +    {
> > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > +      if (stmt)
> > > +        {
> >
> > Might as well do this as:
> >
> >   if (TREE_CODE (t) == SSA_NAME)
> >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> >       {
> >
> > The patch (as hoped) introduces some XPASSes:
> >
> > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> >
> > Could you remove the associated xfails (and comments above them where
> > appropriate)?
> >
> > OK with those changes from my POV, but please give Richi a day or so
> > to object.
> >
> > Thanks for doing this.
> Thanks for the suggestions, I have updated the patch accordingly.
> Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> Richi, does the patch look OK to you ?
ping https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-16 15:54                                                   ` Prathamesh Kulkarni
@ 2019-09-25 16:18                                                     ` Prathamesh Kulkarni
  2019-10-02 23:42                                                       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-09-25 16:18 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

On Mon, 16 Sep 2019 at 08:54, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 9 Sep 2019 at 09:36, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > > scan-assembler-times \\tmovprfx\\t 6
> > > > which now contains 14.
> > > > Should I adjust the test, assuming the change isn't a regression ?
> > >
> > > Well, it is kind-of a regression, but it really just means that the
> > > integer code is now consistent with the floating-point code in having
> > > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > > Presumably any future fix for the existing redundant MOVPRFXs will
> > > apply to the new ones as well.
> > >
> > > The patch looks good to me, just some very minor nits:
> > >
> > > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > > >
> > > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > > -   VECTYPE.  */
> > > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > > +   load/store stmt.  */
> > >
> > > Should be two spaces between sentences.  Maybe:
> > >
> > >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> > >    these vector masks with the vector version of SCALAR_MASK.  */
> > >
> > > since the mask isn't necessarily for a load or store statement.
> > >
> > > > [...]
> > > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > > >     says how the load or store is going to be implemented and GROUP_SIZE
> > > >     is the number of load or store statements in the containing group.
> > > >     If the access is a gather load or scatter store, GS_INFO describes
> > > > -   its arguments.
> > > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > > +   load or store stmt.
> > >
> > > Maybe:
> > >
> > >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> > >    condition under which it occurs.
> > >
> > > since SCALAR_MASK can be null here too.
> > >
> > > > [...]
> > > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > >    /* Handle cond expr.  */
> > > >    for (j = 0; j < ncopies; j++)
> > > >      {
> > > > +      tree loop_mask = NULL_TREE;
> > > > +      bool swap_cond_operands = false;
> > > > +
> > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > > +     {
> > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > +         {
> > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > +         }
> > > > +       else
> > > > +         {
> > > > +           cond.code = invert_tree_comparison (cond.code,
> > > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> > >
> > > Long line.  Maybe just split it out into a separate assignment:
> > >
> > >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> > >
> > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > +             {
> > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > >
> > > Long line here too.
> > >
> > > > [...]
> > > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > >                   }
> > > >               }
> > > >           }
> > > > +
> > > > +       if (loop_mask)
> > > > +         {
> > > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > > +             {
> > > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > > +               gassign *g = gimple_build_assign (tmp,
> > > > +                                                 TREE_CODE (vec_compare),
> > > > +                                                 TREE_OPERAND (vec_compare, 0),
> > > d> +                                                TREE_OPERAND (vec_compare, 1));
> > >
> > > Two long lines.
> > >
> > > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > > +               vec_compare = tmp;
> > > > +             }
> > > > +
> > > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> > >
> > > Long line here too.
> > >
> > > > [...]
> > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > > index dc181524744..c4b2d8e8647 100644
> > > > --- a/gcc/tree-vectorizer.c
> > > > +++ b/gcc/tree-vectorizer.c
> > > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > > >  {
> > > >    return new pass_ipa_increase_alignment (ctxt);
> > > >  }
> > > > +
> > > > +/* If code(T) is comparison op or def of comparison stmt,
> > > > +   extract it's operands.
> > > > +   Else return <NE_EXPR, T, 0>.  */
> > > > +
> > > > +void
> > > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > > +{
> > > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > > +    {
> > > > +      this->code = TREE_CODE (t);
> > > > +      this->op0 = TREE_OPERAND (t, 0);
> > > > +      this->op1 = TREE_OPERAND (t, 1);
> > > > +      return;
> > > > +    }
> > > > +
> > > > +  if (TREE_CODE (t) == SSA_NAME)
> > > > +    {
> > > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > > +      if (stmt)
> > > > +        {
> > >
> > > Might as well do this as:
> > >
> > >   if (TREE_CODE (t) == SSA_NAME)
> > >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> > >       {
> > >
> > > The patch (as hoped) introduces some XPASSes:
> > >
> > > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > >
> > > Could you remove the associated xfails (and comments above them where
> > > appropriate)?
> > >
> > > OK with those changes from my POV, but please give Richi a day or so
> > > to object.
> > >
> > > Thanks for doing this.
> > Thanks for the suggestions, I have updated the patch accordingly.
> > Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> > Richi, does the patch look OK to you ?
> ping https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
ping * 2: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-09-25 16:18                                                     ` Prathamesh Kulkarni
@ 2019-10-02 23:42                                                       ` Prathamesh Kulkarni
  2019-10-04 10:38                                                         ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-10-02 23:42 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

On Wed, 25 Sep 2019 at 09:17, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 16 Sep 2019 at 08:54, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 9 Sep 2019 at 09:36, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > > >
> > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > > > scan-assembler-times \\tmovprfx\\t 6
> > > > > which now contains 14.
> > > > > Should I adjust the test, assuming the change isn't a regression ?
> > > >
> > > > Well, it is kind-of a regression, but it really just means that the
> > > > integer code is now consistent with the floating-point code in having
> > > > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > > > Presumably any future fix for the existing redundant MOVPRFXs will
> > > > apply to the new ones as well.
> > > >
> > > > The patch looks good to me, just some very minor nits:
> > > >
> > > > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > > > >
> > > > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > > > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > > > -   VECTYPE.  */
> > > > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > > > +   load/store stmt.  */
> > > >
> > > > Should be two spaces between sentences.  Maybe:
> > > >
> > > >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> > > >    these vector masks with the vector version of SCALAR_MASK.  */
> > > >
> > > > since the mask isn't necessarily for a load or store statement.
> > > >
> > > > > [...]
> > > > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > > > >     says how the load or store is going to be implemented and GROUP_SIZE
> > > > >     is the number of load or store statements in the containing group.
> > > > >     If the access is a gather load or scatter store, GS_INFO describes
> > > > > -   its arguments.
> > > > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > > > +   load or store stmt.
> > > >
> > > > Maybe:
> > > >
> > > >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> > > >    condition under which it occurs.
> > > >
> > > > since SCALAR_MASK can be null here too.
> > > >
> > > > > [...]
> > > > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > >    /* Handle cond expr.  */
> > > > >    for (j = 0; j < ncopies; j++)
> > > > >      {
> > > > > +      tree loop_mask = NULL_TREE;
> > > > > +      bool swap_cond_operands = false;
> > > > > +
> > > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > > > +     {
> > > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > +         {
> > > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > > +         }
> > > > > +       else
> > > > > +         {
> > > > > +           cond.code = invert_tree_comparison (cond.code,
> > > > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> > > >
> > > > Long line.  Maybe just split it out into a separate assignment:
> > > >
> > > >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > > >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> > > >
> > > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > +             {
> > > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > >
> > > > Long line here too.
> > > >
> > > > > [...]
> > > > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > >                   }
> > > > >               }
> > > > >           }
> > > > > +
> > > > > +       if (loop_mask)
> > > > > +         {
> > > > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > > > +             {
> > > > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > > > +               gassign *g = gimple_build_assign (tmp,
> > > > > +                                                 TREE_CODE (vec_compare),
> > > > > +                                                 TREE_OPERAND (vec_compare, 0),
> > > > d> +                                                TREE_OPERAND (vec_compare, 1));
> > > >
> > > > Two long lines.
> > > >
> > > > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > > > +               vec_compare = tmp;
> > > > > +             }
> > > > > +
> > > > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> > > >
> > > > Long line here too.
> > > >
> > > > > [...]
> > > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > > > index dc181524744..c4b2d8e8647 100644
> > > > > --- a/gcc/tree-vectorizer.c
> > > > > +++ b/gcc/tree-vectorizer.c
> > > > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > > > >  {
> > > > >    return new pass_ipa_increase_alignment (ctxt);
> > > > >  }
> > > > > +
> > > > > +/* If code(T) is comparison op or def of comparison stmt,
> > > > > +   extract it's operands.
> > > > > +   Else return <NE_EXPR, T, 0>.  */
> > > > > +
> > > > > +void
> > > > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > > > +{
> > > > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > > > +    {
> > > > > +      this->code = TREE_CODE (t);
> > > > > +      this->op0 = TREE_OPERAND (t, 0);
> > > > > +      this->op1 = TREE_OPERAND (t, 1);
> > > > > +      return;
> > > > > +    }
> > > > > +
> > > > > +  if (TREE_CODE (t) == SSA_NAME)
> > > > > +    {
> > > > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > > > +      if (stmt)
> > > > > +        {
> > > >
> > > > Might as well do this as:
> > > >
> > > >   if (TREE_CODE (t) == SSA_NAME)
> > > >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> > > >       {
> > > >
> > > > The patch (as hoped) introduces some XPASSes:
> > > >
> > > > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > >
> > > > Could you remove the associated xfails (and comments above them where
> > > > appropriate)?
> > > >
> > > > OK with those changes from my POV, but please give Richi a day or so
> > > > to object.
> > > >
> > > > Thanks for doing this.
> > > Thanks for the suggestions, I have updated the patch accordingly.
> > > Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> > > Richi, does the patch look OK to you ?
> > ping https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
> ping * 2: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
ping * 3: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-02 23:42                                                       ` Prathamesh Kulkarni
@ 2019-10-04 10:38                                                         ` Richard Biener
  2019-10-08  0:10                                                           ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-10-04 10:38 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Sandiford, gcc Patches

On Thu, Oct 3, 2019 at 1:42 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 25 Sep 2019 at 09:17, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 16 Sep 2019 at 08:54, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 9 Sep 2019 at 09:36, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > > >
> > > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > > > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > > > > scan-assembler-times \\tmovprfx\\t 6
> > > > > > which now contains 14.
> > > > > > Should I adjust the test, assuming the change isn't a regression ?
> > > > >
> > > > > Well, it is kind-of a regression, but it really just means that the
> > > > > integer code is now consistent with the floating-point code in having
> > > > > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > > > > Presumably any future fix for the existing redundant MOVPRFXs will
> > > > > apply to the new ones as well.
> > > > >
> > > > > The patch looks good to me, just some very minor nits:
> > > > >
> > > > > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > > > > >
> > > > > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > > > > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > > > > -   VECTYPE.  */
> > > > > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > > > > +   load/store stmt.  */
> > > > >
> > > > > Should be two spaces between sentences.  Maybe:
> > > > >
> > > > >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> > > > >    these vector masks with the vector version of SCALAR_MASK.  */
> > > > >
> > > > > since the mask isn't necessarily for a load or store statement.
> > > > >
> > > > > > [...]
> > > > > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > > > > >     says how the load or store is going to be implemented and GROUP_SIZE
> > > > > >     is the number of load or store statements in the containing group.
> > > > > >     If the access is a gather load or scatter store, GS_INFO describes
> > > > > > -   its arguments.
> > > > > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > > > > +   load or store stmt.
> > > > >
> > > > > Maybe:
> > > > >
> > > > >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> > > > >    condition under which it occurs.
> > > > >
> > > > > since SCALAR_MASK can be null here too.
> > > > >
> > > > > > [...]
> > > > > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > > >    /* Handle cond expr.  */
> > > > > >    for (j = 0; j < ncopies; j++)
> > > > > >      {
> > > > > > +      tree loop_mask = NULL_TREE;
> > > > > > +      bool swap_cond_operands = false;
> > > > > > +
> > > > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > > > > +     {
> > > > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > > +         {
> > > > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > > > +         }
> > > > > > +       else
> > > > > > +         {
> > > > > > +           cond.code = invert_tree_comparison (cond.code,
> > > > > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> > > > >
> > > > > Long line.  Maybe just split it out into a separate assignment:
> > > > >
> > > > >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > > > >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> > > > >
> > > > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > > +             {
> > > > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > >
> > > > > Long line here too.
> > > > >
> > > > > > [...]
> > > > > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > > >                   }
> > > > > >               }
> > > > > >           }
> > > > > > +
> > > > > > +       if (loop_mask)
> > > > > > +         {
> > > > > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > > > > +             {
> > > > > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > > > > +               gassign *g = gimple_build_assign (tmp,
> > > > > > +                                                 TREE_CODE (vec_compare),
> > > > > > +                                                 TREE_OPERAND (vec_compare, 0),
> > > > > d> +                                                TREE_OPERAND (vec_compare, 1));
> > > > >
> > > > > Two long lines.
> > > > >
> > > > > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > > > > +               vec_compare = tmp;
> > > > > > +             }
> > > > > > +
> > > > > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > > > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> > > > >
> > > > > Long line here too.
> > > > >
> > > > > > [...]
> > > > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > > > > index dc181524744..c4b2d8e8647 100644
> > > > > > --- a/gcc/tree-vectorizer.c
> > > > > > +++ b/gcc/tree-vectorizer.c
> > > > > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > > > > >  {
> > > > > >    return new pass_ipa_increase_alignment (ctxt);
> > > > > >  }
> > > > > > +
> > > > > > +/* If code(T) is comparison op or def of comparison stmt,
> > > > > > +   extract it's operands.
> > > > > > +   Else return <NE_EXPR, T, 0>.  */
> > > > > > +
> > > > > > +void
> > > > > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > > > > +{
> > > > > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > > > > +    {
> > > > > > +      this->code = TREE_CODE (t);
> > > > > > +      this->op0 = TREE_OPERAND (t, 0);
> > > > > > +      this->op1 = TREE_OPERAND (t, 1);
> > > > > > +      return;
> > > > > > +    }
> > > > > > +
> > > > > > +  if (TREE_CODE (t) == SSA_NAME)
> > > > > > +    {
> > > > > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > > > > +      if (stmt)
> > > > > > +        {
> > > > >
> > > > > Might as well do this as:
> > > > >
> > > > >   if (TREE_CODE (t) == SSA_NAME)
> > > > >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> > > > >       {
> > > > >
> > > > > The patch (as hoped) introduces some XPASSes:
> > > > >
> > > > > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > > >
> > > > > Could you remove the associated xfails (and comments above them where
> > > > > appropriate)?
> > > > >
> > > > > OK with those changes from my POV, but please give Richi a day or so
> > > > > to object.
> > > > >
> > > > > Thanks for doing this.
> > > > Thanks for the suggestions, I have updated the patch accordingly.
> > > > Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> > > > Richi, does the patch look OK to you ?
> > > ping https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
> > ping * 2: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
> ping * 3: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html

It looks reasonable but the vectorizable_condition totally lack
comments...

Richard.

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-04 10:38                                                         ` Richard Biener
@ 2019-10-08  0:10                                                           ` Prathamesh Kulkarni
  2019-10-08  7:51                                                             ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-10-08  0:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, gcc Patches

[-- Attachment #1: Type: text/plain, Size: 12306 bytes --]

On Fri, 4 Oct 2019 at 16:08, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 1:42 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 25 Sep 2019 at 09:17, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 16 Sep 2019 at 08:54, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 9 Sep 2019 at 09:36, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 9 Sep 2019 at 16:45, Richard Sandiford
> > > > > <richard.sandiford@arm.com> wrote:
> > > > > >
> > > > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > > > > With patch, the only following FAIL remains for aarch64-sve.exp:
> > > > > > > FAIL: gcc.target/aarch64/sve/cond_unary_2.c -march=armv8.2-a+sve
> > > > > > > scan-assembler-times \\tmovprfx\\t 6
> > > > > > > which now contains 14.
> > > > > > > Should I adjust the test, assuming the change isn't a regression ?
> > > > > >
> > > > > > Well, it is kind-of a regression, but it really just means that the
> > > > > > integer code is now consistent with the floating-point code in having
> > > > > > an unnecessary MOVPRFX.  So I think adjusting the count is fine.
> > > > > > Presumably any future fix for the existing redundant MOVPRFXs will
> > > > > > apply to the new ones as well.
> > > > > >
> > > > > > The patch looks good to me, just some very minor nits:
> > > > > >
> > > > > > > @@ -8309,11 +8309,12 @@ vect_double_mask_nunits (tree type)
> > > > > > >
> > > > > > >  /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
> > > > > > >     contain a sequence of NVECTORS masks that each control a vector of type
> > > > > > > -   VECTYPE.  */
> > > > > > > +   VECTYPE. SCALAR_MASK if non-null, represents the mask used for corresponding
> > > > > > > +   load/store stmt.  */
> > > > > >
> > > > > > Should be two spaces between sentences.  Maybe:
> > > > > >
> > > > > >    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> > > > > >    these vector masks with the vector version of SCALAR_MASK.  */
> > > > > >
> > > > > > since the mask isn't necessarily for a load or store statement.
> > > > > >
> > > > > > > [...]
> > > > > > > @@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
> > > > > > >     says how the load or store is going to be implemented and GROUP_SIZE
> > > > > > >     is the number of load or store statements in the containing group.
> > > > > > >     If the access is a gather load or scatter store, GS_INFO describes
> > > > > > > -   its arguments.
> > > > > > > +   its arguments. SCALAR_MASK is the scalar mask used for corresponding
> > > > > > > +   load or store stmt.
> > > > > >
> > > > > > Maybe:
> > > > > >
> > > > > >    its arguments.  If the load or store is conditional, SCALAR_MASK is the
> > > > > >    condition under which it occurs.
> > > > > >
> > > > > > since SCALAR_MASK can be null here too.
> > > > > >
> > > > > > > [...]
> > > > > > > @@ -9975,6 +9978,31 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > > > >    /* Handle cond expr.  */
> > > > > > >    for (j = 0; j < ncopies; j++)
> > > > > > >      {
> > > > > > > +      tree loop_mask = NULL_TREE;
> > > > > > > +      bool swap_cond_operands = false;
> > > > > > > +
> > > > > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > > > > > +     {
> > > > > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > > > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > > > +         {
> > > > > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > > > > +         }
> > > > > > > +       else
> > > > > > > +         {
> > > > > > > +           cond.code = invert_tree_comparison (cond.code,
> > > > > > > +                                               HONOR_NANS (TREE_TYPE (cond.op0)));
> > > > > >
> > > > > > Long line.  Maybe just split it out into a separate assignment:
> > > > > >
> > > > > >               bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > > > > >               cond.code = invert_tree_comparison (cond.code, honor_nans);
> > > > > >
> > > > > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > > > > +             {
> > > > > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > > >
> > > > > > Long line here too.
> > > > > >
> > > > > > > [...]
> > > > > > > @@ -10090,6 +10121,26 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > > > > >                   }
> > > > > > >               }
> > > > > > >           }
> > > > > > > +
> > > > > > > +       if (loop_mask)
> > > > > > > +         {
> > > > > > > +           if (COMPARISON_CLASS_P (vec_compare))
> > > > > > > +             {
> > > > > > > +               tree tmp = make_ssa_name (vec_cmp_type);
> > > > > > > +               gassign *g = gimple_build_assign (tmp,
> > > > > > > +                                                 TREE_CODE (vec_compare),
> > > > > > > +                                                 TREE_OPERAND (vec_compare, 0),
> > > > > > d> +                                                TREE_OPERAND (vec_compare, 1));
> > > > > >
> > > > > > Two long lines.
> > > > > >
> > > > > > > +               vect_finish_stmt_generation (stmt_info, g, gsi);
> > > > > > > +               vec_compare = tmp;
> > > > > > > +             }
> > > > > > > +
> > > > > > > +           tree tmp2 = make_ssa_name (vec_cmp_type);
> > > > > > > +           gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR, vec_compare, loop_mask);
> > > > > >
> > > > > > Long line here too.
> > > > > >
> > > > > > > [...]
> > > > > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > > > > > > index dc181524744..c4b2d8e8647 100644
> > > > > > > --- a/gcc/tree-vectorizer.c
> > > > > > > +++ b/gcc/tree-vectorizer.c
> > > > > > > @@ -1513,3 +1513,39 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
> > > > > > >  {
> > > > > > >    return new pass_ipa_increase_alignment (ctxt);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/* If code(T) is comparison op or def of comparison stmt,
> > > > > > > +   extract it's operands.
> > > > > > > +   Else return <NE_EXPR, T, 0>.  */
> > > > > > > +
> > > > > > > +void
> > > > > > > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > > > > > > +{
> > > > > > > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > > > > > > +    {
> > > > > > > +      this->code = TREE_CODE (t);
> > > > > > > +      this->op0 = TREE_OPERAND (t, 0);
> > > > > > > +      this->op1 = TREE_OPERAND (t, 1);
> > > > > > > +      return;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +  if (TREE_CODE (t) == SSA_NAME)
> > > > > > > +    {
> > > > > > > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > > > > > > +      if (stmt)
> > > > > > > +        {
> > > > > >
> > > > > > Might as well do this as:
> > > > > >
> > > > > >   if (TREE_CODE (t) == SSA_NAME)
> > > > > >     if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
> > > > > >       {
> > > > > >
> > > > > > The patch (as hoped) introduces some XPASSes:
> > > > > >
> > > > > > XPASS: gcc.target/aarch64/sve/cond_cnot_2.c scan-assembler-not \\tsel\\t
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_4.c scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15
> > > > > > XPASS: gcc.target/aarch64/sve/vcond_5.c scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30
> > > > > >
> > > > > > Could you remove the associated xfails (and comments above them where
> > > > > > appropriate)?
> > > > > >
> > > > > > OK with those changes from my POV, but please give Richi a day or so
> > > > > > to object.
> > > > > >
> > > > > > Thanks for doing this.
> > > > > Thanks for the suggestions, I have updated the patch accordingly.
> > > > > Boostrap+test in progress on x86_64-unknown-linux-gnu and aarch64-linux-gnu.
> > > > > Richi, does the patch look OK to you ?
> > > > ping https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
> > > ping * 2: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
> > ping * 3: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00573.html
>
> It looks reasonable but the vectorizable_condition totally lack
> comments...
Hi Richard,
I rebased the patch on top of trunk, and added some comments to
vectorizable_condition,
and scalar_cond_masked_key.
Does it look OK ?

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Richard

[-- Attachment #2: pr86753-v2-4.diff --]
[-- Type: text/x-patch, Size: 25415 bytes --]

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
index d689e21dc11..3df2431be38 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
@@ -32,4 +32,4 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
 /* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* Currently we canonicalize the ?: so that !b[i] is the "false" value.  */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
index dcc30768f88..86064ebfcba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (FLOAT_TYPE) a[i] : b[i];		\
+      {								\
+	FLOAT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (FLOAT_TYPE) a[i] : bi;		\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
index 7e5f2a73ed9..e3a947b2698 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (INT_TYPE) a[i] : b[i];			\
+      {								\
+	INT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (INT_TYPE) a[i] : bi;			\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
index 991ccf016d1..97d1b8f5d45 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
@@ -13,7 +13,10 @@
 		      TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? OP (a[i]) : b[i];			\
+      {								\
+	TYPE bi = b[i];						\
+	r[i] = pred[i] ? OP (a[i]) : bi;			\
+      }								\
   }
 
 #define TEST_INT_TYPE(T, TYPE) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c b/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
index 5c04bcdb3f5..a1b0667dab5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
@@ -15,5 +15,9 @@ f (double *restrict a, double *restrict b, double *restrict c,
     }
 }
 
-/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 2 } } */
+/* See https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01644.html
+   for XFAILing the below test.  */
+
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tfmad\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
index 00d84760a19..b38f23e87ba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
@@ -98,24 +98,24 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
 
 /* 5 for lt, 5 for ult and 5 for nult.  */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for gt, 5 for ugt and 5 for nugt.  */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} } } */
 /* 3 loops * 5 invocations for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 7 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 14 { xfail *-*-* } } } */
@@ -123,19 +123,19 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
    for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
index 23bfb7b2649..2f16fbff522 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
@@ -19,16 +19,16 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* 5 for gt, 5 for ugt, 5 for nueq and 5 for nugt.  */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 20 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} } } */
 /* 3 loops * 5 invocations for ordered, unordered amd ueq.  */
@@ -43,14 +43,14 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 3db4a5cdf78..da952645759 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6603,7 +6603,7 @@ vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	}
       else
 	vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
-			       vectype_in);
+			       vectype_in, NULL);
     }
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
@@ -8005,7 +8005,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
 	      gcc_assert (ncopies == 1 && !slp_node);
 	      vect_record_loop_mask (loop_vinfo,
 				     &LOOP_VINFO_MASKS (loop_vinfo),
-				     1, vectype);
+				     1, vectype, NULL);
 	    }
 	}
       return true;
@@ -8204,11 +8204,12 @@ vect_double_mask_nunits (tree type)
 
 /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
    contain a sequence of NVECTORS masks that each control a vector of type
-   VECTYPE.  */
+   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
+   these vector masks with the vector version of SCALAR_MASK.  */
 
 void
 vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
-		       unsigned int nvectors, tree vectype)
+		       unsigned int nvectors, tree vectype, tree scalar_mask)
 {
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
@@ -8219,6 +8220,13 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   unsigned int nscalars_per_iter
     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (scalar_mask)
+    {
+      scalar_cond_masked_key cond (scalar_mask, nvectors);
+      loop_vinfo->scalar_cond_masked_set.add (cond);
+    }
+
   if (rgm->max_nscalars_per_iter < nscalars_per_iter)
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cac7410387b..4db8e24ccd1 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
    says how the load or store is going to be implemented and GROUP_SIZE
    is the number of load or store statements in the containing group.
    If the access is a gather load or scatter store, GS_INFO describes
-   its arguments.
+   its arguments.  If the load or store is conditional, SCALAR_MASK is the
+   condition under which it occurs.
 
    Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
    supported, otherwise record the required mask types.  */
@@ -1888,7 +1889,7 @@ static void
 check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 			  vec_load_store_type vls_type, int group_size,
 			  vect_memory_access_type memory_access_type,
-			  gather_scatter_info *gs_info)
+			  gather_scatter_info *gs_info, tree scalar_mask)
 {
   /* Invariant loads need no special support.  */
   if (memory_access_type == VMAT_INVARIANT)
@@ -1912,7 +1913,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1936,7 +1937,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1974,7 +1975,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
   if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype);
+    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
   else
     gcc_unreachable ();
 }
@@ -3436,7 +3437,9 @@ vectorizable_call (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	  unsigned int nvectors = (slp_node
 				   ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node)
 				   : ncopies);
-	  vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out);
+	  tree scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
+	  vect_record_loop_mask (loop_vinfo, masks, nvectors,
+				 vectype_out, scalar_mask);
 	}
       return true;
     }
@@ -7390,7 +7393,7 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
       vect_model_store_cost (stmt_info, ncopies, rhs_dt, memory_access_type,
@@ -8637,7 +8640,7 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (stmt_info, ncopies, memory_access_type,
@@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
 
    When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
 
+   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
+   on C, we apply loop mask to result of vector comparison, if it's present.
+   Similarly for E, if it is conditional on !C.
+
    Return true if STMT_INFO is vectorizable in this way.  */
 
 bool
@@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   /* Handle cond expr.  */
   for (j = 0; j < ncopies; j++)
     {
+      tree loop_mask = NULL_TREE;
+      bool swap_cond_operands = false;
+
+      /* Look up if there is a loop mask associated with the
+	 scalar cond, or it's inverse.  */
+
+      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  scalar_cond_masked_key cond (cond_expr, ncopies);
+	  if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+	    {
+	      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+	      loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
+	    }
+	  else
+	    {
+	      bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+	      cond.code = invert_tree_comparison (cond.code, honor_nans);
+	      if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
+						  vectype, j);
+		  cond_code = cond.code;
+		  swap_cond_operands = true;
+		}
+	    }
+	}
+
       stmt_vec_info new_stmt_info = NULL;
       if (j == 0)
 	{
@@ -10076,6 +10112,9 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
           vec_then_clause = vec_oprnds2[i];
           vec_else_clause = vec_oprnds3[i];
 
+	  if (swap_cond_operands)
+	    std::swap (vec_then_clause, vec_else_clause);
+
 	  if (masked)
 	    vec_compare = vec_cond_lhs;
 	  else
@@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		    }
 		}
 	    }
+
+	  /* If loop mask is present, then AND it with
+	     result of vec comparison, so later passes (fre4)
+	     will reuse the same condition used in masked load.
+
+	     For example:
+	     for (int i = 0; i < 100; ++i)
+	       x[i] = y[i] ? z[i] : 10;
+
+	     results in following optimized GIMPLE: 
+
+	     mask__35.8_43 = vect__4.7_41 != { 0, ... };
+	     vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
+	     _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
+	     vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
+	     vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
+					       vect_iftmp.11_47, { 10, ... }>;
+
+	     instead of recomputing vec != { 0, ... } in vec_cond_expr  */
+
+	  if (loop_mask)
+	    {
+	      if (COMPARISON_CLASS_P (vec_compare))
+		{
+		  tree tmp = make_ssa_name (vec_cmp_type);
+		  tree op0 = TREE_OPERAND (vec_compare, 0);
+		  tree op1 = TREE_OPERAND (vec_compare, 1);
+		  gassign *g = gimple_build_assign (tmp,
+						    TREE_CODE (vec_compare),
+						    op0, op1);
+		  vect_finish_stmt_generation (stmt_info, g, gsi);
+		  vec_compare = tmp;
+		}
+
+	      tree tmp2 = make_ssa_name (vec_cmp_type);
+	      gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR,
+						vec_compare, loop_mask);
+	      vect_finish_stmt_generation (stmt_info, g, gsi);
+	      vec_compare = tmp2;
+	    }
+
 	  if (reduction_type == EXTRACT_LAST_REDUCTION)
 	    {
 	      if (!is_gimple_val (vec_compare))
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 800c99fea26..20945a39c84 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1516,3 +1516,36 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
 {
   return new pass_ipa_increase_alignment (ctxt);
 }
+
+/* If code(T) is comparison op or def of comparison stmt,
+   extract it's operands.
+   Else return <NE_EXPR, T, 0>.  */
+
+void
+scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
+{
+  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
+    {
+      this->code = TREE_CODE (t);
+      this->op0 = TREE_OPERAND (t, 0);
+      this->op1 = TREE_OPERAND (t, 1);
+      return;
+    }
+
+  if (TREE_CODE (t) == SSA_NAME)
+    if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
+      {
+	tree_code code = gimple_assign_rhs_code (stmt);
+	if (TREE_CODE_CLASS (code) == tcc_comparison)
+	  {
+	    this->code = code;
+	    this->op0 = gimple_assign_rhs1 (stmt);
+	    this->op1 = gimple_assign_rhs2 (stmt);
+	    return;
+	  }
+      }
+
+  this->code = NE_EXPR;
+  this->op0 = t;
+  this->op1 = build_zero_cst (TREE_TYPE (t));
+}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 837fb5ab525..632f12a30dc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@ typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include "hash-set.h"
 
 /* Used for naming of new temporaries.  */
 enum vect_var_kind {
@@ -177,7 +178,78 @@ public:
 #define SLP_TREE_TWO_OPERATORS(S)		 (S)->two_operators
 #define SLP_TREE_DEF_TYPE(S)			 (S)->def_type
 
+/* Key for map that records association between
+   scalar conditions used in masked loads and corresponding
+   loop mask. The map is populated by vect_record_loop_mask.
+   vectorizable_condition uses the map to check if scalar
+   cond (or it's inverse) has a loop mask associated with it,
+   and if yes, applies loop mask to result of vector comparison.  */
+ 
+struct scalar_cond_masked_key
+{
+  scalar_cond_masked_key (tree t, unsigned ncopies_)
+    : ncopies (ncopies_)
+  {
+    get_cond_ops_from_tree (t);
+  }
+
+  void get_cond_ops_from_tree (tree);
+
+  unsigned ncopies;
+  tree_code code;
+  tree op0;
+  tree op1;
+};
 
+template<>
+struct default_hash_traits<scalar_cond_masked_key>
+{
+  typedef scalar_cond_masked_key compare_type;
+  typedef scalar_cond_masked_key value_type;
+
+  static inline hashval_t
+  hash (value_type v)
+  {
+    inchash::hash h;
+    h.add_int (v.code);
+    inchash::add_expr (v.op0, h, 0);
+    inchash::add_expr (v.op1, h, 0);
+    h.add_int (v.ncopies);
+    return h.end ();
+  }
+
+  static inline bool
+  equal (value_type existing, value_type candidate)
+  {
+    return (existing.ncopies == candidate.ncopies
+	    && existing.code == candidate.code
+	    && operand_equal_p (existing.op0, candidate.op0, 0)
+	    && operand_equal_p (existing.op1, candidate.op1, 0));
+  }
+
+  static inline void
+  mark_empty (value_type &v)
+  {
+    v.ncopies = 0;
+  }
+
+  static inline bool
+  is_empty (value_type v)
+  {
+    return v.ncopies == 0;
+  }
+
+  static inline void mark_deleted (value_type &) {}
+
+  static inline bool is_deleted (const value_type &)
+  {
+    return false;
+  }
+
+  static inline void remove (value_type &) {}
+};
+
+typedef hash_set<scalar_cond_masked_key> scalar_cond_masked_set_type;
 
 /* Describes two objects whose addresses must be unequal for the vectorized
    loop to be valid.  */
@@ -258,6 +330,9 @@ public:
   /* Cost data used by the target cost model.  */
   void *target_cost_data;
 
+  /* Set of scalar conditions that have loop mask applied.  */
+  scalar_cond_masked_set_type scalar_cond_masked_set;
+
 private:
   stmt_vec_info new_stmt_vec_info (gimple *stmt);
   void set_vinfo_for_stmt (gimple *, stmt_vec_info);
@@ -1641,7 +1716,7 @@ extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 extern tree vect_halve_mask_nunits (tree);
 extern tree vect_double_mask_nunits (tree);
 extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
-				   unsigned int, tree);
+				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-08  0:10                                                           ` Prathamesh Kulkarni
@ 2019-10-08  7:51                                                             ` Richard Sandiford
  2019-10-09  3:23                                                               ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-10-08  7:51 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, gcc Patches

Leaving the main review to Richard, just some comments...

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
>  
>     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
>  
> +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
> +   on C, we apply loop mask to result of vector comparison, if it's present.
> +   Similarly for E, if it is conditional on !C.
> +
>     Return true if STMT_INFO is vectorizable in this way.  */
>  
>  bool

I think this is a bit misleading.  But IMO it'd be better not to have
a comment here and just rely on the one in the main function body.
This optimisation isn't really changing the vectorisation strategy,
and the comment could easily get forgotten if things change in future.

> [...]
> @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>    /* Handle cond expr.  */
>    for (j = 0; j < ncopies; j++)
>      {
> +      tree loop_mask = NULL_TREE;
> +      bool swap_cond_operands = false;
> +
> +      /* Look up if there is a loop mask associated with the
> +	 scalar cond, or it's inverse.  */

Maybe:

   See whether another part of the vectorized code applies a loop
   mask to the condition, or to its inverse.

> +
> +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  scalar_cond_masked_key cond (cond_expr, ncopies);
> +	  if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> +	    {
> +	      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +	      loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> +	    }
> +	  else
> +	    {
> +	      bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> +	      cond.code = invert_tree_comparison (cond.code, honor_nans);
> +	      if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		{
> +		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
> +						  vectype, j);
> +		  cond_code = cond.code;
> +		  swap_cond_operands = true;
> +		}
> +	    }
> +	}
> +
>        stmt_vec_info new_stmt_info = NULL;
>        if (j == 0)
>  	{
> @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  		    }
>  		}
>  	    }
> +
> +	  /* If loop mask is present, then AND it with

Maybe "If we decided to apply a loop mask, ..."

> +	     result of vec comparison, so later passes (fre4)

Probably better not to name the pass -- could easily change in future.

> +	     will reuse the same condition used in masked load.

Could be a masked store, or potentially other things too.
So maybe just "will reuse the masked condition"?

> +
> +	     For example:
> +	     for (int i = 0; i < 100; ++i)
> +	       x[i] = y[i] ? z[i] : 10;
> +
> +	     results in following optimized GIMPLE: 
> +
> +	     mask__35.8_43 = vect__4.7_41 != { 0, ... };
> +	     vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> +	     _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> +	     vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> +	     vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> +					       vect_iftmp.11_47, { 10, ... }>;
> +
> +	     instead of recomputing vec != { 0, ... } in vec_cond_expr  */

That's true, but gives the impression that avoiding the vec != { 0, ... }
is the main goal, whereas we could do that just by forcing a three-operand
COND_EXPR.  It's really more about making sure that vec != { 0, ... }
and its masked form aren't both live at the same time.  So maybe:

	     instead of using a masked and unmasked forms of
	     vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
	     unmasked in the VEC_COND_EXPR).  */

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-08  7:51                                                             ` Richard Sandiford
@ 2019-10-09  3:23                                                               ` Prathamesh Kulkarni
  2019-10-15  6:11                                                                 ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-10-09  3:23 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

[-- Attachment #1: Type: text/plain, Size: 4501 bytes --]

On Tue, 8 Oct 2019 at 13:21, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Leaving the main review to Richard, just some comments...
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
> >
> >     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
> >
> > +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
> > +   on C, we apply loop mask to result of vector comparison, if it's present.
> > +   Similarly for E, if it is conditional on !C.
> > +
> >     Return true if STMT_INFO is vectorizable in this way.  */
> >
> >  bool
>
> I think this is a bit misleading.  But IMO it'd be better not to have
> a comment here and just rely on the one in the main function body.
> This optimisation isn't really changing the vectorisation strategy,
> and the comment could easily get forgotten if things change in future.
>
> > [...]
> > @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >    /* Handle cond expr.  */
> >    for (j = 0; j < ncopies; j++)
> >      {
> > +      tree loop_mask = NULL_TREE;
> > +      bool swap_cond_operands = false;
> > +
> > +      /* Look up if there is a loop mask associated with the
> > +      scalar cond, or it's inverse.  */
>
> Maybe:
>
>    See whether another part of the vectorized code applies a loop
>    mask to the condition, or to its inverse.
>
> > +
> > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > +     {
> > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > +         {
> > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > +         }
> > +       else
> > +         {
> > +           bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > +           cond.code = invert_tree_comparison (cond.code, honor_nans);
> > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > +             {
> > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
> > +                                               vectype, j);
> > +               cond_code = cond.code;
> > +               swap_cond_operands = true;
> > +             }
> > +         }
> > +     }
> > +
> >        stmt_vec_info new_stmt_info = NULL;
> >        if (j == 0)
> >       {
> > @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >                   }
> >               }
> >           }
> > +
> > +       /* If loop mask is present, then AND it with
>
> Maybe "If we decided to apply a loop mask, ..."
>
> > +          result of vec comparison, so later passes (fre4)
>
> Probably better not to name the pass -- could easily change in future.
>
> > +          will reuse the same condition used in masked load.
>
> Could be a masked store, or potentially other things too.
> So maybe just "will reuse the masked condition"?
>
> > +
> > +          For example:
> > +          for (int i = 0; i < 100; ++i)
> > +            x[i] = y[i] ? z[i] : 10;
> > +
> > +          results in following optimized GIMPLE:
> > +
> > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
> > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> > +                                            vect_iftmp.11_47, { 10, ... }>;
> > +
> > +          instead of recomputing vec != { 0, ... } in vec_cond_expr  */
>
> That's true, but gives the impression that avoiding the vec != { 0, ... }
> is the main goal, whereas we could do that just by forcing a three-operand
> COND_EXPR.  It's really more about making sure that vec != { 0, ... }
> and its masked form aren't both live at the same time.  So maybe:
>
>              instead of using a masked and unmasked forms of
>              vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
>              unmasked in the VEC_COND_EXPR).  */
>
Hi Richard,
Thanks for the suggestions, I have updated comments in the attached patch.

Thanks,
Prathamesh
> Thanks,
> Richard

[-- Attachment #2: pr86753-v2-5.diff --]
[-- Type: text/x-patch, Size: 24855 bytes --]

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
index d689e21dc11..3df2431be38 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
@@ -32,4 +32,4 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
 /* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* Currently we canonicalize the ?: so that !b[i] is the "false" value.  */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
index dcc30768f88..86064ebfcba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (FLOAT_TYPE) a[i] : b[i];		\
+      {								\
+	FLOAT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (FLOAT_TYPE) a[i] : bi;		\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
index 7e5f2a73ed9..e3a947b2698 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (INT_TYPE) a[i] : b[i];			\
+      {								\
+	INT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (INT_TYPE) a[i] : bi;			\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
index 991ccf016d1..97d1b8f5d45 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
@@ -13,7 +13,10 @@
 		      TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? OP (a[i]) : b[i];			\
+      {								\
+	TYPE bi = b[i];						\
+	r[i] = pred[i] ? OP (a[i]) : bi;			\
+      }								\
   }
 
 #define TEST_INT_TYPE(T, TYPE) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c b/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
index 5c04bcdb3f5..a1b0667dab5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/fmla_2.c
@@ -15,5 +15,9 @@ f (double *restrict a, double *restrict b, double *restrict c,
     }
 }
 
-/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 2 } } */
+/* See https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01644.html
+   for XFAILing the below test.  */
+
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tfmad\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
index 00d84760a19..b38f23e87ba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
@@ -98,24 +98,24 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
 
 /* 5 for lt, 5 for ult and 5 for nult.  */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for gt, 5 for ugt and 5 for nugt.  */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} } } */
 /* 3 loops * 5 invocations for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 7 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 14 { xfail *-*-* } } } */
@@ -123,19 +123,19 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
    for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
index 23bfb7b2649..2f16fbff522 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
@@ -19,16 +19,16 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* 5 for gt, 5 for ugt, 5 for nueq and 5 for nugt.  */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 20 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} } } */
 /* 3 loops * 5 invocations for ordered, unordered amd ueq.  */
@@ -43,14 +43,14 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 3db4a5cdf78..da952645759 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6603,7 +6603,7 @@ vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	}
       else
 	vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
-			       vectype_in);
+			       vectype_in, NULL);
     }
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
@@ -8005,7 +8005,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
 	      gcc_assert (ncopies == 1 && !slp_node);
 	      vect_record_loop_mask (loop_vinfo,
 				     &LOOP_VINFO_MASKS (loop_vinfo),
-				     1, vectype);
+				     1, vectype, NULL);
 	    }
 	}
       return true;
@@ -8204,11 +8204,12 @@ vect_double_mask_nunits (tree type)
 
 /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
    contain a sequence of NVECTORS masks that each control a vector of type
-   VECTYPE.  */
+   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
+   these vector masks with the vector version of SCALAR_MASK.  */
 
 void
 vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
-		       unsigned int nvectors, tree vectype)
+		       unsigned int nvectors, tree vectype, tree scalar_mask)
 {
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
@@ -8219,6 +8220,13 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   unsigned int nscalars_per_iter
     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (scalar_mask)
+    {
+      scalar_cond_masked_key cond (scalar_mask, nvectors);
+      loop_vinfo->scalar_cond_masked_set.add (cond);
+    }
+
   if (rgm->max_nscalars_per_iter < nscalars_per_iter)
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cac7410387b..1e98b687961 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
    says how the load or store is going to be implemented and GROUP_SIZE
    is the number of load or store statements in the containing group.
    If the access is a gather load or scatter store, GS_INFO describes
-   its arguments.
+   its arguments.  If the load or store is conditional, SCALAR_MASK is the
+   condition under which it occurs.
 
    Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
    supported, otherwise record the required mask types.  */
@@ -1888,7 +1889,7 @@ static void
 check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 			  vec_load_store_type vls_type, int group_size,
 			  vect_memory_access_type memory_access_type,
-			  gather_scatter_info *gs_info)
+			  gather_scatter_info *gs_info, tree scalar_mask)
 {
   /* Invariant loads need no special support.  */
   if (memory_access_type == VMAT_INVARIANT)
@@ -1912,7 +1913,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1936,7 +1937,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1974,7 +1975,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
   if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype);
+    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
   else
     gcc_unreachable ();
 }
@@ -3436,7 +3437,9 @@ vectorizable_call (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	  unsigned int nvectors = (slp_node
 				   ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node)
 				   : ncopies);
-	  vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out);
+	  tree scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
+	  vect_record_loop_mask (loop_vinfo, masks, nvectors,
+				 vectype_out, scalar_mask);
 	}
       return true;
     }
@@ -7390,7 +7393,7 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
       vect_model_store_cost (stmt_info, ncopies, rhs_dt, memory_access_type,
@@ -8637,7 +8640,7 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (stmt_info, ncopies, memory_access_type,
@@ -9999,6 +10002,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   /* Handle cond expr.  */
   for (j = 0; j < ncopies; j++)
     {
+      tree loop_mask = NULL_TREE;
+      bool swap_cond_operands = false;
+
+      /* See whether another part of the vectorized code applies a loop
+	 mask to the condition, or to its inverse.  */
+
+      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  scalar_cond_masked_key cond (cond_expr, ncopies);
+	  if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+	    {
+	      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+	      loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
+	    }
+	  else
+	    {
+	      bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+	      cond.code = invert_tree_comparison (cond.code, honor_nans);
+	      if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
+						  vectype, j);
+		  cond_code = cond.code;
+		  swap_cond_operands = true;
+		}
+	    }
+	}
+
       stmt_vec_info new_stmt_info = NULL;
       if (j == 0)
 	{
@@ -10076,6 +10108,9 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
           vec_then_clause = vec_oprnds2[i];
           vec_else_clause = vec_oprnds3[i];
 
+	  if (swap_cond_operands)
+	    std::swap (vec_then_clause, vec_else_clause);
+
 	  if (masked)
 	    vec_compare = vec_cond_lhs;
 	  else
@@ -10114,6 +10149,48 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		    }
 		}
 	    }
+
+	  /* If we decided to apply a loop mask to result of vec
+	     comparison, so later passes will reuse the same condition.
+
+	     For example:
+	     for (int i = 0; i < 100; ++i)
+	       x[i] = y[i] ? z[i] : 10;
+
+	     results in following optimized GIMPLE:
+
+	     mask__35.8_43 = vect__4.7_41 != { 0, ... };
+	     vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
+	     _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
+	     vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
+	     vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
+					       vect_iftmp.11_47, { 10, ... }>;
+
+	     instead of using a masked and unmasked forms of
+             vec != { 0, ... } (masked in the MASK_LOAD,
+             unmasked in the VEC_COND_EXPR).  */
+
+	  if (loop_mask)
+	    {
+	      if (COMPARISON_CLASS_P (vec_compare))
+		{
+		  tree tmp = make_ssa_name (vec_cmp_type);
+		  tree op0 = TREE_OPERAND (vec_compare, 0);
+		  tree op1 = TREE_OPERAND (vec_compare, 1);
+		  gassign *g = gimple_build_assign (tmp,
+						    TREE_CODE (vec_compare),
+						    op0, op1);
+		  vect_finish_stmt_generation (stmt_info, g, gsi);
+		  vec_compare = tmp;
+		}
+
+	      tree tmp2 = make_ssa_name (vec_cmp_type);
+	      gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR,
+						vec_compare, loop_mask);
+	      vect_finish_stmt_generation (stmt_info, g, gsi);
+	      vec_compare = tmp2;
+	    }
+
 	  if (reduction_type == EXTRACT_LAST_REDUCTION)
 	    {
 	      if (!is_gimple_val (vec_compare))
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 800c99fea26..20945a39c84 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1516,3 +1516,36 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
 {
   return new pass_ipa_increase_alignment (ctxt);
 }
+
+/* If code(T) is comparison op or def of comparison stmt,
+   extract it's operands.
+   Else return <NE_EXPR, T, 0>.  */
+
+void
+scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
+{
+  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
+    {
+      this->code = TREE_CODE (t);
+      this->op0 = TREE_OPERAND (t, 0);
+      this->op1 = TREE_OPERAND (t, 1);
+      return;
+    }
+
+  if (TREE_CODE (t) == SSA_NAME)
+    if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
+      {
+	tree_code code = gimple_assign_rhs_code (stmt);
+	if (TREE_CODE_CLASS (code) == tcc_comparison)
+	  {
+	    this->code = code;
+	    this->op0 = gimple_assign_rhs1 (stmt);
+	    this->op1 = gimple_assign_rhs2 (stmt);
+	    return;
+	  }
+      }
+
+  this->code = NE_EXPR;
+  this->op0 = t;
+  this->op1 = build_zero_cst (TREE_TYPE (t));
+}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 837fb5ab525..37367ea1305 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@ typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include "hash-set.h"
 
 /* Used for naming of new temporaries.  */
 enum vect_var_kind {
@@ -177,7 +178,75 @@ public:
 #define SLP_TREE_TWO_OPERATORS(S)		 (S)->two_operators
 #define SLP_TREE_DEF_TYPE(S)			 (S)->def_type
 
+/* Key for map that records association between
+   scalar conditions and corresponding loop mask, and
+   is populated by vect_record_loop_mask.  */
+ 
+struct scalar_cond_masked_key
+{
+  scalar_cond_masked_key (tree t, unsigned ncopies_)
+    : ncopies (ncopies_)
+  {
+    get_cond_ops_from_tree (t);
+  }
+
+  void get_cond_ops_from_tree (tree);
+
+  unsigned ncopies;
+  tree_code code;
+  tree op0;
+  tree op1;
+};
 
+template<>
+struct default_hash_traits<scalar_cond_masked_key>
+{
+  typedef scalar_cond_masked_key compare_type;
+  typedef scalar_cond_masked_key value_type;
+
+  static inline hashval_t
+  hash (value_type v)
+  {
+    inchash::hash h;
+    h.add_int (v.code);
+    inchash::add_expr (v.op0, h, 0);
+    inchash::add_expr (v.op1, h, 0);
+    h.add_int (v.ncopies);
+    return h.end ();
+  }
+
+  static inline bool
+  equal (value_type existing, value_type candidate)
+  {
+    return (existing.ncopies == candidate.ncopies
+	    && existing.code == candidate.code
+	    && operand_equal_p (existing.op0, candidate.op0, 0)
+	    && operand_equal_p (existing.op1, candidate.op1, 0));
+  }
+
+  static inline void
+  mark_empty (value_type &v)
+  {
+    v.ncopies = 0;
+  }
+
+  static inline bool
+  is_empty (value_type v)
+  {
+    return v.ncopies == 0;
+  }
+
+  static inline void mark_deleted (value_type &) {}
+
+  static inline bool is_deleted (const value_type &)
+  {
+    return false;
+  }
+
+  static inline void remove (value_type &) {}
+};
+
+typedef hash_set<scalar_cond_masked_key> scalar_cond_masked_set_type;
 
 /* Describes two objects whose addresses must be unequal for the vectorized
    loop to be valid.  */
@@ -258,6 +327,9 @@ public:
   /* Cost data used by the target cost model.  */
   void *target_cost_data;
 
+  /* Set of scalar conditions that have loop mask applied.  */
+  scalar_cond_masked_set_type scalar_cond_masked_set;
+
 private:
   stmt_vec_info new_stmt_vec_info (gimple *stmt);
   void set_vinfo_for_stmt (gimple *, stmt_vec_info);
@@ -1641,7 +1713,7 @@ extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 extern tree vect_halve_mask_nunits (tree);
 extern tree vect_double_mask_nunits (tree);
 extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
-				   unsigned int, tree);
+				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-09  3:23                                                               ` Prathamesh Kulkarni
@ 2019-10-15  6:11                                                                 ` Prathamesh Kulkarni
  2019-10-15 11:40                                                                   ` Richard Biener
  0 siblings, 1 reply; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-10-15  6:11 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]

On Wed, 9 Oct 2019 at 08:14, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 8 Oct 2019 at 13:21, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Leaving the main review to Richard, just some comments...
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
> > >
> > >     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
> > >
> > > +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
> > > +   on C, we apply loop mask to result of vector comparison, if it's present.
> > > +   Similarly for E, if it is conditional on !C.
> > > +
> > >     Return true if STMT_INFO is vectorizable in this way.  */
> > >
> > >  bool
> >
> > I think this is a bit misleading.  But IMO it'd be better not to have
> > a comment here and just rely on the one in the main function body.
> > This optimisation isn't really changing the vectorisation strategy,
> > and the comment could easily get forgotten if things change in future.
> >
> > > [...]
> > > @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >    /* Handle cond expr.  */
> > >    for (j = 0; j < ncopies; j++)
> > >      {
> > > +      tree loop_mask = NULL_TREE;
> > > +      bool swap_cond_operands = false;
> > > +
> > > +      /* Look up if there is a loop mask associated with the
> > > +      scalar cond, or it's inverse.  */
> >
> > Maybe:
> >
> >    See whether another part of the vectorized code applies a loop
> >    mask to the condition, or to its inverse.
> >
> > > +
> > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > +     {
> > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +         {
> > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > +         }
> > > +       else
> > > +         {
> > > +           bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > > +           cond.code = invert_tree_comparison (cond.code, honor_nans);
> > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > +             {
> > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
> > > +                                               vectype, j);
> > > +               cond_code = cond.code;
> > > +               swap_cond_operands = true;
> > > +             }
> > > +         }
> > > +     }
> > > +
> > >        stmt_vec_info new_stmt_info = NULL;
> > >        if (j == 0)
> > >       {
> > > @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > >                   }
> > >               }
> > >           }
> > > +
> > > +       /* If loop mask is present, then AND it with
> >
> > Maybe "If we decided to apply a loop mask, ..."
> >
> > > +          result of vec comparison, so later passes (fre4)
> >
> > Probably better not to name the pass -- could easily change in future.
> >
> > > +          will reuse the same condition used in masked load.
> >
> > Could be a masked store, or potentially other things too.
> > So maybe just "will reuse the masked condition"?
> >
> > > +
> > > +          For example:
> > > +          for (int i = 0; i < 100; ++i)
> > > +            x[i] = y[i] ? z[i] : 10;
> > > +
> > > +          results in following optimized GIMPLE:
> > > +
> > > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
> > > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> > > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> > > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> > > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> > > +                                            vect_iftmp.11_47, { 10, ... }>;
> > > +
> > > +          instead of recomputing vec != { 0, ... } in vec_cond_expr  */
> >
> > That's true, but gives the impression that avoiding the vec != { 0, ... }
> > is the main goal, whereas we could do that just by forcing a three-operand
> > COND_EXPR.  It's really more about making sure that vec != { 0, ... }
> > and its masked form aren't both live at the same time.  So maybe:
> >
> >              instead of using a masked and unmasked forms of
> >              vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
> >              unmasked in the VEC_COND_EXPR).  */
> >
> Hi Richard,
> Thanks for the suggestions, I have updated comments in the attached patch.
Hi,
The attached patch is rebased on trunk, and after PR91532 fix, the
hunk for fmla_2.c is no
longer required.

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> > Thanks,
> > Richard

[-- Attachment #2: pr86753-v2-6.diff --]
[-- Type: text/x-patch, Size: 23860 bytes --]

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
index d689e21dc11..3df2431be38 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_cnot_2.c
@@ -32,4 +32,4 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-not {\tmov\tz} } } */
 /* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
 /* Currently we canonicalize the ?: so that !b[i] is the "false" value.  */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
index dcc30768f88..86064ebfcba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (FLOAT_TYPE) a[i] : b[i];		\
+      {								\
+	FLOAT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (FLOAT_TYPE) a[i] : bi;		\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
index 7e5f2a73ed9..e3a947b2698 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c
@@ -11,7 +11,10 @@
 		   INT_TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? (INT_TYPE) a[i] : b[i];			\
+      {								\
+	INT_TYPE bi = b[i];					\
+	r[i] = pred[i] ? (INT_TYPE) a[i] : bi;			\
+      }								\
   }
 
 #define TEST_ALL(T) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
index 991ccf016d1..97d1b8f5d45 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c
@@ -13,7 +13,10 @@
 		      TYPE *__restrict pred, int n)		\
   {								\
     for (int i = 0; i < n; ++i)					\
-      r[i] = pred[i] ? OP (a[i]) : b[i];			\
+      {								\
+	TYPE bi = b[i];						\
+	r[i] = pred[i] ? OP (a[i]) : bi;			\
+      }								\
   }
 
 #define TEST_INT_TYPE(T, TYPE) \
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
index 00d84760a19..b38f23e87ba 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_4.c
@@ -98,24 +98,24 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
 
 /* 5 for lt, 5 for ult and 5 for nult.  */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for gt, 5 for ugt and 5 for nugt.  */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0\n} } } */
 /* 3 loops * 5 invocations for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s\n} 180 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 7 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 14 { xfail *-*-* } } } */
@@ -123,19 +123,19 @@ TEST_CMP (nugt)
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmne\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0\n} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
    for all 12 unordered comparisons.  */
-/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d\n} 252 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
index 23bfb7b2649..2f16fbff522 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vcond_5.c
@@ -19,16 +19,16 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for le, 5 for ule and 5 for nule.  */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* 5 for gt, 5 for ugt, 5 for nueq and 5 for nugt.  */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 20 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 40 { xfail *-*-* } } } */
 
 /* 5 for ge, 5 for uge and 5 for nuge.  */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} 15 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, z[0-9]+\.s} 30 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.s, p[0-7]/z, z[0-9]+\.s, #0\.0} } } */
 /* 3 loops * 5 invocations for ordered, unordered amd ueq.  */
@@ -43,14 +43,14 @@
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmlt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmle\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 28 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 56 { xfail *-*-* } } } */
 
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} 21 } } */
+/* { dg-final { scan-assembler-times {\tfcmge\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, z[0-9]+\.d} 42 } } */
 
 /* { dg-final { scan-assembler-not {\tfcmuo\tp[0-9]+\.d, p[0-7]/z, z[0-9]+\.d, #0\.0} } } */
 /* 3 loops * 5 invocations, with 2 invocations having ncopies == 2,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index de018acefad..e0ea1769d7f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6525,7 +6525,7 @@ vectorizable_reduction (stmt_vec_info stmt_info, slp_tree slp_node,
 	}
       else
 	vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
-			       vectype_in);
+			       vectype_in, NULL);
     }
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
@@ -7756,7 +7756,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
 	      gcc_assert (ncopies == 1 && !slp_node);
 	      vect_record_loop_mask (loop_vinfo,
 				     &LOOP_VINFO_MASKS (loop_vinfo),
-				     1, vectype);
+				     1, vectype, NULL);
 	    }
 	}
       return true;
@@ -7955,11 +7955,12 @@ vect_double_mask_nunits (tree type)
 
 /* Record that a fully-masked version of LOOP_VINFO would need MASKS to
    contain a sequence of NVECTORS masks that each control a vector of type
-   VECTYPE.  */
+   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
+   these vector masks with the vector version of SCALAR_MASK.  */
 
 void
 vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
-		       unsigned int nvectors, tree vectype)
+		       unsigned int nvectors, tree vectype, tree scalar_mask)
 {
   gcc_assert (nvectors != 0);
   if (masks->length () < nvectors)
@@ -7970,6 +7971,13 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
   unsigned int nscalars_per_iter
     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (scalar_mask)
+    {
+      scalar_cond_masked_key cond (scalar_mask, nvectors);
+      loop_vinfo->scalar_cond_masked_set.add (cond);
+    }
+
   if (rgm->max_nscalars_per_iter < nscalars_per_iter)
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e606945d536..046af80fa99 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1879,7 +1879,8 @@ static tree permute_vec_elements (tree, tree, tree, stmt_vec_info,
    says how the load or store is going to be implemented and GROUP_SIZE
    is the number of load or store statements in the containing group.
    If the access is a gather load or scatter store, GS_INFO describes
-   its arguments.
+   its arguments.  If the load or store is conditional, SCALAR_MASK is the
+   condition under which it occurs.
 
    Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
    supported, otherwise record the required mask types.  */
@@ -1888,7 +1889,7 @@ static void
 check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 			  vec_load_store_type vls_type, int group_size,
 			  vect_memory_access_type memory_access_type,
-			  gather_scatter_info *gs_info)
+			  gather_scatter_info *gs_info, tree scalar_mask)
 {
   /* Invariant loads need no special support.  */
   if (memory_access_type == VMAT_INVARIANT)
@@ -1912,7 +1913,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1936,7 +1937,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
 	  return;
 	}
       unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
-      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, scalar_mask);
       return;
     }
 
@@ -1974,7 +1975,7 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
   if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype);
+    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
   else
     gcc_unreachable ();
 }
@@ -3436,7 +3437,9 @@ vectorizable_call (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	  unsigned int nvectors = (slp_node
 				   ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node)
 				   : ncopies);
-	  vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out);
+	  tree scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
+	  vect_record_loop_mask (loop_vinfo, masks, nvectors,
+				 vectype_out, scalar_mask);
 	}
       return true;
     }
@@ -7390,7 +7393,7 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
       vect_model_store_cost (stmt_info, ncopies, rhs_dt, memory_access_type,
@@ -8637,7 +8640,7 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       if (loop_vinfo
 	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
 	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
-				  memory_access_type, &gs_info);
+				  memory_access_type, &gs_info, mask);
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (stmt_info, ncopies, memory_access_type,
@@ -10007,6 +10010,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   /* Handle cond expr.  */
   for (j = 0; j < ncopies; j++)
     {
+      tree loop_mask = NULL_TREE;
+      bool swap_cond_operands = false;
+
+      /* See whether another part of the vectorized code applies a loop
+	 mask to the condition, or to its inverse.  */
+
+      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  scalar_cond_masked_key cond (cond_expr, ncopies);
+	  if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+	    {
+	      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+	      loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
+	    }
+	  else
+	    {
+	      bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+	      cond.code = invert_tree_comparison (cond.code, honor_nans);
+	      if (loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+		  loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
+						  vectype, j);
+		  cond_code = cond.code;
+		  swap_cond_operands = true;
+		}
+	    }
+	}
+
       stmt_vec_info new_stmt_info = NULL;
       if (j == 0)
 	{
@@ -10084,6 +10116,9 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
           vec_then_clause = vec_oprnds2[i];
           vec_else_clause = vec_oprnds3[i];
 
+	  if (swap_cond_operands)
+	    std::swap (vec_then_clause, vec_else_clause);
+
 	  if (masked)
 	    vec_compare = vec_cond_lhs;
 	  else
@@ -10122,6 +10157,48 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		    }
 		}
 	    }
+
+	  /* If we decided to apply a loop mask to result of vec
+	     comparison, so later passes will reuse the same condition.
+
+	     For example:
+	     for (int i = 0; i < 100; ++i)
+	       x[i] = y[i] ? z[i] : 10;
+
+	     results in following optimized GIMPLE:
+
+	     mask__35.8_43 = vect__4.7_41 != { 0, ... };
+	     vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
+	     _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
+	     vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
+	     vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
+					       vect_iftmp.11_47, { 10, ... }>;
+
+	     instead of using a masked and unmasked forms of
+             vec != { 0, ... } (masked in the MASK_LOAD,
+             unmasked in the VEC_COND_EXPR).  */
+
+	  if (loop_mask)
+	    {
+	      if (COMPARISON_CLASS_P (vec_compare))
+		{
+		  tree tmp = make_ssa_name (vec_cmp_type);
+		  tree op0 = TREE_OPERAND (vec_compare, 0);
+		  tree op1 = TREE_OPERAND (vec_compare, 1);
+		  gassign *g = gimple_build_assign (tmp,
+						    TREE_CODE (vec_compare),
+						    op0, op1);
+		  vect_finish_stmt_generation (stmt_info, g, gsi);
+		  vec_compare = tmp;
+		}
+
+	      tree tmp2 = make_ssa_name (vec_cmp_type);
+	      gassign *g = gimple_build_assign (tmp2, BIT_AND_EXPR,
+						vec_compare, loop_mask);
+	      vect_finish_stmt_generation (stmt_info, g, gsi);
+	      vec_compare = tmp2;
+	    }
+
 	  if (reduction_type == EXTRACT_LAST_REDUCTION)
 	    {
 	      if (!is_gimple_val (vec_compare))
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 3e8637f070d..27304181f9f 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1516,3 +1516,36 @@ make_pass_ipa_increase_alignment (gcc::context *ctxt)
 {
   return new pass_ipa_increase_alignment (ctxt);
 }
+
+/* If code(T) is comparison op or def of comparison stmt,
+   extract it's operands.
+   Else return <NE_EXPR, T, 0>.  */
+
+void
+scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
+{
+  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
+    {
+      this->code = TREE_CODE (t);
+      this->op0 = TREE_OPERAND (t, 0);
+      this->op1 = TREE_OPERAND (t, 1);
+      return;
+    }
+
+  if (TREE_CODE (t) == SSA_NAME)
+    if (gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t)))
+      {
+	tree_code code = gimple_assign_rhs_code (stmt);
+	if (TREE_CODE_CLASS (code) == tcc_comparison)
+	  {
+	    this->code = code;
+	    this->op0 = gimple_assign_rhs1 (stmt);
+	    this->op1 = gimple_assign_rhs2 (stmt);
+	    return;
+	  }
+      }
+
+  this->code = NE_EXPR;
+  this->op0 = t;
+  this->op1 = build_zero_cst (TREE_TYPE (t));
+}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 291304fe95e..d563420e926 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -177,7 +177,75 @@ public:
 #define SLP_TREE_TWO_OPERATORS(S)		 (S)->two_operators
 #define SLP_TREE_DEF_TYPE(S)			 (S)->def_type
 
+/* Key for map that records association between
+   scalar conditions and corresponding loop mask, and
+   is populated by vect_record_loop_mask.  */
+ 
+struct scalar_cond_masked_key
+{
+  scalar_cond_masked_key (tree t, unsigned ncopies_)
+    : ncopies (ncopies_)
+  {
+    get_cond_ops_from_tree (t);
+  }
+
+  void get_cond_ops_from_tree (tree);
+
+  unsigned ncopies;
+  tree_code code;
+  tree op0;
+  tree op1;
+};
 
+template<>
+struct default_hash_traits<scalar_cond_masked_key>
+{
+  typedef scalar_cond_masked_key compare_type;
+  typedef scalar_cond_masked_key value_type;
+
+  static inline hashval_t
+  hash (value_type v)
+  {
+    inchash::hash h;
+    h.add_int (v.code);
+    inchash::add_expr (v.op0, h, 0);
+    inchash::add_expr (v.op1, h, 0);
+    h.add_int (v.ncopies);
+    return h.end ();
+  }
+
+  static inline bool
+  equal (value_type existing, value_type candidate)
+  {
+    return (existing.ncopies == candidate.ncopies
+           && existing.code == candidate.code
+           && operand_equal_p (existing.op0, candidate.op0, 0)
+           && operand_equal_p (existing.op1, candidate.op1, 0));
+  }
+
+  static inline void
+  mark_empty (value_type &v)
+  {
+    v.ncopies = 0;
+  }
+
+  static inline bool
+  is_empty (value_type v)
+  {
+    return v.ncopies == 0;
+  }
+
+  static inline void mark_deleted (value_type &) {}
+
+  static inline bool is_deleted (const value_type &)
+  {
+    return false;
+  }
+
+  static inline void remove (value_type &) {}
+};
+
+typedef hash_set<scalar_cond_masked_key> scalar_cond_masked_set_type;
 
 /* Describes two objects whose addresses must be unequal for the vectorized
    loop to be valid.  */
@@ -258,6 +326,9 @@ public:
   /* Cost data used by the target cost model.  */
   void *target_cost_data;
 
+  /* Set of scalar conditions that have loop mask applied.  */
+  scalar_cond_masked_set_type scalar_cond_masked_set;
+
 private:
   stmt_vec_info new_stmt_vec_info (gimple *stmt);
   void set_vinfo_for_stmt (gimple *, stmt_vec_info);
@@ -1642,7 +1713,7 @@ extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 extern tree vect_halve_mask_nunits (tree);
 extern tree vect_double_mask_nunits (tree);
 extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
-				   unsigned int, tree);
+				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
 extern stmt_vec_info info_for_reduction (stmt_vec_info);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-15  6:11                                                                 ` Prathamesh Kulkarni
@ 2019-10-15 11:40                                                                   ` Richard Biener
  2019-10-16 12:13                                                                     ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Biener @ 2019-10-15 11:40 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Sandiford, gcc Patches

On Tue, Oct 15, 2019 at 8:07 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 9 Oct 2019 at 08:14, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 8 Oct 2019 at 13:21, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Leaving the main review to Richard, just some comments...
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
> > > >
> > > >     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
> > > >
> > > > +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
> > > > +   on C, we apply loop mask to result of vector comparison, if it's present.
> > > > +   Similarly for E, if it is conditional on !C.
> > > > +
> > > >     Return true if STMT_INFO is vectorizable in this way.  */
> > > >
> > > >  bool
> > >
> > > I think this is a bit misleading.  But IMO it'd be better not to have
> > > a comment here and just rely on the one in the main function body.
> > > This optimisation isn't really changing the vectorisation strategy,
> > > and the comment could easily get forgotten if things change in future.
> > >
> > > > [...]
> > > > @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > >    /* Handle cond expr.  */
> > > >    for (j = 0; j < ncopies; j++)
> > > >      {
> > > > +      tree loop_mask = NULL_TREE;
> > > > +      bool swap_cond_operands = false;
> > > > +
> > > > +      /* Look up if there is a loop mask associated with the
> > > > +      scalar cond, or it's inverse.  */
> > >
> > > Maybe:
> > >
> > >    See whether another part of the vectorized code applies a loop
> > >    mask to the condition, or to its inverse.
> > >
> > > > +
> > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > > +     {
> > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > +         {
> > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> > > > +         }
> > > > +       else
> > > > +         {
> > > > +           bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> > > > +           cond.code = invert_tree_comparison (cond.code, honor_nans);
> > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> > > > +             {
> > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
> > > > +                                               vectype, j);
> > > > +               cond_code = cond.code;
> > > > +               swap_cond_operands = true;
> > > > +             }
> > > > +         }
> > > > +     }
> > > > +
> > > >        stmt_vec_info new_stmt_info = NULL;
> > > >        if (j == 0)
> > > >       {
> > > > @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> > > >                   }
> > > >               }
> > > >           }
> > > > +
> > > > +       /* If loop mask is present, then AND it with
> > >
> > > Maybe "If we decided to apply a loop mask, ..."
> > >
> > > > +          result of vec comparison, so later passes (fre4)
> > >
> > > Probably better not to name the pass -- could easily change in future.
> > >
> > > > +          will reuse the same condition used in masked load.
> > >
> > > Could be a masked store, or potentially other things too.
> > > So maybe just "will reuse the masked condition"?
> > >
> > > > +
> > > > +          For example:
> > > > +          for (int i = 0; i < 100; ++i)
> > > > +            x[i] = y[i] ? z[i] : 10;
> > > > +
> > > > +          results in following optimized GIMPLE:
> > > > +
> > > > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
> > > > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> > > > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> > > > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> > > > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> > > > +                                            vect_iftmp.11_47, { 10, ... }>;
> > > > +
> > > > +          instead of recomputing vec != { 0, ... } in vec_cond_expr  */
> > >
> > > That's true, but gives the impression that avoiding the vec != { 0, ... }
> > > is the main goal, whereas we could do that just by forcing a three-operand
> > > COND_EXPR.  It's really more about making sure that vec != { 0, ... }
> > > and its masked form aren't both live at the same time.  So maybe:
> > >
> > >              instead of using a masked and unmasked forms of
> > >              vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
> > >              unmasked in the VEC_COND_EXPR).  */
> > >
> > Hi Richard,
> > Thanks for the suggestions, I have updated comments in the attached patch.
> Hi,
> The attached patch is rebased on trunk, and after PR91532 fix, the
> hunk for fmla_2.c is no
> longer required.

Hmm.  So we already record some mask info - you just add in addition
to that the scalar predicate representing the mask.  I wonder if you can
integrate that into the existing vec_loop_masks vector instead of
adding another data structure on the side?  Not that I am understanding
the existing fully masked code at all (or specifically what it computes
as nscalars_per_iter, etc. ... :/).  At least add the new vinfo member
right to the other masks related field.

I still fail to understand this in full so defering to Richard who added
all this stuff.

Richard.

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > > Thanks,
> > > Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-15 11:40                                                                   ` Richard Biener
@ 2019-10-16 12:13                                                                     ` Richard Sandiford
  2019-10-18  5:20                                                                       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2019-10-16 12:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Prathamesh Kulkarni, gcc Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Oct 15, 2019 at 8:07 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Wed, 9 Oct 2019 at 08:14, Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>> >
>> > On Tue, 8 Oct 2019 at 13:21, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> > >
>> > > Leaving the main review to Richard, just some comments...
>> > >
>> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > > @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
>> > > >
>> > > >     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
>> > > >
>> > > > +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
>> > > > +   on C, we apply loop mask to result of vector comparison, if it's present.
>> > > > +   Similarly for E, if it is conditional on !C.
>> > > > +
>> > > >     Return true if STMT_INFO is vectorizable in this way.  */
>> > > >
>> > > >  bool
>> > >
>> > > I think this is a bit misleading.  But IMO it'd be better not to have
>> > > a comment here and just rely on the one in the main function body.
>> > > This optimisation isn't really changing the vectorisation strategy,
>> > > and the comment could easily get forgotten if things change in future.
>> > >
>> > > > [...]
>> > > > @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>> > > >    /* Handle cond expr.  */
>> > > >    for (j = 0; j < ncopies; j++)
>> > > >      {
>> > > > +      tree loop_mask = NULL_TREE;
>> > > > +      bool swap_cond_operands = false;
>> > > > +
>> > > > +      /* Look up if there is a loop mask associated with the
>> > > > +      scalar cond, or it's inverse.  */
>> > >
>> > > Maybe:
>> > >
>> > >    See whether another part of the vectorized code applies a loop
>> > >    mask to the condition, or to its inverse.
>> > >
>> > > > +
>> > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> > > > +     {
>> > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
>> > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
>> > > > +         {
>> > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
>> > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
>> > > > +         }
>> > > > +       else
>> > > > +         {
>> > > > +           bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>> > > > +           cond.code = invert_tree_comparison (cond.code, honor_nans);
>> > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
>> > > > +             {
>> > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
>> > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
>> > > > +                                               vectype, j);
>> > > > +               cond_code = cond.code;
>> > > > +               swap_cond_operands = true;
>> > > > +             }
>> > > > +         }
>> > > > +     }
>> > > > +
>> > > >        stmt_vec_info new_stmt_info = NULL;
>> > > >        if (j == 0)
>> > > >       {
>> > > > @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>> > > >                   }
>> > > >               }
>> > > >           }
>> > > > +
>> > > > +       /* If loop mask is present, then AND it with
>> > >
>> > > Maybe "If we decided to apply a loop mask, ..."
>> > >
>> > > > +          result of vec comparison, so later passes (fre4)
>> > >
>> > > Probably better not to name the pass -- could easily change in future.
>> > >
>> > > > +          will reuse the same condition used in masked load.
>> > >
>> > > Could be a masked store, or potentially other things too.
>> > > So maybe just "will reuse the masked condition"?
>> > >
>> > > > +
>> > > > +          For example:
>> > > > +          for (int i = 0; i < 100; ++i)
>> > > > +            x[i] = y[i] ? z[i] : 10;
>> > > > +
>> > > > +          results in following optimized GIMPLE:
>> > > > +
>> > > > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
>> > > > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
>> > > > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
>> > > > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
>> > > > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
>> > > > +                                            vect_iftmp.11_47, { 10, ... }>;
>> > > > +
>> > > > +          instead of recomputing vec != { 0, ... } in vec_cond_expr  */
>> > >
>> > > That's true, but gives the impression that avoiding the vec != { 0, ... }
>> > > is the main goal, whereas we could do that just by forcing a three-operand
>> > > COND_EXPR.  It's really more about making sure that vec != { 0, ... }
>> > > and its masked form aren't both live at the same time.  So maybe:
>> > >
>> > >              instead of using a masked and unmasked forms of
>> > >              vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
>> > >              unmasked in the VEC_COND_EXPR).  */
>> > >
>> > Hi Richard,
>> > Thanks for the suggestions, I have updated comments in the attached patch.
>> Hi,
>> The attached patch is rebased on trunk, and after PR91532 fix, the
>> hunk for fmla_2.c is no
>> longer required.
>
> Hmm.  So we already record some mask info - you just add in addition
> to that the scalar predicate representing the mask.  I wonder if you can
> integrate that into the existing vec_loop_masks vector instead of
> adding another data structure on the side?  Not that I am understanding
> the existing fully masked code at all (or specifically what it computes
> as nscalars_per_iter, etc. ... :/).

We can AND several different scalar conditions with the same loop
mask (that's relatively common), and could even AND the same scalar
condition with different loop masks (although that's less likely).
So I think having separate info makes sense.

> At least add the new vinfo member right to the other masks related
> field.

Agree that would be better.

> @@ -10122,6 +10157,48 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  		    }
>  		}
>  	    }
> +
> +	  /* If we decided to apply a loop mask to result of vec
> +	     comparison, so later passes will reuse the same condition.

Maybe:

	  /* If we decided to apply a loop mask to the result of the vector
	     comparison, AND the comparison with the mask now.  Later passes
	     should then be able to reuse the AND results between mulitple
	     vector statements.

> +	     For example:
> +	     for (int i = 0; i < 100; ++i)
> +	       x[i] = y[i] ? z[i] : 10;
> +
> +	     results in following optimized GIMPLE:
> +
> +	     mask__35.8_43 = vect__4.7_41 != { 0, ... };
> +	     vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> +	     _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> +	     vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> +	     vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> +					       vect_iftmp.11_47, { 10, ... }>;
> +
> +	     instead of using a masked and unmasked forms of
> +             vec != { 0, ... } (masked in the MASK_LOAD,
> +             unmasked in the VEC_COND_EXPR).  */

The last paragraph uses some space rather than tab indentation.

> +/* If code(T) is comparison op or def of comparison stmt,
> +   extract it's operands.
> +   Else return <NE_EXPR, T, 0>.  */
> +
> +void
> +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> +{

Maybe:

/* If the condition represented by T is a comparison or the SSA name
   result of a comparison, extract the comparison's operands.  Represent
   T as NE_EXPR <T, 0> otherwise.  */

OK with those changes and the one Richard asked for.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [SVE] PR86753
  2019-10-16 12:13                                                                     ` Richard Sandiford
@ 2019-10-18  5:20                                                                       ` Prathamesh Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Prathamesh Kulkarni @ 2019-10-18  5:20 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, gcc Patches

On Wed, 16 Oct 2019 at 04:19, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Oct 15, 2019 at 8:07 AM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Wed, 9 Oct 2019 at 08:14, Prathamesh Kulkarni
> >> <prathamesh.kulkarni@linaro.org> wrote:
> >> >
> >> > On Tue, 8 Oct 2019 at 13:21, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> > >
> >> > > Leaving the main review to Richard, just some comments...
> >> > >
> >> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > > @@ -9774,6 +9777,10 @@ vect_is_simple_cond (tree cond, vec_info *vinfo,
> >> > > >
> >> > > >     When STMT_INFO is vectorized as a nested cycle, for_reduction is true.
> >> > > >
> >> > > > +   For COND_EXPR<C, T, E> if T comes from masked load, and is conditional
> >> > > > +   on C, we apply loop mask to result of vector comparison, if it's present.
> >> > > > +   Similarly for E, if it is conditional on !C.
> >> > > > +
> >> > > >     Return true if STMT_INFO is vectorizable in this way.  */
> >> > > >
> >> > > >  bool
> >> > >
> >> > > I think this is a bit misleading.  But IMO it'd be better not to have
> >> > > a comment here and just rely on the one in the main function body.
> >> > > This optimisation isn't really changing the vectorisation strategy,
> >> > > and the comment could easily get forgotten if things change in future.
> >> > >
> >> > > > [...]
> >> > > > @@ -9999,6 +10006,35 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >> > > >    /* Handle cond expr.  */
> >> > > >    for (j = 0; j < ncopies; j++)
> >> > > >      {
> >> > > > +      tree loop_mask = NULL_TREE;
> >> > > > +      bool swap_cond_operands = false;
> >> > > > +
> >> > > > +      /* Look up if there is a loop mask associated with the
> >> > > > +      scalar cond, or it's inverse.  */
> >> > >
> >> > > Maybe:
> >> > >
> >> > >    See whether another part of the vectorized code applies a loop
> >> > >    mask to the condition, or to its inverse.
> >> > >
> >> > > > +
> >> > > > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> >> > > > +     {
> >> > > > +       scalar_cond_masked_key cond (cond_expr, ncopies);
> >> > > > +       if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> >> > > > +         {
> >> > > > +           vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> >> > > > +           loop_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
> >> > > > +         }
> >> > > > +       else
> >> > > > +         {
> >> > > > +           bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> >> > > > +           cond.code = invert_tree_comparison (cond.code, honor_nans);
> >> > > > +           if (loop_vinfo->scalar_cond_masked_set.contains (cond))
> >> > > > +             {
> >> > > > +               vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> >> > > > +               loop_mask = vect_get_loop_mask (gsi, masks, ncopies,
> >> > > > +                                               vectype, j);
> >> > > > +               cond_code = cond.code;
> >> > > > +               swap_cond_operands = true;
> >> > > > +             }
> >> > > > +         }
> >> > > > +     }
> >> > > > +
> >> > > >        stmt_vec_info new_stmt_info = NULL;
> >> > > >        if (j == 0)
> >> > > >       {
> >> > > > @@ -10114,6 +10153,47 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >> > > >                   }
> >> > > >               }
> >> > > >           }
> >> > > > +
> >> > > > +       /* If loop mask is present, then AND it with
> >> > >
> >> > > Maybe "If we decided to apply a loop mask, ..."
> >> > >
> >> > > > +          result of vec comparison, so later passes (fre4)
> >> > >
> >> > > Probably better not to name the pass -- could easily change in future.
> >> > >
> >> > > > +          will reuse the same condition used in masked load.
> >> > >
> >> > > Could be a masked store, or potentially other things too.
> >> > > So maybe just "will reuse the masked condition"?
> >> > >
> >> > > > +
> >> > > > +          For example:
> >> > > > +          for (int i = 0; i < 100; ++i)
> >> > > > +            x[i] = y[i] ? z[i] : 10;
> >> > > > +
> >> > > > +          results in following optimized GIMPLE:
> >> > > > +
> >> > > > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
> >> > > > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> >> > > > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> >> > > > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> >> > > > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> >> > > > +                                            vect_iftmp.11_47, { 10, ... }>;
> >> > > > +
> >> > > > +          instead of recomputing vec != { 0, ... } in vec_cond_expr  */
> >> > >
> >> > > That's true, but gives the impression that avoiding the vec != { 0, ... }
> >> > > is the main goal, whereas we could do that just by forcing a three-operand
> >> > > COND_EXPR.  It's really more about making sure that vec != { 0, ... }
> >> > > and its masked form aren't both live at the same time.  So maybe:
> >> > >
> >> > >              instead of using a masked and unmasked forms of
> >> > >              vect__4.7_41 != { 0, ... } (masked in the MASK_LOAD,
> >> > >              unmasked in the VEC_COND_EXPR).  */
> >> > >
> >> > Hi Richard,
> >> > Thanks for the suggestions, I have updated comments in the attached patch.
> >> Hi,
> >> The attached patch is rebased on trunk, and after PR91532 fix, the
> >> hunk for fmla_2.c is no
> >> longer required.
> >
> > Hmm.  So we already record some mask info - you just add in addition
> > to that the scalar predicate representing the mask.  I wonder if you can
> > integrate that into the existing vec_loop_masks vector instead of
> > adding another data structure on the side?  Not that I am understanding
> > the existing fully masked code at all (or specifically what it computes
> > as nscalars_per_iter, etc. ... :/).
>
> We can AND several different scalar conditions with the same loop
> mask (that's relatively common), and could even AND the same scalar
> condition with different loop masks (although that's less likely).
> So I think having separate info makes sense.
>
> > At least add the new vinfo member right to the other masks related
> > field.
>
> Agree that would be better.
>
> > @@ -10122,6 +10157,48 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >                   }
> >               }
> >           }
> > +
> > +       /* If we decided to apply a loop mask to result of vec
> > +          comparison, so later passes will reuse the same condition.
>
> Maybe:
>
>           /* If we decided to apply a loop mask to the result of the vector
>              comparison, AND the comparison with the mask now.  Later passes
>              should then be able to reuse the AND results between mulitple
>              vector statements.
>
> > +          For example:
> > +          for (int i = 0; i < 100; ++i)
> > +            x[i] = y[i] ? z[i] : 10;
> > +
> > +          results in following optimized GIMPLE:
> > +
> > +          mask__35.8_43 = vect__4.7_41 != { 0, ... };
> > +          vec_mask_and_46 = loop_mask_40 & mask__35.8_43;
> > +          _19 = &MEM[base: z_12(D), index: ivtmp_56, step: 4, offset: 0B];
> > +          vect_iftmp.11_47 = .MASK_LOAD (_19, 4B, vec_mask_and_46);
> > +          vect_iftmp.12_52 = VEC_COND_EXPR <vec_mask_and_46,
> > +                                            vect_iftmp.11_47, { 10, ... }>;
> > +
> > +          instead of using a masked and unmasked forms of
> > +             vec != { 0, ... } (masked in the MASK_LOAD,
> > +             unmasked in the VEC_COND_EXPR).  */
>
> The last paragraph uses some space rather than tab indentation.
>
> > +/* If code(T) is comparison op or def of comparison stmt,
> > +   extract it's operands.
> > +   Else return <NE_EXPR, T, 0>.  */
> > +
> > +void
> > +scalar_cond_masked_key::get_cond_ops_from_tree (tree t)
> > +{
>
> Maybe:
>
> /* If the condition represented by T is a comparison or the SSA name
>    result of a comparison, extract the comparison's operands.  Represent
>    T as NE_EXPR <T, 0> otherwise.  */
>
> OK with those changes and the one Richard asked for.
Thanks! Committed in r277141.

Thanks,
Prathamesh
>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2019-10-18  5:15 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-14 15:53 [SVE] PR86753 Prathamesh Kulkarni
2019-08-14 16:59 ` Richard Biener
2019-08-14 17:01   ` Richard Biener
2019-08-14 21:22     ` Richard Sandiford
2019-08-21 20:10       ` Prathamesh Kulkarni
2019-08-22 12:05         ` Richard Biener
2019-08-23 12:46           ` Prathamesh Kulkarni
2019-08-23 13:47             ` Richard Sandiford
2019-08-23 14:30               ` Prathamesh Kulkarni
2019-08-23 14:34                 ` Richard Sandiford
2019-08-26  5:59                   ` Prathamesh Kulkarni
2019-08-26 11:46                     ` Richard Biener
2019-08-26 13:39                       ` Prathamesh Kulkarni
2019-08-27 10:41                         ` Richard Sandiford
2019-08-27 11:31                           ` Richard Biener
2019-08-27 12:52                             ` Richard Sandiford
2019-08-27 15:55                               ` Prathamesh Kulkarni
2019-08-27 17:39                                 ` Richard Sandiford
2019-08-27 20:10                                   ` Prathamesh Kulkarni
2019-08-28  9:42                                     ` Richard Sandiford
2019-08-30 12:09                                       ` Richard Biener
2019-08-31 16:56                                         ` Prathamesh Kulkarni
2019-09-05  9:00                                           ` Richard Sandiford
2019-09-05 12:51                                             ` Prathamesh Kulkarni
2019-09-09 11:15                                               ` Richard Sandiford
2019-09-09 16:37                                                 ` Prathamesh Kulkarni
2019-09-09 20:56                                                   ` Prathamesh Kulkarni
2019-09-10 12:20                                                     ` Richard Sandiford
2019-09-10 13:35                                                     ` Matthew Malcomson
2019-09-10 21:36                                                       ` Prathamesh Kulkarni
2019-09-16 15:54                                                   ` Prathamesh Kulkarni
2019-09-25 16:18                                                     ` Prathamesh Kulkarni
2019-10-02 23:42                                                       ` Prathamesh Kulkarni
2019-10-04 10:38                                                         ` Richard Biener
2019-10-08  0:10                                                           ` Prathamesh Kulkarni
2019-10-08  7:51                                                             ` Richard Sandiford
2019-10-09  3:23                                                               ` Prathamesh Kulkarni
2019-10-15  6:11                                                                 ` Prathamesh Kulkarni
2019-10-15 11:40                                                                   ` Richard Biener
2019-10-16 12:13                                                                     ` Richard Sandiford
2019-10-18  5:20                                                                       ` Prathamesh Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).