public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, vec-tails] Support loop epilogue vectorization
@ 2016-11-01 12:38 Yuri Rumyantsev
  2016-11-02 12:27 ` Richard Biener
  2016-11-09 10:37 ` Bin.Cheng
  0 siblings, 2 replies; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-01 12:38 UTC (permalink / raw)
  To: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 502 bytes --]

Hi All,

I re-send all patches sent by Ilya earlier for review which support
vectorization of loop epilogues and loops with low trip count. We
assume that the only patch - vec-tails-07-combine-tail.patch - was not
approved by Jeff.

I did re-base of all patches and performed bootstrapping and
regression testing that did not show any new failures. Also all
changes related to new vect_do_peeling algorithm have been changed
accordingly.

Is it OK for trunk?

ChangeLog files and patches are attached.

[-- Attachment #2: vec-tails-changelogs.tgz --]
[-- Type: application/x-gzip, Size: 2676 bytes --]

[-- Attachment #3: vec-tails-patches.tgz --]
[-- Type: application/x-gzip, Size: 28807 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-01 12:38 [PATCH, vec-tails] Support loop epilogue vectorization Yuri Rumyantsev
@ 2016-11-02 12:27 ` Richard Biener
  2016-11-03 12:33   ` Yuri Rumyantsev
  2016-11-05 18:35   ` Jeff Law
  2016-11-09 10:37 ` Bin.Cheng
  1 sibling, 2 replies; 38+ messages in thread
From: Richard Biener @ 2016-11-02 12:27 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:

> Hi All,
> 
> I re-send all patches sent by Ilya earlier for review which support
> vectorization of loop epilogues and loops with low trip count. We
> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> approved by Jeff.
> 
> I did re-base of all patches and performed bootstrapping and
> regression testing that did not show any new failures. Also all
> changes related to new vect_do_peeling algorithm have been changed
> accordingly.
> 
> Is it OK for trunk?

I would have prefered that the series up to -03-nomask-tails would
_only_ contain epilogue loop vectorization changes but unfortunately
the patchset is oddly separated.

I have a comment on that part nevertheless:

@@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)
   /* Check if we can possibly peel the loop.  */
   if (!vect_can_advance_ivs_p (loop_vinfo)
       || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
-      || loop->inner)
+      || loop->inner
+      /* Required peeling was performed in prologue and
+        is not required for epilogue.  */
+      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     do_peeling = false;

   if (do_peeling
@@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)

   do_versioning =
        optimize_loop_nest_for_speed_p (loop)
-       && (!loop->inner); /* FORNOW */
+       && (!loop->inner) /* FORNOW */
+        /* Required versioning was performed for the
+          original loop and is not required for epilogue.  */
+       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);

   if (do_versioning)
     {

please do that check in the single caller of this function.

Otherwise I still dislike the new ->aux use and I believe that simply
passing down info from the processed parent would be _much_ cleaner.
That is, here (and avoid the FOR_EACH_LOOP change):

@@ -580,12 +586,21 @@ vectorize_loops (void)
            && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-       vect_transform_loop (loop_vinfo);
+       new_loop = vect_transform_loop (loop_vinfo);
        num_vectorized_loops++;
        /* Now that the loop has been vectorized, allow it to be unrolled
           etc.  */
        loop->force_vectorize = false;

+       /* Add new loop to a processing queue.  To make it easier
+          to match loop and its epilogue vectorization in dumps
+          put new loop as the next loop to process.  */
+       if (new_loop)
+         {
+           loops.safe_insert (i + 1, new_loop->num);
+           vect_loops_num = number_of_loops (cfun);
+         }

simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
function which will set up stuff properly (and also perform
the if-conversion of the epilogue there).

That said, if we can get in non-masked epilogue vectorization
separately that would be great.

I'm still torn about all the rest of the stuff and question its
usability (esp. merging the epilogue with the main vector loop).
But it has already been approved ... oh well.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-02 12:27 ` Richard Biener
@ 2016-11-03 12:33   ` Yuri Rumyantsev
  2016-11-08 12:39     ` Richard Biener
  2016-11-05 18:35   ` Jeff Law
  1 sibling, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-03 12:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

Hi Richard,

I did not understand your last remark:

> That is, here (and avoid the FOR_EACH_LOOP change):
>
> @@ -580,12 +586,21 @@ vectorize_loops (void)
>           && dump_enabled_p ())
>           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>                            "loop vectorized\n");
> -       vect_transform_loop (loop_vinfo);
> +       new_loop = vect_transform_loop (loop_vinfo);
>         num_vectorized_loops++;
>        /* Now that the loop has been vectorized, allow it to be unrolled
>           etc.  */
>      loop->force_vectorize = false;
>
> +       /* Add new loop to a processing queue.  To make it easier
> +          to match loop and its epilogue vectorization in dumps
> +          put new loop as the next loop to process.  */
> +       if (new_loop)
> +         {
> +           loops.safe_insert (i + 1, new_loop->num);
> +           vect_loops_num = number_of_loops (cfun);
> +         }
>
> simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
f> unction which will set up stuff properly (and also perform
> the if-conversion of the epilogue there).
>
> That said, if we can get in non-masked epilogue vectorization
> separately that would be great.

Could you please clarify your proposal.

Thanks.
Yuri.

2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>
>> Hi All,
>>
>> I re-send all patches sent by Ilya earlier for review which support
>> vectorization of loop epilogues and loops with low trip count. We
>> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> approved by Jeff.
>>
>> I did re-base of all patches and performed bootstrapping and
>> regression testing that did not show any new failures. Also all
>> changes related to new vect_do_peeling algorithm have been changed
>> accordingly.
>>
>> Is it OK for trunk?
>
> I would have prefered that the series up to -03-nomask-tails would
> _only_ contain epilogue loop vectorization changes but unfortunately
> the patchset is oddly separated.
>
> I have a comment on that part nevertheless:
>
> @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> loop_vinfo)
>    /* Check if we can possibly peel the loop.  */
>    if (!vect_can_advance_ivs_p (loop_vinfo)
>        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> -      || loop->inner)
> +      || loop->inner
> +      /* Required peeling was performed in prologue and
> +        is not required for epilogue.  */
> +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>      do_peeling = false;
>
>    if (do_peeling
> @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> loop_vinfo)
>
>    do_versioning =
>         optimize_loop_nest_for_speed_p (loop)
> -       && (!loop->inner); /* FORNOW */
> +       && (!loop->inner) /* FORNOW */
> +        /* Required versioning was performed for the
> +          original loop and is not required for epilogue.  */
> +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>
>    if (do_versioning)
>      {
>
> please do that check in the single caller of this function.
>
> Otherwise I still dislike the new ->aux use and I believe that simply
> passing down info from the processed parent would be _much_ cleaner.
> That is, here (and avoid the FOR_EACH_LOOP change):
>
> @@ -580,12 +586,21 @@ vectorize_loops (void)
>             && dump_enabled_p ())
>            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>                             "loop vectorized\n");
> -       vect_transform_loop (loop_vinfo);
> +       new_loop = vect_transform_loop (loop_vinfo);
>         num_vectorized_loops++;
>         /* Now that the loop has been vectorized, allow it to be unrolled
>            etc.  */
>         loop->force_vectorize = false;
>
> +       /* Add new loop to a processing queue.  To make it easier
> +          to match loop and its epilogue vectorization in dumps
> +          put new loop as the next loop to process.  */
> +       if (new_loop)
> +         {
> +           loops.safe_insert (i + 1, new_loop->num);
> +           vect_loops_num = number_of_loops (cfun);
> +         }
>
> simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> function which will set up stuff properly (and also perform
> the if-conversion of the epilogue there).
>
> That said, if we can get in non-masked epilogue vectorization
> separately that would be great.
>
> I'm still torn about all the rest of the stuff and question its
> usability (esp. merging the epilogue with the main vector loop).
> But it has already been approved ... oh well.
>
> Thanks,
> Richard.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-02 12:27 ` Richard Biener
  2016-11-03 12:33   ` Yuri Rumyantsev
@ 2016-11-05 18:35   ` Jeff Law
  2016-11-06 11:16     ` Richard Biener
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Law @ 2016-11-05 18:35 UTC (permalink / raw)
  To: Richard Biener, Yuri Rumyantsev; +Cc: gcc-patches, Ilya Enkovich

On 11/02/2016 06:27 AM, Richard Biener wrote:
> I'm still torn about all the rest of the stuff and question its
> usability (esp. merging the epilogue with the main vector loop).
> But it has already been approved ... oh well.
Note that merging of the epilogue with the main vector loop may well be 
useful for SVE as well.  I'm hoping Richard S. will chime in on how to 
best exploit SVE.

Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-05 18:35   ` Jeff Law
@ 2016-11-06 11:16     ` Richard Biener
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Biener @ 2016-11-06 11:16 UTC (permalink / raw)
  To: Jeff Law, Yuri Rumyantsev; +Cc: gcc-patches, Ilya Enkovich

On November 5, 2016 3:40:04 AM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 11/02/2016 06:27 AM, Richard Biener wrote:
>> I'm still torn about all the rest of the stuff and question its
>> usability (esp. merging the epilogue with the main vector loop).
>> But it has already been approved ... oh well.
>Note that merging of the epilogue with the main vector loop may well be
>
>useful for SVE as well.  I'm hoping Richard S. will chime in on how to 
>best exploit SVE.

Possibly, but full exploitation of SVE requires a full vectorizer rewrite.  The only thing I can see us implementing is making the vector size fixed via versioning, that is, let the user specify -mvecsize=N and if that's not the case at runtime trap or execute a scalar variant.

Richard.

>Jeff


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-03 12:33   ` Yuri Rumyantsev
@ 2016-11-08 12:39     ` Richard Biener
  2016-11-08 14:17       ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-08 12:39 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:

> Hi Richard,
> 
> I did not understand your last remark:
> 
> > That is, here (and avoid the FOR_EACH_LOOP change):
> >
> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >           && dump_enabled_p ())
> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >                            "loop vectorized\n");
> > -       vect_transform_loop (loop_vinfo);
> > +       new_loop = vect_transform_loop (loop_vinfo);
> >         num_vectorized_loops++;
> >        /* Now that the loop has been vectorized, allow it to be unrolled
> >           etc.  */
> >      loop->force_vectorize = false;
> >
> > +       /* Add new loop to a processing queue.  To make it easier
> > +          to match loop and its epilogue vectorization in dumps
> > +          put new loop as the next loop to process.  */
> > +       if (new_loop)
> > +         {
> > +           loops.safe_insert (i + 1, new_loop->num);
> > +           vect_loops_num = number_of_loops (cfun);
> > +         }
> >
> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> f> unction which will set up stuff properly (and also perform
> > the if-conversion of the epilogue there).
> >
> > That said, if we can get in non-masked epilogue vectorization
> > separately that would be great.
> 
> Could you please clarify your proposal.

When a loop was vectorized set things up to immediately vectorize
its epilogue, avoiding changing the loop iteration and avoiding
the re-use of ->aux.

Richard.

> Thanks.
> Yuri.
> 
> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >
> >> Hi All,
> >>
> >> I re-send all patches sent by Ilya earlier for review which support
> >> vectorization of loop epilogues and loops with low trip count. We
> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >> approved by Jeff.
> >>
> >> I did re-base of all patches and performed bootstrapping and
> >> regression testing that did not show any new failures. Also all
> >> changes related to new vect_do_peeling algorithm have been changed
> >> accordingly.
> >>
> >> Is it OK for trunk?
> >
> > I would have prefered that the series up to -03-nomask-tails would
> > _only_ contain epilogue loop vectorization changes but unfortunately
> > the patchset is oddly separated.
> >
> > I have a comment on that part nevertheless:
> >
> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> > loop_vinfo)
> >    /* Check if we can possibly peel the loop.  */
> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> > -      || loop->inner)
> > +      || loop->inner
> > +      /* Required peeling was performed in prologue and
> > +        is not required for epilogue.  */
> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >      do_peeling = false;
> >
> >    if (do_peeling
> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> > loop_vinfo)
> >
> >    do_versioning =
> >         optimize_loop_nest_for_speed_p (loop)
> > -       && (!loop->inner); /* FORNOW */
> > +       && (!loop->inner) /* FORNOW */
> > +        /* Required versioning was performed for the
> > +          original loop and is not required for epilogue.  */
> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >
> >    if (do_versioning)
> >      {
> >
> > please do that check in the single caller of this function.
> >
> > Otherwise I still dislike the new ->aux use and I believe that simply
> > passing down info from the processed parent would be _much_ cleaner.
> > That is, here (and avoid the FOR_EACH_LOOP change):
> >
> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >             && dump_enabled_p ())
> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >                             "loop vectorized\n");
> > -       vect_transform_loop (loop_vinfo);
> > +       new_loop = vect_transform_loop (loop_vinfo);
> >         num_vectorized_loops++;
> >         /* Now that the loop has been vectorized, allow it to be unrolled
> >            etc.  */
> >         loop->force_vectorize = false;
> >
> > +       /* Add new loop to a processing queue.  To make it easier
> > +          to match loop and its epilogue vectorization in dumps
> > +          put new loop as the next loop to process.  */
> > +       if (new_loop)
> > +         {
> > +           loops.safe_insert (i + 1, new_loop->num);
> > +           vect_loops_num = number_of_loops (cfun);
> > +         }
> >
> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> > function which will set up stuff properly (and also perform
> > the if-conversion of the epilogue there).
> >
> > That said, if we can get in non-masked epilogue vectorization
> > separately that would be great.
> >
> > I'm still torn about all the rest of the stuff and question its
> > usability (esp. merging the epilogue with the main vector loop).
> > But it has already been approved ... oh well.
> >
> > Thanks,
> > Richard.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-08 12:39     ` Richard Biener
@ 2016-11-08 14:17       ` Yuri Rumyantsev
  2016-11-10 12:34         ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-08 14:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 5585 bytes --]

Richard,

Here is updated 3 patch.

I checked that all new tests related to epilogue vectorization passed with it.

Your comments will be appreciated.

2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>
>> Hi Richard,
>>
>> I did not understand your last remark:
>>
>> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >
>> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >           && dump_enabled_p ())
>> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >                            "loop vectorized\n");
>> > -       vect_transform_loop (loop_vinfo);
>> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >         num_vectorized_loops++;
>> >        /* Now that the loop has been vectorized, allow it to be unrolled
>> >           etc.  */
>> >      loop->force_vectorize = false;
>> >
>> > +       /* Add new loop to a processing queue.  To make it easier
>> > +          to match loop and its epilogue vectorization in dumps
>> > +          put new loop as the next loop to process.  */
>> > +       if (new_loop)
>> > +         {
>> > +           loops.safe_insert (i + 1, new_loop->num);
>> > +           vect_loops_num = number_of_loops (cfun);
>> > +         }
>> >
>> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> f> unction which will set up stuff properly (and also perform
>> > the if-conversion of the epilogue there).
>> >
>> > That said, if we can get in non-masked epilogue vectorization
>> > separately that would be great.
>>
>> Could you please clarify your proposal.
>
> When a loop was vectorized set things up to immediately vectorize
> its epilogue, avoiding changing the loop iteration and avoiding
> the re-use of ->aux.
>
> Richard.
>
>> Thanks.
>> Yuri.
>>
>> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >
>> >> Hi All,
>> >>
>> >> I re-send all patches sent by Ilya earlier for review which support
>> >> vectorization of loop epilogues and loops with low trip count. We
>> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> >> approved by Jeff.
>> >>
>> >> I did re-base of all patches and performed bootstrapping and
>> >> regression testing that did not show any new failures. Also all
>> >> changes related to new vect_do_peeling algorithm have been changed
>> >> accordingly.
>> >>
>> >> Is it OK for trunk?
>> >
>> > I would have prefered that the series up to -03-nomask-tails would
>> > _only_ contain epilogue loop vectorization changes but unfortunately
>> > the patchset is oddly separated.
>> >
>> > I have a comment on that part nevertheless:
>> >
>> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> > loop_vinfo)
>> >    /* Check if we can possibly peel the loop.  */
>> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>> > -      || loop->inner)
>> > +      || loop->inner
>> > +      /* Required peeling was performed in prologue and
>> > +        is not required for epilogue.  */
>> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >      do_peeling = false;
>> >
>> >    if (do_peeling
>> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> > loop_vinfo)
>> >
>> >    do_versioning =
>> >         optimize_loop_nest_for_speed_p (loop)
>> > -       && (!loop->inner); /* FORNOW */
>> > +       && (!loop->inner) /* FORNOW */
>> > +        /* Required versioning was performed for the
>> > +          original loop and is not required for epilogue.  */
>> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >
>> >    if (do_versioning)
>> >      {
>> >
>> > please do that check in the single caller of this function.
>> >
>> > Otherwise I still dislike the new ->aux use and I believe that simply
>> > passing down info from the processed parent would be _much_ cleaner.
>> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >
>> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >             && dump_enabled_p ())
>> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >                             "loop vectorized\n");
>> > -       vect_transform_loop (loop_vinfo);
>> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >         num_vectorized_loops++;
>> >         /* Now that the loop has been vectorized, allow it to be unrolled
>> >            etc.  */
>> >         loop->force_vectorize = false;
>> >
>> > +       /* Add new loop to a processing queue.  To make it easier
>> > +          to match loop and its epilogue vectorization in dumps
>> > +          put new loop as the next loop to process.  */
>> > +       if (new_loop)
>> > +         {
>> > +           loops.safe_insert (i + 1, new_loop->num);
>> > +           vect_loops_num = number_of_loops (cfun);
>> > +         }
>> >
>> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> > function which will set up stuff properly (and also perform
>> > the if-conversion of the epilogue there).
>> >
>> > That said, if we can get in non-masked epilogue vectorization
>> > separately that would be great.
>> >
>> > I'm still torn about all the rest of the stuff and question its
>> > usability (esp. merging the epilogue with the main vector loop).
>> > But it has already been approved ... oh well.
>> >
>> > Thanks,
>> > Richard.
>>
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: patch.03.new --]
[-- Type: application/octet-stream, Size: 11917 bytes --]

diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 0a20189..0b86ffe 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -2734,7 +2734,7 @@ ifcvt_local_dce (basic_block bb)
    profitability analysis.  Returns non-zero todo flags when something
    changed.  */
 
-static unsigned int
+unsigned int
 tree_if_conversion (struct loop *loop)
 {
   unsigned int todo = 0;
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 9346cfe..1fc4966 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -480,9 +480,15 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo, int *max_vf)
 				LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
     return false;
 
-  FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
-    if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
-      return false;
+  /* For epilogues we either have no aliases or alias versioning
+     was applied to original loop.  Therefore we may just get max_vf
+     using VF of original loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    *max_vf = LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo);
+  else
+    FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
+      if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
+	return false;
 
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6bfd332..80585ed 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1611,11 +1611,13 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
+   Returns created epilogue or NULL.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
 
-void
+
+struct loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, int th, bool check_profitability,
 		 bool niters_no_overflow)
@@ -1631,7 +1633,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
   if (!prolog_peeling && !epilog_peeling)
-    return;
+    return NULL;
 
   prob_vector = 9 * REG_BR_PROB_BASE / 10;
   if ((vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo)) == 2)
@@ -1639,7 +1641,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   prob_prolog = prob_epilog = (vf - 1) * REG_BR_PROB_BASE / vf;
   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  struct loop *prolog, *epilog, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
   create_lcssa_for_virtual_phi (loop);
   update_ssa (TODO_update_ssa_only_virtuals);
@@ -1821,6 +1823,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
   adjust_vec.release ();
   free_original_copy_tables ();
+
+  return epilog;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index af49b8c..6eb6d00 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-cfg.h"
+#include "tree-if-conv.h"
 
 /* Loop Vectorization Pass.
 
@@ -1258,8 +1259,8 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, bool clean_stmts)
   destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
   loop_vinfo->scalar_cost_vec.release ();
 
+  loop->aux = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
   free (loop_vinfo);
-  loop->aux = NULL;
 }
 
 
@@ -1543,7 +1544,10 @@ vect_analyze_loop_form (struct loop *loop)
   if (! vect_analyze_loop_form_1 (loop, &loop_cond,
 				  &assumptions, &number_of_iterationsm1,
 				  &number_of_iterations, &inner_loop_cond))
-    return NULL;
+    {
+      loop->aux = NULL;
+      return NULL;
+    }
 
   loop_vec_info loop_vinfo = new_loop_vec_info (loop);
   LOOP_VINFO_NITERSM1 (loop_vinfo) = number_of_iterationsm1;
@@ -1563,6 +1567,14 @@ vect_analyze_loop_form (struct loop *loop)
       LOOP_VINFO_NITERS_ASSUMPTIONS (loop_vinfo) = assumptions;
     }
 
+  /* For epilogues we want to vectorize aux holds
+     loop_vec_info of the original loop.  */
+  if (loop->aux)
+    {
+      gcc_assert (LOOP_VINFO_VECTORIZABLE_P ((loop_vec_info)loop->aux));
+      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = (loop_vec_info)loop->aux;
+    }
+
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
       if (dump_enabled_p ())
@@ -1579,7 +1591,6 @@ vect_analyze_loop_form (struct loop *loop)
     STMT_VINFO_TYPE (vinfo_for_stmt (inner_loop_cond))
       = loop_exit_ctrl_vec_info_type;
 
-  gcc_assert (!loop->aux);
   loop->aux = loop_vinfo;
   return loop_vinfo;
 }
@@ -2031,15 +2042,20 @@ start_over:
   if (!ok)
     return false;
 
-  /* This pass will decide on using loop versioning and/or loop peeling in
-     order to enhance the alignment of data references in the loop.  */
-  ok = vect_enhance_data_refs_alignment (loop_vinfo);
-  if (!ok)
+  /* Do not invoke vect_enhance_data_refs_alignment for eplilogue
+     vectorization.  */
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "bad data alignment.\n");
-      return false;
+    /* This pass will decide on using loop versioning and/or loop peeling in
+       order to enhance the alignment of data references in the loop.  */
+    ok = vect_enhance_data_refs_alignment (loop_vinfo);
+    if (!ok)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "bad data alignment.\n");
+        return false;
+      }
     }
 
   if (slp)
@@ -2344,7 +2360,10 @@ vect_analyze_loop (struct loop *loop)
       if (fatal
 	  || vector_sizes == 0
 	  || current_vector_size == 0)
-	return NULL;
+	{
+	  loop->aux = NULL;
+	  return NULL;
+	}
 
       /* Try the next biggest vector size.  */
       current_vector_size = 1 << floor_log2 (vector_sizes);
@@ -6663,12 +6682,14 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
 
    The analysis phase has determined that the loop is vectorizable.
    Vectorize the loop - created vectorized stmts to replace the scalar
-   stmts in the loop, and update the loop exit condition.  */
+   stmts in the loop, and update the loop exit condition.
+   Returns scalar epilogue loop if any.  */
 
-void
+struct loop *
 vect_transform_loop (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *epilogue = NULL;
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
   int nbbs = loop->num_nodes;
   int i;
@@ -6747,8 +6768,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
-  vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
-		   check_profitability, niters_no_overflow);
+  epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
+			      check_profitability, niters_no_overflow);
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
@@ -7051,6 +7072,59 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   /* Clear-up safelen field since its value is invalid after vectorization
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
+
+  /* Don't vectorize epilogue for epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+  /* Scalar epilogue is not vectorized in case
+     we use combined vector epilogue.  */
+  else if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    epilogue = NULL;
+
+  if (epilogue)
+    {
+      if (!LOOP_VINFO_MASK_EPILOGUE (loop_vinfo))
+	{
+	  unsigned int vector_sizes
+	    = targetm.vectorize.autovectorize_vector_sizes ();
+	  vector_sizes &= current_vector_size - 1;
+
+	  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	    epilogue = NULL;
+	  else if (!vector_sizes)
+	    epilogue = NULL;
+	  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	    {
+	      int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	      int ratio = current_vector_size / smallest_vec_size;
+	      int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+		- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	      eiters = eiters % vf;
+
+	      epilogue->nb_iterations_upper_bound = eiters - 1;
+
+	      if (eiters < vf / ratio)
+		epilogue = NULL;
+	    }
+	}
+    }
+
+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
+
+      gcc_assert (!epilogue->aux);
+      epilogue->aux = loop_vinfo;
+    }
+
+  return epilogue;
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 22e587a..ac782e3 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -514,6 +514,7 @@ vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  struct loop *new_loop;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -539,6 +540,7 @@ vectorize_loops (void)
 	     || loop->force_vectorize)
       {
 	loop_vec_info loop_vinfo;
+vectorize_epilogue:
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
@@ -580,7 +582,7 @@ vectorize_loops (void)
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-	vect_transform_loop (loop_vinfo);
+	new_loop = vect_transform_loop (loop_vinfo);
 	num_vectorized_loops++;
 	/* Now that the loop has been vectorized, allow it to be unrolled
 	   etc.  */
@@ -602,6 +604,13 @@ vectorize_loops (void)
 	    fold_loop_vectorized_call (loop_vectorized_call, boolean_true_node);
 	    ret |= TODO_cleanup_cfg;
 	  }
+	if (new_loop)
+	  {
+	    /* Epilogue of vectorized loop must be vectorized too.  */
+	    vect_loops_num = number_of_loops (cfun);
+	    loop = new_loop;
+	    goto vectorize_epilogue;
+	  }
       }
 
   vect_location = UNKNOWN_LOCATION;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 99b9982..9942499 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1063,8 +1063,8 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
-extern void vect_do_peeling (loop_vec_info, tree, tree,
-			     tree *, int, bool, bool);
+extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
+				     tree *, int, bool, bool);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
 
@@ -1179,7 +1179,7 @@ extern loop_vec_info vect_analyze_loop (struct loop *);
 extern tree vect_build_loop_niters (loop_vec_info);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, bool);
 /* Drive for loop transformation stage.  */
-extern void vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 extern bool vectorizable_live_operation (gimple *, gimple_stmt_iterator *,
 					 slp_tree, int, gimple **);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-01 12:38 [PATCH, vec-tails] Support loop epilogue vectorization Yuri Rumyantsev
  2016-11-02 12:27 ` Richard Biener
@ 2016-11-09 10:37 ` Bin.Cheng
  2016-11-09 11:28   ` Yuri Rumyantsev
  1 sibling, 1 reply; 38+ messages in thread
From: Bin.Cheng @ 2016-11-09 10:37 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> I re-send all patches sent by Ilya earlier for review which support
> vectorization of loop epilogues and loops with low trip count. We
> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> approved by Jeff.
>
> I did re-base of all patches and performed bootstrapping and
> regression testing that did not show any new failures. Also all
> changes related to new vect_do_peeling algorithm have been changed
> accordingly.
>
> Is it OK for trunk?

Hi,
I can't approve patches, but had some comments after going through the
implementation.

One confusing part is cost model change, as well as the way it's used
to decide how epilogue loop should be vectorized.  Given vect-tail is
disabled at the moment and the cost change needs further tuning, is it
reasonable to split this part out and get vectorization part
reviewed/submitted independently?  For example, let user specified
parameters make the decision for now.  Cost and target dependent
changes should go in at last, this could make the patch easier to
read.

The implementation computes/shares quite amount information from main
loop to epilogue loop vectorization.  Furthermore, variables/fields
for such information are somehow named in a misleading way.  For
example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
flag controlling whether epilogue loop should be vectorized with
masking.  However, it's actually controlled by exactly the same flag
as whether epilogue loop should be combined into the main loop with
masking:
@@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)

   slpeel_make_loop_iterate_ntimes (loop, niters_vector);

+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    vect_combine_loop_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
               expected_iterations / vf);

IMHO, we should decouple main loop vectorization and epilogue
vectorization as much as possible by sharing as few information as we
can.  The general idea is to handle epilogue loop just like normal
short-trip loop.  For example, we can rename
LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
else), and we don't differentiate its meaning between main and
epilogue(short-trip) loop.  It only indicates the current loop should
be vectorized with masking no matter it's a main loop or epilogue
loop, and it works just like the current implementation.

After this change, we can refine vectorization and make it more
general for normal loop and epilogue(short trip) loop.  For example,
this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
loop and use it to control how it should be vectorized:
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    {
+      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
+    }
+  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+       && min_profitable_combine_iters >= 0)
+    {

This works, but not that good for understanding or maintaining.

Thanks,
bin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-09 10:37 ` Bin.Cheng
@ 2016-11-09 11:28   ` Yuri Rumyantsev
  2016-11-09 11:46     ` Bin.Cheng
  2016-11-09 12:52     ` Richard Biener
  0 siblings, 2 replies; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-09 11:28 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

Thanks Richard for your comments.
Your proposed to handle epilogue loop just like normal short-trip loop
but this is not exactly truth since e.g. epilogue must not be peeled
for alignment.

It is not clear for me what are my next steps? Should I re-design the
patch completely or simply decompose the whole patch to different
parts? But it means that we must start review process from beginning
but release is closed to its end.
Note also that i work for Intel till the end of year and have not idea
who will continue working on this project.

Any help will be appreciated.

Thanks.
Yuri.

2016-11-09 13:37 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
> On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi All,
>>
>> I re-send all patches sent by Ilya earlier for review which support
>> vectorization of loop epilogues and loops with low trip count. We
>> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> approved by Jeff.
>>
>> I did re-base of all patches and performed bootstrapping and
>> regression testing that did not show any new failures. Also all
>> changes related to new vect_do_peeling algorithm have been changed
>> accordingly.
>>
>> Is it OK for trunk?
>
> Hi,
> I can't approve patches, but had some comments after going through the
> implementation.
>
> One confusing part is cost model change, as well as the way it's used
> to decide how epilogue loop should be vectorized.  Given vect-tail is
> disabled at the moment and the cost change needs further tuning, is it
> reasonable to split this part out and get vectorization part
> reviewed/submitted independently?  For example, let user specified
> parameters make the decision for now.  Cost and target dependent
> changes should go in at last, this could make the patch easier to
> read.
>
> The implementation computes/shares quite amount information from main
> loop to epilogue loop vectorization.  Furthermore, variables/fields
> for such information are somehow named in a misleading way.  For
> example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
> flag controlling whether epilogue loop should be vectorized with
> masking.  However, it's actually controlled by exactly the same flag
> as whether epilogue loop should be combined into the main loop with
> masking:
> @@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>
>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    vect_combine_loop_epilogue (loop_vinfo);
> +
>    /* Reduce loop iterations by the vectorization factor.  */
>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
>                expected_iterations / vf);
>
> IMHO, we should decouple main loop vectorization and epilogue
> vectorization as much as possible by sharing as few information as we
> can.  The general idea is to handle epilogue loop just like normal
> short-trip loop.  For example, we can rename
> LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
> else), and we don't differentiate its meaning between main and
> epilogue(short-trip) loop.  It only indicates the current loop should
> be vectorized with masking no matter it's a main loop or epilogue
> loop, and it works just like the current implementation.
>
> After this change, we can refine vectorization and make it more
> general for normal loop and epilogue(short trip) loop.  For example,
> this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
> loop and use it to control how it should be vectorized:
> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    {
> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> +    }
> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +       && min_profitable_combine_iters >= 0)
> +    {
>
> This works, but not that good for understanding or maintaining.
>
> Thanks,
> bin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-09 11:28   ` Yuri Rumyantsev
@ 2016-11-09 11:46     ` Bin.Cheng
  2016-11-09 12:12       ` Yuri Rumyantsev
  2016-11-09 12:52     ` Richard Biener
  1 sibling, 1 reply; 38+ messages in thread
From: Bin.Cheng @ 2016-11-09 11:46 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On Wed, Nov 9, 2016 at 11:28 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Thanks Richard for your comments.
> Your proposed to handle epilogue loop just like normal short-trip loop
> but this is not exactly truth since e.g. epilogue must not be peeled
> for alignment.
Yes there must be some differences, my motivation is to minimize that
so we don't need to specially check normal/epilogue loops at too many
places.
Of course it's just my feeling when going through the patch set, and
could be wrong.

Thanks,
bin
>
> It is not clear for me what are my next steps? Should I re-design the
> patch completely or simply decompose the whole patch to different
> parts? But it means that we must start review process from beginning
> but release is closed to its end.
> Note also that i work for Intel till the end of year and have not idea
> who will continue working on this project.
>
> Any help will be appreciated.
>
> Thanks.
> Yuri.
>
> 2016-11-09 13:37 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
>> On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi All,
>>>
>>> I re-send all patches sent by Ilya earlier for review which support
>>> vectorization of loop epilogues and loops with low trip count. We
>>> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>>> approved by Jeff.
>>>
>>> I did re-base of all patches and performed bootstrapping and
>>> regression testing that did not show any new failures. Also all
>>> changes related to new vect_do_peeling algorithm have been changed
>>> accordingly.
>>>
>>> Is it OK for trunk?
>>
>> Hi,
>> I can't approve patches, but had some comments after going through the
>> implementation.
>>
>> One confusing part is cost model change, as well as the way it's used
>> to decide how epilogue loop should be vectorized.  Given vect-tail is
>> disabled at the moment and the cost change needs further tuning, is it
>> reasonable to split this part out and get vectorization part
>> reviewed/submitted independently?  For example, let user specified
>> parameters make the decision for now.  Cost and target dependent
>> changes should go in at last, this could make the patch easier to
>> read.
>>
>> The implementation computes/shares quite amount information from main
>> loop to epilogue loop vectorization.  Furthermore, variables/fields
>> for such information are somehow named in a misleading way.  For
>> example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
>> flag controlling whether epilogue loop should be vectorized with
>> masking.  However, it's actually controlled by exactly the same flag
>> as whether epilogue loop should be combined into the main loop with
>> masking:
>> @@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>
>>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>>
>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    vect_combine_loop_epilogue (loop_vinfo);
>> +
>>    /* Reduce loop iterations by the vectorization factor.  */
>>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
>>                expected_iterations / vf);
>>
>> IMHO, we should decouple main loop vectorization and epilogue
>> vectorization as much as possible by sharing as few information as we
>> can.  The general idea is to handle epilogue loop just like normal
>> short-trip loop.  For example, we can rename
>> LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
>> else), and we don't differentiate its meaning between main and
>> epilogue(short-trip) loop.  It only indicates the current loop should
>> be vectorized with masking no matter it's a main loop or epilogue
>> loop, and it works just like the current implementation.
>>
>> After this change, we can refine vectorization and make it more
>> general for normal loop and epilogue(short trip) loop.  For example,
>> this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
>> loop and use it to control how it should be vectorized:
>> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
>> +    {
>> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
>> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
>> +    }
>> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> +       && min_profitable_combine_iters >= 0)
>> +    {
>>
>> This works, but not that good for understanding or maintaining.
>>
>> Thanks,
>> bin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-09 11:46     ` Bin.Cheng
@ 2016-11-09 12:12       ` Yuri Rumyantsev
  2016-11-09 12:40         ` Bin.Cheng
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-09 12:12 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

I am familiar with SVE extension and understand that implemented
approach might be not suitable for ARM. The main point is that only
load/store instructions are masked but all other calculations are not
(we did special conversion for reduction statements to implement
merging predication semantic). For SVE peeling for niters is not
required but it is not true for x86 -  we must determine what
vectorization scheme is more profitable: loop combining (the only
essential for SVE) or separate epilogue vectorization using masking or
less vectorization factor. So I'd like to have the full list of
required changes to our implementation to try to remove them.

Thanks.
Yuri.

2016-11-09 14:46 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
> On Wed, Nov 9, 2016 at 11:28 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Thanks Richard for your comments.
>> Your proposed to handle epilogue loop just like normal short-trip loop
>> but this is not exactly truth since e.g. epilogue must not be peeled
>> for alignment.
> Yes there must be some differences, my motivation is to minimize that
> so we don't need to specially check normal/epilogue loops at too many
> places.
> Of course it's just my feeling when going through the patch set, and
> could be wrong.
>
> Thanks,
> bin
>>
>> It is not clear for me what are my next steps? Should I re-design the
>> patch completely or simply decompose the whole patch to different
>> parts? But it means that we must start review process from beginning
>> but release is closed to its end.
>> Note also that i work for Intel till the end of year and have not idea
>> who will continue working on this project.
>>
>> Any help will be appreciated.
>>
>> Thanks.
>> Yuri.
>>
>> 2016-11-09 13:37 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
>>> On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi All,
>>>>
>>>> I re-send all patches sent by Ilya earlier for review which support
>>>> vectorization of loop epilogues and loops with low trip count. We
>>>> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>>>> approved by Jeff.
>>>>
>>>> I did re-base of all patches and performed bootstrapping and
>>>> regression testing that did not show any new failures. Also all
>>>> changes related to new vect_do_peeling algorithm have been changed
>>>> accordingly.
>>>>
>>>> Is it OK for trunk?
>>>
>>> Hi,
>>> I can't approve patches, but had some comments after going through the
>>> implementation.
>>>
>>> One confusing part is cost model change, as well as the way it's used
>>> to decide how epilogue loop should be vectorized.  Given vect-tail is
>>> disabled at the moment and the cost change needs further tuning, is it
>>> reasonable to split this part out and get vectorization part
>>> reviewed/submitted independently?  For example, let user specified
>>> parameters make the decision for now.  Cost and target dependent
>>> changes should go in at last, this could make the patch easier to
>>> read.
>>>
>>> The implementation computes/shares quite amount information from main
>>> loop to epilogue loop vectorization.  Furthermore, variables/fields
>>> for such information are somehow named in a misleading way.  For
>>> example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
>>> flag controlling whether epilogue loop should be vectorized with
>>> masking.  However, it's actually controlled by exactly the same flag
>>> as whether epilogue loop should be combined into the main loop with
>>> masking:
>>> @@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>>
>>>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>>>
>>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>>> +    vect_combine_loop_epilogue (loop_vinfo);
>>> +
>>>    /* Reduce loop iterations by the vectorization factor.  */
>>>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
>>>                expected_iterations / vf);
>>>
>>> IMHO, we should decouple main loop vectorization and epilogue
>>> vectorization as much as possible by sharing as few information as we
>>> can.  The general idea is to handle epilogue loop just like normal
>>> short-trip loop.  For example, we can rename
>>> LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
>>> else), and we don't differentiate its meaning between main and
>>> epilogue(short-trip) loop.  It only indicates the current loop should
>>> be vectorized with masking no matter it's a main loop or epilogue
>>> loop, and it works just like the current implementation.
>>>
>>> After this change, we can refine vectorization and make it more
>>> general for normal loop and epilogue(short trip) loop.  For example,
>>> this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
>>> loop and use it to control how it should be vectorized:
>>> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
>>> +    {
>>> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
>>> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
>>> +    }
>>> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>> +       && min_profitable_combine_iters >= 0)
>>> +    {
>>>
>>> This works, but not that good for understanding or maintaining.
>>>
>>> Thanks,
>>> bin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-09 12:12       ` Yuri Rumyantsev
@ 2016-11-09 12:40         ` Bin.Cheng
  0 siblings, 0 replies; 38+ messages in thread
From: Bin.Cheng @ 2016-11-09 12:40 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On Wed, Nov 9, 2016 at 12:12 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> I am familiar with SVE extension and understand that implemented
> approach might be not suitable for ARM. The main point is that only
> load/store instructions are masked but all other calculations are not
> (we did special conversion for reduction statements to implement
> merging predication semantic). For SVE peeling for niters is not
> required but it is not true for x86 -  we must determine what
> vectorization scheme is more profitable: loop combining (the only
> essential for SVE) or separate epilogue vectorization using masking or
> less vectorization factor. So I'd like to have the full list of
> required changes to our implementation to try to remove them.
Hmm, sorry that my comment gave impression that I was trying to hold
back the patch, it's not what I meant by any means.  Also it's not
related to SVE, As a matter of fact, I haven't read any document about
SVE yet.  Sorry again for the false impression conveyed by previous
messages.

Thanks,
bin
>
> Thanks.
> Yuri.
>
> 2016-11-09 14:46 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
>> On Wed, Nov 9, 2016 at 11:28 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Thanks Richard for your comments.
>>> Your proposed to handle epilogue loop just like normal short-trip loop
>>> but this is not exactly truth since e.g. epilogue must not be peeled
>>> for alignment.
>> Yes there must be some differences, my motivation is to minimize that
>> so we don't need to specially check normal/epilogue loops at too many
>> places.
>> Of course it's just my feeling when going through the patch set, and
>> could be wrong.
>>
>> Thanks,
>> bin
>>>
>>> It is not clear for me what are my next steps? Should I re-design the
>>> patch completely or simply decompose the whole patch to different
>>> parts? But it means that we must start review process from beginning
>>> but release is closed to its end.
>>> Note also that i work for Intel till the end of year and have not idea
>>> who will continue working on this project.
>>>
>>> Any help will be appreciated.
>>>
>>> Thanks.
>>> Yuri.
>>>
>>> 2016-11-09 13:37 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
>>>> On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>> Hi All,
>>>>>
>>>>> I re-send all patches sent by Ilya earlier for review which support
>>>>> vectorization of loop epilogues and loops with low trip count. We
>>>>> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>>>>> approved by Jeff.
>>>>>
>>>>> I did re-base of all patches and performed bootstrapping and
>>>>> regression testing that did not show any new failures. Also all
>>>>> changes related to new vect_do_peeling algorithm have been changed
>>>>> accordingly.
>>>>>
>>>>> Is it OK for trunk?
>>>>
>>>> Hi,
>>>> I can't approve patches, but had some comments after going through the
>>>> implementation.
>>>>
>>>> One confusing part is cost model change, as well as the way it's used
>>>> to decide how epilogue loop should be vectorized.  Given vect-tail is
>>>> disabled at the moment and the cost change needs further tuning, is it
>>>> reasonable to split this part out and get vectorization part
>>>> reviewed/submitted independently?  For example, let user specified
>>>> parameters make the decision for now.  Cost and target dependent
>>>> changes should go in at last, this could make the patch easier to
>>>> read.
>>>>
>>>> The implementation computes/shares quite amount information from main
>>>> loop to epilogue loop vectorization.  Furthermore, variables/fields
>>>> for such information are somehow named in a misleading way.  For
>>>> example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
>>>> flag controlling whether epilogue loop should be vectorized with
>>>> masking.  However, it's actually controlled by exactly the same flag
>>>> as whether epilogue loop should be combined into the main loop with
>>>> masking:
>>>> @@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>>>
>>>>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>>>>
>>>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>>>> +    vect_combine_loop_epilogue (loop_vinfo);
>>>> +
>>>>    /* Reduce loop iterations by the vectorization factor.  */
>>>>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
>>>>                expected_iterations / vf);
>>>>
>>>> IMHO, we should decouple main loop vectorization and epilogue
>>>> vectorization as much as possible by sharing as few information as we
>>>> can.  The general idea is to handle epilogue loop just like normal
>>>> short-trip loop.  For example, we can rename
>>>> LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
>>>> else), and we don't differentiate its meaning between main and
>>>> epilogue(short-trip) loop.  It only indicates the current loop should
>>>> be vectorized with masking no matter it's a main loop or epilogue
>>>> loop, and it works just like the current implementation.
>>>>
>>>> After this change, we can refine vectorization and make it more
>>>> general for normal loop and epilogue(short trip) loop.  For example,
>>>> this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
>>>> loop and use it to control how it should be vectorized:
>>>> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
>>>> +    {
>>>> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
>>>> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
>>>> +    }
>>>> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>>> +       && min_profitable_combine_iters >= 0)
>>>> +    {
>>>>
>>>> This works, but not that good for understanding or maintaining.
>>>>
>>>> Thanks,
>>>> bin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-09 11:28   ` Yuri Rumyantsev
  2016-11-09 11:46     ` Bin.Cheng
@ 2016-11-09 12:52     ` Richard Biener
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Biener @ 2016-11-09 12:52 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Bin.Cheng, Jeff Law, gcc-patches, Ilya Enkovich

On Wed, 9 Nov 2016, Yuri Rumyantsev wrote:

> Thanks Richard for your comments.
> Your proposed to handle epilogue loop just like normal short-trip loop
> but this is not exactly truth since e.g. epilogue must not be peeled
> for alignment.

But if we know the epilogue data-refs are aligned we should have reflected
that in the code so the vectorizer wouldn't even try to peel for
alignment.  OTOH peeling for alignment already knows that peeling
a short-trip loop is not likely profitable (maybe the condition needs
to be hardened to also work for VF/2).

> It is not clear for me what are my next steps? Should I re-design the
> patch completely or simply decompose the whole patch to different
> parts? But it means that we must start review process from beginning
> but release is closed to its end.

What I disliked about the series from the beginning is that it does
everything at once rather than first introducing vectorizing of
epilogues as an independent patch.  Lumping all in together makes
it hard to decipher the conditions each feature is enabled.

I'm mostly concerned about the predication part and thus if we can
get the other parts separated and committed that would be a much
smaller patch to look at and experiment.

Note that only stage1 is at its end, we usually still accept patches
that were posted before stage1 end during stage3.

> Note also that i work for Intel till the end of year and have not idea
> who will continue working on this project.

Noted.

Richard.

> Any help will be appreciated.
>
> Thanks.
> Yuri.
> 
> 2016-11-09 13:37 GMT+03:00 Bin.Cheng <amker.cheng@gmail.com>:
> > On Tue, Nov 1, 2016 at 12:38 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> Hi All,
> >>
> >> I re-send all patches sent by Ilya earlier for review which support
> >> vectorization of loop epilogues and loops with low trip count. We
> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >> approved by Jeff.
> >>
> >> I did re-base of all patches and performed bootstrapping and
> >> regression testing that did not show any new failures. Also all
> >> changes related to new vect_do_peeling algorithm have been changed
> >> accordingly.
> >>
> >> Is it OK for trunk?
> >
> > Hi,
> > I can't approve patches, but had some comments after going through the
> > implementation.
> >
> > One confusing part is cost model change, as well as the way it's used
> > to decide how epilogue loop should be vectorized.  Given vect-tail is
> > disabled at the moment and the cost change needs further tuning, is it
> > reasonable to split this part out and get vectorization part
> > reviewed/submitted independently?  For example, let user specified
> > parameters make the decision for now.  Cost and target dependent
> > changes should go in at last, this could make the patch easier to
> > read.
> >
> > The implementation computes/shares quite amount information from main
> > loop to epilogue loop vectorization.  Furthermore, variables/fields
> > for such information are somehow named in a misleading way.  For
> > example. LOOP_VINFO_MASK_EPILOGUE gives me the impression this is the
> > flag controlling whether epilogue loop should be vectorized with
> > masking.  However, it's actually controlled by exactly the same flag
> > as whether epilogue loop should be combined into the main loop with
> > masking:
> > @@ -7338,6 +8013,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >
> >    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
> >
> > +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> > +    vect_combine_loop_epilogue (loop_vinfo);
> > +
> >    /* Reduce loop iterations by the vectorization factor.  */
> >    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
> >                expected_iterations / vf);
> >
> > IMHO, we should decouple main loop vectorization and epilogue
> > vectorization as much as possible by sharing as few information as we
> > can.  The general idea is to handle epilogue loop just like normal
> > short-trip loop.  For example, we can rename
> > LOOP_VINFO_COMBINE_EPILOGUE into LOOP_VINFO_VECT_MASK (or something
> > else), and we don't differentiate its meaning between main and
> > epilogue(short-trip) loop.  It only indicates the current loop should
> > be vectorized with masking no matter it's a main loop or epilogue
> > loop, and it works just like the current implementation.
> >
> > After this change, we can refine vectorization and make it more
> > general for normal loop and epilogue(short trip) loop.  For example,
> > this implementation sets LOOP_VINFO_PEELING_FOR_NITER  for epilogue
> > loop and use it to control how it should be vectorized:
> > +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > +    {
> > +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> > +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> > +    }
> > +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> > +       && min_profitable_combine_iters >= 0)
> > +    {
> >
> > This works, but not that good for understanding or maintaining.
> >
> > Thanks,
> > bin
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-08 14:17       ` Yuri Rumyantsev
@ 2016-11-10 12:34         ` Richard Biener
  2016-11-10 12:36           ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-10 12:34 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:

> Richard,
> 
> Here is updated 3 patch.
> 
> I checked that all new tests related to epilogue vectorization passed with it.
> 
> Your comments will be appreciated.

A lot better now.  Instead of the ->aux dance I now prefer to
pass the original loops loop_vinfo to vect_analyze_loop as
optional argument (if non-NULL we analyze the epilogue of that 
loop_vinfo).  OTOH I remember we mainly use it to get at the
original vectorization factor?  So we can pass down an (optional)
forced vectorization factor as well?

Richard.

> 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >
> >> Hi Richard,
> >>
> >> I did not understand your last remark:
> >>
> >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >
> >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >           && dump_enabled_p ())
> >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> >                            "loop vectorized\n");
> >> > -       vect_transform_loop (loop_vinfo);
> >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >         num_vectorized_loops++;
> >> >        /* Now that the loop has been vectorized, allow it to be unrolled
> >> >           etc.  */
> >> >      loop->force_vectorize = false;
> >> >
> >> > +       /* Add new loop to a processing queue.  To make it easier
> >> > +          to match loop and its epilogue vectorization in dumps
> >> > +          put new loop as the next loop to process.  */
> >> > +       if (new_loop)
> >> > +         {
> >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> > +           vect_loops_num = number_of_loops (cfun);
> >> > +         }
> >> >
> >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> f> unction which will set up stuff properly (and also perform
> >> > the if-conversion of the epilogue there).
> >> >
> >> > That said, if we can get in non-masked epilogue vectorization
> >> > separately that would be great.
> >>
> >> Could you please clarify your proposal.
> >
> > When a loop was vectorized set things up to immediately vectorize
> > its epilogue, avoiding changing the loop iteration and avoiding
> > the re-use of ->aux.
> >
> > Richard.
> >
> >> Thanks.
> >> Yuri.
> >>
> >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >> >
> >> >> Hi All,
> >> >>
> >> >> I re-send all patches sent by Ilya earlier for review which support
> >> >> vectorization of loop epilogues and loops with low trip count. We
> >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >> >> approved by Jeff.
> >> >>
> >> >> I did re-base of all patches and performed bootstrapping and
> >> >> regression testing that did not show any new failures. Also all
> >> >> changes related to new vect_do_peeling algorithm have been changed
> >> >> accordingly.
> >> >>
> >> >> Is it OK for trunk?
> >> >
> >> > I would have prefered that the series up to -03-nomask-tails would
> >> > _only_ contain epilogue loop vectorization changes but unfortunately
> >> > the patchset is oddly separated.
> >> >
> >> > I have a comment on that part nevertheless:
> >> >
> >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> > loop_vinfo)
> >> >    /* Check if we can possibly peel the loop.  */
> >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> >> > -      || loop->inner)
> >> > +      || loop->inner
> >> > +      /* Required peeling was performed in prologue and
> >> > +        is not required for epilogue.  */
> >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> >      do_peeling = false;
> >> >
> >> >    if (do_peeling
> >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> > loop_vinfo)
> >> >
> >> >    do_versioning =
> >> >         optimize_loop_nest_for_speed_p (loop)
> >> > -       && (!loop->inner); /* FORNOW */
> >> > +       && (!loop->inner) /* FORNOW */
> >> > +        /* Required versioning was performed for the
> >> > +          original loop and is not required for epilogue.  */
> >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >> >
> >> >    if (do_versioning)
> >> >      {
> >> >
> >> > please do that check in the single caller of this function.
> >> >
> >> > Otherwise I still dislike the new ->aux use and I believe that simply
> >> > passing down info from the processed parent would be _much_ cleaner.
> >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >
> >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >             && dump_enabled_p ())
> >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> >                             "loop vectorized\n");
> >> > -       vect_transform_loop (loop_vinfo);
> >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >         num_vectorized_loops++;
> >> >         /* Now that the loop has been vectorized, allow it to be unrolled
> >> >            etc.  */
> >> >         loop->force_vectorize = false;
> >> >
> >> > +       /* Add new loop to a processing queue.  To make it easier
> >> > +          to match loop and its epilogue vectorization in dumps
> >> > +          put new loop as the next loop to process.  */
> >> > +       if (new_loop)
> >> > +         {
> >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> > +           vect_loops_num = number_of_loops (cfun);
> >> > +         }
> >> >
> >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> > function which will set up stuff properly (and also perform
> >> > the if-conversion of the epilogue there).
> >> >
> >> > That said, if we can get in non-masked epilogue vectorization
> >> > separately that would be great.
> >> >
> >> > I'm still torn about all the rest of the stuff and question its
> >> > usability (esp. merging the epilogue with the main vector loop).
> >> > But it has already been approved ... oh well.
> >> >
> >> > Thanks,
> >> > Richard.
> >>
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-10 12:34         ` Richard Biener
@ 2016-11-10 12:36           ` Richard Biener
  2016-11-11 11:15             ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-10 12:36 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Thu, 10 Nov 2016, Richard Biener wrote:

> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> 
> > Richard,
> > 
> > Here is updated 3 patch.
> > 
> > I checked that all new tests related to epilogue vectorization passed with it.
> > 
> > Your comments will be appreciated.
> 
> A lot better now.  Instead of the ->aux dance I now prefer to
> pass the original loops loop_vinfo to vect_analyze_loop as
> optional argument (if non-NULL we analyze the epilogue of that 
> loop_vinfo).  OTOH I remember we mainly use it to get at the
> original vectorization factor?  So we can pass down an (optional)
> forced vectorization factor as well?

Btw, I wonder if you can produce a single patch containing just
epilogue vectorization, that is combine patches 1-3 but rip out
changes only needed by later patches?

Thanks,
Richard.

> Richard.
> 
> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> > >
> > >> Hi Richard,
> > >>
> > >> I did not understand your last remark:
> > >>
> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> > >> >
> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> > >> >           && dump_enabled_p ())
> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> > >> >                            "loop vectorized\n");
> > >> > -       vect_transform_loop (loop_vinfo);
> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> > >> >         num_vectorized_loops++;
> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
> > >> >           etc.  */
> > >> >      loop->force_vectorize = false;
> > >> >
> > >> > +       /* Add new loop to a processing queue.  To make it easier
> > >> > +          to match loop and its epilogue vectorization in dumps
> > >> > +          put new loop as the next loop to process.  */
> > >> > +       if (new_loop)
> > >> > +         {
> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> > >> > +           vect_loops_num = number_of_loops (cfun);
> > >> > +         }
> > >> >
> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> > >> f> unction which will set up stuff properly (and also perform
> > >> > the if-conversion of the epilogue there).
> > >> >
> > >> > That said, if we can get in non-masked epilogue vectorization
> > >> > separately that would be great.
> > >>
> > >> Could you please clarify your proposal.
> > >
> > > When a loop was vectorized set things up to immediately vectorize
> > > its epilogue, avoiding changing the loop iteration and avoiding
> > > the re-use of ->aux.
> > >
> > > Richard.
> > >
> > >> Thanks.
> > >> Yuri.
> > >>
> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> > >> >
> > >> >> Hi All,
> > >> >>
> > >> >> I re-send all patches sent by Ilya earlier for review which support
> > >> >> vectorization of loop epilogues and loops with low trip count. We
> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> > >> >> approved by Jeff.
> > >> >>
> > >> >> I did re-base of all patches and performed bootstrapping and
> > >> >> regression testing that did not show any new failures. Also all
> > >> >> changes related to new vect_do_peeling algorithm have been changed
> > >> >> accordingly.
> > >> >>
> > >> >> Is it OK for trunk?
> > >> >
> > >> > I would have prefered that the series up to -03-nomask-tails would
> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
> > >> > the patchset is oddly separated.
> > >> >
> > >> > I have a comment on that part nevertheless:
> > >> >
> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> > >> > loop_vinfo)
> > >> >    /* Check if we can possibly peel the loop.  */
> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> > >> > -      || loop->inner)
> > >> > +      || loop->inner
> > >> > +      /* Required peeling was performed in prologue and
> > >> > +        is not required for epilogue.  */
> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> > >> >      do_peeling = false;
> > >> >
> > >> >    if (do_peeling
> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> > >> > loop_vinfo)
> > >> >
> > >> >    do_versioning =
> > >> >         optimize_loop_nest_for_speed_p (loop)
> > >> > -       && (!loop->inner); /* FORNOW */
> > >> > +       && (!loop->inner) /* FORNOW */
> > >> > +        /* Required versioning was performed for the
> > >> > +          original loop and is not required for epilogue.  */
> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> > >> >
> > >> >    if (do_versioning)
> > >> >      {
> > >> >
> > >> > please do that check in the single caller of this function.
> > >> >
> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
> > >> > passing down info from the processed parent would be _much_ cleaner.
> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> > >> >
> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> > >> >             && dump_enabled_p ())
> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> > >> >                             "loop vectorized\n");
> > >> > -       vect_transform_loop (loop_vinfo);
> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> > >> >         num_vectorized_loops++;
> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
> > >> >            etc.  */
> > >> >         loop->force_vectorize = false;
> > >> >
> > >> > +       /* Add new loop to a processing queue.  To make it easier
> > >> > +          to match loop and its epilogue vectorization in dumps
> > >> > +          put new loop as the next loop to process.  */
> > >> > +       if (new_loop)
> > >> > +         {
> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> > >> > +           vect_loops_num = number_of_loops (cfun);
> > >> > +         }
> > >> >
> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> > >> > function which will set up stuff properly (and also perform
> > >> > the if-conversion of the epilogue there).
> > >> >
> > >> > That said, if we can get in non-masked epilogue vectorization
> > >> > separately that would be great.
> > >> >
> > >> > I'm still torn about all the rest of the stuff and question its
> > >> > usability (esp. merging the epilogue with the main vector loop).
> > >> > But it has already been approved ... oh well.
> > >> >
> > >> > Thanks,
> > >> > Richard.
> > >>
> > >>
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-10 12:36           ` Richard Biener
@ 2016-11-11 11:15             ` Yuri Rumyantsev
  2016-11-11 14:15               ` Yuri Rumyantsev
  2016-11-14 12:51               ` Richard Biener
  0 siblings, 2 replies; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-11 11:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 8412 bytes --]

Richard,

I prepare updated 3 patch with passing additional argument to
vect_analyze_loop as you proposed (untested).

You wrote:
tw, I wonder if you can produce a single patch containing just
epilogue vectorization, that is combine patches 1-3 but rip out
changes only needed by later patches?

Did you mean that I exclude all support for vectorization epilogues,
i.e. exclude from 2-nd patch all non-related changes
like

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 11863af..32011c1 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
+  LOOP_VINFO_CAN_BE_MASKED (res) = false;
+  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
+  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
+  LOOP_VINFO_MASK_EPILOGUE (res) = false;
+  LOOP_VINFO_NEED_MASKING (res) = false;
+  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;

Did you mean also that new combined patch must be working patch, i.e.
can be integrated without other patches?

Could you please look at updated patch?

Thanks.
Yuri.

2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Thu, 10 Nov 2016, Richard Biener wrote:
>
>> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>
>> > Richard,
>> >
>> > Here is updated 3 patch.
>> >
>> > I checked that all new tests related to epilogue vectorization passed with it.
>> >
>> > Your comments will be appreciated.
>>
>> A lot better now.  Instead of the ->aux dance I now prefer to
>> pass the original loops loop_vinfo to vect_analyze_loop as
>> optional argument (if non-NULL we analyze the epilogue of that
>> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> original vectorization factor?  So we can pass down an (optional)
>> forced vectorization factor as well?
>
> Btw, I wonder if you can produce a single patch containing just
> epilogue vectorization, that is combine patches 1-3 but rip out
> changes only needed by later patches?
>
> Thanks,
> Richard.
>
>> Richard.
>>
>> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> > >
>> > >> Hi Richard,
>> > >>
>> > >> I did not understand your last remark:
>> > >>
>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> > >> >
>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> > >> >           && dump_enabled_p ())
>> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> > >> >                            "loop vectorized\n");
>> > >> > -       vect_transform_loop (loop_vinfo);
>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> > >> >         num_vectorized_loops++;
>> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
>> > >> >           etc.  */
>> > >> >      loop->force_vectorize = false;
>> > >> >
>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> > >> > +          to match loop and its epilogue vectorization in dumps
>> > >> > +          put new loop as the next loop to process.  */
>> > >> > +       if (new_loop)
>> > >> > +         {
>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> > >> > +           vect_loops_num = number_of_loops (cfun);
>> > >> > +         }
>> > >> >
>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> > >> f> unction which will set up stuff properly (and also perform
>> > >> > the if-conversion of the epilogue there).
>> > >> >
>> > >> > That said, if we can get in non-masked epilogue vectorization
>> > >> > separately that would be great.
>> > >>
>> > >> Could you please clarify your proposal.
>> > >
>> > > When a loop was vectorized set things up to immediately vectorize
>> > > its epilogue, avoiding changing the loop iteration and avoiding
>> > > the re-use of ->aux.
>> > >
>> > > Richard.
>> > >
>> > >> Thanks.
>> > >> Yuri.
>> > >>
>> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> > >> >
>> > >> >> Hi All,
>> > >> >>
>> > >> >> I re-send all patches sent by Ilya earlier for review which support
>> > >> >> vectorization of loop epilogues and loops with low trip count. We
>> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> > >> >> approved by Jeff.
>> > >> >>
>> > >> >> I did re-base of all patches and performed bootstrapping and
>> > >> >> regression testing that did not show any new failures. Also all
>> > >> >> changes related to new vect_do_peeling algorithm have been changed
>> > >> >> accordingly.
>> > >> >>
>> > >> >> Is it OK for trunk?
>> > >> >
>> > >> > I would have prefered that the series up to -03-nomask-tails would
>> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
>> > >> > the patchset is oddly separated.
>> > >> >
>> > >> > I have a comment on that part nevertheless:
>> > >> >
>> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> > >> > loop_vinfo)
>> > >> >    /* Check if we can possibly peel the loop.  */
>> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>> > >> > -      || loop->inner)
>> > >> > +      || loop->inner
>> > >> > +      /* Required peeling was performed in prologue and
>> > >> > +        is not required for epilogue.  */
>> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> > >> >      do_peeling = false;
>> > >> >
>> > >> >    if (do_peeling
>> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> > >> > loop_vinfo)
>> > >> >
>> > >> >    do_versioning =
>> > >> >         optimize_loop_nest_for_speed_p (loop)
>> > >> > -       && (!loop->inner); /* FORNOW */
>> > >> > +       && (!loop->inner) /* FORNOW */
>> > >> > +        /* Required versioning was performed for the
>> > >> > +          original loop and is not required for epilogue.  */
>> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> > >> >
>> > >> >    if (do_versioning)
>> > >> >      {
>> > >> >
>> > >> > please do that check in the single caller of this function.
>> > >> >
>> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
>> > >> > passing down info from the processed parent would be _much_ cleaner.
>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> > >> >
>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> > >> >             && dump_enabled_p ())
>> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> > >> >                             "loop vectorized\n");
>> > >> > -       vect_transform_loop (loop_vinfo);
>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> > >> >         num_vectorized_loops++;
>> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
>> > >> >            etc.  */
>> > >> >         loop->force_vectorize = false;
>> > >> >
>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> > >> > +          to match loop and its epilogue vectorization in dumps
>> > >> > +          put new loop as the next loop to process.  */
>> > >> > +       if (new_loop)
>> > >> > +         {
>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> > >> > +           vect_loops_num = number_of_loops (cfun);
>> > >> > +         }
>> > >> >
>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> > >> > function which will set up stuff properly (and also perform
>> > >> > the if-conversion of the epilogue there).
>> > >> >
>> > >> > That said, if we can get in non-masked epilogue vectorization
>> > >> > separately that would be great.
>> > >> >
>> > >> > I'm still torn about all the rest of the stuff and question its
>> > >> > usability (esp. merging the epilogue with the main vector loop).
>> > >> > But it has already been approved ... oh well.
>> > >> >
>> > >> > Thanks,
>> > >> > Richard.
>> > >>
>> > >>
>> > >
>> > > --
>> > > Richard Biener <rguenther@suse.de>
>> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >
>>
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: patch.03.update1 --]
[-- Type: application/octet-stream, Size: 12710 bytes --]

diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 0a20189..0b86ffe 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -2734,7 +2734,7 @@ ifcvt_local_dce (basic_block bb)
    profitability analysis.  Returns non-zero todo flags when something
    changed.  */
 
-static unsigned int
+unsigned int
 tree_if_conversion (struct loop *loop)
 {
   unsigned int todo = 0;
diff --git a/gcc/tree-if-conv.h b/gcc/tree-if-conv.h
new file mode 100644
index 0000000..3a732c2
--- /dev/null
+++ b/gcc/tree-if-conv.h
@@ -0,0 +1,24 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TREE_IF_CONV_H
+#define GCC_TREE_IF_CONV_H
+
+unsigned int tree_if_conversion (struct loop *);
+
+#endif  /* GCC_TREE_IF_CONV_H  */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 9346cfe..1fc4966 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -480,9 +480,15 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo, int *max_vf)
 				LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
     return false;
 
-  FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
-    if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
-      return false;
+  /* For epilogues we either have no aliases or alias versioning
+     was applied to original loop.  Therefore we may just get max_vf
+     using VF of original loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    *max_vf = LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo);
+  else
+    FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
+      if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
+	return false;
 
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6bfd332..80585ed 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1611,11 +1611,13 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
+   Returns created epilogue or NULL.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
 
-void
+
+struct loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, int th, bool check_profitability,
 		 bool niters_no_overflow)
@@ -1631,7 +1633,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
   if (!prolog_peeling && !epilog_peeling)
-    return;
+    return NULL;
 
   prob_vector = 9 * REG_BR_PROB_BASE / 10;
   if ((vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo)) == 2)
@@ -1639,7 +1641,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   prob_prolog = prob_epilog = (vf - 1) * REG_BR_PROB_BASE / vf;
   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  struct loop *prolog, *epilog, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
   create_lcssa_for_virtual_phi (loop);
   update_ssa (TODO_update_ssa_only_virtuals);
@@ -1821,6 +1823,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
   adjust_vec.release ();
   free_original_copy_tables ();
+
+  return epilog;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index af49b8c..1804560 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-cfg.h"
+#include "tree-if-conv.h"
 
 /* Loop Vectorization Pass.
 
@@ -2031,15 +2032,20 @@ start_over:
   if (!ok)
     return false;
 
-  /* This pass will decide on using loop versioning and/or loop peeling in
-     order to enhance the alignment of data references in the loop.  */
-  ok = vect_enhance_data_refs_alignment (loop_vinfo);
-  if (!ok)
+  /* Do not invoke vect_enhance_data_refs_alignment for eplilogue
+     vectorization.  */
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "bad data alignment.\n");
-      return false;
+    /* This pass will decide on using loop versioning and/or loop peeling in
+       order to enhance the alignment of data references in the loop.  */
+    ok = vect_enhance_data_refs_alignment (loop_vinfo);
+    if (!ok)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "bad data alignment.\n");
+        return false;
+      }
     }
 
   if (slp)
@@ -2293,9 +2299,10 @@ again:
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
    for it.  The different analyses will record information in the
-   loop_vec_info struct.  */
+   loop_vec_info struct.  If ORIG_LOOP_VINFO is not NULL epilogue must
+   be vectorized.  */
 loop_vec_info
-vect_analyze_loop (struct loop *loop)
+vect_analyze_loop (struct loop *loop, loop_vec_info orig_loop_vinfo)
 {
   loop_vec_info loop_vinfo;
   unsigned int vector_sizes;
@@ -2331,6 +2338,10 @@ vect_analyze_loop (struct loop *loop)
 	}
 
       bool fatal = false;
+
+      if (orig_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+
       if (vect_analyze_loop_2 (loop_vinfo, fatal))
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
@@ -6663,12 +6674,14 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
 
    The analysis phase has determined that the loop is vectorizable.
    Vectorize the loop - created vectorized stmts to replace the scalar
-   stmts in the loop, and update the loop exit condition.  */
+   stmts in the loop, and update the loop exit condition.
+   Returns scalar epilogue loop if any.  */
 
-void
+struct loop *
 vect_transform_loop (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *epilogue = NULL;
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
   int nbbs = loop->num_nodes;
   int i;
@@ -6747,8 +6760,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
-  vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
-		   check_profitability, niters_no_overflow);
+  epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
+			      check_profitability, niters_no_overflow);
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
@@ -7051,6 +7064,59 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   /* Clear-up safelen field since its value is invalid after vectorization
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
+
+  /* Don't vectorize epilogue for epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+  /* Scalar epilogue is not vectorized in case
+     we use combined vector epilogue.  */
+  else if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    epilogue = NULL;
+
+  if (epilogue)
+    {
+      if (!LOOP_VINFO_MASK_EPILOGUE (loop_vinfo))
+	{
+	  unsigned int vector_sizes
+	    = targetm.vectorize.autovectorize_vector_sizes ();
+	  vector_sizes &= current_vector_size - 1;
+
+	  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	    epilogue = NULL;
+	  else if (!vector_sizes)
+	    epilogue = NULL;
+	  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	    {
+	      int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	      int ratio = current_vector_size / smallest_vec_size;
+	      int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+		- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	      eiters = eiters % vf;
+
+	      epilogue->nb_iterations_upper_bound = eiters - 1;
+
+	      if (eiters < vf / ratio)
+		epilogue = NULL;
+	    }
+	}
+    }
+
+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
+
+      gcc_assert (!epilogue->aux);
+      epilogue->aux = loop_vinfo;
+    }
+
+  return epilogue;
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 22e587a..568894a 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -514,6 +514,7 @@ vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  struct loop *new_loop;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -538,7 +539,8 @@ vectorize_loops (void)
 	      && optimize_loop_nest_for_speed_p (loop))
 	     || loop->force_vectorize)
       {
-	loop_vec_info loop_vinfo;
+	loop_vec_info loop_vinfo, orig_loop_vinfo;
+vectorize_epilogue:
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
@@ -546,7 +548,7 @@ vectorize_loops (void)
                        LOCATION_FILE (vect_location),
 		       LOCATION_LINE (vect_location));
 
-	loop_vinfo = vect_analyze_loop (loop);
+	loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo);
 	loop->aux = loop_vinfo;
 
 	if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
@@ -580,7 +582,7 @@ vectorize_loops (void)
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-	vect_transform_loop (loop_vinfo);
+	new_loop = vect_transform_loop (loop_vinfo);
 	num_vectorized_loops++;
 	/* Now that the loop has been vectorized, allow it to be unrolled
 	   etc.  */
@@ -602,6 +604,16 @@ vectorize_loops (void)
 	    fold_loop_vectorized_call (loop_vectorized_call, boolean_true_node);
 	    ret |= TODO_cleanup_cfg;
 	  }
+
+	orig_loop_vinfo = NULL;
+	if (new_loop)
+	  {
+	    /* Epilogue of vectorized loop must be vectorized too.  */
+	    vect_loops_num = number_of_loops (cfun);
+	    loop = new_loop;
+	    orig_loop_vinfo = loop_vinfo;  /* For epilogue vectorization.  */
+	    goto vectorize_epilogue;
+	  }
       }
 
   vect_location = UNKNOWN_LOCATION;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 99b9982..735097d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1063,8 +1063,8 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
-extern void vect_do_peeling (loop_vec_info, tree, tree,
-			     tree *, int, bool, bool);
+extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
+				     tree *, int, bool, bool);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
 
@@ -1175,11 +1175,11 @@ extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern gimple *vect_force_simple_reduction (loop_vec_info, gimple *, bool,
 					    bool *, bool);
 /* Drive for loop analysis stage.  */
-extern loop_vec_info vect_analyze_loop (struct loop *);
+extern loop_vec_info vect_analyze_loop (struct loop *, loop_vec_info);
 extern tree vect_build_loop_niters (loop_vec_info);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, bool);
 /* Drive for loop transformation stage.  */
-extern void vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 extern bool vectorizable_live_operation (gimple *, gimple_stmt_iterator *,
 					 slp_tree, int, gimple **);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-11 11:15             ` Yuri Rumyantsev
@ 2016-11-11 14:15               ` Yuri Rumyantsev
  2016-11-11 14:43                 ` Yuri Rumyantsev
  2016-11-14 12:51               ` Richard Biener
  1 sibling, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-11 14:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

Richard,

Sorry for confusion but my updated patch  does not work properly, so I
need to fix it.

Yuri.

2016-11-11 14:15 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Richard,
>
> I prepare updated 3 patch with passing additional argument to
> vect_analyze_loop as you proposed (untested).
>
> You wrote:
> tw, I wonder if you can produce a single patch containing just
> epilogue vectorization, that is combine patches 1-3 but rip out
> changes only needed by later patches?
>
> Did you mean that I exclude all support for vectorization epilogues,
> i.e. exclude from 2-nd patch all non-related changes
> like
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 11863af..32011c1 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> +  LOOP_VINFO_NEED_MASKING (res) = false;
> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>
> Did you mean also that new combined patch must be working patch, i.e.
> can be integrated without other patches?
>
> Could you please look at updated patch?
>
> Thanks.
> Yuri.
>
> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> On Thu, 10 Nov 2016, Richard Biener wrote:
>>
>>> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>
>>> > Richard,
>>> >
>>> > Here is updated 3 patch.
>>> >
>>> > I checked that all new tests related to epilogue vectorization passed with it.
>>> >
>>> > Your comments will be appreciated.
>>>
>>> A lot better now.  Instead of the ->aux dance I now prefer to
>>> pass the original loops loop_vinfo to vect_analyze_loop as
>>> optional argument (if non-NULL we analyze the epilogue of that
>>> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>> original vectorization factor?  So we can pass down an (optional)
>>> forced vectorization factor as well?
>>
>> Btw, I wonder if you can produce a single patch containing just
>> epilogue vectorization, that is combine patches 1-3 but rip out
>> changes only needed by later patches?
>>
>> Thanks,
>> Richard.
>>
>>> Richard.
>>>
>>> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>> > >
>>> > >> Hi Richard,
>>> > >>
>>> > >> I did not understand your last remark:
>>> > >>
>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> > >> >
>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> > >> >           && dump_enabled_p ())
>>> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>>> > >> >                            "loop vectorized\n");
>>> > >> > -       vect_transform_loop (loop_vinfo);
>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> > >> >         num_vectorized_loops++;
>>> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
>>> > >> >           etc.  */
>>> > >> >      loop->force_vectorize = false;
>>> > >> >
>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>>> > >> > +          to match loop and its epilogue vectorization in dumps
>>> > >> > +          put new loop as the next loop to process.  */
>>> > >> > +       if (new_loop)
>>> > >> > +         {
>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> > >> > +         }
>>> > >> >
>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>>> > >> f> unction which will set up stuff properly (and also perform
>>> > >> > the if-conversion of the epilogue there).
>>> > >> >
>>> > >> > That said, if we can get in non-masked epilogue vectorization
>>> > >> > separately that would be great.
>>> > >>
>>> > >> Could you please clarify your proposal.
>>> > >
>>> > > When a loop was vectorized set things up to immediately vectorize
>>> > > its epilogue, avoiding changing the loop iteration and avoiding
>>> > > the re-use of ->aux.
>>> > >
>>> > > Richard.
>>> > >
>>> > >> Thanks.
>>> > >> Yuri.
>>> > >>
>>> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>> > >> >
>>> > >> >> Hi All,
>>> > >> >>
>>> > >> >> I re-send all patches sent by Ilya earlier for review which support
>>> > >> >> vectorization of loop epilogues and loops with low trip count. We
>>> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>>> > >> >> approved by Jeff.
>>> > >> >>
>>> > >> >> I did re-base of all patches and performed bootstrapping and
>>> > >> >> regression testing that did not show any new failures. Also all
>>> > >> >> changes related to new vect_do_peeling algorithm have been changed
>>> > >> >> accordingly.
>>> > >> >>
>>> > >> >> Is it OK for trunk?
>>> > >> >
>>> > >> > I would have prefered that the series up to -03-nomask-tails would
>>> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
>>> > >> > the patchset is oddly separated.
>>> > >> >
>>> > >> > I have a comment on that part nevertheless:
>>> > >> >
>>> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>>> > >> > loop_vinfo)
>>> > >> >    /* Check if we can possibly peel the loop.  */
>>> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>>> > >> > -      || loop->inner)
>>> > >> > +      || loop->inner
>>> > >> > +      /* Required peeling was performed in prologue and
>>> > >> > +        is not required for epilogue.  */
>>> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>> > >> >      do_peeling = false;
>>> > >> >
>>> > >> >    if (do_peeling
>>> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>>> > >> > loop_vinfo)
>>> > >> >
>>> > >> >    do_versioning =
>>> > >> >         optimize_loop_nest_for_speed_p (loop)
>>> > >> > -       && (!loop->inner); /* FORNOW */
>>> > >> > +       && (!loop->inner) /* FORNOW */
>>> > >> > +        /* Required versioning was performed for the
>>> > >> > +          original loop and is not required for epilogue.  */
>>> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>> > >> >
>>> > >> >    if (do_versioning)
>>> > >> >      {
>>> > >> >
>>> > >> > please do that check in the single caller of this function.
>>> > >> >
>>> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
>>> > >> > passing down info from the processed parent would be _much_ cleaner.
>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> > >> >
>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> > >> >             && dump_enabled_p ())
>>> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>>> > >> >                             "loop vectorized\n");
>>> > >> > -       vect_transform_loop (loop_vinfo);
>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> > >> >         num_vectorized_loops++;
>>> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
>>> > >> >            etc.  */
>>> > >> >         loop->force_vectorize = false;
>>> > >> >
>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>>> > >> > +          to match loop and its epilogue vectorization in dumps
>>> > >> > +          put new loop as the next loop to process.  */
>>> > >> > +       if (new_loop)
>>> > >> > +         {
>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> > >> > +         }
>>> > >> >
>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>>> > >> > function which will set up stuff properly (and also perform
>>> > >> > the if-conversion of the epilogue there).
>>> > >> >
>>> > >> > That said, if we can get in non-masked epilogue vectorization
>>> > >> > separately that would be great.
>>> > >> >
>>> > >> > I'm still torn about all the rest of the stuff and question its
>>> > >> > usability (esp. merging the epilogue with the main vector loop).
>>> > >> > But it has already been approved ... oh well.
>>> > >> >
>>> > >> > Thanks,
>>> > >> > Richard.
>>> > >>
>>> > >>
>>> > >
>>> > > --
>>> > > Richard Biener <rguenther@suse.de>
>>> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>> >
>>>
>>>
>>
>> --
>> Richard Biener <rguenther@suse.de>
>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-11 14:15               ` Yuri Rumyantsev
@ 2016-11-11 14:43                 ` Yuri Rumyantsev
  2016-11-14 12:56                   ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-11 14:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 9223 bytes --]

Richard,

Here is fixed version of updated patch 3.

Any comments will be appreciated.

Thanks.
Yuri.

2016-11-11 17:15 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Richard,
>
> Sorry for confusion but my updated patch  does not work properly, so I
> need to fix it.
>
> Yuri.
>
> 2016-11-11 14:15 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> Richard,
>>
>> I prepare updated 3 patch with passing additional argument to
>> vect_analyze_loop as you proposed (untested).
>>
>> You wrote:
>> tw, I wonder if you can produce a single patch containing just
>> epilogue vectorization, that is combine patches 1-3 but rip out
>> changes only needed by later patches?
>>
>> Did you mean that I exclude all support for vectorization epilogues,
>> i.e. exclude from 2-nd patch all non-related changes
>> like
>>
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 11863af..32011c1 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>
>> Did you mean also that new combined patch must be working patch, i.e.
>> can be integrated without other patches?
>>
>> Could you please look at updated patch?
>>
>> Thanks.
>> Yuri.
>>
>> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> On Thu, 10 Nov 2016, Richard Biener wrote:
>>>
>>>> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>
>>>> > Richard,
>>>> >
>>>> > Here is updated 3 patch.
>>>> >
>>>> > I checked that all new tests related to epilogue vectorization passed with it.
>>>> >
>>>> > Your comments will be appreciated.
>>>>
>>>> A lot better now.  Instead of the ->aux dance I now prefer to
>>>> pass the original loops loop_vinfo to vect_analyze_loop as
>>>> optional argument (if non-NULL we analyze the epilogue of that
>>>> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>> original vectorization factor?  So we can pass down an (optional)
>>>> forced vectorization factor as well?
>>>
>>> Btw, I wonder if you can produce a single patch containing just
>>> epilogue vectorization, that is combine patches 1-3 but rip out
>>> changes only needed by later patches?
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Richard.
>>>>
>>>> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>> > >
>>>> > >> Hi Richard,
>>>> > >>
>>>> > >> I did not understand your last remark:
>>>> > >>
>>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>> > >> >
>>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>> > >> >           && dump_enabled_p ())
>>>> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>>>> > >> >                            "loop vectorized\n");
>>>> > >> > -       vect_transform_loop (loop_vinfo);
>>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>> > >> >         num_vectorized_loops++;
>>>> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
>>>> > >> >           etc.  */
>>>> > >> >      loop->force_vectorize = false;
>>>> > >> >
>>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>>>> > >> > +          to match loop and its epilogue vectorization in dumps
>>>> > >> > +          put new loop as the next loop to process.  */
>>>> > >> > +       if (new_loop)
>>>> > >> > +         {
>>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>> > >> > +         }
>>>> > >> >
>>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>>>> > >> f> unction which will set up stuff properly (and also perform
>>>> > >> > the if-conversion of the epilogue there).
>>>> > >> >
>>>> > >> > That said, if we can get in non-masked epilogue vectorization
>>>> > >> > separately that would be great.
>>>> > >>
>>>> > >> Could you please clarify your proposal.
>>>> > >
>>>> > > When a loop was vectorized set things up to immediately vectorize
>>>> > > its epilogue, avoiding changing the loop iteration and avoiding
>>>> > > the re-use of ->aux.
>>>> > >
>>>> > > Richard.
>>>> > >
>>>> > >> Thanks.
>>>> > >> Yuri.
>>>> > >>
>>>> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>> > >> >
>>>> > >> >> Hi All,
>>>> > >> >>
>>>> > >> >> I re-send all patches sent by Ilya earlier for review which support
>>>> > >> >> vectorization of loop epilogues and loops with low trip count. We
>>>> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>>>> > >> >> approved by Jeff.
>>>> > >> >>
>>>> > >> >> I did re-base of all patches and performed bootstrapping and
>>>> > >> >> regression testing that did not show any new failures. Also all
>>>> > >> >> changes related to new vect_do_peeling algorithm have been changed
>>>> > >> >> accordingly.
>>>> > >> >>
>>>> > >> >> Is it OK for trunk?
>>>> > >> >
>>>> > >> > I would have prefered that the series up to -03-nomask-tails would
>>>> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
>>>> > >> > the patchset is oddly separated.
>>>> > >> >
>>>> > >> > I have a comment on that part nevertheless:
>>>> > >> >
>>>> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>>>> > >> > loop_vinfo)
>>>> > >> >    /* Check if we can possibly peel the loop.  */
>>>> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>>>> > >> > -      || loop->inner)
>>>> > >> > +      || loop->inner
>>>> > >> > +      /* Required peeling was performed in prologue and
>>>> > >> > +        is not required for epilogue.  */
>>>> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>> > >> >      do_peeling = false;
>>>> > >> >
>>>> > >> >    if (do_peeling
>>>> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>>>> > >> > loop_vinfo)
>>>> > >> >
>>>> > >> >    do_versioning =
>>>> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>> > >> > -       && (!loop->inner); /* FORNOW */
>>>> > >> > +       && (!loop->inner) /* FORNOW */
>>>> > >> > +        /* Required versioning was performed for the
>>>> > >> > +          original loop and is not required for epilogue.  */
>>>> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>> > >> >
>>>> > >> >    if (do_versioning)
>>>> > >> >      {
>>>> > >> >
>>>> > >> > please do that check in the single caller of this function.
>>>> > >> >
>>>> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
>>>> > >> > passing down info from the processed parent would be _much_ cleaner.
>>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>> > >> >
>>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>> > >> >             && dump_enabled_p ())
>>>> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>>>> > >> >                             "loop vectorized\n");
>>>> > >> > -       vect_transform_loop (loop_vinfo);
>>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>> > >> >         num_vectorized_loops++;
>>>> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
>>>> > >> >            etc.  */
>>>> > >> >         loop->force_vectorize = false;
>>>> > >> >
>>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
>>>> > >> > +          to match loop and its epilogue vectorization in dumps
>>>> > >> > +          put new loop as the next loop to process.  */
>>>> > >> > +       if (new_loop)
>>>> > >> > +         {
>>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>> > >> > +         }
>>>> > >> >
>>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>>>> > >> > function which will set up stuff properly (and also perform
>>>> > >> > the if-conversion of the epilogue there).
>>>> > >> >
>>>> > >> > That said, if we can get in non-masked epilogue vectorization
>>>> > >> > separately that would be great.
>>>> > >> >
>>>> > >> > I'm still torn about all the rest of the stuff and question its
>>>> > >> > usability (esp. merging the epilogue with the main vector loop).
>>>> > >> > But it has already been approved ... oh well.
>>>> > >> >
>>>> > >> > Thanks,
>>>> > >> > Richard.
>>>> > >>
>>>> > >>
>>>> > >
>>>> > > --
>>>> > > Richard Biener <rguenther@suse.de>
>>>> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>>> >
>>>>
>>>>
>>>
>>> --
>>> Richard Biener <rguenther@suse.de>
>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: patch.03.update1 --]
[-- Type: application/octet-stream, Size: 12691 bytes --]

diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 0a20189..0b86ffe 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -2734,7 +2734,7 @@ ifcvt_local_dce (basic_block bb)
    profitability analysis.  Returns non-zero todo flags when something
    changed.  */
 
-static unsigned int
+unsigned int
 tree_if_conversion (struct loop *loop)
 {
   unsigned int todo = 0;
diff --git a/gcc/tree-if-conv.h b/gcc/tree-if-conv.h
new file mode 100644
index 0000000..3a732c2
--- /dev/null
+++ b/gcc/tree-if-conv.h
@@ -0,0 +1,24 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TREE_IF_CONV_H
+#define GCC_TREE_IF_CONV_H
+
+unsigned int tree_if_conversion (struct loop *);
+
+#endif  /* GCC_TREE_IF_CONV_H  */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 9346cfe..1fc4966 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -480,9 +480,15 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo, int *max_vf)
 				LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
     return false;
 
-  FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
-    if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
-      return false;
+  /* For epilogues we either have no aliases or alias versioning
+     was applied to original loop.  Therefore we may just get max_vf
+     using VF of original loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    *max_vf = LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo);
+  else
+    FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
+      if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
+	return false;
 
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6bfd332..80585ed 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1611,11 +1611,13 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
+   Returns created epilogue or NULL.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
 
-void
+
+struct loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, int th, bool check_profitability,
 		 bool niters_no_overflow)
@@ -1631,7 +1633,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
   if (!prolog_peeling && !epilog_peeling)
-    return;
+    return NULL;
 
   prob_vector = 9 * REG_BR_PROB_BASE / 10;
   if ((vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo)) == 2)
@@ -1639,7 +1641,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   prob_prolog = prob_epilog = (vf - 1) * REG_BR_PROB_BASE / vf;
   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  struct loop *prolog, *epilog, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
   create_lcssa_for_virtual_phi (loop);
   update_ssa (TODO_update_ssa_only_virtuals);
@@ -1821,6 +1823,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
   adjust_vec.release ();
   free_original_copy_tables ();
+
+  return epilog;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index af49b8c..1804560 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-cfg.h"
+#include "tree-if-conv.h"
 
 /* Loop Vectorization Pass.
 
@@ -2031,15 +2032,20 @@ start_over:
   if (!ok)
     return false;
 
-  /* This pass will decide on using loop versioning and/or loop peeling in
-     order to enhance the alignment of data references in the loop.  */
-  ok = vect_enhance_data_refs_alignment (loop_vinfo);
-  if (!ok)
+  /* Do not invoke vect_enhance_data_refs_alignment for eplilogue
+     vectorization.  */
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "bad data alignment.\n");
-      return false;
+    /* This pass will decide on using loop versioning and/or loop peeling in
+       order to enhance the alignment of data references in the loop.  */
+    ok = vect_enhance_data_refs_alignment (loop_vinfo);
+    if (!ok)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "bad data alignment.\n");
+        return false;
+      }
     }
 
   if (slp)
@@ -2293,9 +2299,10 @@ again:
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
    for it.  The different analyses will record information in the
-   loop_vec_info struct.  */
+   loop_vec_info struct.  If ORIG_LOOP_VINFO is not NULL epilogue must
+   be vectorized.  */
 loop_vec_info
-vect_analyze_loop (struct loop *loop)
+vect_analyze_loop (struct loop *loop, loop_vec_info orig_loop_vinfo)
 {
   loop_vec_info loop_vinfo;
   unsigned int vector_sizes;
@@ -2331,6 +2338,10 @@ vect_analyze_loop (struct loop *loop)
 	}
 
       bool fatal = false;
+
+      if (orig_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+
       if (vect_analyze_loop_2 (loop_vinfo, fatal))
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
@@ -6663,12 +6674,14 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
 
    The analysis phase has determined that the loop is vectorizable.
    Vectorize the loop - created vectorized stmts to replace the scalar
-   stmts in the loop, and update the loop exit condition.  */
+   stmts in the loop, and update the loop exit condition.
+   Returns scalar epilogue loop if any.  */
 
-void
+struct loop *
 vect_transform_loop (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *epilogue = NULL;
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
   int nbbs = loop->num_nodes;
   int i;
@@ -6747,8 +6760,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
-  vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
-		   check_profitability, niters_no_overflow);
+  epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
+			      check_profitability, niters_no_overflow);
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
@@ -7051,6 +7064,59 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   /* Clear-up safelen field since its value is invalid after vectorization
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
+
+  /* Don't vectorize epilogue for epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+  /* Scalar epilogue is not vectorized in case
+     we use combined vector epilogue.  */
+  else if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    epilogue = NULL;
+
+  if (epilogue)
+    {
+      if (!LOOP_VINFO_MASK_EPILOGUE (loop_vinfo))
+	{
+	  unsigned int vector_sizes
+	    = targetm.vectorize.autovectorize_vector_sizes ();
+	  vector_sizes &= current_vector_size - 1;
+
+	  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	    epilogue = NULL;
+	  else if (!vector_sizes)
+	    epilogue = NULL;
+	  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	    {
+	      int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	      int ratio = current_vector_size / smallest_vec_size;
+	      int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+		- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	      eiters = eiters % vf;
+
+	      epilogue->nb_iterations_upper_bound = eiters - 1;
+
+	      if (eiters < vf / ratio)
+		epilogue = NULL;
+	    }
+	}
+    }
+
+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
+
+      gcc_assert (!epilogue->aux);
+      epilogue->aux = loop_vinfo;
+    }
+
+  return epilogue;
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 22e587a..568894a 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -514,6 +514,7 @@ vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  struct loop *new_loop;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -538,7 +539,8 @@ vectorize_loops (void)
 	      && optimize_loop_nest_for_speed_p (loop))
 	     || loop->force_vectorize)
       {
-	loop_vec_info loop_vinfo;
+	loop_vec_info loop_vinfo, orig_loop_vinfo = NULL;
+vectorize_epilogue:
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
@@ -546,7 +548,7 @@ vectorize_loops (void)
                        LOCATION_FILE (vect_location),
 		       LOCATION_LINE (vect_location));
 
-	loop_vinfo = vect_analyze_loop (loop);
+	loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo);
 	loop->aux = loop_vinfo;
 
 	if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
@@ -580,7 +582,7 @@ vectorize_loops (void)
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-	vect_transform_loop (loop_vinfo);
+	new_loop = vect_transform_loop (loop_vinfo);
 	num_vectorized_loops++;
 	/* Now that the loop has been vectorized, allow it to be unrolled
 	   etc.  */
@@ -602,6 +604,16 @@ vectorize_loops (void)
 	    fold_loop_vectorized_call (loop_vectorized_call, boolean_true_node);
 	    ret |= TODO_cleanup_cfg;
 	  }
+
+	if (new_loop)
+	  {
+	    /* Epilogue of vectorized loop must be vectorized too.  */
+	    vect_loops_num = number_of_loops (cfun);
+	    loop = new_loop;
+	    orig_loop_vinfo = loop_vinfo;  /* For epilogue vectorization.  */
+	    goto vectorize_epilogue;
+	  }
       }
 
   vect_location = UNKNOWN_LOCATION;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 99b9982..735097d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1063,8 +1063,8 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
-extern void vect_do_peeling (loop_vec_info, tree, tree,
-			     tree *, int, bool, bool);
+extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
+				     tree *, int, bool, bool);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
 
@@ -1175,11 +1175,11 @@ extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern gimple *vect_force_simple_reduction (loop_vec_info, gimple *, bool,
 					    bool *, bool);
 /* Drive for loop analysis stage.  */
-extern loop_vec_info vect_analyze_loop (struct loop *);
+extern loop_vec_info vect_analyze_loop (struct loop *, loop_vec_info);
 extern tree vect_build_loop_niters (loop_vec_info);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, bool);
 /* Drive for loop transformation stage.  */
-extern void vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 extern bool vectorizable_live_operation (gimple *, gimple_stmt_iterator *,
 					 slp_tree, int, gimple **);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-11 11:15             ` Yuri Rumyantsev
  2016-11-11 14:15               ` Yuri Rumyantsev
@ 2016-11-14 12:51               ` Richard Biener
  2016-11-14 13:30                 ` Yuri Rumyantsev
  1 sibling, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-14 12:51 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:

> Richard,
> 
> I prepare updated 3 patch with passing additional argument to
> vect_analyze_loop as you proposed (untested).
> 
> You wrote:
> tw, I wonder if you can produce a single patch containing just
> epilogue vectorization, that is combine patches 1-3 but rip out
> changes only needed by later patches?
> 
> Did you mean that I exclude all support for vectorization epilogues,
> i.e. exclude from 2-nd patch all non-related changes
> like
> 
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 11863af..32011c1 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> +  LOOP_VINFO_NEED_MASKING (res) = false;
> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;

Yes.
 
> Did you mean also that new combined patch must be working patch, i.e.
> can be integrated without other patches?

Yes.

> Could you please look at updated patch?

Will do.

Thanks,
Richard.

> Thanks.
> Yuri.
> 
> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >
> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >>
> >> > Richard,
> >> >
> >> > Here is updated 3 patch.
> >> >
> >> > I checked that all new tests related to epilogue vectorization passed with it.
> >> >
> >> > Your comments will be appreciated.
> >>
> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >> optional argument (if non-NULL we analyze the epilogue of that
> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >> original vectorization factor?  So we can pass down an (optional)
> >> forced vectorization factor as well?
> >
> > Btw, I wonder if you can produce a single patch containing just
> > epilogue vectorization, that is combine patches 1-3 but rip out
> > changes only needed by later patches?
> >
> > Thanks,
> > Richard.
> >
> >> Richard.
> >>
> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >> > >
> >> > >> Hi Richard,
> >> > >>
> >> > >> I did not understand your last remark:
> >> > >>
> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> > >> >
> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> > >> >           && dump_enabled_p ())
> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> > >> >                            "loop vectorized\n");
> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> > >> >         num_vectorized_loops++;
> >> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
> >> > >> >           etc.  */
> >> > >> >      loop->force_vectorize = false;
> >> > >> >
> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >> > >> > +          to match loop and its epilogue vectorization in dumps
> >> > >> > +          put new loop as the next loop to process.  */
> >> > >> > +       if (new_loop)
> >> > >> > +         {
> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> > >> > +         }
> >> > >> >
> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> > >> f> unction which will set up stuff properly (and also perform
> >> > >> > the if-conversion of the epilogue there).
> >> > >> >
> >> > >> > That said, if we can get in non-masked epilogue vectorization
> >> > >> > separately that would be great.
> >> > >>
> >> > >> Could you please clarify your proposal.
> >> > >
> >> > > When a loop was vectorized set things up to immediately vectorize
> >> > > its epilogue, avoiding changing the loop iteration and avoiding
> >> > > the re-use of ->aux.
> >> > >
> >> > > Richard.
> >> > >
> >> > >> Thanks.
> >> > >> Yuri.
> >> > >>
> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >> > >> >
> >> > >> >> Hi All,
> >> > >> >>
> >> > >> >> I re-send all patches sent by Ilya earlier for review which support
> >> > >> >> vectorization of loop epilogues and loops with low trip count. We
> >> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >> > >> >> approved by Jeff.
> >> > >> >>
> >> > >> >> I did re-base of all patches and performed bootstrapping and
> >> > >> >> regression testing that did not show any new failures. Also all
> >> > >> >> changes related to new vect_do_peeling algorithm have been changed
> >> > >> >> accordingly.
> >> > >> >>
> >> > >> >> Is it OK for trunk?
> >> > >> >
> >> > >> > I would have prefered that the series up to -03-nomask-tails would
> >> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
> >> > >> > the patchset is oddly separated.
> >> > >> >
> >> > >> > I have a comment on that part nevertheless:
> >> > >> >
> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> > >> > loop_vinfo)
> >> > >> >    /* Check if we can possibly peel the loop.  */
> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> >> > >> > -      || loop->inner)
> >> > >> > +      || loop->inner
> >> > >> > +      /* Required peeling was performed in prologue and
> >> > >> > +        is not required for epilogue.  */
> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> > >> >      do_peeling = false;
> >> > >> >
> >> > >> >    if (do_peeling
> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> > >> > loop_vinfo)
> >> > >> >
> >> > >> >    do_versioning =
> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >> > >> > -       && (!loop->inner); /* FORNOW */
> >> > >> > +       && (!loop->inner) /* FORNOW */
> >> > >> > +        /* Required versioning was performed for the
> >> > >> > +          original loop and is not required for epilogue.  */
> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >> > >> >
> >> > >> >    if (do_versioning)
> >> > >> >      {
> >> > >> >
> >> > >> > please do that check in the single caller of this function.
> >> > >> >
> >> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
> >> > >> > passing down info from the processed parent would be _much_ cleaner.
> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> > >> >
> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> > >> >             && dump_enabled_p ())
> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> > >> >                             "loop vectorized\n");
> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> > >> >         num_vectorized_loops++;
> >> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
> >> > >> >            etc.  */
> >> > >> >         loop->force_vectorize = false;
> >> > >> >
> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >> > >> > +          to match loop and its epilogue vectorization in dumps
> >> > >> > +          put new loop as the next loop to process.  */
> >> > >> > +       if (new_loop)
> >> > >> > +         {
> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> > >> > +         }
> >> > >> >
> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> > >> > function which will set up stuff properly (and also perform
> >> > >> > the if-conversion of the epilogue there).
> >> > >> >
> >> > >> > That said, if we can get in non-masked epilogue vectorization
> >> > >> > separately that would be great.
> >> > >> >
> >> > >> > I'm still torn about all the rest of the stuff and question its
> >> > >> > usability (esp. merging the epilogue with the main vector loop).
> >> > >> > But it has already been approved ... oh well.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Richard.
> >> > >>
> >> > >>
> >> > >
> >> > > --
> >> > > Richard Biener <rguenther@suse.de>
> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> >> >
> >>
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-11 14:43                 ` Yuri Rumyantsev
@ 2016-11-14 12:56                   ` Richard Biener
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Biener @ 2016-11-14 12:56 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:

> Richard,
> 
> Here is fixed version of updated patch 3.
> 
> Any comments will be appreciated.

Looks good apart from

+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+       tree_if_conversion (epilogue);
+
+      gcc_assert (!epilogue->aux);
+      epilogue->aux = loop_vinfo;

where the last two lines should now no longer be necessary?

Thanks,
Richard.

> Thanks.
> Yuri.
> 
> 2016-11-11 17:15 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> > Richard,
> >
> > Sorry for confusion but my updated patch  does not work properly, so I
> > need to fix it.
> >
> > Yuri.
> >
> > 2016-11-11 14:15 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> >> Richard,
> >>
> >> I prepare updated 3 patch with passing additional argument to
> >> vect_analyze_loop as you proposed (untested).
> >>
> >> You wrote:
> >> tw, I wonder if you can produce a single patch containing just
> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >> changes only needed by later patches?
> >>
> >> Did you mean that I exclude all support for vectorization epilogues,
> >> i.e. exclude from 2-nd patch all non-related changes
> >> like
> >>
> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> index 11863af..32011c1 100644
> >> --- a/gcc/tree-vect-loop.c
> >> +++ b/gcc/tree-vect-loop.c
> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >>
> >> Did you mean also that new combined patch must be working patch, i.e.
> >> can be integrated without other patches?
> >>
> >> Could you please look at updated patch?
> >>
> >> Thanks.
> >> Yuri.
> >>
> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>> On Thu, 10 Nov 2016, Richard Biener wrote:
> >>>
> >>>> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >>>>
> >>>> > Richard,
> >>>> >
> >>>> > Here is updated 3 patch.
> >>>> >
> >>>> > I checked that all new tests related to epilogue vectorization passed with it.
> >>>> >
> >>>> > Your comments will be appreciated.
> >>>>
> >>>> A lot better now.  Instead of the ->aux dance I now prefer to
> >>>> pass the original loops loop_vinfo to vect_analyze_loop as
> >>>> optional argument (if non-NULL we analyze the epilogue of that
> >>>> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >>>> original vectorization factor?  So we can pass down an (optional)
> >>>> forced vectorization factor as well?
> >>>
> >>> Btw, I wonder if you can produce a single patch containing just
> >>> epilogue vectorization, that is combine patches 1-3 but rip out
> >>> changes only needed by later patches?
> >>>
> >>> Thanks,
> >>> Richard.
> >>>
> >>>> Richard.
> >>>>
> >>>> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >>>> > >
> >>>> > >> Hi Richard,
> >>>> > >>
> >>>> > >> I did not understand your last remark:
> >>>> > >>
> >>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>> > >> >
> >>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>> > >> >           && dump_enabled_p ())
> >>>> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >>>> > >> >                            "loop vectorized\n");
> >>>> > >> > -       vect_transform_loop (loop_vinfo);
> >>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>> > >> >         num_vectorized_loops++;
> >>>> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
> >>>> > >> >           etc.  */
> >>>> > >> >      loop->force_vectorize = false;
> >>>> > >> >
> >>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >>>> > >> > +          to match loop and its epilogue vectorization in dumps
> >>>> > >> > +          put new loop as the next loop to process.  */
> >>>> > >> > +       if (new_loop)
> >>>> > >> > +         {
> >>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>> > >> > +         }
> >>>> > >> >
> >>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >>>> > >> f> unction which will set up stuff properly (and also perform
> >>>> > >> > the if-conversion of the epilogue there).
> >>>> > >> >
> >>>> > >> > That said, if we can get in non-masked epilogue vectorization
> >>>> > >> > separately that would be great.
> >>>> > >>
> >>>> > >> Could you please clarify your proposal.
> >>>> > >
> >>>> > > When a loop was vectorized set things up to immediately vectorize
> >>>> > > its epilogue, avoiding changing the loop iteration and avoiding
> >>>> > > the re-use of ->aux.
> >>>> > >
> >>>> > > Richard.
> >>>> > >
> >>>> > >> Thanks.
> >>>> > >> Yuri.
> >>>> > >>
> >>>> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >>>> > >> >
> >>>> > >> >> Hi All,
> >>>> > >> >>
> >>>> > >> >> I re-send all patches sent by Ilya earlier for review which support
> >>>> > >> >> vectorization of loop epilogues and loops with low trip count. We
> >>>> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >>>> > >> >> approved by Jeff.
> >>>> > >> >>
> >>>> > >> >> I did re-base of all patches and performed bootstrapping and
> >>>> > >> >> regression testing that did not show any new failures. Also all
> >>>> > >> >> changes related to new vect_do_peeling algorithm have been changed
> >>>> > >> >> accordingly.
> >>>> > >> >>
> >>>> > >> >> Is it OK for trunk?
> >>>> > >> >
> >>>> > >> > I would have prefered that the series up to -03-nomask-tails would
> >>>> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
> >>>> > >> > the patchset is oddly separated.
> >>>> > >> >
> >>>> > >> > I have a comment on that part nevertheless:
> >>>> > >> >
> >>>> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >>>> > >> > loop_vinfo)
> >>>> > >> >    /* Check if we can possibly peel the loop.  */
> >>>> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >>>> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> >>>> > >> > -      || loop->inner)
> >>>> > >> > +      || loop->inner
> >>>> > >> > +      /* Required peeling was performed in prologue and
> >>>> > >> > +        is not required for epilogue.  */
> >>>> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >>>> > >> >      do_peeling = false;
> >>>> > >> >
> >>>> > >> >    if (do_peeling
> >>>> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >>>> > >> > loop_vinfo)
> >>>> > >> >
> >>>> > >> >    do_versioning =
> >>>> > >> >         optimize_loop_nest_for_speed_p (loop)
> >>>> > >> > -       && (!loop->inner); /* FORNOW */
> >>>> > >> > +       && (!loop->inner) /* FORNOW */
> >>>> > >> > +        /* Required versioning was performed for the
> >>>> > >> > +          original loop and is not required for epilogue.  */
> >>>> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >>>> > >> >
> >>>> > >> >    if (do_versioning)
> >>>> > >> >      {
> >>>> > >> >
> >>>> > >> > please do that check in the single caller of this function.
> >>>> > >> >
> >>>> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
> >>>> > >> > passing down info from the processed parent would be _much_ cleaner.
> >>>> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>> > >> >
> >>>> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>> > >> >             && dump_enabled_p ())
> >>>> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >>>> > >> >                             "loop vectorized\n");
> >>>> > >> > -       vect_transform_loop (loop_vinfo);
> >>>> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>> > >> >         num_vectorized_loops++;
> >>>> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
> >>>> > >> >            etc.  */
> >>>> > >> >         loop->force_vectorize = false;
> >>>> > >> >
> >>>> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >>>> > >> > +          to match loop and its epilogue vectorization in dumps
> >>>> > >> > +          put new loop as the next loop to process.  */
> >>>> > >> > +       if (new_loop)
> >>>> > >> > +         {
> >>>> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>> > >> > +         }
> >>>> > >> >
> >>>> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >>>> > >> > function which will set up stuff properly (and also perform
> >>>> > >> > the if-conversion of the epilogue there).
> >>>> > >> >
> >>>> > >> > That said, if we can get in non-masked epilogue vectorization
> >>>> > >> > separately that would be great.
> >>>> > >> >
> >>>> > >> > I'm still torn about all the rest of the stuff and question its
> >>>> > >> > usability (esp. merging the epilogue with the main vector loop).
> >>>> > >> > But it has already been approved ... oh well.
> >>>> > >> >
> >>>> > >> > Thanks,
> >>>> > >> > Richard.
> >>>> > >>
> >>>> > >>
> >>>> > >
> >>>> > > --
> >>>> > > Richard Biener <rguenther@suse.de>
> >>>> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> >>>> >
> >>>>
> >>>>
> >>>
> >>> --
> >>> Richard Biener <rguenther@suse.de>
> >>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-14 12:51               ` Richard Biener
@ 2016-11-14 13:30                 ` Yuri Rumyantsev
  2016-11-14 13:41                   ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-14 13:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 9492 bytes --]

Richard,

In my previous patch I forgot to remove couple lines related to aux field.
Here is the correct updated patch.

Thanks.
Yuri.

2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>
>> Richard,
>>
>> I prepare updated 3 patch with passing additional argument to
>> vect_analyze_loop as you proposed (untested).
>>
>> You wrote:
>> tw, I wonder if you can produce a single patch containing just
>> epilogue vectorization, that is combine patches 1-3 but rip out
>> changes only needed by later patches?
>>
>> Did you mean that I exclude all support for vectorization epilogues,
>> i.e. exclude from 2-nd patch all non-related changes
>> like
>>
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 11863af..32011c1 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>
> Yes.
>
>> Did you mean also that new combined patch must be working patch, i.e.
>> can be integrated without other patches?
>
> Yes.
>
>> Could you please look at updated patch?
>
> Will do.
>
> Thanks,
> Richard.
>
>> Thanks.
>> Yuri.
>>
>> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > On Thu, 10 Nov 2016, Richard Biener wrote:
>> >
>> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>> >>
>> >> > Richard,
>> >> >
>> >> > Here is updated 3 patch.
>> >> >
>> >> > I checked that all new tests related to epilogue vectorization passed with it.
>> >> >
>> >> > Your comments will be appreciated.
>> >>
>> >> A lot better now.  Instead of the ->aux dance I now prefer to
>> >> pass the original loops loop_vinfo to vect_analyze_loop as
>> >> optional argument (if non-NULL we analyze the epilogue of that
>> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> >> original vectorization factor?  So we can pass down an (optional)
>> >> forced vectorization factor as well?
>> >
>> > Btw, I wonder if you can produce a single patch containing just
>> > epilogue vectorization, that is combine patches 1-3 but rip out
>> > changes only needed by later patches?
>> >
>> > Thanks,
>> > Richard.
>> >
>> >> Richard.
>> >>
>> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> >> > >
>> >> > >> Hi Richard,
>> >> > >>
>> >> > >> I did not understand your last remark:
>> >> > >>
>> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> > >> >
>> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> > >> >           && dump_enabled_p ())
>> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >> > >> >                            "loop vectorized\n");
>> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> > >> >         num_vectorized_loops++;
>> >> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
>> >> > >> >           etc.  */
>> >> > >> >      loop->force_vectorize = false;
>> >> > >> >
>> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> >> > >> > +          to match loop and its epilogue vectorization in dumps
>> >> > >> > +          put new loop as the next loop to process.  */
>> >> > >> > +       if (new_loop)
>> >> > >> > +         {
>> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> > >> > +         }
>> >> > >> >
>> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> >> > >> f> unction which will set up stuff properly (and also perform
>> >> > >> > the if-conversion of the epilogue there).
>> >> > >> >
>> >> > >> > That said, if we can get in non-masked epilogue vectorization
>> >> > >> > separately that would be great.
>> >> > >>
>> >> > >> Could you please clarify your proposal.
>> >> > >
>> >> > > When a loop was vectorized set things up to immediately vectorize
>> >> > > its epilogue, avoiding changing the loop iteration and avoiding
>> >> > > the re-use of ->aux.
>> >> > >
>> >> > > Richard.
>> >> > >
>> >> > >> Thanks.
>> >> > >> Yuri.
>> >> > >>
>> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >> > >> >
>> >> > >> >> Hi All,
>> >> > >> >>
>> >> > >> >> I re-send all patches sent by Ilya earlier for review which support
>> >> > >> >> vectorization of loop epilogues and loops with low trip count. We
>> >> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> >> > >> >> approved by Jeff.
>> >> > >> >>
>> >> > >> >> I did re-base of all patches and performed bootstrapping and
>> >> > >> >> regression testing that did not show any new failures. Also all
>> >> > >> >> changes related to new vect_do_peeling algorithm have been changed
>> >> > >> >> accordingly.
>> >> > >> >>
>> >> > >> >> Is it OK for trunk?
>> >> > >> >
>> >> > >> > I would have prefered that the series up to -03-nomask-tails would
>> >> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
>> >> > >> > the patchset is oddly separated.
>> >> > >> >
>> >> > >> > I have a comment on that part nevertheless:
>> >> > >> >
>> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> >> > >> > loop_vinfo)
>> >> > >> >    /* Check if we can possibly peel the loop.  */
>> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>> >> > >> > -      || loop->inner)
>> >> > >> > +      || loop->inner
>> >> > >> > +      /* Required peeling was performed in prologue and
>> >> > >> > +        is not required for epilogue.  */
>> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >> > >> >      do_peeling = false;
>> >> > >> >
>> >> > >> >    if (do_peeling
>> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> >> > >> > loop_vinfo)
>> >> > >> >
>> >> > >> >    do_versioning =
>> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>> >> > >> > -       && (!loop->inner); /* FORNOW */
>> >> > >> > +       && (!loop->inner) /* FORNOW */
>> >> > >> > +        /* Required versioning was performed for the
>> >> > >> > +          original loop and is not required for epilogue.  */
>> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >> > >> >
>> >> > >> >    if (do_versioning)
>> >> > >> >      {
>> >> > >> >
>> >> > >> > please do that check in the single caller of this function.
>> >> > >> >
>> >> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
>> >> > >> > passing down info from the processed parent would be _much_ cleaner.
>> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> > >> >
>> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> > >> >             && dump_enabled_p ())
>> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >> > >> >                             "loop vectorized\n");
>> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> > >> >         num_vectorized_loops++;
>> >> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
>> >> > >> >            etc.  */
>> >> > >> >         loop->force_vectorize = false;
>> >> > >> >
>> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> >> > >> > +          to match loop and its epilogue vectorization in dumps
>> >> > >> > +          put new loop as the next loop to process.  */
>> >> > >> > +       if (new_loop)
>> >> > >> > +         {
>> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> > >> > +         }
>> >> > >> >
>> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> >> > >> > function which will set up stuff properly (and also perform
>> >> > >> > the if-conversion of the epilogue there).
>> >> > >> >
>> >> > >> > That said, if we can get in non-masked epilogue vectorization
>> >> > >> > separately that would be great.
>> >> > >> >
>> >> > >> > I'm still torn about all the rest of the stuff and question its
>> >> > >> > usability (esp. merging the epilogue with the main vector loop).
>> >> > >> > But it has already been approved ... oh well.
>> >> > >> >
>> >> > >> > Thanks,
>> >> > >> > Richard.
>> >> > >>
>> >> > >>
>> >> > >
>> >> > > --
>> >> > > Richard Biener <rguenther@suse.de>
>> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >> >
>> >>
>> >>
>> >
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: patch.03.update2 --]
[-- Type: application/octet-stream, Size: 15316 bytes --]

diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 0a20189..0b86ffe 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -2734,7 +2734,7 @@ ifcvt_local_dce (basic_block bb)
    profitability analysis.  Returns non-zero todo flags when something
    changed.  */
 
-static unsigned int
+unsigned int
 tree_if_conversion (struct loop *loop)
 {
   unsigned int todo = 0;
diff --git a/gcc/tree-if-conv.h b/gcc/tree-if-conv.h
new file mode 100644
index 0000000..3a732c2
--- /dev/null
+++ b/gcc/tree-if-conv.h
@@ -0,0 +1,24 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TREE_IF_CONV_H
+#define GCC_TREE_IF_CONV_H
+
+unsigned int tree_if_conversion (struct loop *);
+
+#endif  /* GCC_TREE_IF_CONV_H  */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 9346cfe..1fc4966 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -480,9 +480,15 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo, int *max_vf)
 				LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
     return false;
 
-  FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
-    if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
-      return false;
+  /* For epilogues we either have no aliases or alias versioning
+     was applied to original loop.  Therefore we may just get max_vf
+     using VF of original loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    *max_vf = LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo);
+  else
+    FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
+      if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
+	return false;
 
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6bfd332..80585ed 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1611,11 +1611,13 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
+   Returns created epilogue or NULL.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
 
-void
+
+struct loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, int th, bool check_profitability,
 		 bool niters_no_overflow)
@@ -1631,7 +1633,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
   if (!prolog_peeling && !epilog_peeling)
-    return;
+    return NULL;
 
   prob_vector = 9 * REG_BR_PROB_BASE / 10;
   if ((vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo)) == 2)
@@ -1639,7 +1641,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   prob_prolog = prob_epilog = (vf - 1) * REG_BR_PROB_BASE / vf;
   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  struct loop *prolog, *epilog, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
   create_lcssa_for_virtual_phi (loop);
   update_ssa (TODO_update_ssa_only_virtuals);
@@ -1821,6 +1823,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
   adjust_vec.release ();
   free_original_copy_tables ();
+
+  return epilog;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 9cca9b7..5511eac 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-cfg.h"
+#include "tree-if-conv.h"
 
 /* Loop Vectorization Pass.
 
@@ -1171,6 +1172,12 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
+  LOOP_VINFO_CAN_BE_MASKED (res) = false;
+  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
+  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
+  LOOP_VINFO_MASK_EPILOGUE (res) = false;
+  LOOP_VINFO_NEED_MASKING (res) = false;
+  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
 
   return res;
 }
@@ -2025,15 +2032,20 @@ start_over:
   if (!ok)
     return false;
 
-  /* This pass will decide on using loop versioning and/or loop peeling in
-     order to enhance the alignment of data references in the loop.  */
-  ok = vect_enhance_data_refs_alignment (loop_vinfo);
-  if (!ok)
+  /* Do not invoke vect_enhance_data_refs_alignment for eplilogue
+     vectorization.  */
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "bad data alignment.\n");
-      return false;
+    /* This pass will decide on using loop versioning and/or loop peeling in
+       order to enhance the alignment of data references in the loop.  */
+    ok = vect_enhance_data_refs_alignment (loop_vinfo);
+    if (!ok)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "bad data alignment.\n");
+        return false;
+      }
     }
 
   if (slp)
@@ -2287,9 +2299,10 @@ again:
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
    for it.  The different analyses will record information in the
-   loop_vec_info struct.  */
+   loop_vec_info struct.  If ORIG_LOOP_VINFO is not NULL epilogue must
+   be vectorized.  */
 loop_vec_info
-vect_analyze_loop (struct loop *loop)
+vect_analyze_loop (struct loop *loop, loop_vec_info orig_loop_vinfo)
 {
   loop_vec_info loop_vinfo;
   unsigned int vector_sizes;
@@ -2325,6 +2338,10 @@ vect_analyze_loop (struct loop *loop)
 	}
 
       bool fatal = false;
+
+      if (orig_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+
       if (vect_analyze_loop_2 (loop_vinfo, fatal))
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
@@ -6657,12 +6674,14 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
 
    The analysis phase has determined that the loop is vectorizable.
    Vectorize the loop - created vectorized stmts to replace the scalar
-   stmts in the loop, and update the loop exit condition.  */
+   stmts in the loop, and update the loop exit condition.
+   Returns scalar epilogue loop if any.  */
 
-void
+struct loop *
 vect_transform_loop (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *epilogue = NULL;
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
   int nbbs = loop->num_nodes;
   int i;
@@ -6741,8 +6760,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
-  vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
-		   check_profitability, niters_no_overflow);
+  epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
+			      check_profitability, niters_no_overflow);
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
@@ -7045,6 +7064,56 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   /* Clear-up safelen field since its value is invalid after vectorization
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
+
+  /* Don't vectorize epilogue for epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+  /* Scalar epilogue is not vectorized in case
+     we use combined vector epilogue.  */
+  else if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    epilogue = NULL;
+
+  if (epilogue)
+    {
+      if (!LOOP_VINFO_MASK_EPILOGUE (loop_vinfo))
+	{
+	  unsigned int vector_sizes
+	    = targetm.vectorize.autovectorize_vector_sizes ();
+	  vector_sizes &= current_vector_size - 1;
+
+	  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	    epilogue = NULL;
+	  else if (!vector_sizes)
+	    epilogue = NULL;
+	  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	    {
+	      int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	      int ratio = current_vector_size / smallest_vec_size;
+	      int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+		- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	      eiters = eiters % vf;
+
+	      epilogue->nb_iterations_upper_bound = eiters - 1;
+
+	      if (eiters < vf / ratio)
+		epilogue = NULL;
+	    }
+	}
+    }
+
+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
+    }
+
+  return epilogue;
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 22e587a..35d7a3e 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -514,6 +514,7 @@ vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  struct loop *new_loop;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -538,7 +539,8 @@ vectorize_loops (void)
 	      && optimize_loop_nest_for_speed_p (loop))
 	     || loop->force_vectorize)
       {
-	loop_vec_info loop_vinfo;
+	loop_vec_info loop_vinfo, orig_loop_vinfo = NULL;
+vectorize_epilogue:
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
@@ -546,7 +548,7 @@ vectorize_loops (void)
                        LOCATION_FILE (vect_location),
 		       LOCATION_LINE (vect_location));
 
-	loop_vinfo = vect_analyze_loop (loop);
+	loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo);
 	loop->aux = loop_vinfo;
 
 	if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
@@ -580,7 +582,7 @@ vectorize_loops (void)
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-	vect_transform_loop (loop_vinfo);
+	new_loop = vect_transform_loop (loop_vinfo);
 	num_vectorized_loops++;
 	/* Now that the loop has been vectorized, allow it to be unrolled
 	   etc.  */
@@ -602,6 +604,15 @@ vectorize_loops (void)
 	    fold_loop_vectorized_call (loop_vectorized_call, boolean_true_node);
 	    ret |= TODO_cleanup_cfg;
 	  }
+
+	if (new_loop)
+	  {
+	    /* Epilogue of vectorized loop must be vectorized too.  */
+	    vect_loops_num = number_of_loops (cfun);
+	    loop = new_loop;
+	    orig_loop_vinfo = loop_vinfo;  /* To pass vect_analyze_loop.  */
+	    goto vectorize_epilogue;
+	  }
       }
 
   vect_location = UNKNOWN_LOCATION;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 3866548..735097d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -335,6 +335,23 @@ typedef struct _loop_vec_info : public vec_info {
   /* Mark loops having masked stores.  */
   bool has_mask_store;
 
+  /* True if vectorized loop can be masked.  */
+  bool can_be_masked;
+  /* If vector mask with 2^N elements is required to mask the loop
+     then N-th bit of this field is set to 1.  */
+  unsigned required_masks;
+
+  /* True if we should vectorize loop epilogue with masking.  */
+  bool mask_epilogue;
+  /* True if we should combine main loop with epilogue using masking.  */
+  bool combine_epilogue;
+  /* True if loop vectorization requires masking.  E.g. we want to
+     vectorize loop with low trip count.  */
+  bool need_masking;
+  /* For loops being epilogues of already vectorized loops
+     this points to the original vectorized loop.  Otherwise NULL.  */
+  _loop_vec_info *orig_loop_info;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -374,6 +391,12 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_HAS_MASK_STORE(L)       (L)->has_mask_store
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
+#define LOOP_VINFO_CAN_BE_MASKED(L)	   (L)->can_be_masked
+#define LOOP_VINFO_REQUIRED_MASKS(L)       (L)->required_masks
+#define LOOP_VINFO_COMBINE_EPILOGUE(L)     (L)->combine_epilogue
+#define LOOP_VINFO_MASK_EPILOGUE(L)	   (L)->mask_epilogue
+#define LOOP_VINFO_NEED_MASKING(L)	   (L)->need_masking
+#define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
@@ -389,6 +412,14 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_NITERS_KNOWN_P(L)          \
   (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0)
 
+#define LOOP_VINFO_EPILOGUE_P(L) \
+  (LOOP_VINFO_ORIG_LOOP_INFO (L) != NULL)
+
+#define LOOP_VINFO_ORIG_MASK_EPILOGUE(L) \
+  (LOOP_VINFO_MASK_EPILOGUE (LOOP_VINFO_ORIG_LOOP_INFO (L)))
+#define LOOP_VINFO_ORIG_VECT_FACTOR(L) \
+  (LOOP_VINFO_VECT_FACTOR (LOOP_VINFO_ORIG_LOOP_INFO (L)))
+
 static inline loop_vec_info
 loop_vec_info_for_loop (struct loop *loop)
 {
@@ -1032,8 +1063,8 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
-extern void vect_do_peeling (loop_vec_info, tree, tree,
-			     tree *, int, bool, bool);
+extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
+				     tree *, int, bool, bool);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
 
@@ -1144,11 +1175,11 @@ extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern gimple *vect_force_simple_reduction (loop_vec_info, gimple *, bool,
 					    bool *, bool);
 /* Drive for loop analysis stage.  */
-extern loop_vec_info vect_analyze_loop (struct loop *);
+extern loop_vec_info vect_analyze_loop (struct loop *, loop_vec_info);
 extern tree vect_build_loop_niters (loop_vec_info);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, bool);
 /* Drive for loop transformation stage.  */
-extern void vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 extern bool vectorizable_live_operation (gimple *, gimple_stmt_iterator *,
 					 slp_tree, int, gimple **);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-14 13:30                 ` Yuri Rumyantsev
@ 2016-11-14 13:41                   ` Richard Biener
  2016-11-14 15:39                     ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-14 13:41 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:

> Richard,
> 
> In my previous patch I forgot to remove couple lines related to aux field.
> Here is the correct updated patch.

Yeah, I noticed.  This patch would be ok for trunk (together with
necessary parts from 1 and 2) if all not required parts are removed
(and you'd add the testcases covering non-masked tail vect).

Thus, can you please produce a single complete patch containing only
non-masked epilogue vectoriziation?

Thanks,
Richard.

> Thanks.
> Yuri.
> 
> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
> >
> >> Richard,
> >>
> >> I prepare updated 3 patch with passing additional argument to
> >> vect_analyze_loop as you proposed (untested).
> >>
> >> You wrote:
> >> tw, I wonder if you can produce a single patch containing just
> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >> changes only needed by later patches?
> >>
> >> Did you mean that I exclude all support for vectorization epilogues,
> >> i.e. exclude from 2-nd patch all non-related changes
> >> like
> >>
> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> index 11863af..32011c1 100644
> >> --- a/gcc/tree-vect-loop.c
> >> +++ b/gcc/tree-vect-loop.c
> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >
> > Yes.
> >
> >> Did you mean also that new combined patch must be working patch, i.e.
> >> can be integrated without other patches?
> >
> > Yes.
> >
> >> Could you please look at updated patch?
> >
> > Will do.
> >
> > Thanks,
> > Richard.
> >
> >> Thanks.
> >> Yuri.
> >>
> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >> >
> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >> >>
> >> >> > Richard,
> >> >> >
> >> >> > Here is updated 3 patch.
> >> >> >
> >> >> > I checked that all new tests related to epilogue vectorization passed with it.
> >> >> >
> >> >> > Your comments will be appreciated.
> >> >>
> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >> >> optional argument (if non-NULL we analyze the epilogue of that
> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >> >> original vectorization factor?  So we can pass down an (optional)
> >> >> forced vectorization factor as well?
> >> >
> >> > Btw, I wonder if you can produce a single patch containing just
> >> > epilogue vectorization, that is combine patches 1-3 but rip out
> >> > changes only needed by later patches?
> >> >
> >> > Thanks,
> >> > Richard.
> >> >
> >> >> Richard.
> >> >>
> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >> >> > >
> >> >> > >> Hi Richard,
> >> >> > >>
> >> >> > >> I did not understand your last remark:
> >> >> > >>
> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >> > >> >
> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >> > >> >           && dump_enabled_p ())
> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> >> > >> >                            "loop vectorized\n");
> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >> > >> >         num_vectorized_loops++;
> >> >> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
> >> >> > >> >           etc.  */
> >> >> > >> >      loop->force_vectorize = false;
> >> >> > >> >
> >> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >> >> > >> > +          to match loop and its epilogue vectorization in dumps
> >> >> > >> > +          put new loop as the next loop to process.  */
> >> >> > >> > +       if (new_loop)
> >> >> > >> > +         {
> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >> > >> > +         }
> >> >> > >> >
> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> >> > >> f> unction which will set up stuff properly (and also perform
> >> >> > >> > the if-conversion of the epilogue there).
> >> >> > >> >
> >> >> > >> > That said, if we can get in non-masked epilogue vectorization
> >> >> > >> > separately that would be great.
> >> >> > >>
> >> >> > >> Could you please clarify your proposal.
> >> >> > >
> >> >> > > When a loop was vectorized set things up to immediately vectorize
> >> >> > > its epilogue, avoiding changing the loop iteration and avoiding
> >> >> > > the re-use of ->aux.
> >> >> > >
> >> >> > > Richard.
> >> >> > >
> >> >> > >> Thanks.
> >> >> > >> Yuri.
> >> >> > >>
> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >> >> > >> >
> >> >> > >> >> Hi All,
> >> >> > >> >>
> >> >> > >> >> I re-send all patches sent by Ilya earlier for review which support
> >> >> > >> >> vectorization of loop epilogues and loops with low trip count. We
> >> >> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
> >> >> > >> >> approved by Jeff.
> >> >> > >> >>
> >> >> > >> >> I did re-base of all patches and performed bootstrapping and
> >> >> > >> >> regression testing that did not show any new failures. Also all
> >> >> > >> >> changes related to new vect_do_peeling algorithm have been changed
> >> >> > >> >> accordingly.
> >> >> > >> >>
> >> >> > >> >> Is it OK for trunk?
> >> >> > >> >
> >> >> > >> > I would have prefered that the series up to -03-nomask-tails would
> >> >> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
> >> >> > >> > the patchset is oddly separated.
> >> >> > >> >
> >> >> > >> > I have a comment on that part nevertheless:
> >> >> > >> >
> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> >> > >> > loop_vinfo)
> >> >> > >> >    /* Check if we can possibly peel the loop.  */
> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
> >> >> > >> > -      || loop->inner)
> >> >> > >> > +      || loop->inner
> >> >> > >> > +      /* Required peeling was performed in prologue and
> >> >> > >> > +        is not required for epilogue.  */
> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> >> > >> >      do_peeling = false;
> >> >> > >> >
> >> >> > >> >    if (do_peeling
> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
> >> >> > >> > loop_vinfo)
> >> >> > >> >
> >> >> > >> >    do_versioning =
> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >> >> > >> > -       && (!loop->inner); /* FORNOW */
> >> >> > >> > +       && (!loop->inner) /* FORNOW */
> >> >> > >> > +        /* Required versioning was performed for the
> >> >> > >> > +          original loop and is not required for epilogue.  */
> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >> >> > >> >
> >> >> > >> >    if (do_versioning)
> >> >> > >> >      {
> >> >> > >> >
> >> >> > >> > please do that check in the single caller of this function.
> >> >> > >> >
> >> >> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
> >> >> > >> > passing down info from the processed parent would be _much_ cleaner.
> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >> > >> >
> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >> > >> >             && dump_enabled_p ())
> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> >> >> > >> >                             "loop vectorized\n");
> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >> > >> >         num_vectorized_loops++;
> >> >> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
> >> >> > >> >            etc.  */
> >> >> > >> >         loop->force_vectorize = false;
> >> >> > >> >
> >> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
> >> >> > >> > +          to match loop and its epilogue vectorization in dumps
> >> >> > >> > +          put new loop as the next loop to process.  */
> >> >> > >> > +       if (new_loop)
> >> >> > >> > +         {
> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >> > >> > +         }
> >> >> > >> >
> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
> >> >> > >> > function which will set up stuff properly (and also perform
> >> >> > >> > the if-conversion of the epilogue there).
> >> >> > >> >
> >> >> > >> > That said, if we can get in non-masked epilogue vectorization
> >> >> > >> > separately that would be great.
> >> >> > >> >
> >> >> > >> > I'm still torn about all the rest of the stuff and question its
> >> >> > >> > usability (esp. merging the epilogue with the main vector loop).
> >> >> > >> > But it has already been approved ... oh well.
> >> >> > >> >
> >> >> > >> > Thanks,
> >> >> > >> > Richard.
> >> >> > >>
> >> >> > >>
> >> >> > >
> >> >> > > --
> >> >> > > Richard Biener <rguenther@suse.de>
> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> >> >> >
> >> >>
> >> >>
> >> >
> >> > --
> >> > Richard Biener <rguenther@suse.de>
> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-14 13:41                   ` Richard Biener
@ 2016-11-14 15:39                     ` Yuri Rumyantsev
  2016-11-14 17:59                       ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-14 15:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

Richard,

I checked one of the tests designed for epilogue vectorization using
patches 1 - 3 and found out that build compiler performs vectorization
of epilogues with --param vect-epilogues-nomask=1 passed:

$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
t1.new-nomask.s -fdump-tree-vect-details
$ grep VECTORIZED -c t1.c.156t.vect
4
 Without param only 2 loops are vectorized.

Should I simply add a part of tests related to this feature or I must
delete all not necessary changes also?

Thanks.
Yuri.

2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>
>> Richard,
>>
>> In my previous patch I forgot to remove couple lines related to aux field.
>> Here is the correct updated patch.
>
> Yeah, I noticed.  This patch would be ok for trunk (together with
> necessary parts from 1 and 2) if all not required parts are removed
> (and you'd add the testcases covering non-masked tail vect).
>
> Thus, can you please produce a single complete patch containing only
> non-masked epilogue vectoriziation?
>
> Thanks,
> Richard.
>
>> Thanks.
>> Yuri.
>>
>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>> >
>> >> Richard,
>> >>
>> >> I prepare updated 3 patch with passing additional argument to
>> >> vect_analyze_loop as you proposed (untested).
>> >>
>> >> You wrote:
>> >> tw, I wonder if you can produce a single patch containing just
>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>> >> changes only needed by later patches?
>> >>
>> >> Did you mean that I exclude all support for vectorization epilogues,
>> >> i.e. exclude from 2-nd patch all non-related changes
>> >> like
>> >>
>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> index 11863af..32011c1 100644
>> >> --- a/gcc/tree-vect-loop.c
>> >> +++ b/gcc/tree-vect-loop.c
>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>> >
>> > Yes.
>> >
>> >> Did you mean also that new combined patch must be working patch, i.e.
>> >> can be integrated without other patches?
>> >
>> > Yes.
>> >
>> >> Could you please look at updated patch?
>> >
>> > Will do.
>> >
>> > Thanks,
>> > Richard.
>> >
>> >> Thanks.
>> >> Yuri.
>> >>
>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>> >> >
>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>
>> >> >> > Richard,
>> >> >> >
>> >> >> > Here is updated 3 patch.
>> >> >> >
>> >> >> > I checked that all new tests related to epilogue vectorization passed with it.
>> >> >> >
>> >> >> > Your comments will be appreciated.
>> >> >>
>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> >> >> original vectorization factor?  So we can pass down an (optional)
>> >> >> forced vectorization factor as well?
>> >> >
>> >> > Btw, I wonder if you can produce a single patch containing just
>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>> >> > changes only needed by later patches?
>> >> >
>> >> > Thanks,
>> >> > Richard.
>> >> >
>> >> >> Richard.
>> >> >>
>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> >> >> > >
>> >> >> > >> Hi Richard,
>> >> >> > >>
>> >> >> > >> I did not understand your last remark:
>> >> >> > >>
>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> >> > >> >
>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> >> > >> >           && dump_enabled_p ())
>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >> >> > >> >                            "loop vectorized\n");
>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> >> > >> >         num_vectorized_loops++;
>> >> >> > >> >        /* Now that the loop has been vectorized, allow it to be unrolled
>> >> >> > >> >           etc.  */
>> >> >> > >> >      loop->force_vectorize = false;
>> >> >> > >> >
>> >> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> >> >> > >> > +          to match loop and its epilogue vectorization in dumps
>> >> >> > >> > +          put new loop as the next loop to process.  */
>> >> >> > >> > +       if (new_loop)
>> >> >> > >> > +         {
>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> >> > >> > +         }
>> >> >> > >> >
>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> >> >> > >> f> unction which will set up stuff properly (and also perform
>> >> >> > >> > the if-conversion of the epilogue there).
>> >> >> > >> >
>> >> >> > >> > That said, if we can get in non-masked epilogue vectorization
>> >> >> > >> > separately that would be great.
>> >> >> > >>
>> >> >> > >> Could you please clarify your proposal.
>> >> >> > >
>> >> >> > > When a loop was vectorized set things up to immediately vectorize
>> >> >> > > its epilogue, avoiding changing the loop iteration and avoiding
>> >> >> > > the re-use of ->aux.
>> >> >> > >
>> >> >> > > Richard.
>> >> >> > >
>> >> >> > >> Thanks.
>> >> >> > >> Yuri.
>> >> >> > >>
>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >> >> > >> >
>> >> >> > >> >> Hi All,
>> >> >> > >> >>
>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review which support
>> >> >> > >> >> vectorization of loop epilogues and loops with low trip count. We
>> >> >> > >> >> assume that the only patch - vec-tails-07-combine-tail.patch - was not
>> >> >> > >> >> approved by Jeff.
>> >> >> > >> >>
>> >> >> > >> >> I did re-base of all patches and performed bootstrapping and
>> >> >> > >> >> regression testing that did not show any new failures. Also all
>> >> >> > >> >> changes related to new vect_do_peeling algorithm have been changed
>> >> >> > >> >> accordingly.
>> >> >> > >> >>
>> >> >> > >> >> Is it OK for trunk?
>> >> >> > >> >
>> >> >> > >> > I would have prefered that the series up to -03-nomask-tails would
>> >> >> > >> > _only_ contain epilogue loop vectorization changes but unfortunately
>> >> >> > >> > the patchset is oddly separated.
>> >> >> > >> >
>> >> >> > >> > I have a comment on that part nevertheless:
>> >> >> > >> >
>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> >> >> > >> > loop_vinfo)
>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop))
>> >> >> > >> > -      || loop->inner)
>> >> >> > >> > +      || loop->inner
>> >> >> > >> > +      /* Required peeling was performed in prologue and
>> >> >> > >> > +        is not required for epilogue.  */
>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >> >> > >> >      do_peeling = false;
>> >> >> > >> >
>> >> >> > >> >    if (do_peeling
>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment (loop_vec_info
>> >> >> > >> > loop_vinfo)
>> >> >> > >> >
>> >> >> > >> >    do_versioning =
>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>> >> >> > >> > +        /* Required versioning was performed for the
>> >> >> > >> > +          original loop and is not required for epilogue.  */
>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >> >> > >> >
>> >> >> > >> >    if (do_versioning)
>> >> >> > >> >      {
>> >> >> > >> >
>> >> >> > >> > please do that check in the single caller of this function.
>> >> >> > >> >
>> >> >> > >> > Otherwise I still dislike the new ->aux use and I believe that simply
>> >> >> > >> > passing down info from the processed parent would be _much_ cleaner.
>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> >> > >> >
>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> >> > >> >             && dump_enabled_p ())
>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
>> >> >> > >> >                             "loop vectorized\n");
>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> >> > >> >         num_vectorized_loops++;
>> >> >> > >> >         /* Now that the loop has been vectorized, allow it to be unrolled
>> >> >> > >> >            etc.  */
>> >> >> > >> >         loop->force_vectorize = false;
>> >> >> > >> >
>> >> >> > >> > +       /* Add new loop to a processing queue.  To make it easier
>> >> >> > >> > +          to match loop and its epilogue vectorization in dumps
>> >> >> > >> > +          put new loop as the next loop to process.  */
>> >> >> > >> > +       if (new_loop)
>> >> >> > >> > +         {
>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> >> > >> > +         }
>> >> >> > >> >
>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, new_loop)
>> >> >> > >> > function which will set up stuff properly (and also perform
>> >> >> > >> > the if-conversion of the epilogue there).
>> >> >> > >> >
>> >> >> > >> > That said, if we can get in non-masked epilogue vectorization
>> >> >> > >> > separately that would be great.
>> >> >> > >> >
>> >> >> > >> > I'm still torn about all the rest of the stuff and question its
>> >> >> > >> > usability (esp. merging the epilogue with the main vector loop).
>> >> >> > >> > But it has already been approved ... oh well.
>> >> >> > >> >
>> >> >> > >> > Thanks,
>> >> >> > >> > Richard.
>> >> >> > >>
>> >> >> > >>
>> >> >> > >
>> >> >> > > --
>> >> >> > > Richard Biener <rguenther@suse.de>
>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> > --
>> >> > Richard Biener <rguenther@suse.de>
>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >>
>> >
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-14 15:39                     ` Yuri Rumyantsev
@ 2016-11-14 17:59                       ` Richard Biener
  2016-11-15 14:42                         ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-14 17:59 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>Richard,
>
>I checked one of the tests designed for epilogue vectorization using
>patches 1 - 3 and found out that build compiler performs vectorization
>of epilogues with --param vect-epilogues-nomask=1 passed:
>
>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>t1.new-nomask.s -fdump-tree-vect-details
>$ grep VECTORIZED -c t1.c.156t.vect
>4
> Without param only 2 loops are vectorized.
>
>Should I simply add a part of tests related to this feature or I must
>delete all not necessary changes also?

Please remove all not necessary changes.

Richard.

>Thanks.
>Yuri.
>
>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>
>>> Richard,
>>>
>>> In my previous patch I forgot to remove couple lines related to aux
>field.
>>> Here is the correct updated patch.
>>
>> Yeah, I noticed.  This patch would be ok for trunk (together with
>> necessary parts from 1 and 2) if all not required parts are removed
>> (and you'd add the testcases covering non-masked tail vect).
>>
>> Thus, can you please produce a single complete patch containing only
>> non-masked epilogue vectoriziation?
>>
>> Thanks,
>> Richard.
>>
>>> Thanks.
>>> Yuri.
>>>
>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>> >
>>> >> Richard,
>>> >>
>>> >> I prepare updated 3 patch with passing additional argument to
>>> >> vect_analyze_loop as you proposed (untested).
>>> >>
>>> >> You wrote:
>>> >> tw, I wonder if you can produce a single patch containing just
>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>> >> changes only needed by later patches?
>>> >>
>>> >> Did you mean that I exclude all support for vectorization
>epilogues,
>>> >> i.e. exclude from 2-nd patch all non-related changes
>>> >> like
>>> >>
>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>> >> index 11863af..32011c1 100644
>>> >> --- a/gcc/tree-vect-loop.c
>>> >> +++ b/gcc/tree-vect-loop.c
>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>> >
>>> > Yes.
>>> >
>>> >> Did you mean also that new combined patch must be working patch,
>i.e.
>>> >> can be integrated without other patches?
>>> >
>>> > Yes.
>>> >
>>> >> Could you please look at updated patch?
>>> >
>>> > Will do.
>>> >
>>> > Thanks,
>>> > Richard.
>>> >
>>> >> Thanks.
>>> >> Yuri.
>>> >>
>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>> >> >
>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>> >> >>
>>> >> >> > Richard,
>>> >> >> >
>>> >> >> > Here is updated 3 patch.
>>> >> >> >
>>> >> >> > I checked that all new tests related to epilogue
>vectorization passed with it.
>>> >> >> >
>>> >> >> > Your comments will be appreciated.
>>> >> >>
>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>> >> >> original vectorization factor?  So we can pass down an
>(optional)
>>> >> >> forced vectorization factor as well?
>>> >> >
>>> >> > Btw, I wonder if you can produce a single patch containing just
>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>> >> > changes only needed by later patches?
>>> >> >
>>> >> > Thanks,
>>> >> > Richard.
>>> >> >
>>> >> >> Richard.
>>> >> >>
>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
><rguenther@suse.de>:
>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>> >> >> > >
>>> >> >> > >> Hi Richard,
>>> >> >> > >>
>>> >> >> > >> I did not understand your last remark:
>>> >> >> > >>
>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> >> >> > >> >
>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> >> >> > >> >           && dump_enabled_p ())
>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>vect_location,
>>> >> >> > >> >                            "loop vectorized\n");
>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> >> >> > >> >         num_vectorized_loops++;
>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>it to be unrolled
>>> >> >> > >> >           etc.  */
>>> >> >> > >> >      loop->force_vectorize = false;
>>> >> >> > >> >
>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>it easier
>>> >> >> > >> > +          to match loop and its epilogue vectorization
>in dumps
>>> >> >> > >> > +          put new loop as the next loop to process. 
>*/
>>> >> >> > >> > +       if (new_loop)
>>> >> >> > >> > +         {
>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> >> >> > >> > +         }
>>> >> >> > >> >
>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>new_loop)
>>> >> >> > >> f> unction which will set up stuff properly (and also
>perform
>>> >> >> > >> > the if-conversion of the epilogue there).
>>> >> >> > >> >
>>> >> >> > >> > That said, if we can get in non-masked epilogue
>vectorization
>>> >> >> > >> > separately that would be great.
>>> >> >> > >>
>>> >> >> > >> Could you please clarify your proposal.
>>> >> >> > >
>>> >> >> > > When a loop was vectorized set things up to immediately
>vectorize
>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>avoiding
>>> >> >> > > the re-use of ->aux.
>>> >> >> > >
>>> >> >> > > Richard.
>>> >> >> > >
>>> >> >> > >> Thanks.
>>> >> >> > >> Yuri.
>>> >> >> > >>
>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
><rguenther@suse.de>:
>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>> >> >> > >> >
>>> >> >> > >> >> Hi All,
>>> >> >> > >> >>
>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>which support
>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>trip count. We
>>> >> >> > >> >> assume that the only patch -
>vec-tails-07-combine-tail.patch - was not
>>> >> >> > >> >> approved by Jeff.
>>> >> >> > >> >>
>>> >> >> > >> >> I did re-base of all patches and performed
>bootstrapping and
>>> >> >> > >> >> regression testing that did not show any new failures.
>Also all
>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>been changed
>>> >> >> > >> >> accordingly.
>>> >> >> > >> >>
>>> >> >> > >> >> Is it OK for trunk?
>>> >> >> > >> >
>>> >> >> > >> > I would have prefered that the series up to
>-03-nomask-tails would
>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>unfortunately
>>> >> >> > >> > the patchset is oddly separated.
>>> >> >> > >> >
>>> >> >> > >> > I have a comment on that part nevertheless:
>>> >> >> > >> >
>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>(loop_vec_info
>>> >> >> > >> > loop_vinfo)
>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>single_exit (loop))
>>> >> >> > >> > -      || loop->inner)
>>> >> >> > >> > +      || loop->inner
>>> >> >> > >> > +      /* Required peeling was performed in prologue
>and
>>> >> >> > >> > +        is not required for epilogue.  */
>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>> >> >> > >> >      do_peeling = false;
>>> >> >> > >> >
>>> >> >> > >> >    if (do_peeling
>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>(loop_vec_info
>>> >> >> > >> > loop_vinfo)
>>> >> >> > >> >
>>> >> >> > >> >    do_versioning =
>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>> >> >> > >> > +        /* Required versioning was performed for the
>>> >> >> > >> > +          original loop and is not required for
>epilogue.  */
>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>> >> >> > >> >
>>> >> >> > >> >    if (do_versioning)
>>> >> >> > >> >      {
>>> >> >> > >> >
>>> >> >> > >> > please do that check in the single caller of this
>function.
>>> >> >> > >> >
>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>believe that simply
>>> >> >> > >> > passing down info from the processed parent would be
>_much_ cleaner.
>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> >> >> > >> >
>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> >> >> > >> >             && dump_enabled_p ())
>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>vect_location,
>>> >> >> > >> >                             "loop vectorized\n");
>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> >> >> > >> >         num_vectorized_loops++;
>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>it to be unrolled
>>> >> >> > >> >            etc.  */
>>> >> >> > >> >         loop->force_vectorize = false;
>>> >> >> > >> >
>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>it easier
>>> >> >> > >> > +          to match loop and its epilogue vectorization
>in dumps
>>> >> >> > >> > +          put new loop as the next loop to process. 
>*/
>>> >> >> > >> > +       if (new_loop)
>>> >> >> > >> > +         {
>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> >> >> > >> > +         }
>>> >> >> > >> >
>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>new_loop)
>>> >> >> > >> > function which will set up stuff properly (and also
>perform
>>> >> >> > >> > the if-conversion of the epilogue there).
>>> >> >> > >> >
>>> >> >> > >> > That said, if we can get in non-masked epilogue
>vectorization
>>> >> >> > >> > separately that would be great.
>>> >> >> > >> >
>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>question its
>>> >> >> > >> > usability (esp. merging the epilogue with the main
>vector loop).
>>> >> >> > >> > But it has already been approved ... oh well.
>>> >> >> > >> >
>>> >> >> > >> > Thanks,
>>> >> >> > >> > Richard.
>>> >> >> > >>
>>> >> >> > >>
>>> >> >> > >
>>> >> >> > > --
>>> >> >> > > Richard Biener <rguenther@suse.de>
>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>Graham Norton, HRB 21284 (AG Nuernberg)
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >
>>> >> > --
>>> >> > Richard Biener <rguenther@suse.de>
>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>Norton, HRB 21284 (AG Nuernberg)
>>> >>
>>> >
>>> > --
>>> > Richard Biener <rguenther@suse.de>
>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>Norton, HRB 21284 (AG Nuernberg)
>>>
>>
>> --
>> Richard Biener <rguenther@suse.de>
>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>Norton, HRB 21284 (AG Nuernberg)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-14 17:59                       ` Richard Biener
@ 2016-11-15 14:42                         ` Yuri Rumyantsev
  2016-11-16  9:56                           ` Richard Biener
  2016-11-18 13:20                           ` Christophe Lyon
  0 siblings, 2 replies; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-15 14:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 13876 bytes --]

Hi All,

Here is patch for non-masked epilogue vectoriziation.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

Thanks.
Changelog:

2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>

* params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
* tree-if-conv.c (tree_if_conversion): Make public.
* * tree-if-conv.h: New file.
* tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
dynamic alias checks for epilogues.
* tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
* tree-vect-loop.c: include tree-if-conv.h.
(new_loop_vec_info): Add zeroing orig_loop_info field.
(vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
(vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
using passed argument.
(vect_transform_loop): Check if created epilogue should be returned
for further vectorization with less vf.  If-convert epilogue if
required. Print vectorization success for epilogue.
* tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
if it is required, pass loop_vinfo produced during vectorization of
loop body to vect_analyze_loop.
* tree-vectorizer.h (struct _loop_vec_info): Add new field
orig_loop_info.
(LOOP_VINFO_ORIG_LOOP_INFO): New.
(LOOP_VINFO_EPILOGUE_P): New.
(LOOP_VINFO_ORIG_VECT_FACTOR): New.
(vect_do_peeling): Change prototype to return epilogue.
(vect_analyze_loop): Add argument of loop_vec_info type.
(vect_transform_loop): Return created loop.

gcc/testsuite/

* lib/target-supports.exp (check_avx2_hw_available): New.
(check_effective_target_avx2_runtime): New.
* gcc.dg/vect/vect-tail-nomask-1.c: New test.


2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>Richard,
>>
>>I checked one of the tests designed for epilogue vectorization using
>>patches 1 - 3 and found out that build compiler performs vectorization
>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>
>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>t1.new-nomask.s -fdump-tree-vect-details
>>$ grep VECTORIZED -c t1.c.156t.vect
>>4
>> Without param only 2 loops are vectorized.
>>
>>Should I simply add a part of tests related to this feature or I must
>>delete all not necessary changes also?
>
> Please remove all not necessary changes.
>
> Richard.
>
>>Thanks.
>>Yuri.
>>
>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>
>>>> Richard,
>>>>
>>>> In my previous patch I forgot to remove couple lines related to aux
>>field.
>>>> Here is the correct updated patch.
>>>
>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>> necessary parts from 1 and 2) if all not required parts are removed
>>> (and you'd add the testcases covering non-masked tail vect).
>>>
>>> Thus, can you please produce a single complete patch containing only
>>> non-masked epilogue vectoriziation?
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks.
>>>> Yuri.
>>>>
>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>> >
>>>> >> Richard,
>>>> >>
>>>> >> I prepare updated 3 patch with passing additional argument to
>>>> >> vect_analyze_loop as you proposed (untested).
>>>> >>
>>>> >> You wrote:
>>>> >> tw, I wonder if you can produce a single patch containing just
>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>> >> changes only needed by later patches?
>>>> >>
>>>> >> Did you mean that I exclude all support for vectorization
>>epilogues,
>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>> >> like
>>>> >>
>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>> >> index 11863af..32011c1 100644
>>>> >> --- a/gcc/tree-vect-loop.c
>>>> >> +++ b/gcc/tree-vect-loop.c
>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>> >
>>>> > Yes.
>>>> >
>>>> >> Did you mean also that new combined patch must be working patch,
>>i.e.
>>>> >> can be integrated without other patches?
>>>> >
>>>> > Yes.
>>>> >
>>>> >> Could you please look at updated patch?
>>>> >
>>>> > Will do.
>>>> >
>>>> > Thanks,
>>>> > Richard.
>>>> >
>>>> >> Thanks.
>>>> >> Yuri.
>>>> >>
>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>> >> >
>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>> >> >>
>>>> >> >> > Richard,
>>>> >> >> >
>>>> >> >> > Here is updated 3 patch.
>>>> >> >> >
>>>> >> >> > I checked that all new tests related to epilogue
>>vectorization passed with it.
>>>> >> >> >
>>>> >> >> > Your comments will be appreciated.
>>>> >> >>
>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>> >> >> original vectorization factor?  So we can pass down an
>>(optional)
>>>> >> >> forced vectorization factor as well?
>>>> >> >
>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>> >> > changes only needed by later patches?
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Richard.
>>>> >> >
>>>> >> >> Richard.
>>>> >> >>
>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>><rguenther@suse.de>:
>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>> >> >> > >
>>>> >> >> > >> Hi Richard,
>>>> >> >> > >>
>>>> >> >> > >> I did not understand your last remark:
>>>> >> >> > >>
>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>> >> >> > >> >
>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>> >> >> > >> >           && dump_enabled_p ())
>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>vect_location,
>>>> >> >> > >> >                            "loop vectorized\n");
>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>> >> >> > >> >         num_vectorized_loops++;
>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>it to be unrolled
>>>> >> >> > >> >           etc.  */
>>>> >> >> > >> >      loop->force_vectorize = false;
>>>> >> >> > >> >
>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>it easier
>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>in dumps
>>>> >> >> > >> > +          put new loop as the next loop to process.
>>*/
>>>> >> >> > >> > +       if (new_loop)
>>>> >> >> > >> > +         {
>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>> >> >> > >> > +         }
>>>> >> >> > >> >
>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>new_loop)
>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>perform
>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>> >> >> > >> >
>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>vectorization
>>>> >> >> > >> > separately that would be great.
>>>> >> >> > >>
>>>> >> >> > >> Could you please clarify your proposal.
>>>> >> >> > >
>>>> >> >> > > When a loop was vectorized set things up to immediately
>>vectorize
>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>avoiding
>>>> >> >> > > the re-use of ->aux.
>>>> >> >> > >
>>>> >> >> > > Richard.
>>>> >> >> > >
>>>> >> >> > >> Thanks.
>>>> >> >> > >> Yuri.
>>>> >> >> > >>
>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>><rguenther@suse.de>:
>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>> >> >> > >> >
>>>> >> >> > >> >> Hi All,
>>>> >> >> > >> >>
>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>which support
>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>trip count. We
>>>> >> >> > >> >> assume that the only patch -
>>vec-tails-07-combine-tail.patch - was not
>>>> >> >> > >> >> approved by Jeff.
>>>> >> >> > >> >>
>>>> >> >> > >> >> I did re-base of all patches and performed
>>bootstrapping and
>>>> >> >> > >> >> regression testing that did not show any new failures.
>>Also all
>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>been changed
>>>> >> >> > >> >> accordingly.
>>>> >> >> > >> >>
>>>> >> >> > >> >> Is it OK for trunk?
>>>> >> >> > >> >
>>>> >> >> > >> > I would have prefered that the series up to
>>-03-nomask-tails would
>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>unfortunately
>>>> >> >> > >> > the patchset is oddly separated.
>>>> >> >> > >> >
>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>> >> >> > >> >
>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>(loop_vec_info
>>>> >> >> > >> > loop_vinfo)
>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>single_exit (loop))
>>>> >> >> > >> > -      || loop->inner)
>>>> >> >> > >> > +      || loop->inner
>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>and
>>>> >> >> > >> > +        is not required for epilogue.  */
>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>> >> >> > >> >      do_peeling = false;
>>>> >> >> > >> >
>>>> >> >> > >> >    if (do_peeling
>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>(loop_vec_info
>>>> >> >> > >> > loop_vinfo)
>>>> >> >> > >> >
>>>> >> >> > >> >    do_versioning =
>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>> >> >> > >> > +          original loop and is not required for
>>epilogue.  */
>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>> >> >> > >> >
>>>> >> >> > >> >    if (do_versioning)
>>>> >> >> > >> >      {
>>>> >> >> > >> >
>>>> >> >> > >> > please do that check in the single caller of this
>>function.
>>>> >> >> > >> >
>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>believe that simply
>>>> >> >> > >> > passing down info from the processed parent would be
>>_much_ cleaner.
>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>> >> >> > >> >
>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>> >> >> > >> >             && dump_enabled_p ())
>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>vect_location,
>>>> >> >> > >> >                             "loop vectorized\n");
>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>> >> >> > >> >         num_vectorized_loops++;
>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>it to be unrolled
>>>> >> >> > >> >            etc.  */
>>>> >> >> > >> >         loop->force_vectorize = false;
>>>> >> >> > >> >
>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>it easier
>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>in dumps
>>>> >> >> > >> > +          put new loop as the next loop to process.
>>*/
>>>> >> >> > >> > +       if (new_loop)
>>>> >> >> > >> > +         {
>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>> >> >> > >> > +         }
>>>> >> >> > >> >
>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>new_loop)
>>>> >> >> > >> > function which will set up stuff properly (and also
>>perform
>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>> >> >> > >> >
>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>vectorization
>>>> >> >> > >> > separately that would be great.
>>>> >> >> > >> >
>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>question its
>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>vector loop).
>>>> >> >> > >> > But it has already been approved ... oh well.
>>>> >> >> > >> >
>>>> >> >> > >> > Thanks,
>>>> >> >> > >> > Richard.
>>>> >> >> > >>
>>>> >> >> > >>
>>>> >> >> > >
>>>> >> >> > > --
>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >
>>>> >> > --
>>>> >> > Richard Biener <rguenther@suse.de>
>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>Norton, HRB 21284 (AG Nuernberg)
>>>> >>
>>>> >
>>>> > --
>>>> > Richard Biener <rguenther@suse.de>
>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>Norton, HRB 21284 (AG Nuernberg)
>>>>
>>>
>>> --
>>> Richard Biener <rguenther@suse.de>
>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>Norton, HRB 21284 (AG Nuernberg)
>
>

[-- Attachment #2: patch.1-3 --]
[-- Type: application/octet-stream, Size: 19559 bytes --]

diff --git a/gcc/params.def b/gcc/params.def
index ab3eb3d..8025efa 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1264,6 +1264,11 @@ DEFPARAM (PARAM_MAX_VRP_SWITCH_ASSERTIONS,
 	  "edge of a switch statement during VRP",
 	  10, 0, 0)
 
+DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
+	  "vect-epilogues-nomask",
+	  "Enable loop epilogue vectorization using smaller vector size.",
+	  0, 0, 1)
+
 /*
 
 Local variables:
diff --git a/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c
new file mode 100644
index 0000000..dc016bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c
@@ -0,0 +1,106 @@
+/* { dg-do run } */
+/* { dg-require-weak "" } */
+/* { dg-additional-options "--param vect-epilogues-nomask=1 -mavx2" { target avx2_runtime } } */
+
+#define SIZE 1023
+#define ALIGN 64
+
+extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment, __SIZE_TYPE__ size) __attribute__((weak));
+extern void free (void *);
+
+void __attribute__((noinline))
+test_citer (int * __restrict__ a,
+	    int * __restrict__ b,
+	    int * __restrict__ c)
+{
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (int *)__builtin_assume_aligned (b, ALIGN);
+  c = (int *)__builtin_assume_aligned (c, ALIGN);
+
+  for (i = 0; i < SIZE; i++)
+    c[i] = a[i] + b[i];
+}
+
+void __attribute__((noinline))
+test_viter (int * __restrict__ a,
+	    int * __restrict__ b,
+	    int * __restrict__ c,
+	    int size)
+{
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (int *)__builtin_assume_aligned (b, ALIGN);
+  c = (int *)__builtin_assume_aligned (c, ALIGN);
+
+  for (i = 0; i < size; i++)
+    c[i] = a[i] + b[i];
+}
+
+void __attribute__((noinline))
+init_data (int * __restrict__ a,
+	   int * __restrict__ b,
+	   int * __restrict__ c,
+	   int size)
+{
+  for (int i = 0; i < size; i++)
+    {
+      a[i] = i;
+      b[i] = -i;
+      c[i] = 0;
+      asm volatile("": : :"memory");
+    }
+  a[size] = b[size] = c[size] = size;
+}
+
+
+void __attribute__((noinline))
+run_test ()
+{
+  int *a;
+  int *b;
+  int *c;
+  int i;
+
+  if (posix_memalign ((void **)&a, ALIGN, (SIZE + 1) * sizeof (int)) != 0)
+    return;
+  if (posix_memalign ((void **)&b, ALIGN, (SIZE + 1) * sizeof (int)) != 0)
+    return;
+  if (posix_memalign ((void **)&c, ALIGN, (SIZE + 1) * sizeof (int)) != 0)
+    return;
+
+  init_data (a, b, c, SIZE);
+  test_citer (a, b, c);
+  for (i = 0; i < SIZE; i++)
+    if (c[i] != a[i] + b[i])
+      __builtin_abort ();
+  if (a[SIZE] != SIZE || b[SIZE] != SIZE || c[SIZE] != SIZE)
+    __builtin_abort ();
+
+  init_data (a, b, c, SIZE);
+  test_viter (a, b, c, SIZE);
+  for (i = 0; i < SIZE; i++)
+    if (c[i] != a[i] + b[i])
+      __builtin_abort ();
+  if (a[SIZE] != SIZE || b[SIZE] != SIZE || c[SIZE] != SIZE)
+    __builtin_abort ();
+
+  free (a);
+  free (b);
+  free (c);
+}
+
+int
+main (int argc, const char **argv)
+{
+  if (!posix_memalign)
+    return 0;
+
+  run_test ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { target avx2_runtime } } } */
+/* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 722955a..4dfab68 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -1732,6 +1732,36 @@ proc check_avx_hw_available { } {
     }]
 }
 
+# Return 1 if the target supports executing AVX2 instructions, 0
+# otherwise.  Cache the result.
+
+proc check_avx2_hw_available { } {
+    return [check_cached_effective_target avx2_hw_available {
+	# If this is not the right target then we can skip the test.
+	if { !([istarget x86_64-*-*] || [istarget i?86-*-*]) } {
+	    expr 0
+	} else {
+	    check_runtime_nocache avx2_hw_available {
+		#include "cpuid.h"
+		int main ()
+		{
+		  unsigned int eax, ebx, ecx, edx;
+		  if (!__get_cpuid (1, &eax, &ebx, &ecx, &edx)
+		      || ((ecx & bit_OSXSAVE) != bit_OSXSAVE))
+		    return 1;
+
+		  if (__get_cpuid_max (0, NULL) < 7)
+		    return 1;
+
+		  __cpuid_count (7, 0, eax, ebx, ecx, edx);
+
+		  return (ebx & bit_AVX2) != bit_AVX2;
+		}
+	    } ""
+	}
+    }]
+}
+
 # Return 1 if the target supports running SSE executables, 0 otherwise.
 
 proc check_effective_target_sse_runtime { } {
@@ -1807,6 +1837,17 @@ proc check_effective_target_avx_runtime { } {
     return 0
 }
 
+# Return 1 if the target supports running AVX2 executables, 0 otherwise.
+
+proc check_effective_target_avx2_runtime { } {
+    if { [check_effective_target_avx2]
+	 && [check_avx2_hw_available]
+	 && [check_avx_os_support_available] } {
+	return 1
+    }
+    return 0
+}
+
 # Return 1 if we are compiling for 64-bit PowerPC but we do not use direct
 # move instructions for moves from GPR to FPR.
 
diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 0a20189..0b86ffe 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -2734,7 +2734,7 @@ ifcvt_local_dce (basic_block bb)
    profitability analysis.  Returns non-zero todo flags when something
    changed.  */
 
-static unsigned int
+unsigned int
 tree_if_conversion (struct loop *loop)
 {
   unsigned int todo = 0;
diff --git a/gcc/tree-if-conv.h b/gcc/tree-if-conv.h
new file mode 100644
index 0000000..3a732c2
--- /dev/null
+++ b/gcc/tree-if-conv.h
@@ -0,0 +1,24 @@
+/* Copyright (C) 2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TREE_IF_CONV_H
+#define GCC_TREE_IF_CONV_H
+
+unsigned int tree_if_conversion (struct loop *);
+
+#endif  /* GCC_TREE_IF_CONV_H  */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 9346cfe..1fc4966 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -480,9 +480,15 @@ vect_analyze_data_ref_dependences (loop_vec_info loop_vinfo, int *max_vf)
 				LOOP_VINFO_LOOP_NEST (loop_vinfo), true))
     return false;
 
-  FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
-    if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
-      return false;
+  /* For epilogues we either have no aliases or alias versioning
+     was applied to original loop.  Therefore we may just get max_vf
+     using VF of original loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    *max_vf = LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo);
+  else
+    FOR_EACH_VEC_ELT (LOOP_VINFO_DDRS (loop_vinfo), i, ddr)
+      if (vect_analyze_data_ref_dependence (ddr, loop_vinfo, max_vf))
+	return false;
 
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6bfd332..80585ed 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1611,11 +1611,13 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
+   Returns created epilogue or NULL.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
 
-void
+
+struct loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, int th, bool check_profitability,
 		 bool niters_no_overflow)
@@ -1631,7 +1633,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
   if (!prolog_peeling && !epilog_peeling)
-    return;
+    return NULL;
 
   prob_vector = 9 * REG_BR_PROB_BASE / 10;
   if ((vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo)) == 2)
@@ -1639,7 +1641,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   prob_prolog = prob_epilog = (vf - 1) * REG_BR_PROB_BASE / vf;
   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  struct loop *prolog, *epilog, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *first_loop = loop;
   create_lcssa_for_virtual_phi (loop);
   update_ssa (TODO_update_ssa_only_virtuals);
@@ -1821,6 +1823,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     }
   adjust_vec.release ();
   free_original_copy_tables ();
+
+  return epilog;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 9cca9b7..778479d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -49,6 +49,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-cfg.h"
+#include "tree-if-conv.h"
 
 /* Loop Vectorization Pass.
 
@@ -1171,6 +1172,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
+  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
 
   return res;
 }
@@ -2025,15 +2027,20 @@ start_over:
   if (!ok)
     return false;
 
-  /* This pass will decide on using loop versioning and/or loop peeling in
-     order to enhance the alignment of data references in the loop.  */
-  ok = vect_enhance_data_refs_alignment (loop_vinfo);
-  if (!ok)
+  /* Do not invoke vect_enhance_data_refs_alignment for eplilogue
+     vectorization.  */
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "bad data alignment.\n");
-      return false;
+    /* This pass will decide on using loop versioning and/or loop peeling in
+       order to enhance the alignment of data references in the loop.  */
+    ok = vect_enhance_data_refs_alignment (loop_vinfo);
+    if (!ok)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "bad data alignment.\n");
+        return false;
+      }
     }
 
   if (slp)
@@ -2287,9 +2294,10 @@ again:
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
    for it.  The different analyses will record information in the
-   loop_vec_info struct.  */
+   loop_vec_info struct.  If ORIG_LOOP_VINFO is not NULL epilogue must
+   be vectorized.  */
 loop_vec_info
-vect_analyze_loop (struct loop *loop)
+vect_analyze_loop (struct loop *loop, loop_vec_info orig_loop_vinfo)
 {
   loop_vec_info loop_vinfo;
   unsigned int vector_sizes;
@@ -2325,6 +2333,10 @@ vect_analyze_loop (struct loop *loop)
 	}
 
       bool fatal = false;
+
+      if (orig_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+
       if (vect_analyze_loop_2 (loop_vinfo, fatal))
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
@@ -6657,12 +6669,14 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
 
    The analysis phase has determined that the loop is vectorizable.
    Vectorize the loop - created vectorized stmts to replace the scalar
-   stmts in the loop, and update the loop exit condition.  */
+   stmts in the loop, and update the loop exit condition.
+   Returns scalar epilogue loop if any.  */
 
-void
+struct loop *
 vect_transform_loop (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *epilogue = NULL;
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
   int nbbs = loop->num_nodes;
   int i;
@@ -6741,8 +6755,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
-  vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
-		   check_profitability, niters_no_overflow);
+  epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, th,
+			      check_profitability, niters_no_overflow);
   if (niters_vector == NULL_TREE)
     {
       if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
@@ -7028,12 +7042,19 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (dump_enabled_p ())
     {
-      dump_printf_loc (MSG_NOTE, vect_location,
-		       "LOOP VECTORIZED\n");
-      if (loop->inner)
+      if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "LOOP VECTORIZED\n");
+	  if (loop->inner)
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "OUTER LOOP VECTORIZED\n");
+	  dump_printf (MSG_NOTE, "\n");
+	}
+      else
 	dump_printf_loc (MSG_NOTE, vect_location,
-			 "OUTER LOOP VECTORIZED\n");
-      dump_printf (MSG_NOTE, "\n");
+			 "LOOP EPILOGUE VECTORIZED (VS=%d)\n",
+			 current_vector_size);
     }
 
   /* Free SLP instances here because otherwise stmt reference counting
@@ -7045,6 +7066,49 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   /* Clear-up safelen field since its value is invalid after vectorization
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
+
+  /* Don't vectorize epilogue for epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    epilogue = NULL;
+
+  if (epilogue)
+    {
+	unsigned int vector_sizes
+	  = targetm.vectorize.autovectorize_vector_sizes ();
+	vector_sizes &= current_vector_size - 1;
+
+	if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	  epilogue = NULL;
+	else if (!vector_sizes)
+	  epilogue = NULL;
+	else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		 && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	  {
+	    int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	    int ratio = current_vector_size / smallest_vec_size;
+	    int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+	      - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	    eiters = eiters % vf;
+
+	    epilogue->nb_iterations_upper_bound = eiters - 1;
+
+	    if (eiters < vf / ratio)
+	      epilogue = NULL;
+	    }
+    }
+
+  if (epilogue)
+    {
+      epilogue->force_vectorize = loop->force_vectorize;
+      epilogue->safelen = loop->safelen;
+      epilogue->dont_vectorize = false;
+
+      /* We may need to if-convert epilogue to vectorize it.  */
+      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+	tree_if_conversion (epilogue);
+    }
+
+  return epilogue;
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 22e587a..35d7a3e 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -514,6 +514,7 @@ vectorize_loops (void)
   hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL;
   bool any_ifcvt_loops = false;
   unsigned ret = 0;
+  struct loop *new_loop;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -538,7 +539,8 @@ vectorize_loops (void)
 	      && optimize_loop_nest_for_speed_p (loop))
 	     || loop->force_vectorize)
       {
-	loop_vec_info loop_vinfo;
+	loop_vec_info loop_vinfo, orig_loop_vinfo = NULL;
+vectorize_epilogue:
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
@@ -546,7 +548,7 @@ vectorize_loops (void)
                        LOCATION_FILE (vect_location),
 		       LOCATION_LINE (vect_location));
 
-	loop_vinfo = vect_analyze_loop (loop);
+	loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo);
 	loop->aux = loop_vinfo;
 
 	if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
@@ -580,7 +582,7 @@ vectorize_loops (void)
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
                            "loop vectorized\n");
-	vect_transform_loop (loop_vinfo);
+	new_loop = vect_transform_loop (loop_vinfo);
 	num_vectorized_loops++;
 	/* Now that the loop has been vectorized, allow it to be unrolled
 	   etc.  */
@@ -602,6 +604,15 @@ vectorize_loops (void)
 	    fold_loop_vectorized_call (loop_vectorized_call, boolean_true_node);
 	    ret |= TODO_cleanup_cfg;
 	  }
+
+	if (new_loop)
+	  {
+	    /* Epilogue of vectorized loop must be vectorized too.  */
+	    vect_loops_num = number_of_loops (cfun);
+	    loop = new_loop;
+	    orig_loop_vinfo = loop_vinfo;  /* To pass vect_analyze_loop.  */
+	    goto vectorize_epilogue;
+	  }
       }
 
   vect_location = UNKNOWN_LOCATION;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 3866548..4450a19 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -335,6 +335,10 @@ typedef struct _loop_vec_info : public vec_info {
   /* Mark loops having masked stores.  */
   bool has_mask_store;
 
+  /* For loops being epilogues of already vectorized loops
+     this points to the original vectorized loop.  Otherwise NULL.  */
+  _loop_vec_info *orig_loop_info;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -374,6 +378,7 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_HAS_MASK_STORE(L)       (L)->has_mask_store
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
+#define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
@@ -389,6 +394,12 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_NITERS_KNOWN_P(L)          \
   (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0)
 
+#define LOOP_VINFO_EPILOGUE_P(L) \
+  (LOOP_VINFO_ORIG_LOOP_INFO (L) != NULL)
+
+#define LOOP_VINFO_ORIG_VECT_FACTOR(L) \
+  (LOOP_VINFO_VECT_FACTOR (LOOP_VINFO_ORIG_LOOP_INFO (L)))
+
 static inline loop_vec_info
 loop_vec_info_for_loop (struct loop *loop)
 {
@@ -1032,8 +1043,8 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
-extern void vect_do_peeling (loop_vec_info, tree, tree,
-			     tree *, int, bool, bool);
+extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
+				     tree *, int, bool, bool);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
 
@@ -1144,11 +1155,11 @@ extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern gimple *vect_force_simple_reduction (loop_vec_info, gimple *, bool,
 					    bool *, bool);
 /* Drive for loop analysis stage.  */
-extern loop_vec_info vect_analyze_loop (struct loop *);
+extern loop_vec_info vect_analyze_loop (struct loop *, loop_vec_info);
 extern tree vect_build_loop_niters (loop_vec_info);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, bool);
 /* Drive for loop transformation stage.  */
-extern void vect_transform_loop (loop_vec_info);
+extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 extern bool vectorizable_live_operation (gimple *, gimple_stmt_iterator *,
 					 slp_tree, int, gimple **);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-15 14:42                         ` Yuri Rumyantsev
@ 2016-11-16  9:56                           ` Richard Biener
  2016-11-18 13:20                           ` Christophe Lyon
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Biener @ 2016-11-16  9:56 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, gcc-patches, Ilya Enkovich

On Tue, 15 Nov 2016, Yuri Rumyantsev wrote:

> Hi All,
> 
> Here is patch for non-masked epilogue vectoriziation.
> 
> Bootstrap and regression testing did not show any new failures.
> 
> Is it OK for trunk?

Ok for trunk.

I believe we ultimatively want to remove the new
--param and enable this by default with a better cost model.
What immediately comes to my mind when seeing

+  if (epilogue)
+    {
+       unsigned int vector_sizes
+         = targetm.vectorize.autovectorize_vector_sizes ();
+       vector_sizes &= current_vector_size - 1;
+
+       if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+         epilogue = NULL;
+       else if (!vector_sizes)
+         epilogue = NULL;
+       else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+                && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+         {
+           int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+           int ratio = current_vector_size / smallest_vec_size;
+           int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+             - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+           eiters = eiters % vf;
+
+           epilogue->nb_iterations_upper_bound = eiters - 1;
+
+           if (eiters < vf / ratio)
+             epilogue = NULL;
+           }

is that we have targetm.vectorize.preferred_simd_mode which
for example with -mprefer-avx128 will first try with 128bit
vectorization.  So if a ! prefered vector size ends up
creating the epilogue we know vectorizing with the prefered
size will fail.  The above also does not take into account
peeling for gaps in which case we know the epilogue runs at
least VF times (but the vectorized epilogue might also need
an epilogue itself if not masked).

The natural next step is the masked epilogue support, after
that the merged masked epilogue one.

Thanks,
Richard.

> Thanks.
> Changelog:
> 
> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
> 
> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
> * tree-if-conv.c (tree_if_conversion): Make public.
> * * tree-if-conv.h: New file.
> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
> dynamic alias checks for epilogues.
> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
> * tree-vect-loop.c: include tree-if-conv.h.
> (new_loop_vec_info): Add zeroing orig_loop_info field.
> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
> using passed argument.
> (vect_transform_loop): Check if created epilogue should be returned
> for further vectorization with less vf.  If-convert epilogue if
> required. Print vectorization success for epilogue.
> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
> if it is required, pass loop_vinfo produced during vectorization of
> loop body to vect_analyze_loop.
> * tree-vectorizer.h (struct _loop_vec_info): Add new field
> orig_loop_info.
> (LOOP_VINFO_ORIG_LOOP_INFO): New.
> (LOOP_VINFO_EPILOGUE_P): New.
> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
> (vect_do_peeling): Change prototype to return epilogue.
> (vect_analyze_loop): Add argument of loop_vec_info type.
> (vect_transform_loop): Return created loop.
> 
> gcc/testsuite/
> 
> * lib/target-supports.exp (check_avx2_hw_available): New.
> (check_effective_target_avx2_runtime): New.
> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
> 
> 
> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >>Richard,
> >>
> >>I checked one of the tests designed for epilogue vectorization using
> >>patches 1 - 3 and found out that build compiler performs vectorization
> >>of epilogues with --param vect-epilogues-nomask=1 passed:
> >>
> >>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
> >>t1.new-nomask.s -fdump-tree-vect-details
> >>$ grep VECTORIZED -c t1.c.156t.vect
> >>4
> >> Without param only 2 loops are vectorized.
> >>
> >>Should I simply add a part of tests related to this feature or I must
> >>delete all not necessary changes also?
> >
> > Please remove all not necessary changes.
> >
> > Richard.
> >
> >>Thanks.
> >>Yuri.
> >>
> >>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
> >>>
> >>>> Richard,
> >>>>
> >>>> In my previous patch I forgot to remove couple lines related to aux
> >>field.
> >>>> Here is the correct updated patch.
> >>>
> >>> Yeah, I noticed.  This patch would be ok for trunk (together with
> >>> necessary parts from 1 and 2) if all not required parts are removed
> >>> (and you'd add the testcases covering non-masked tail vect).
> >>>
> >>> Thus, can you please produce a single complete patch containing only
> >>> non-masked epilogue vectoriziation?
> >>>
> >>> Thanks,
> >>> Richard.
> >>>
> >>>> Thanks.
> >>>> Yuri.
> >>>>
> >>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
> >>>> >
> >>>> >> Richard,
> >>>> >>
> >>>> >> I prepare updated 3 patch with passing additional argument to
> >>>> >> vect_analyze_loop as you proposed (untested).
> >>>> >>
> >>>> >> You wrote:
> >>>> >> tw, I wonder if you can produce a single patch containing just
> >>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >>>> >> changes only needed by later patches?
> >>>> >>
> >>>> >> Did you mean that I exclude all support for vectorization
> >>epilogues,
> >>>> >> i.e. exclude from 2-nd patch all non-related changes
> >>>> >> like
> >>>> >>
> >>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >>>> >> index 11863af..32011c1 100644
> >>>> >> --- a/gcc/tree-vect-loop.c
> >>>> >> +++ b/gcc/tree-vect-loop.c
> >>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >>>> >
> >>>> > Yes.
> >>>> >
> >>>> >> Did you mean also that new combined patch must be working patch,
> >>i.e.
> >>>> >> can be integrated without other patches?
> >>>> >
> >>>> > Yes.
> >>>> >
> >>>> >> Could you please look at updated patch?
> >>>> >
> >>>> > Will do.
> >>>> >
> >>>> > Thanks,
> >>>> > Richard.
> >>>> >
> >>>> >> Thanks.
> >>>> >> Yuri.
> >>>> >>
> >>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >>>> >> >
> >>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >>>> >> >>
> >>>> >> >> > Richard,
> >>>> >> >> >
> >>>> >> >> > Here is updated 3 patch.
> >>>> >> >> >
> >>>> >> >> > I checked that all new tests related to epilogue
> >>vectorization passed with it.
> >>>> >> >> >
> >>>> >> >> > Your comments will be appreciated.
> >>>> >> >>
> >>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
> >>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >>>> >> >> original vectorization factor?  So we can pass down an
> >>(optional)
> >>>> >> >> forced vectorization factor as well?
> >>>> >> >
> >>>> >> > Btw, I wonder if you can produce a single patch containing just
> >>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
> >>>> >> > changes only needed by later patches?
> >>>> >> >
> >>>> >> > Thanks,
> >>>> >> > Richard.
> >>>> >> >
> >>>> >> >> Richard.
> >>>> >> >>
> >>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
> >><rguenther@suse.de>:
> >>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >>>> >> >> > >
> >>>> >> >> > >> Hi Richard,
> >>>> >> >> > >>
> >>>> >> >> > >> I did not understand your last remark:
> >>>> >> >> > >>
> >>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>> >> >> > >> >
> >>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>> >> >> > >> >           && dump_enabled_p ())
> >>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >>vect_location,
> >>>> >> >> > >> >                            "loop vectorized\n");
> >>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>> >> >> > >> >         num_vectorized_loops++;
> >>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
> >>it to be unrolled
> >>>> >> >> > >> >           etc.  */
> >>>> >> >> > >> >      loop->force_vectorize = false;
> >>>> >> >> > >> >
> >>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >>it easier
> >>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >>in dumps
> >>>> >> >> > >> > +          put new loop as the next loop to process.
> >>*/
> >>>> >> >> > >> > +       if (new_loop)
> >>>> >> >> > >> > +         {
> >>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>> >> >> > >> > +         }
> >>>> >> >> > >> >
> >>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >>new_loop)
> >>>> >> >> > >> f> unction which will set up stuff properly (and also
> >>perform
> >>>> >> >> > >> > the if-conversion of the epilogue there).
> >>>> >> >> > >> >
> >>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >>vectorization
> >>>> >> >> > >> > separately that would be great.
> >>>> >> >> > >>
> >>>> >> >> > >> Could you please clarify your proposal.
> >>>> >> >> > >
> >>>> >> >> > > When a loop was vectorized set things up to immediately
> >>vectorize
> >>>> >> >> > > its epilogue, avoiding changing the loop iteration and
> >>avoiding
> >>>> >> >> > > the re-use of ->aux.
> >>>> >> >> > >
> >>>> >> >> > > Richard.
> >>>> >> >> > >
> >>>> >> >> > >> Thanks.
> >>>> >> >> > >> Yuri.
> >>>> >> >> > >>
> >>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
> >><rguenther@suse.de>:
> >>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >>>> >> >> > >> >
> >>>> >> >> > >> >> Hi All,
> >>>> >> >> > >> >>
> >>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
> >>which support
> >>>> >> >> > >> >> vectorization of loop epilogues and loops with low
> >>trip count. We
> >>>> >> >> > >> >> assume that the only patch -
> >>vec-tails-07-combine-tail.patch - was not
> >>>> >> >> > >> >> approved by Jeff.
> >>>> >> >> > >> >>
> >>>> >> >> > >> >> I did re-base of all patches and performed
> >>bootstrapping and
> >>>> >> >> > >> >> regression testing that did not show any new failures.
> >>Also all
> >>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
> >>been changed
> >>>> >> >> > >> >> accordingly.
> >>>> >> >> > >> >>
> >>>> >> >> > >> >> Is it OK for trunk?
> >>>> >> >> > >> >
> >>>> >> >> > >> > I would have prefered that the series up to
> >>-03-nomask-tails would
> >>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
> >>unfortunately
> >>>> >> >> > >> > the patchset is oddly separated.
> >>>> >> >> > >> >
> >>>> >> >> > >> > I have a comment on that part nevertheless:
> >>>> >> >> > >> >
> >>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
> >>(loop_vec_info
> >>>> >> >> > >> > loop_vinfo)
> >>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
> >>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
> >>single_exit (loop))
> >>>> >> >> > >> > -      || loop->inner)
> >>>> >> >> > >> > +      || loop->inner
> >>>> >> >> > >> > +      /* Required peeling was performed in prologue
> >>and
> >>>> >> >> > >> > +        is not required for epilogue.  */
> >>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >>>> >> >> > >> >      do_peeling = false;
> >>>> >> >> > >> >
> >>>> >> >> > >> >    if (do_peeling
> >>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
> >>(loop_vec_info
> >>>> >> >> > >> > loop_vinfo)
> >>>> >> >> > >> >
> >>>> >> >> > >> >    do_versioning =
> >>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
> >>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
> >>>> >> >> > >> > +        /* Required versioning was performed for the
> >>>> >> >> > >> > +          original loop and is not required for
> >>epilogue.  */
> >>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >>>> >> >> > >> >
> >>>> >> >> > >> >    if (do_versioning)
> >>>> >> >> > >> >      {
> >>>> >> >> > >> >
> >>>> >> >> > >> > please do that check in the single caller of this
> >>function.
> >>>> >> >> > >> >
> >>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
> >>believe that simply
> >>>> >> >> > >> > passing down info from the processed parent would be
> >>_much_ cleaner.
> >>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>> >> >> > >> >
> >>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>> >> >> > >> >             && dump_enabled_p ())
> >>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >>vect_location,
> >>>> >> >> > >> >                             "loop vectorized\n");
> >>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>> >> >> > >> >         num_vectorized_loops++;
> >>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
> >>it to be unrolled
> >>>> >> >> > >> >            etc.  */
> >>>> >> >> > >> >         loop->force_vectorize = false;
> >>>> >> >> > >> >
> >>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >>it easier
> >>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >>in dumps
> >>>> >> >> > >> > +          put new loop as the next loop to process.
> >>*/
> >>>> >> >> > >> > +       if (new_loop)
> >>>> >> >> > >> > +         {
> >>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>> >> >> > >> > +         }
> >>>> >> >> > >> >
> >>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >>new_loop)
> >>>> >> >> > >> > function which will set up stuff properly (and also
> >>perform
> >>>> >> >> > >> > the if-conversion of the epilogue there).
> >>>> >> >> > >> >
> >>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >>vectorization
> >>>> >> >> > >> > separately that would be great.
> >>>> >> >> > >> >
> >>>> >> >> > >> > I'm still torn about all the rest of the stuff and
> >>question its
> >>>> >> >> > >> > usability (esp. merging the epilogue with the main
> >>vector loop).
> >>>> >> >> > >> > But it has already been approved ... oh well.
> >>>> >> >> > >> >
> >>>> >> >> > >> > Thanks,
> >>>> >> >> > >> > Richard.
> >>>> >> >> > >>
> >>>> >> >> > >>
> >>>> >> >> > >
> >>>> >> >> > > --
> >>>> >> >> > > Richard Biener <rguenther@suse.de>
> >>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
> >>Graham Norton, HRB 21284 (AG Nuernberg)
> >>>> >> >> >
> >>>> >> >>
> >>>> >> >>
> >>>> >> >
> >>>> >> > --
> >>>> >> > Richard Biener <rguenther@suse.de>
> >>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>Norton, HRB 21284 (AG Nuernberg)
> >>>> >>
> >>>> >
> >>>> > --
> >>>> > Richard Biener <rguenther@suse.de>
> >>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>Norton, HRB 21284 (AG Nuernberg)
> >>>>
> >>>
> >>> --
> >>> Richard Biener <rguenther@suse.de>
> >>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>Norton, HRB 21284 (AG Nuernberg)
> >
> >
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-15 14:42                         ` Yuri Rumyantsev
  2016-11-16  9:56                           ` Richard Biener
@ 2016-11-18 13:20                           ` Christophe Lyon
  2016-11-18 15:46                             ` Yuri Rumyantsev
  1 sibling, 1 reply; 38+ messages in thread
From: Christophe Lyon @ 2016-11-18 13:20 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> Here is patch for non-masked epilogue vectoriziation.
>
> Bootstrap and regression testing did not show any new failures.
>
> Is it OK for trunk?
>
> Thanks.
> Changelog:
>
> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>
> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
> * tree-if-conv.c (tree_if_conversion): Make public.
> * * tree-if-conv.h: New file.
> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
> dynamic alias checks for epilogues.
> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
> * tree-vect-loop.c: include tree-if-conv.h.
> (new_loop_vec_info): Add zeroing orig_loop_info field.
> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
> using passed argument.
> (vect_transform_loop): Check if created epilogue should be returned
> for further vectorization with less vf.  If-convert epilogue if
> required. Print vectorization success for epilogue.
> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
> if it is required, pass loop_vinfo produced during vectorization of
> loop body to vect_analyze_loop.
> * tree-vectorizer.h (struct _loop_vec_info): Add new field
> orig_loop_info.
> (LOOP_VINFO_ORIG_LOOP_INFO): New.
> (LOOP_VINFO_EPILOGUE_P): New.
> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
> (vect_do_peeling): Change prototype to return epilogue.
> (vect_analyze_loop): Add argument of loop_vec_info type.
> (vect_transform_loop): Return created loop.
>
> gcc/testsuite/
>
> * lib/target-supports.exp (check_avx2_hw_available): New.
> (check_effective_target_avx2_runtime): New.
> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>

Hi,

This new test fails on arm-none-eabi (using default cpu/fpu/mode):
  gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
  gcc.dg/vect/vect-tail-nomask-1.c execution test

It does pass on the same target if configured --with-cpu=cortex-a9.

Christophe



>
> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>Richard,
>>>
>>>I checked one of the tests designed for epilogue vectorization using
>>>patches 1 - 3 and found out that build compiler performs vectorization
>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>>
>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>>t1.new-nomask.s -fdump-tree-vect-details
>>>$ grep VECTORIZED -c t1.c.156t.vect
>>>4
>>> Without param only 2 loops are vectorized.
>>>
>>>Should I simply add a part of tests related to this feature or I must
>>>delete all not necessary changes also?
>>
>> Please remove all not necessary changes.
>>
>> Richard.
>>
>>>Thanks.
>>>Yuri.
>>>
>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>>
>>>>> Richard,
>>>>>
>>>>> In my previous patch I forgot to remove couple lines related to aux
>>>field.
>>>>> Here is the correct updated patch.
>>>>
>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>>> necessary parts from 1 and 2) if all not required parts are removed
>>>> (and you'd add the testcases covering non-masked tail vect).
>>>>
>>>> Thus, can you please produce a single complete patch containing only
>>>> non-masked epilogue vectoriziation?
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> Thanks.
>>>>> Yuri.
>>>>>
>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>>> >
>>>>> >> Richard,
>>>>> >>
>>>>> >> I prepare updated 3 patch with passing additional argument to
>>>>> >> vect_analyze_loop as you proposed (untested).
>>>>> >>
>>>>> >> You wrote:
>>>>> >> tw, I wonder if you can produce a single patch containing just
>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>>> >> changes only needed by later patches?
>>>>> >>
>>>>> >> Did you mean that I exclude all support for vectorization
>>>epilogues,
>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>>> >> like
>>>>> >>
>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>>> >> index 11863af..32011c1 100644
>>>>> >> --- a/gcc/tree-vect-loop.c
>>>>> >> +++ b/gcc/tree-vect-loop.c
>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>>> >
>>>>> > Yes.
>>>>> >
>>>>> >> Did you mean also that new combined patch must be working patch,
>>>i.e.
>>>>> >> can be integrated without other patches?
>>>>> >
>>>>> > Yes.
>>>>> >
>>>>> >> Could you please look at updated patch?
>>>>> >
>>>>> > Will do.
>>>>> >
>>>>> > Thanks,
>>>>> > Richard.
>>>>> >
>>>>> >> Thanks.
>>>>> >> Yuri.
>>>>> >>
>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>>> >> >
>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>> >> >>
>>>>> >> >> > Richard,
>>>>> >> >> >
>>>>> >> >> > Here is updated 3 patch.
>>>>> >> >> >
>>>>> >> >> > I checked that all new tests related to epilogue
>>>vectorization passed with it.
>>>>> >> >> >
>>>>> >> >> > Your comments will be appreciated.
>>>>> >> >>
>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>>> >> >> original vectorization factor?  So we can pass down an
>>>(optional)
>>>>> >> >> forced vectorization factor as well?
>>>>> >> >
>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>>> >> > changes only needed by later patches?
>>>>> >> >
>>>>> >> > Thanks,
>>>>> >> > Richard.
>>>>> >> >
>>>>> >> >> Richard.
>>>>> >> >>
>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>><rguenther@suse.de>:
>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>>> >> >> > >
>>>>> >> >> > >> Hi Richard,
>>>>> >> >> > >>
>>>>> >> >> > >> I did not understand your last remark:
>>>>> >> >> > >>
>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>> >> >> > >> >
>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>> >> >> > >> >           && dump_enabled_p ())
>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>vect_location,
>>>>> >> >> > >> >                            "loop vectorized\n");
>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>>it to be unrolled
>>>>> >> >> > >> >           etc.  */
>>>>> >> >> > >> >      loop->force_vectorize = false;
>>>>> >> >> > >> >
>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>it easier
>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>in dumps
>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>*/
>>>>> >> >> > >> > +       if (new_loop)
>>>>> >> >> > >> > +         {
>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>> >> >> > >> > +         }
>>>>> >> >> > >> >
>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>new_loop)
>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>>perform
>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>> >> >> > >> >
>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>vectorization
>>>>> >> >> > >> > separately that would be great.
>>>>> >> >> > >>
>>>>> >> >> > >> Could you please clarify your proposal.
>>>>> >> >> > >
>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>>vectorize
>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>>avoiding
>>>>> >> >> > > the re-use of ->aux.
>>>>> >> >> > >
>>>>> >> >> > > Richard.
>>>>> >> >> > >
>>>>> >> >> > >> Thanks.
>>>>> >> >> > >> Yuri.
>>>>> >> >> > >>
>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>><rguenther@suse.de>:
>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>>> >> >> > >> >
>>>>> >> >> > >> >> Hi All,
>>>>> >> >> > >> >>
>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>>which support
>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>>trip count. We
>>>>> >> >> > >> >> assume that the only patch -
>>>vec-tails-07-combine-tail.patch - was not
>>>>> >> >> > >> >> approved by Jeff.
>>>>> >> >> > >> >>
>>>>> >> >> > >> >> I did re-base of all patches and performed
>>>bootstrapping and
>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>>Also all
>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>>been changed
>>>>> >> >> > >> >> accordingly.
>>>>> >> >> > >> >>
>>>>> >> >> > >> >> Is it OK for trunk?
>>>>> >> >> > >> >
>>>>> >> >> > >> > I would have prefered that the series up to
>>>-03-nomask-tails would
>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>>unfortunately
>>>>> >> >> > >> > the patchset is oddly separated.
>>>>> >> >> > >> >
>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>>> >> >> > >> >
>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>>(loop_vec_info
>>>>> >> >> > >> > loop_vinfo)
>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>>single_exit (loop))
>>>>> >> >> > >> > -      || loop->inner)
>>>>> >> >> > >> > +      || loop->inner
>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>>and
>>>>> >> >> > >> > +        is not required for epilogue.  */
>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>>> >> >> > >> >      do_peeling = false;
>>>>> >> >> > >> >
>>>>> >> >> > >> >    if (do_peeling
>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>>(loop_vec_info
>>>>> >> >> > >> > loop_vinfo)
>>>>> >> >> > >> >
>>>>> >> >> > >> >    do_versioning =
>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>>> >> >> > >> > +          original loop and is not required for
>>>epilogue.  */
>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>>> >> >> > >> >
>>>>> >> >> > >> >    if (do_versioning)
>>>>> >> >> > >> >      {
>>>>> >> >> > >> >
>>>>> >> >> > >> > please do that check in the single caller of this
>>>function.
>>>>> >> >> > >> >
>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>>believe that simply
>>>>> >> >> > >> > passing down info from the processed parent would be
>>>_much_ cleaner.
>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>> >> >> > >> >
>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>> >> >> > >> >             && dump_enabled_p ())
>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>vect_location,
>>>>> >> >> > >> >                             "loop vectorized\n");
>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>>it to be unrolled
>>>>> >> >> > >> >            etc.  */
>>>>> >> >> > >> >         loop->force_vectorize = false;
>>>>> >> >> > >> >
>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>it easier
>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>in dumps
>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>*/
>>>>> >> >> > >> > +       if (new_loop)
>>>>> >> >> > >> > +         {
>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>> >> >> > >> > +         }
>>>>> >> >> > >> >
>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>new_loop)
>>>>> >> >> > >> > function which will set up stuff properly (and also
>>>perform
>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>> >> >> > >> >
>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>vectorization
>>>>> >> >> > >> > separately that would be great.
>>>>> >> >> > >> >
>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>>question its
>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>>vector loop).
>>>>> >> >> > >> > But it has already been approved ... oh well.
>>>>> >> >> > >> >
>>>>> >> >> > >> > Thanks,
>>>>> >> >> > >> > Richard.
>>>>> >> >> > >>
>>>>> >> >> > >>
>>>>> >> >> > >
>>>>> >> >> > > --
>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>>> >> >> >
>>>>> >> >>
>>>>> >> >>
>>>>> >> >
>>>>> >> > --
>>>>> >> > Richard Biener <rguenther@suse.de>
>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>Norton, HRB 21284 (AG Nuernberg)
>>>>> >>
>>>>> >
>>>>> > --
>>>>> > Richard Biener <rguenther@suse.de>
>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>
>>>>
>>>> --
>>>> Richard Biener <rguenther@suse.de>
>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>Norton, HRB 21284 (AG Nuernberg)
>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-18 13:20                           ` Christophe Lyon
@ 2016-11-18 15:46                             ` Yuri Rumyantsev
  2016-11-18 15:54                               ` Christophe Lyon
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-18 15:46 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

It is very strange that this test failed on arm, since it requires
target avx2 to check vectorizer dumps:

/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
target avx2_runtime } } } */
/* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
\\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */

Could you please clarify what is the reason of the failure?

Thanks.

2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi All,
>>
>> Here is patch for non-masked epilogue vectoriziation.
>>
>> Bootstrap and regression testing did not show any new failures.
>>
>> Is it OK for trunk?
>>
>> Thanks.
>> Changelog:
>>
>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>
>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>> * tree-if-conv.c (tree_if_conversion): Make public.
>> * * tree-if-conv.h: New file.
>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>> dynamic alias checks for epilogues.
>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>> * tree-vect-loop.c: include tree-if-conv.h.
>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>> using passed argument.
>> (vect_transform_loop): Check if created epilogue should be returned
>> for further vectorization with less vf.  If-convert epilogue if
>> required. Print vectorization success for epilogue.
>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>> if it is required, pass loop_vinfo produced during vectorization of
>> loop body to vect_analyze_loop.
>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>> orig_loop_info.
>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>> (LOOP_VINFO_EPILOGUE_P): New.
>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>> (vect_do_peeling): Change prototype to return epilogue.
>> (vect_analyze_loop): Add argument of loop_vec_info type.
>> (vect_transform_loop): Return created loop.
>>
>> gcc/testsuite/
>>
>> * lib/target-supports.exp (check_avx2_hw_available): New.
>> (check_effective_target_avx2_runtime): New.
>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>>
>
> Hi,
>
> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>
> It does pass on the same target if configured --with-cpu=cortex-a9.
>
> Christophe
>
>
>
>>
>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>Richard,
>>>>
>>>>I checked one of the tests designed for epilogue vectorization using
>>>>patches 1 - 3 and found out that build compiler performs vectorization
>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>>>
>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>>>t1.new-nomask.s -fdump-tree-vect-details
>>>>$ grep VECTORIZED -c t1.c.156t.vect
>>>>4
>>>> Without param only 2 loops are vectorized.
>>>>
>>>>Should I simply add a part of tests related to this feature or I must
>>>>delete all not necessary changes also?
>>>
>>> Please remove all not necessary changes.
>>>
>>> Richard.
>>>
>>>>Thanks.
>>>>Yuri.
>>>>
>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>>>
>>>>>> Richard,
>>>>>>
>>>>>> In my previous patch I forgot to remove couple lines related to aux
>>>>field.
>>>>>> Here is the correct updated patch.
>>>>>
>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>>>> necessary parts from 1 and 2) if all not required parts are removed
>>>>> (and you'd add the testcases covering non-masked tail vect).
>>>>>
>>>>> Thus, can you please produce a single complete patch containing only
>>>>> non-masked epilogue vectoriziation?
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks.
>>>>>> Yuri.
>>>>>>
>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>>>> >
>>>>>> >> Richard,
>>>>>> >>
>>>>>> >> I prepare updated 3 patch with passing additional argument to
>>>>>> >> vect_analyze_loop as you proposed (untested).
>>>>>> >>
>>>>>> >> You wrote:
>>>>>> >> tw, I wonder if you can produce a single patch containing just
>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>> >> changes only needed by later patches?
>>>>>> >>
>>>>>> >> Did you mean that I exclude all support for vectorization
>>>>epilogues,
>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>>>> >> like
>>>>>> >>
>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>>>> >> index 11863af..32011c1 100644
>>>>>> >> --- a/gcc/tree-vect-loop.c
>>>>>> >> +++ b/gcc/tree-vect-loop.c
>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>>>> >
>>>>>> > Yes.
>>>>>> >
>>>>>> >> Did you mean also that new combined patch must be working patch,
>>>>i.e.
>>>>>> >> can be integrated without other patches?
>>>>>> >
>>>>>> > Yes.
>>>>>> >
>>>>>> >> Could you please look at updated patch?
>>>>>> >
>>>>>> > Will do.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Richard.
>>>>>> >
>>>>>> >> Thanks.
>>>>>> >> Yuri.
>>>>>> >>
>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>>>> >> >
>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>>> >> >>
>>>>>> >> >> > Richard,
>>>>>> >> >> >
>>>>>> >> >> > Here is updated 3 patch.
>>>>>> >> >> >
>>>>>> >> >> > I checked that all new tests related to epilogue
>>>>vectorization passed with it.
>>>>>> >> >> >
>>>>>> >> >> > Your comments will be appreciated.
>>>>>> >> >>
>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>>>> >> >> original vectorization factor?  So we can pass down an
>>>>(optional)
>>>>>> >> >> forced vectorization factor as well?
>>>>>> >> >
>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>> >> > changes only needed by later patches?
>>>>>> >> >
>>>>>> >> > Thanks,
>>>>>> >> > Richard.
>>>>>> >> >
>>>>>> >> >> Richard.
>>>>>> >> >>
>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>>><rguenther@suse.de>:
>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>>>> >> >> > >
>>>>>> >> >> > >> Hi Richard,
>>>>>> >> >> > >>
>>>>>> >> >> > >> I did not understand your last remark:
>>>>>> >> >> > >>
>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>> >> >> > >> >           && dump_enabled_p ())
>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>vect_location,
>>>>>> >> >> > >> >                            "loop vectorized\n");
>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>>>it to be unrolled
>>>>>> >> >> > >> >           etc.  */
>>>>>> >> >> > >> >      loop->force_vectorize = false;
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>it easier
>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>in dumps
>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>*/
>>>>>> >> >> > >> > +       if (new_loop)
>>>>>> >> >> > >> > +         {
>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>> >> >> > >> > +         }
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>new_loop)
>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>>>perform
>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>vectorization
>>>>>> >> >> > >> > separately that would be great.
>>>>>> >> >> > >>
>>>>>> >> >> > >> Could you please clarify your proposal.
>>>>>> >> >> > >
>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>>>vectorize
>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>>>avoiding
>>>>>> >> >> > > the re-use of ->aux.
>>>>>> >> >> > >
>>>>>> >> >> > > Richard.
>>>>>> >> >> > >
>>>>>> >> >> > >> Thanks.
>>>>>> >> >> > >> Yuri.
>>>>>> >> >> > >>
>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>>><rguenther@suse.de>:
>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>>>> >> >> > >> >
>>>>>> >> >> > >> >> Hi All,
>>>>>> >> >> > >> >>
>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>>>which support
>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>>>trip count. We
>>>>>> >> >> > >> >> assume that the only patch -
>>>>vec-tails-07-combine-tail.patch - was not
>>>>>> >> >> > >> >> approved by Jeff.
>>>>>> >> >> > >> >>
>>>>>> >> >> > >> >> I did re-base of all patches and performed
>>>>bootstrapping and
>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>>>Also all
>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>>>been changed
>>>>>> >> >> > >> >> accordingly.
>>>>>> >> >> > >> >>
>>>>>> >> >> > >> >> Is it OK for trunk?
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > I would have prefered that the series up to
>>>>-03-nomask-tails would
>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>>>unfortunately
>>>>>> >> >> > >> > the patchset is oddly separated.
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>>>(loop_vec_info
>>>>>> >> >> > >> > loop_vinfo)
>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>>>single_exit (loop))
>>>>>> >> >> > >> > -      || loop->inner)
>>>>>> >> >> > >> > +      || loop->inner
>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>>>and
>>>>>> >> >> > >> > +        is not required for epilogue.  */
>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>>>> >> >> > >> >      do_peeling = false;
>>>>>> >> >> > >> >
>>>>>> >> >> > >> >    if (do_peeling
>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>>>(loop_vec_info
>>>>>> >> >> > >> > loop_vinfo)
>>>>>> >> >> > >> >
>>>>>> >> >> > >> >    do_versioning =
>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>>>> >> >> > >> > +          original loop and is not required for
>>>>epilogue.  */
>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>>>> >> >> > >> >
>>>>>> >> >> > >> >    if (do_versioning)
>>>>>> >> >> > >> >      {
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > please do that check in the single caller of this
>>>>function.
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>>>believe that simply
>>>>>> >> >> > >> > passing down info from the processed parent would be
>>>>_much_ cleaner.
>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>> >> >> > >> >             && dump_enabled_p ())
>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>vect_location,
>>>>>> >> >> > >> >                             "loop vectorized\n");
>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>>>it to be unrolled
>>>>>> >> >> > >> >            etc.  */
>>>>>> >> >> > >> >         loop->force_vectorize = false;
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>it easier
>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>in dumps
>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>*/
>>>>>> >> >> > >> > +       if (new_loop)
>>>>>> >> >> > >> > +         {
>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>> >> >> > >> > +         }
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>new_loop)
>>>>>> >> >> > >> > function which will set up stuff properly (and also
>>>>perform
>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>vectorization
>>>>>> >> >> > >> > separately that would be great.
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>>>question its
>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>>>vector loop).
>>>>>> >> >> > >> > But it has already been approved ... oh well.
>>>>>> >> >> > >> >
>>>>>> >> >> > >> > Thanks,
>>>>>> >> >> > >> > Richard.
>>>>>> >> >> > >>
>>>>>> >> >> > >>
>>>>>> >> >> > >
>>>>>> >> >> > > --
>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> > Richard Biener <rguenther@suse.de>
>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>> >>
>>>>>> >
>>>>>> > --
>>>>>> > Richard Biener <rguenther@suse.de>
>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>
>>>>>
>>>>> --
>>>>> Richard Biener <rguenther@suse.de>
>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>Norton, HRB 21284 (AG Nuernberg)
>>>
>>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-18 15:46                             ` Yuri Rumyantsev
@ 2016-11-18 15:54                               ` Christophe Lyon
  2016-11-24 13:42                                 ` Yuri Rumyantsev
  2016-11-29 16:22                                 ` Christophe Lyon
  0 siblings, 2 replies; 38+ messages in thread
From: Christophe Lyon @ 2016-11-18 15:54 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> It is very strange that this test failed on arm, since it requires
> target avx2 to check vectorizer dumps:
>
> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
> target avx2_runtime } } } */
> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>
> Could you please clarify what is the reason of the failure?

It's not the scan-dumps that fail, but the execution.
The test calls abort() for some reason.

It will take me a while to rebuild the test manually in the right
debug environment to provide you with more traces.



>
> Thanks.
>
> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi All,
>>>
>>> Here is patch for non-masked epilogue vectoriziation.
>>>
>>> Bootstrap and regression testing did not show any new failures.
>>>
>>> Is it OK for trunk?
>>>
>>> Thanks.
>>> Changelog:
>>>
>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>
>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>>> * tree-if-conv.c (tree_if_conversion): Make public.
>>> * * tree-if-conv.h: New file.
>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>>> dynamic alias checks for epilogues.
>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>>> * tree-vect-loop.c: include tree-if-conv.h.
>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>>> using passed argument.
>>> (vect_transform_loop): Check if created epilogue should be returned
>>> for further vectorization with less vf.  If-convert epilogue if
>>> required. Print vectorization success for epilogue.
>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>>> if it is required, pass loop_vinfo produced during vectorization of
>>> loop body to vect_analyze_loop.
>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>>> orig_loop_info.
>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>>> (LOOP_VINFO_EPILOGUE_P): New.
>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>>> (vect_do_peeling): Change prototype to return epilogue.
>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>>> (vect_transform_loop): Return created loop.
>>>
>>> gcc/testsuite/
>>>
>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>>> (check_effective_target_avx2_runtime): New.
>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>>>
>>
>> Hi,
>>
>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>>
>> It does pass on the same target if configured --with-cpu=cortex-a9.
>>
>> Christophe
>>
>>
>>
>>>
>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>Richard,
>>>>>
>>>>>I checked one of the tests designed for epilogue vectorization using
>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>>>>
>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>>>>t1.new-nomask.s -fdump-tree-vect-details
>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>>>>>4
>>>>> Without param only 2 loops are vectorized.
>>>>>
>>>>>Should I simply add a part of tests related to this feature or I must
>>>>>delete all not necessary changes also?
>>>>
>>>> Please remove all not necessary changes.
>>>>
>>>> Richard.
>>>>
>>>>>Thanks.
>>>>>Yuri.
>>>>>
>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>
>>>>>>> Richard,
>>>>>>>
>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>>>>>field.
>>>>>>> Here is the correct updated patch.
>>>>>>
>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>>>>>> (and you'd add the testcases covering non-masked tail vect).
>>>>>>
>>>>>> Thus, can you please produce a single complete patch containing only
>>>>>> non-masked epilogue vectoriziation?
>>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>>> Thanks.
>>>>>>> Yuri.
>>>>>>>
>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>> >
>>>>>>> >> Richard,
>>>>>>> >>
>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>>>>>>> >> vect_analyze_loop as you proposed (untested).
>>>>>>> >>
>>>>>>> >> You wrote:
>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>> >> changes only needed by later patches?
>>>>>>> >>
>>>>>>> >> Did you mean that I exclude all support for vectorization
>>>>>epilogues,
>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>>>>> >> like
>>>>>>> >>
>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>>>>> >> index 11863af..32011c1 100644
>>>>>>> >> --- a/gcc/tree-vect-loop.c
>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>>>>> >
>>>>>>> > Yes.
>>>>>>> >
>>>>>>> >> Did you mean also that new combined patch must be working patch,
>>>>>i.e.
>>>>>>> >> can be integrated without other patches?
>>>>>>> >
>>>>>>> > Yes.
>>>>>>> >
>>>>>>> >> Could you please look at updated patch?
>>>>>>> >
>>>>>>> > Will do.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Richard.
>>>>>>> >
>>>>>>> >> Thanks.
>>>>>>> >> Yuri.
>>>>>>> >>
>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>>>>> >> >
>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>> >> >>
>>>>>>> >> >> > Richard,
>>>>>>> >> >> >
>>>>>>> >> >> > Here is updated 3 patch.
>>>>>>> >> >> >
>>>>>>> >> >> > I checked that all new tests related to epilogue
>>>>>vectorization passed with it.
>>>>>>> >> >> >
>>>>>>> >> >> > Your comments will be appreciated.
>>>>>>> >> >>
>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>>>>> >> >> original vectorization factor?  So we can pass down an
>>>>>(optional)
>>>>>>> >> >> forced vectorization factor as well?
>>>>>>> >> >
>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>> >> > changes only needed by later patches?
>>>>>>> >> >
>>>>>>> >> > Thanks,
>>>>>>> >> > Richard.
>>>>>>> >> >
>>>>>>> >> >> Richard.
>>>>>>> >> >>
>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>>>><rguenther@suse.de>:
>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>> >> >> > >
>>>>>>> >> >> > >> Hi Richard,
>>>>>>> >> >> > >>
>>>>>>> >> >> > >> I did not understand your last remark:
>>>>>>> >> >> > >>
>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>> >> >> > >> >           && dump_enabled_p ())
>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>vect_location,
>>>>>>> >> >> > >> >                            "loop vectorized\n");
>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>>>>it to be unrolled
>>>>>>> >> >> > >> >           etc.  */
>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>it easier
>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>in dumps
>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>*/
>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>> >> >> > >> > +         {
>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>> >> >> > >> > +         }
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>new_loop)
>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>>>>perform
>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>vectorization
>>>>>>> >> >> > >> > separately that would be great.
>>>>>>> >> >> > >>
>>>>>>> >> >> > >> Could you please clarify your proposal.
>>>>>>> >> >> > >
>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>>>>vectorize
>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>>>>avoiding
>>>>>>> >> >> > > the re-use of ->aux.
>>>>>>> >> >> > >
>>>>>>> >> >> > > Richard.
>>>>>>> >> >> > >
>>>>>>> >> >> > >> Thanks.
>>>>>>> >> >> > >> Yuri.
>>>>>>> >> >> > >>
>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>>>><rguenther@suse.de>:
>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> >> Hi All,
>>>>>>> >> >> > >> >>
>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>>>>which support
>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>>>>trip count. We
>>>>>>> >> >> > >> >> assume that the only patch -
>>>>>vec-tails-07-combine-tail.patch - was not
>>>>>>> >> >> > >> >> approved by Jeff.
>>>>>>> >> >> > >> >>
>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>>>>>bootstrapping and
>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>>>>Also all
>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>>>>been changed
>>>>>>> >> >> > >> >> accordingly.
>>>>>>> >> >> > >> >>
>>>>>>> >> >> > >> >> Is it OK for trunk?
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > I would have prefered that the series up to
>>>>>-03-nomask-tails would
>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>>>>unfortunately
>>>>>>> >> >> > >> > the patchset is oddly separated.
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>>>>(loop_vec_info
>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>>>>single_exit (loop))
>>>>>>> >> >> > >> > -      || loop->inner)
>>>>>>> >> >> > >> > +      || loop->inner
>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>>>>and
>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>>>>> >> >> > >> >      do_peeling = false;
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> >    if (do_peeling
>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>>>>(loop_vec_info
>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> >    do_versioning =
>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>>>>> >> >> > >> > +          original loop and is not required for
>>>>>epilogue.  */
>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> >    if (do_versioning)
>>>>>>> >> >> > >> >      {
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > please do that check in the single caller of this
>>>>>function.
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>>>>believe that simply
>>>>>>> >> >> > >> > passing down info from the processed parent would be
>>>>>_much_ cleaner.
>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>> >> >> > >> >             && dump_enabled_p ())
>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>vect_location,
>>>>>>> >> >> > >> >                             "loop vectorized\n");
>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>>>>it to be unrolled
>>>>>>> >> >> > >> >            etc.  */
>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>it easier
>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>in dumps
>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>*/
>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>> >> >> > >> > +         {
>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>> >> >> > >> > +         }
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>new_loop)
>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>>>>>perform
>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>vectorization
>>>>>>> >> >> > >> > separately that would be great.
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>>>>question its
>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>>>>vector loop).
>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>>>>>>> >> >> > >> >
>>>>>>> >> >> > >> > Thanks,
>>>>>>> >> >> > >> > Richard.
>>>>>>> >> >> > >>
>>>>>>> >> >> > >>
>>>>>>> >> >> > >
>>>>>>> >> >> > > --
>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >> > --
>>>>>>> >> > Richard Biener <rguenther@suse.de>
>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>> >>
>>>>>>> >
>>>>>>> > --
>>>>>>> > Richard Biener <rguenther@suse.de>
>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Richard Biener <rguenther@suse.de>
>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>
>>>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-18 15:54                               ` Christophe Lyon
@ 2016-11-24 13:42                                 ` Yuri Rumyantsev
  2016-11-28 14:39                                   ` Richard Biener
  2016-11-29 16:22                                 ` Christophe Lyon
  1 sibling, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-24 13:42 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 19946 bytes --]

Hi All,

Here is the second patch which supports epilogue vectorization using
masking without cost model. Currently it is possible
only with passing parameter "--param vect-epilogues-mask=1".

Bootstrapping and regression testing did not show any new regression.

Any comments will be appreciated.

ChangeLog:
2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>

* params.def (PARAM_VECT_EPILOGUES_MASK): New.
* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
* tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
(new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
required_mask fields.
(vect_check_required_masks_widening): New.
(vect_check_required_masks_narrowing): New.
(vect_get_masking_iv_elems): New.
(vect_get_masking_iv_type): New.
(vect_get_extreme_masks): New.
(vect_check_required_masks): New.
(vect_analyze_loop_operations): Call vect_check_required_masks if all
statements can be masked.
(vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
Add check that epilogue can be masked with the same vf with issue
fail notes.  Allow epilogue vectorization through masking of low trip
loops. Set to true can_be_masked field before loop operation analysis.
Do not set-up min_scalar_loop_bound for epilogue vectorization through
masking.  Do not peeling for epilogue masking.  Reset can_be_masked
field before repeat analysis.
(vect_estimate_min_profitable_iters): Do not compute profitability
for epilogue masking.  Set up mask_loop filed to true if parameter
PARAM_VECT_EPILOGUES_MASK is non-zero.
(vectorizable_reduction): Add check that statement can be masked.
(vectorizable_induction): Do not support masking for induction.
(vect_gen_ivs_for_masking): New.
(vect_get_mask_index_for_elems): New.
(vect_get_mask_index_for_type): New.
(vect_create_narrowed_masks): New.
(vect_create_widened_masks): New.
(vect_gen_loop_masks): New.
(vect_mask_reduction_stmt): New.
(vect_mask_mask_load_store_stmt): New.
(vect_mask_load_store_stmt): New.
(vect_mask_loop): New.
(vect_transform_loop): Invoke vect_mask_loop if required.
Use div_ceil to recompute upper bounds for masked loops.  Issue
statistics for epilogue vectorization through masking. Do not reduce
vf for masking epilogue.
* tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
(can_mask_load_store): New.
(vectorizable_mask_load_store): Check that mask conjuction is
supported.  Set-up first_copy_p field of stmt_vinfo.
(vectorizable_simd_clone_call): Check that simd clone can not be
masked.
(vectorizable_store): Check that store can be masked. Mark the first
copy of generated vector stores and provide it with vectype and the
original data reference.
(vectorizable_load): Check that load can be masked.
(vect_stmt_should_be_masked_for_epilogue): New.
(vect_add_required_mask_for_stmt): New.
(vect_analyze_stmt): Add check on unsupported statements for masking
with printing message.
* tree-vectorizer.h (struct _loop_vec_info): Add new fields
can_be_maske, required_masks, masl_loop.
(LOOP_VINFO_CAN_BE_MASKED): New.
(LOOP_VINFO_REQUIRED_MASKS): New.
(LOOP_VINFO_MASK_LOOP): New.
(struct _stmt_vec_info): Add first_copy_p field.
(STMT_VINFO_FIRST_COPY_P): New.

gcc/testsuite/

* gcc.dg/vect/vect-tail-mask-1.c: New test.

2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> It is very strange that this test failed on arm, since it requires
>> target avx2 to check vectorizer dumps:
>>
>> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>> target avx2_runtime } } } */
>> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>>
>> Could you please clarify what is the reason of the failure?
>
> It's not the scan-dumps that fail, but the execution.
> The test calls abort() for some reason.
>
> It will take me a while to rebuild the test manually in the right
> debug environment to provide you with more traces.
>
>
>
>>
>> Thanks.
>>
>> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi All,
>>>>
>>>> Here is patch for non-masked epilogue vectoriziation.
>>>>
>>>> Bootstrap and regression testing did not show any new failures.
>>>>
>>>> Is it OK for trunk?
>>>>
>>>> Thanks.
>>>> Changelog:
>>>>
>>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>>
>>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>>>> * tree-if-conv.c (tree_if_conversion): Make public.
>>>> * * tree-if-conv.h: New file.
>>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>>>> dynamic alias checks for epilogues.
>>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>>>> * tree-vect-loop.c: include tree-if-conv.h.
>>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>>>> using passed argument.
>>>> (vect_transform_loop): Check if created epilogue should be returned
>>>> for further vectorization with less vf.  If-convert epilogue if
>>>> required. Print vectorization success for epilogue.
>>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>>>> if it is required, pass loop_vinfo produced during vectorization of
>>>> loop body to vect_analyze_loop.
>>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>>>> orig_loop_info.
>>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>>>> (LOOP_VINFO_EPILOGUE_P): New.
>>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>>>> (vect_do_peeling): Change prototype to return epilogue.
>>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>>>> (vect_transform_loop): Return created loop.
>>>>
>>>> gcc/testsuite/
>>>>
>>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>>>> (check_effective_target_avx2_runtime): New.
>>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>>>>
>>>
>>> Hi,
>>>
>>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>>>
>>> It does pass on the same target if configured --with-cpu=cortex-a9.
>>>
>>> Christophe
>>>
>>>
>>>
>>>>
>>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>Richard,
>>>>>>
>>>>>>I checked one of the tests designed for epilogue vectorization using
>>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>>>>>
>>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>>>>>t1.new-nomask.s -fdump-tree-vect-details
>>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>>>>>>4
>>>>>> Without param only 2 loops are vectorized.
>>>>>>
>>>>>>Should I simply add a part of tests related to this feature or I must
>>>>>>delete all not necessary changes also?
>>>>>
>>>>> Please remove all not necessary changes.
>>>>>
>>>>> Richard.
>>>>>
>>>>>>Thanks.
>>>>>>Yuri.
>>>>>>
>>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>
>>>>>>>> Richard,
>>>>>>>>
>>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>>>>>>field.
>>>>>>>> Here is the correct updated patch.
>>>>>>>
>>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>>>>>>> (and you'd add the testcases covering non-masked tail vect).
>>>>>>>
>>>>>>> Thus, can you please produce a single complete patch containing only
>>>>>>> non-masked epilogue vectoriziation?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Richard.
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Yuri.
>>>>>>>>
>>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >
>>>>>>>> >> Richard,
>>>>>>>> >>
>>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>>>>>>>> >> vect_analyze_loop as you proposed (untested).
>>>>>>>> >>
>>>>>>>> >> You wrote:
>>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>>> >> changes only needed by later patches?
>>>>>>>> >>
>>>>>>>> >> Did you mean that I exclude all support for vectorization
>>>>>>epilogues,
>>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>>>>>> >> like
>>>>>>>> >>
>>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>>>>>> >> index 11863af..32011c1 100644
>>>>>>>> >> --- a/gcc/tree-vect-loop.c
>>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>>>>>> >
>>>>>>>> > Yes.
>>>>>>>> >
>>>>>>>> >> Did you mean also that new combined patch must be working patch,
>>>>>>i.e.
>>>>>>>> >> can be integrated without other patches?
>>>>>>>> >
>>>>>>>> > Yes.
>>>>>>>> >
>>>>>>>> >> Could you please look at updated patch?
>>>>>>>> >
>>>>>>>> > Will do.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Richard.
>>>>>>>> >
>>>>>>>> >> Thanks.
>>>>>>>> >> Yuri.
>>>>>>>> >>
>>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>>>>>> >> >
>>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> > Richard,
>>>>>>>> >> >> >
>>>>>>>> >> >> > Here is updated 3 patch.
>>>>>>>> >> >> >
>>>>>>>> >> >> > I checked that all new tests related to epilogue
>>>>>>vectorization passed with it.
>>>>>>>> >> >> >
>>>>>>>> >> >> > Your comments will be appreciated.
>>>>>>>> >> >>
>>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>>>>>> >> >> original vectorization factor?  So we can pass down an
>>>>>>(optional)
>>>>>>>> >> >> forced vectorization factor as well?
>>>>>>>> >> >
>>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>>> >> > changes only needed by later patches?
>>>>>>>> >> >
>>>>>>>> >> > Thanks,
>>>>>>>> >> > Richard.
>>>>>>>> >> >
>>>>>>>> >> >> Richard.
>>>>>>>> >> >>
>>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>>>>><rguenther@suse.de>:
>>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >> > >
>>>>>>>> >> >> > >> Hi Richard,
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> I did not understand your last remark:
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>>> >> >> > >> >           && dump_enabled_p ())
>>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>>vect_location,
>>>>>>>> >> >> > >> >                            "loop vectorized\n");
>>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>>>>>it to be unrolled
>>>>>>>> >> >> > >> >           etc.  */
>>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>>it easier
>>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>>in dumps
>>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>>*/
>>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>>> >> >> > >> > +         {
>>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>>> >> >> > >> > +         }
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>>new_loop)
>>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>>>>>perform
>>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>>vectorization
>>>>>>>> >> >> > >> > separately that would be great.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> Could you please clarify your proposal.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>>>>>vectorize
>>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>>>>>avoiding
>>>>>>>> >> >> > > the re-use of ->aux.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > Richard.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > >> Thanks.
>>>>>>>> >> >> > >> Yuri.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>>>>><rguenther@suse.de>:
>>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >> Hi All,
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>>>>>which support
>>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>>>>>trip count. We
>>>>>>>> >> >> > >> >> assume that the only patch -
>>>>>>vec-tails-07-combine-tail.patch - was not
>>>>>>>> >> >> > >> >> approved by Jeff.
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>>>>>>bootstrapping and
>>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>>>>>Also all
>>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>>>>>been changed
>>>>>>>> >> >> > >> >> accordingly.
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> Is it OK for trunk?
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I would have prefered that the series up to
>>>>>>-03-nomask-tails would
>>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>>>>>unfortunately
>>>>>>>> >> >> > >> > the patchset is oddly separated.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>>>>>(loop_vec_info
>>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>>>>>single_exit (loop))
>>>>>>>> >> >> > >> > -      || loop->inner)
>>>>>>>> >> >> > >> > +      || loop->inner
>>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>>>>>and
>>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>>>>>> >> >> > >> >      do_peeling = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    if (do_peeling
>>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>>>>>(loop_vec_info
>>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    do_versioning =
>>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>>>>>> >> >> > >> > +          original loop and is not required for
>>>>>>epilogue.  */
>>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    if (do_versioning)
>>>>>>>> >> >> > >> >      {
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > please do that check in the single caller of this
>>>>>>function.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>>>>>believe that simply
>>>>>>>> >> >> > >> > passing down info from the processed parent would be
>>>>>>_much_ cleaner.
>>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>>> >> >> > >> >             && dump_enabled_p ())
>>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>>vect_location,
>>>>>>>> >> >> > >> >                             "loop vectorized\n");
>>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>>>>>it to be unrolled
>>>>>>>> >> >> > >> >            etc.  */
>>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>>it easier
>>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>>in dumps
>>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>>*/
>>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>>> >> >> > >> > +         {
>>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>>> >> >> > >> > +         }
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>>new_loop)
>>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>>>>>>perform
>>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>>vectorization
>>>>>>>> >> >> > >> > separately that would be great.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>>>>>question its
>>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>>>>>vector loop).
>>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > Thanks,
>>>>>>>> >> >> > >> > Richard.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > --
>>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >> > --
>>>>>>>> >> > Richard Biener <rguenther@suse.de>
>>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Richard Biener <rguenther@suse.de>
>>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Richard Biener <rguenther@suse.de>
>>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>
>>>>>

[-- Attachment #2: epilog-mask.patch --]
[-- Type: application/octet-stream, Size: 51372 bytes --]

diff --git a/gcc/params.def b/gcc/params.def
index 50f75a7..db860b6 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1270,6 +1270,12 @@ DEFPARAM (PARAM_MAX_VRP_SWITCH_ASSERTIONS,
 	  "edge of a switch statement during VRP",
 	  10, 0, 0)
 
+DEFPARAM (PARAM_VECT_EPILOGUES_MASK,
+	  "vect-epilogues-mask",
+	  "Enable loop epilogue vectorization using the same vector "
+	  "size and masking.",
+	  0, 0, 1)
+
 DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
 	  "vect-epilogues-nomask",
 	  "Enable loop epilogue vectorization using smaller vector size.",
diff --git a/gcc/testsuite/gcc.dg/vect/vect-tail-mask-1.c b/gcc/testsuite/gcc.dg/vect/vect-tail-mask-1.c
new file mode 100644
index 0000000..0637056
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-tail-mask-1.c
@@ -0,0 +1,8 @@
+/* { dg-do run } */
+/* { dg-require-weak "" } */
+/* { dg-additional-options "--param vect-epilogues-mask=1 -mavx2" { target avx2_runtime } } */
+
+#include "vect-tail-nomask-1.c"
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { target avx2_runtime } } } */
+/* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED AND MASKED \\(VS=32\\)" 2 "vect" { target avx2_runtime } } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 5a30314..1a0df70 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4125,6 +4125,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
   case vect_scalar_var:
     prefix = "stmp";
     break;
+  case vect_mask_var:
+    prefix = "mask";
+    break;
   case vect_pointer_var:
     prefix = "vectp";
     break;
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index e13d6a2..36be342 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
 			 || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
 
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    {
+      prolog_peeling = false;
+      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
+	epilog_peeling = false;
+    }
+
   if (!prolog_peeling && !epilog_peeling)
     return NULL;
 
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 4150b0d..b948283 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -31,6 +31,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "ssa.h"
 #include "optabs-tree.h"
+#include "insn-config.h"
+#include "recog.h"		/* FIXME: for insn_data.  */
 #include "diagnostic-core.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -50,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cgraph.h"
 #include "tree-cfg.h"
 #include "tree-if-conv.h"
+#include "alias.h"
 
 /* Loop Vectorization Pass.
 
@@ -1172,6 +1175,9 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
+  LOOP_VINFO_CAN_BE_MASKED (res) = false;
+  LOOP_VINFO_MASK_LOOP (res) = false;
+  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
   LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
 
   return res;
@@ -1664,6 +1670,235 @@ vect_update_vf_for_slp (loop_vec_info loop_vinfo)
 		     vectorization_factor);
 }
 
+/* Function vect_check_required_masks_widening.
+
+   Return true if vector mask of type MASK_TYPE can be widened
+   to a type having REQ_ELEMS elements in a single vector.  */
+
+static bool
+vect_check_required_masks_widening (loop_vec_info loop_vinfo,
+				    tree mask_type, unsigned req_elems)
+{
+  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  gcc_assert (mask_elems > req_elems);
+
+  /* Don't convert if it requires too many intermediate steps.  */
+  int steps = exact_log2 (mask_elems / req_elems);
+  if (steps > MAX_INTERM_CVT_STEPS + 1)
+    return false;
+
+  /* Check we have conversion support for given mask mode.  */
+  machine_mode mode = TYPE_MODE (mask_type);
+  insn_code icode = optab_handler (vec_unpacks_lo_optab, mode);
+  if (icode == CODE_FOR_nothing
+      || optab_handler (vec_unpacks_hi_optab, mode) == CODE_FOR_nothing)
+    return false;
+
+  /* Make recursive call for multi-step conversion.  */
+  if (steps > 1)
+    {
+      mask_elems = mask_elems >> 1;
+      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+
+      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
+					       req_elems))
+	return false;
+    }
+  else
+    {
+      mask_type = build_truth_vector_type (req_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+    }
+
+  return true;
+}
+
+/* Function vect_check_required_masks_narrowing.
+
+   Return true if vector mask of type MASK_TYPE can be narrowed
+   to a type having REQ_ELEMS elements in a single vector.  */
+
+static bool
+vect_check_required_masks_narrowing (loop_vec_info loop_vinfo,
+				     tree mask_type, unsigned req_elems)
+{
+  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  gcc_assert (req_elems > mask_elems);
+
+  /* Don't convert if it requires too many intermediate steps.  */
+  int steps = exact_log2 (req_elems / mask_elems);
+  if (steps > MAX_INTERM_CVT_STEPS + 1)
+    return false;
+
+  /* Check we have conversion support for given mask mode.  */
+  machine_mode mode = TYPE_MODE (mask_type);
+  insn_code icode = optab_handler (vec_pack_trunc_optab, mode);
+  if (icode == CODE_FOR_nothing)
+    return false;
+
+  /* Make recursive call for multi-step conversion.  */
+  if (steps > 1)
+    {
+      mask_elems = mask_elems << 1;
+      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+
+      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
+						req_elems))
+	return false;
+    }
+  else
+    {
+      mask_type = build_truth_vector_type (req_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+    }
+
+  return true;
+}
+
+/* Function vect_get_masking_iv_elems.
+
+   Return a number of elements in IV used for loop masking.  */
+static int
+vect_get_masking_iv_elems (loop_vec_info loop_vinfo)
+{
+  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
+
+  /* We extend IV type in case it is not big enough to
+     fill full vector.  */
+  return MIN ((int)TYPE_VECTOR_SUBPARTS (iv_vectype),
+	      LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+}
+
+/* Function vect_get_masking_iv_type.
+
+   Return a type of IV used for loop masking.  */
+static tree
+vect_get_masking_iv_type (loop_vec_info loop_vinfo)
+{
+  /* Masking IV is to be compared to vector of NITERS and therefore
+     type of NITERS is used as a basic type for IV.
+     FIXME: It can be improved by using smaller size when possible
+     for more efficient masks computation.  */
+  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
+  unsigned vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  if (TYPE_VECTOR_SUBPARTS (iv_vectype) <= vf)
+    return iv_vectype;
+
+  unsigned elem_size = current_vector_size * BITS_PER_UNIT / vf;
+  iv_type = build_nonstandard_integer_type (elem_size, TYPE_UNSIGNED (iv_type));
+
+  return get_vectype_for_scalar_type (iv_type);
+}
+
+/* Function vect_get_extreme_masks.
+
+   Determine minimum and maximum number of elements in masks
+   required for masking a loop described by LOOP_VINFO.
+   Computed values are returned in MIN_MASK_ELEMS and
+   MAX_MASK_ELEMS.  */
+
+static void
+vect_get_extreme_masks (loop_vec_info loop_vinfo,
+			unsigned *min_mask_elems,
+			unsigned *max_mask_elems)
+{
+  unsigned required_masks = LOOP_VINFO_REQUIRED_MASKS (loop_vinfo);
+  unsigned elems = 1;
+
+  *min_mask_elems = *max_mask_elems = vect_get_masking_iv_elems (loop_vinfo);
+
+  while (required_masks)
+    {
+      if (required_masks & 1)
+	{
+	  if (elems < *min_mask_elems)
+	    *min_mask_elems = elems;
+	  if (elems > *max_mask_elems)
+	    *max_mask_elems = elems;
+	}
+      elems = elems << 1;
+      required_masks = required_masks >> 1;
+    }
+}
+
+/* Function vect_check_required_masks.
+
+   For given LOOP_VINFO check all required masks can be computed.  */
+
+static void
+vect_check_required_masks (loop_vec_info loop_vinfo)
+{
+  if (!LOOP_VINFO_REQUIRED_MASKS (loop_vinfo))
+    return;
+
+  /* Firstly check we have a proper comparison to get
+     an initial mask.  */
+  tree iv_vectype = vect_get_masking_iv_type (loop_vinfo);
+  unsigned iv_elems = TYPE_VECTOR_SUBPARTS (iv_vectype);
+
+  tree mask_type = build_same_sized_truth_vector_type (iv_vectype);
+
+  if (!expand_vec_cmp_expr_p (iv_vectype, mask_type, LT_EXPR))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: required vector comparison "
+			 "is not supported.\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      return;
+    }
+
+  /* Now check the widest and the narrowest masks.
+     All intermediate values are obtained while
+     computing extreme values.  */
+  unsigned min_mask_elems = 0;
+  unsigned max_mask_elems = 0;
+
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+
+  if (min_mask_elems < iv_elems)
+    {
+      /* Check mask widening is available.  */
+      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
+					       min_mask_elems))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required mask widening "
+			     "is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	  return;
+	}
+
+    }
+
+  if (max_mask_elems > iv_elems)
+    {
+      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
+						max_mask_elems))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required mask narrowing "
+			     "is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	  return;
+	}
+
+    }
+}
+
 /* Function vect_analyze_loop_operations.
 
    Scan the loop stmts and make sure they are all vectorizable.  */
@@ -1815,6 +2050,12 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
       return false;
     }
 
+  /* If all statements can be masked then we also need
+     to check we may compute required masks and compute
+     its cost.  */
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    vect_check_required_masks (loop_vinfo);
+
   return true;
 }
 
@@ -1984,7 +2225,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
   int saved_vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   HOST_WIDE_INT estimated_niter;
   unsigned th;
-  int min_scalar_loop_bound;
+  int min_scalar_loop_bound = 0;
 
   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  */
   ok = vect_analyze_slp (loop_vinfo, n_stmts);
@@ -2009,6 +2250,34 @@ start_over:
   unsigned vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   gcc_assert (vectorization_factor != 0);
 
+  /* For now we mask loop epilogue using the same VF since it was used
+     for cost estimations and it should be easier for reduction
+     optimization.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+      && LOOP_VINFO_MASK_LOOP (loop_vinfo)
+      && LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo) != (int)vectorization_factor)
+    {
+      /* If we couldn't vectorize epilogue with masking then we may still
+	 try to vectorize it with a smaller vector size.  */
+      if (LOOP_VINFO_ORIG_VECT_FACTOR (loop_vinfo) > (int)vectorization_factor
+	  && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "couldn't mask loop epilogue; trying to vectorize "
+			     "using a smaller vector.\n");
+	  LOOP_VINFO_MASK_LOOP (loop_vinfo) = false;
+	}
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "not vectorized: VF for loop epilogue doesn't "
+			     "match original loop VF.\n");
+	  return false;
+	}
+    }
+
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "vectorization_factor = %d, niters = "
@@ -2022,11 +2291,18 @@ start_over:
       || (max_niter != -1
 	  && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "not vectorized: iteration count smaller than "
-			 "vectorization factor.\n");
-      return false;
+      /* Allow low trip count for loop epilogue we want to mask.  */
+      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	  && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
+	LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "not vectorized: iteration count smaller than "
+			     "vectorization factor.\n");
+	  return false;
+	}
     }
 
   /* Analyze the alignment of the data-refs in the loop.
@@ -2064,6 +2340,8 @@ start_over:
       }
     }
 
+  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = true;
+
   if (slp)
     {
       /* Analyze operations in the SLP instances.  Note this may
@@ -2123,8 +2401,10 @@ start_over:
       goto again;
     }
 
-  min_scalar_loop_bound = ((PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
-			    * vectorization_factor) - 1);
+  if (!(LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	&& LOOP_VINFO_MASK_LOOP (loop_vinfo)))
+    min_scalar_loop_bound = ((PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
+			      * vectorization_factor) - 1);
 
   /* Use the cost model only if it is more conservative than user specified
      threshold.  */
@@ -2177,7 +2457,9 @@ start_over:
         / LOOP_VINFO_VECT_FACTOR (loop_vinfo))
        * LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
+    ;  /* Loop epilogue is not required.  */
+  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
     {
       if (ctz_hwi (LOOP_VINFO_INT_NITERS (loop_vinfo)
@@ -2307,6 +2589,7 @@ again:
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
+  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = true;
 
   goto start_over;
 }
@@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
   void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
 
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    {
+      /* Currently we don't produce scalar epilogue version in case
+	 its masked version is provided.  It means we don't need to
+	 compute profitability one more time here.  Just make a
+	 masked loop version.  */
+      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	  && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "cost model: mask loop epilogue.\n");
+	  LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
+	  *ret_min_profitable_niters = 0;
+	  *ret_min_profitable_estimate = 0;
+	  return;
+	}
+    }
   /* Cost model disabled.  */
-  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
     {
       dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n");
       *ret_min_profitable_niters = 0;
       *ret_min_profitable_estimate = 0;
+      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
+	  && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
       return;
     }
 
@@ -5523,6 +5826,7 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
       outer_loop = loop;
       loop = loop->inner;
       nested_cycle = true;
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;  /* Not supported yet.  */
     }
 
   /* 1. Is vectorizable reduction?  */
@@ -5777,6 +6081,18 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 
   gcc_assert (ncopies >= 1);
 
+  if (slp_node || PURE_SLP_STMT (stmt_info) || code == COND_EXPR
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+	 == INTEGER_INDUC_COND_REDUCTION)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported conditional "
+			 "reduction\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   vec_mode = TYPE_MODE (vectype_in);
 
   if (code == COND_EXPR)
@@ -6058,6 +6374,20 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 	}
     }
 
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Check that masking of reduction is supported.  */
+      tree mask_vtype = build_same_sized_truth_vector_type (vectype_out);
+      if (!expand_vec_cond_expr_p (vectype_out, mask_vtype, EQ_EXPR))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required vector conditional "
+			     "expression is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       if (first_p
@@ -6482,6 +6812,14 @@ vectorizable_induction (gimple *phi,
   if (gimple_code (phi) != GIMPLE_PHI)
     return false;
 
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported induction.\n");
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = induc_vec_info_type;
@@ -6704,6 +7042,524 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
   return false;
 }
 
+/* Function vect_gen_ivs_for_masking.
+
+   Create IVs to be used for masks computation to mask loop described
+   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
+
+   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
+   vectors, in this case IVS's elements with lower index hold IV with
+   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
+   vectorization factor.  */
+
+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = vect_get_masking_iv_type (loop_vinfo);
+  tree type = TREE_TYPE (vectype);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies  = vf / elems;
+  int i, k;
+  tree iv, init_val, step_val;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree *vtemp;
+
+  /* Create {VF, ..., VF} vector constant.  */
+  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
+
+  vtemp = XALLOCAVEC (tree, vf);
+  for (i = 0; i < ncopies; i++)
+    {
+      /* Create initial IV value.  */
+      for (k = 0; k < vf; k++)
+	vtemp[k] = build_int_cst (type, k + i * elems);
+      init_val = build_vector (vectype, vtemp);
+
+      /* Create an inductive variable including phi node.  */
+      standard_iv_increment_position (loop, &gsi, &insert_after);
+      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
+		 &iv, NULL);
+      ivs->safe_push (iv);
+    }
+}
+
+/* Function vect_get_mask_index_for_elems.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask having MASK_ELEMS elements.  */
+
+static inline unsigned
+vect_get_mask_index_for_elems (unsigned mask_elems)
+{
+  return current_vector_size / mask_elems - 1;
+}
+
+/* Function vect_get_mask_index_for_type.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask appropriate for VECTYPE.  */
+
+static inline unsigned
+vect_get_mask_index_for_type (tree vectype)
+{
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  return vect_get_mask_index_for_elems (elems);
+}
+
+/* Function vect_create_narrowed_masks.
+
+   Create masks by narrowing NMASKS base masks having BASE_MASK_ELEMS
+   elements each and put them into MASKS vector.  MAX_MASK_ELEMS holds
+   the maximum number of elements in a mask required.  Generated
+   statements are inserted before GSI.  */
+static void
+vect_create_narrowed_masks (vec<tree> *masks, unsigned nmasks,
+			    unsigned base_mask_elems, unsigned max_mask_elems,
+			    gimple_stmt_iterator *gsi)
+{
+  unsigned cur_mask_elems = base_mask_elems;
+  unsigned cur_mask, prev_mask;
+  unsigned vec_size = current_vector_size;
+  tree mask_type, mask;
+  gimple *stmt;
+
+  while (cur_mask_elems < max_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems <<= 1;
+      nmasks >>= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i++)
+	{
+	  tree mask_low = (*masks)[prev_mask++];
+	  tree mask_hi = (*masks)[prev_mask++];
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
+				      mask_low, mask_hi);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_create_widened_masks.
+
+   Create masks by widening NMASKS base masks having BASE_MASK_ELEMS
+   elements each and put them into MASKS vector.  MIN_MASK_ELEMS holds
+   the minimum number of elements in a mask required.  Generated
+   statements are inserted before GSI.  */
+static void
+vect_create_widened_masks (vec<tree> *masks, unsigned nmasks,
+			   unsigned base_mask_elems, unsigned min_mask_elems,
+			   gimple_stmt_iterator *gsi)
+{
+  unsigned cur_mask_elems = base_mask_elems;
+  unsigned cur_mask, prev_mask;
+  unsigned vec_size = current_vector_size;
+  tree mask_type, mask;
+  gimple *stmt;
+
+  while (cur_mask_elems > min_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems >>= 1;
+      nmasks <<= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i += 2)
+	{
+	  tree orig_mask = (*masks)[prev_mask++];
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_gen_loop_masks.
+
+   Create masks to mask a loop described by LOOP_VINFO.  Masks
+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
+   into MASKS vector.
+
+   Index of a mask in a vector is computed according to a number
+   of masks's elements.  Masks are sorted by number of its elements
+   in descending order.  Index 0 is used to access a mask with
+   current_vector_size elements.  Among masks with the same number
+   of elements the one with lower index is used to mask iterations
+   with smaller iteration counter.  Note that vector may have NULL values
+   for masks which are not required.  Use vect_get_mask_index_for_elems
+   or vect_get_mask_index_for_type to access resulting vector.  */
+
+static void
+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  unsigned min_mask_elems, max_mask_elems, nmasks;
+  unsigned iv_elems, cur_mask;
+  auto_vec<tree> ivs;
+  tree vectype, mask_type;
+  tree vec_niters, vec_niters_val, mask;
+  gimple *stmt;
+  basic_block bb;
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+
+  /* Create required IVs.  */
+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
+  vectype = TREE_TYPE (ivs[0]);
+
+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
+
+  /* Get a proper niter to build a vector.  */
+  if (!is_gimple_val (niters))
+    {
+      gimple_seq seq = NULL;
+      niters = force_gimple_operand (niters, &seq, true, NULL);
+      gsi_insert_seq_on_edge_immediate (pe, seq);
+    }
+
+  /* We may need a type cast in case niter has a too small type
+     for generated IVs.  */
+  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
+    {
+      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
+					    NULL, "niters");
+      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+      niters = new_niters;
+    }
+
+  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
+  vec_niters_val = build_vector_from_val (vectype, niters);
+  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
+  stmt = gimple_build_assign (vec_niters, vec_niters_val);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Determine which masks we need to compute and how many.  */
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
+  masks->safe_grow_cleared (nmasks);
+
+  /* Now create base masks through comparison IV < VEC_NITERS.  */
+  mask_type = build_same_sized_truth_vector_type (vectype);
+  cur_mask = vect_get_mask_index_for_elems (iv_elems);
+  for (unsigned i = 0; i < ivs.length (); i++)
+    {
+      tree iv = ivs[i];
+      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      (*masks)[cur_mask++] = mask;
+    }
+
+  vect_create_narrowed_masks (masks, ivs.length (), iv_elems,
+			      max_mask_elems, &gsi);
+
+  vect_create_widened_masks (masks, ivs.length (), iv_elems,
+			     min_mask_elems, &gsi);
+}
+
+/* Function vect_mask_reduction_stmt.
+
+   Mask given vectorized reduction statement STMT using
+   MASK.  In case scalar reduction statement is vectorized
+   into several vector statements then PREV holds a
+   preceding vector statement copy for STMT.
+
+   Masking is performed using VEC_COND_EXPR:
+
+   S1: r_1 = r_2 + d_3
+
+   is transformed into
+
+   S1': r_4 = r_2 + d_3
+   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
+
+   Return generated condition statement.  */
+
+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{
+  gimple_stmt_iterator gsi;
+  tree vectype;
+  tree lhs, rhs, tmp;
+  gimple *new_stmt, *phi;
+
+  lhs = gimple_assign_lhs (stmt);
+  vectype = TREE_TYPE (lhs);
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  /* Find operand RHS defined by PHI node.  */
+  rhs = gimple_assign_rhs1 (stmt);
+  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+  phi = SSA_NAME_DEF_STMT (rhs);
+
+  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
+    {
+      rhs = gimple_assign_rhs2 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      phi = SSA_NAME_DEF_STMT (rhs);
+      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
+    }
+
+  /* Convert reduction stmt to ordinary assignment to TMP.  */
+  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
+  gimple_assign_set_lhs (stmt, tmp);
+
+  /* Create VEC_COND_EXPR and insert it after STMT.  */
+  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
+  gsi = gsi_for_stmt (stmt);
+  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+
+  return new_stmt;
+}
+
+/* Function vect_mask_mask_load_store_stmt.
+
+   Mask given vectorized MASK_LOAD or MASK_STORE statement
+   STMT using MASK.  Function replaces a mask used by STMT
+   with its conjunction with MASK.  */
+
+static void
+vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old_mask, new_mask;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old_mask = gimple_call_arg (stmt, 2);
+
+  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
+
+  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+
+  gimple_call_set_arg (stmt, 2, new_mask);
+  update_stmt (stmt);
+}
+
+
+/* Function vect_mask_load_store_stmt.
+
+   Mask given vectorized load or store statement STMT using
+   MASK.  DR is a data reference for a scalar memory access.
+   Assignment is transformed into MASK_LOAD or MASK_STORE
+   statement.  SI is either an iterator pointing to STMT and
+   is to be updated or NULL.  */
+
+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+			   data_reference *dr, gimple_stmt_iterator *si)
+{
+  tree mem, val, addr, ptr;
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  unsigned align, misalign;
+  tree elem_type = TREE_TYPE (vectype);
+  gimple *new_stmt;
+
+  gcc_assert (!si || gsi_stmt (*si) == stmt);
+
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      val = gimple_assign_rhs1 (stmt);
+      mem = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      val = gimple_assign_lhs (stmt);
+      mem = gimple_assign_rhs1 (stmt);
+    }
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+		       misalign ? misalign & -misalign : align);
+
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					   mask, val);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
+					     mask);
+      gimple_call_set_lhs (new_stmt, val);
+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);
+}
+
+/* Function vect_mask_loop.
+
+   Perform masking of vectorized loop, only memory accesses and
+   reductions are masked, other statements stay unchanged.  */
+
+static void
+vect_mask_loop (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  unsigned mask_no;
+  auto_vec<tree> masks;
+
+  vect_gen_loop_masks (loop_vinfo, &masks);
+
+  /* Convert reduction statements if any.  */
+  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple *prev_stmt = NULL;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      while (stmt)
+	{
+	  prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
+						prev_stmt);
+	  stmt_info = vinfo_for_stmt (stmt);
+	  stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	}
+    }
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  stmt_vec_info stmt_info = NULL;
+	  tree vectype = NULL;
+	  data_reference *dr;
+
+	  /* Mask load case.  */
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	      && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Mask store case.  */
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
+	    }
+	  /* Load case.  */
+	  else if (gimple_assign_load_p (stmt)
+		   && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+
+	      /* Skip vector loads.  */
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+
+	      /* Skip invariant loads.  */
+	      if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Store case.  */
+	  else if (gimple_code (stmt) == GIMPLE_ASSIGN
+		   && gimple_store_p (stmt)
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  else
+	    continue;
+
+	  /* Skip hoisted out statements.  */
+	  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	    continue;
+
+	  mask_no = vect_get_mask_index_for_type (vectype);
+
+	  dr = STMT_VINFO_DATA_REF (stmt_info);
+	  while (stmt)
+	    {
+	      if (is_gimple_call (stmt))
+		vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
+	      else
+		vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
+					   /* Have to update iterator only if
+					      it points to stmt we mask.  */
+					   stmt == gsi_stmt (si) ? &si : NULL);
+
+	      stmt_info = vinfo_for_stmt (stmt);
+	      stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	    }
+	}
+    }
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop has beed masked ===\n");
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -7054,6 +7910,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, niters_vector);
 
+  if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
+    vect_mask_loop (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
 		      expected_iterations / vf);
@@ -7067,16 +7926,21 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   int bias = 1 - min_epilogue_iters;
   /* In these calculations the "- 1" converts loop iteration counts
      back to latch counts.  */
+  #define DIV(x, y)  \
+    (LOOP_VINFO_MASK_LOOP (loop_vinfo) \
+      ? wi::div_ceil ((x), (y), UNSIGNED) \
+      : wi::udiv_floor ((x), (y)))
+
   if (loop->any_upper_bound)
     loop->nb_iterations_upper_bound
-      = wi::udiv_floor (loop->nb_iterations_upper_bound + bias, vf) - 1;
+      = DIV (loop->nb_iterations_upper_bound + bias, vf) - 1;
   if (loop->any_likely_upper_bound)
     loop->nb_iterations_likely_upper_bound
-      = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias, vf) - 1;
+      = DIV (loop->nb_iterations_likely_upper_bound + bias, vf) - 1;
   if (loop->any_estimate)
     loop->nb_iterations_estimate
-      = wi::udiv_floor (loop->nb_iterations_estimate + bias, vf) - 1;
-
+      = DIV (loop->nb_iterations_estimate + bias, vf) - 1;
+  #undef DIV
   if (dump_enabled_p ())
     {
       if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
@@ -7088,6 +7952,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 			     "OUTER LOOP VECTORIZED\n");
 	  dump_printf (MSG_NOTE, "\n");
 	}
+      else if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "LOOP EPILOGUE VECTORIZED AND MASKED (VS=%d)\n",
+			 current_vector_size);
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "LOOP EPILOGUE VECTORIZED (VS=%d)\n",
@@ -7110,28 +7978,31 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   if (epilogue)
     {
-	unsigned int vector_sizes
-	  = targetm.vectorize.autovectorize_vector_sizes ();
-	vector_sizes &= current_vector_size - 1;
-
-	if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-	  epilogue = NULL;
-	else if (!vector_sizes)
-	  epilogue = NULL;
-	else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-		 && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
-	  {
-	    int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
-	    int ratio = current_vector_size / smallest_vec_size;
-	    int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
-	      - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-	    eiters = eiters % vf;
+      if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
+	{
+	  unsigned int vector_sizes
+	    = targetm.vectorize.autovectorize_vector_sizes ();
+	  vector_sizes &= current_vector_size - 1;
+
+	  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	    epilogue = NULL;
+	  else if (!vector_sizes)
+	    epilogue = NULL;
+	  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+		   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
+	    {
+	      int smallest_vec_size = 1 << ctz_hwi (vector_sizes);
+	      int ratio = current_vector_size / smallest_vec_size;
+	      int eiters = LOOP_VINFO_INT_NITERS (loop_vinfo)
+		- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+	      eiters = eiters % vf;
 
-	    epilogue->nb_iterations_upper_bound = eiters - 1;
+	      epilogue->nb_iterations_upper_bound = eiters - 1;
 
-	    if (eiters < vf / ratio)
-	      epilogue = NULL;
+	      if (eiters < vf / ratio)
+		epilogue = NULL;
 	    }
+	}
     }
 
   if (epilogue)
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index b0b131d..07c7dd5 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "builtins.h"
 #include "internal-fn.h"
+#include "tree-ssa-loop-ivopts.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -584,6 +585,38 @@ process_use (gimple *stmt, tree use, loop_vec_info loop_vinfo,
   return true;
 }
 
+/* Return true if STMT can be converted to masked form.  */
+
+static bool
+can_mask_load_store (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  tree vectype, mask_vectype;
+  tree lhs, ref;
+
+  if (!stmt_info)
+    return false;
+  lhs = gimple_assign_lhs (stmt);
+  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
+  if (may_be_nonaddressable_p (ref))
+    return false;
+  vectype = STMT_VINFO_VECTYPE (stmt_info);
+  mask_vectype = build_same_sized_truth_vector_type (vectype);
+  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
+				  TYPE_MODE (mask_vectype),
+				  gimple_assign_load_p (stmt)))
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "Statement can't be masked.\n");
+	  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
+	}
+
+       return false;
+    }
+  return true;
+}
 
 /* Function vect_mark_stmts_to_be_vectorized.
 
@@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 	       && !useless_type_conversion_p (vectype, rhs_vectype)))
     return false;
 
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Check that mask conjuction is supported.  */
+      optab tab;
+      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: unsupported mask operation\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
@@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 					  ptr, vec_mask, vec_rhs);
 	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
 	  if (i == 0)
-	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	    {
+	      STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	      STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
+	    }
 	  else
 	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
 	  prev_stmt_info = vinfo_for_stmt (new_stmt);
@@ -3211,6 +3261,18 @@ vectorizable_simd_clone_call (gimple *stmt, gimple_stmt_iterator *gsi,
   if (slp_node)
     return false;
 
+  /* Masked clones are not yet supported.  But we allow
+     calls which may be just called with no mask.  */
+  if (!(gimple_call_flags (stmt) & ECF_CONST)
+      || (gimple_call_flags (stmt) & ECF_LOOPING_CONST_OR_PURE))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: non-const call "
+			 "(masked calls are not supported)\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   /* Process function arguments.  */
   nargs = gimple_call_num_args (stmt);
 
@@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 			    &memory_access_type, &gs_info))
     return false;
 
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && memory_access_type != VMAT_CONTIGUOUS)
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported memory access type.\n");
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && !can_mask_load_store (stmt))
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported masked store.\n");
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
@@ -6389,7 +6469,16 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       if (!slp)
 	{
 	  if (j == 0)
-	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	    {
+	      STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	      STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
+	      /* Original statement is replaced with the first vector one.
+		 Keep data reference and original vectype in the first
+		 vector copy for masking purposes.  */
+	      STMT_VINFO_DATA_REF (vinfo_for_stmt (new_stmt))
+		= STMT_VINFO_DATA_REF (stmt_info);
+	      STMT_VINFO_VECTYPE (vinfo_for_stmt (new_stmt)) = vectype;
+	    }
 	  else
 	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
 	  prev_stmt_info = vinfo_for_stmt (new_stmt);
@@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       gcc_assert (!nested_in_vect_loop);
       gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));
 
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: grouped access is not"
+			     " supported.");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      }
+
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
       group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
 
@@ -6707,6 +6805,36 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 			    &memory_access_type, &gs_info))
     return false;
 
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && integer_zerop (nested_in_vect_loop
+			? STMT_VINFO_DR_STEP (stmt_info)
+			: DR_STEP (dr)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "allow invariant load for masked loop.\n");
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && memory_access_type != VMAT_CONTIGUOUS
+      && memory_access_type != VMAT_INVARIANT)
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported memory access type.\n");
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && memory_access_type != VMAT_INVARIANT
+      && !can_mask_load_store (stmt))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported masked load.\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       if (!slp)
@@ -8252,6 +8380,43 @@ vectorizable_comparison (gimple *stmt, gimple_stmt_iterator *gsi,
   return true;
 }
 
+/* Return true if vector version of STMT should be masked
+   in a vectorized loop epilogue (considering usage of the
+   same VF as for main loop).  */
+
+static bool
+vect_stmt_should_be_masked_for_epilogue (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+  /* We should mask all statements accessing memory.  */
+  if (STMT_VINFO_DATA_REF (stmt_info))
+    return true;
+
+  /* We should also mask all recursions.  */
+  if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
+      || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
+    return true;
+
+  return false;
+}
+
+/* Add a mask required to mask STMT to LOOP_VINFO_REQUIRED_MASKS.  */
+
+static void
+vect_add_required_mask_for_stmt (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+  unsigned HOST_WIDE_INT nelems = TYPE_VECTOR_SUBPARTS (vectype);
+  int bit_no = exact_log2 (nelems);
+
+  gcc_assert (bit_no >= 0);
+
+  LOOP_VINFO_REQUIRED_MASKS (loop_vinfo) |= (1 << bit_no);
+}
+
 /* Make sure the statement is vectorizable.  */
 
 bool
@@ -8259,6 +8424,7 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
   enum vect_relevant relevance = STMT_VINFO_RELEVANT (stmt_info);
   bool ok;
   tree scalar_type, vectype;
@@ -8426,6 +8592,10 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
       STMT_VINFO_VECTYPE (stmt_info) = vectype;
    }
 
+  /* Masking is not supported for SLP yet.  */
+  if (loop_vinfo && node)
+    LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+
   if (STMT_VINFO_RELEVANT_P (stmt_info))
     {
       gcc_assert (!VECTOR_MODE_P (TYPE_MODE (gimple_expr_type (stmt))));
@@ -8485,6 +8655,30 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
       return false;
     }
 
+  if (loop_vinfo
+      && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Currently we have real masking for loads and stores only.
+	 We can't mask loop which has other statements which may
+	 trap.  */
+      if ((!is_gimple_call (stmt)
+	   || !gimple_call_internal_p (stmt)
+	   || (gimple_call_internal_fn (stmt) != IFN_MASK_STORE
+	       && gimple_call_internal_fn (stmt) != IFN_MASK_LOAD))
+	  && gimple_could_trap_p_1 (stmt, false, false))
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			       "cannot be masked: unsupported trapping stmt: ");
+	      dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
+	    }
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+      else if (vect_stmt_should_be_masked_for_epilogue (stmt))
+	vect_add_required_mask_for_stmt (stmt);
+    }
+
   if (bb_vinfo)
     return true;
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2a7fa0a..196a09b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -335,6 +335,16 @@ typedef struct _loop_vec_info : public vec_info {
   /* Mark loops having masked stores.  */
   bool has_mask_store;
 
+  /* True if vectorized loop can be masked.  */
+  bool can_be_masked;
+
+  /* If vector mask with 2^N elements is required to mask the loop
+     then N-th bit of this field is set to 1.  */
+  unsigned required_masks;
+
+  /* True if we should vectorize loop with masking.  */
+  bool mask_loop;
+
   /* For loops being epilogues of already vectorized loops
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
@@ -378,6 +388,9 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_HAS_MASK_STORE(L)       (L)->has_mask_store
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
+#define LOOP_VINFO_CAN_BE_MASKED(L)        (L)->can_be_masked
+#define LOOP_VINFO_REQUIRED_MASKS(L)       (L)->required_masks
+#define LOOP_VINFO_MASK_LOOP(L)            (L)->mask_loop
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
@@ -665,6 +678,10 @@ typedef struct _stmt_vec_info {
   /* For both loads and stores.  */
   bool simd_lane_access_p;
 
+  /* True for the first vector statement copy when scalar
+     statement is vectorized into several vector ones.  */
+  bool first_copy_p;
+
   /* For reduction loops, this is the type of reduction.  */
   enum vect_reduction_type v_reduc_type;
 
@@ -724,6 +741,7 @@ STMT_VINFO_BB_VINFO (stmt_vec_info stmt_vinfo)
 #define STMT_VINFO_STRIDED_P(S)	   	   (S)->strided_p
 #define STMT_VINFO_MEMORY_ACCESS_TYPE(S)   (S)->memory_access_type
 #define STMT_VINFO_SIMD_LANE_ACCESS_P(S)   (S)->simd_lane_access_p
+#define STMT_VINFO_FIRST_COPY_P(S)	   (S)->first_copy_p
 #define STMT_VINFO_VEC_REDUCTION_TYPE(S)   (S)->v_reduc_type
 #define STMT_VINFO_VEC_CONST_COND_REDUC_CODE(S) (S)->const_cond_reduc_code
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-24 13:42                                 ` Yuri Rumyantsev
@ 2016-11-28 14:39                                   ` Richard Biener
  2016-11-28 16:57                                     ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-11-28 14:39 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Christophe Lyon, Jeff Law, gcc-patches, Ilya Enkovich

On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:

> Hi All,
> 
> Here is the second patch which supports epilogue vectorization using
> masking without cost model. Currently it is possible
> only with passing parameter "--param vect-epilogues-mask=1".
> 
> Bootstrapping and regression testing did not show any new regression.
> 
> Any comments will be appreciated.

Going over the patch the main question is one how it works -- it looks
like the decision whether to vectorize & mask the epilogue is made
when vectorizing the loop that generates the epilogue rather than
in the epilogue vectorization path?

That is, I'd have expected to see this handling low-trip count loops
by masking?  And thus masking the epilogue simply by it being
low-trip count?

Richard.

> ChangeLog:
> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
> 
> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
> required_mask fields.
> (vect_check_required_masks_widening): New.
> (vect_check_required_masks_narrowing): New.
> (vect_get_masking_iv_elems): New.
> (vect_get_masking_iv_type): New.
> (vect_get_extreme_masks): New.
> (vect_check_required_masks): New.
> (vect_analyze_loop_operations): Call vect_check_required_masks if all
> statements can be masked.
> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
> Add check that epilogue can be masked with the same vf with issue
> fail notes.  Allow epilogue vectorization through masking of low trip
> loops. Set to true can_be_masked field before loop operation analysis.
> Do not set-up min_scalar_loop_bound for epilogue vectorization through
> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
> field before repeat analysis.
> (vect_estimate_min_profitable_iters): Do not compute profitability
> for epilogue masking.  Set up mask_loop filed to true if parameter
> PARAM_VECT_EPILOGUES_MASK is non-zero.
> (vectorizable_reduction): Add check that statement can be masked.
> (vectorizable_induction): Do not support masking for induction.
> (vect_gen_ivs_for_masking): New.
> (vect_get_mask_index_for_elems): New.
> (vect_get_mask_index_for_type): New.
> (vect_create_narrowed_masks): New.
> (vect_create_widened_masks): New.
> (vect_gen_loop_masks): New.
> (vect_mask_reduction_stmt): New.
> (vect_mask_mask_load_store_stmt): New.
> (vect_mask_load_store_stmt): New.
> (vect_mask_loop): New.
> (vect_transform_loop): Invoke vect_mask_loop if required.
> Use div_ceil to recompute upper bounds for masked loops.  Issue
> statistics for epilogue vectorization through masking. Do not reduce
> vf for masking epilogue.
> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
> (can_mask_load_store): New.
> (vectorizable_mask_load_store): Check that mask conjuction is
> supported.  Set-up first_copy_p field of stmt_vinfo.
> (vectorizable_simd_clone_call): Check that simd clone can not be
> masked.
> (vectorizable_store): Check that store can be masked. Mark the first
> copy of generated vector stores and provide it with vectype and the
> original data reference.
> (vectorizable_load): Check that load can be masked.
> (vect_stmt_should_be_masked_for_epilogue): New.
> (vect_add_required_mask_for_stmt): New.
> (vect_analyze_stmt): Add check on unsupported statements for masking
> with printing message.
> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
> can_be_maske, required_masks, masl_loop.
> (LOOP_VINFO_CAN_BE_MASKED): New.
> (LOOP_VINFO_REQUIRED_MASKS): New.
> (LOOP_VINFO_MASK_LOOP): New.
> (struct _stmt_vec_info): Add first_copy_p field.
> (STMT_VINFO_FIRST_COPY_P): New.
> 
> gcc/testsuite/
> 
> * gcc.dg/vect/vect-tail-mask-1.c: New test.
> 
> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> It is very strange that this test failed on arm, since it requires
> >> target avx2 to check vectorizer dumps:
> >>
> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
> >> target avx2_runtime } } } */
> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
> >>
> >> Could you please clarify what is the reason of the failure?
> >
> > It's not the scan-dumps that fail, but the execution.
> > The test calls abort() for some reason.
> >
> > It will take me a while to rebuild the test manually in the right
> > debug environment to provide you with more traces.
> >
> >
> >
> >>
> >> Thanks.
> >>
> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >>>> Hi All,
> >>>>
> >>>> Here is patch for non-masked epilogue vectoriziation.
> >>>>
> >>>> Bootstrap and regression testing did not show any new failures.
> >>>>
> >>>> Is it OK for trunk?
> >>>>
> >>>> Thanks.
> >>>> Changelog:
> >>>>
> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
> >>>>
> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
> >>>> * * tree-if-conv.h: New file.
> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
> >>>> dynamic alias checks for epilogues.
> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
> >>>> * tree-vect-loop.c: include tree-if-conv.h.
> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
> >>>> using passed argument.
> >>>> (vect_transform_loop): Check if created epilogue should be returned
> >>>> for further vectorization with less vf.  If-convert epilogue if
> >>>> required. Print vectorization success for epilogue.
> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
> >>>> if it is required, pass loop_vinfo produced during vectorization of
> >>>> loop body to vect_analyze_loop.
> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
> >>>> orig_loop_info.
> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
> >>>> (LOOP_VINFO_EPILOGUE_P): New.
> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
> >>>> (vect_do_peeling): Change prototype to return epilogue.
> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
> >>>> (vect_transform_loop): Return created loop.
> >>>>
> >>>> gcc/testsuite/
> >>>>
> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
> >>>> (check_effective_target_avx2_runtime): New.
> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
> >>>
> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
> >>>
> >>> Christophe
> >>>
> >>>
> >>>
> >>>>
> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >>>>>>Richard,
> >>>>>>
> >>>>>>I checked one of the tests designed for epilogue vectorization using
> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
> >>>>>>
> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
> >>>>>>4
> >>>>>> Without param only 2 loops are vectorized.
> >>>>>>
> >>>>>>Should I simply add a part of tests related to this feature or I must
> >>>>>>delete all not necessary changes also?
> >>>>>
> >>>>> Please remove all not necessary changes.
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>>Thanks.
> >>>>>>Yuri.
> >>>>>>
> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
> >>>>>>>
> >>>>>>>> Richard,
> >>>>>>>>
> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
> >>>>>>field.
> >>>>>>>> Here is the correct updated patch.
> >>>>>>>
> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
> >>>>>>>
> >>>>>>> Thus, can you please produce a single complete patch containing only
> >>>>>>> non-masked epilogue vectoriziation?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Richard.
> >>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>> Yuri.
> >>>>>>>>
> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
> >>>>>>>> >
> >>>>>>>> >> Richard,
> >>>>>>>> >>
> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
> >>>>>>>> >>
> >>>>>>>> >> You wrote:
> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >>>>>>>> >> changes only needed by later patches?
> >>>>>>>> >>
> >>>>>>>> >> Did you mean that I exclude all support for vectorization
> >>>>>>epilogues,
> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
> >>>>>>>> >> like
> >>>>>>>> >>
> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >>>>>>>> >> index 11863af..32011c1 100644
> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >>>>>>>> >
> >>>>>>>> > Yes.
> >>>>>>>> >
> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
> >>>>>>i.e.
> >>>>>>>> >> can be integrated without other patches?
> >>>>>>>> >
> >>>>>>>> > Yes.
> >>>>>>>> >
> >>>>>>>> >> Could you please look at updated patch?
> >>>>>>>> >
> >>>>>>>> > Will do.
> >>>>>>>> >
> >>>>>>>> > Thanks,
> >>>>>>>> > Richard.
> >>>>>>>> >
> >>>>>>>> >> Thanks.
> >>>>>>>> >> Yuri.
> >>>>>>>> >>
> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >>>>>>>> >> >
> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >>>>>>>> >> >>
> >>>>>>>> >> >> > Richard,
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > Here is updated 3 patch.
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > I checked that all new tests related to epilogue
> >>>>>>vectorization passed with it.
> >>>>>>>> >> >> >
> >>>>>>>> >> >> > Your comments will be appreciated.
> >>>>>>>> >> >>
> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
> >>>>>>(optional)
> >>>>>>>> >> >> forced vectorization factor as well?
> >>>>>>>> >> >
> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
> >>>>>>>> >> > changes only needed by later patches?
> >>>>>>>> >> >
> >>>>>>>> >> > Thanks,
> >>>>>>>> >> > Richard.
> >>>>>>>> >> >
> >>>>>>>> >> >> Richard.
> >>>>>>>> >> >>
> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
> >>>>>><rguenther@suse.de>:
> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >>>>>>>> >> >> > >
> >>>>>>>> >> >> > >> Hi Richard,
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >> I did not understand your last remark:
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >>>>>>vect_location,
> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
> >>>>>>it to be unrolled
> >>>>>>>> >> >> > >> >           etc.  */
> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >>>>>>it easier
> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >>>>>>in dumps
> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >>>>>>*/
> >>>>>>>> >> >> > >> > +       if (new_loop)
> >>>>>>>> >> >> > >> > +         {
> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>>>>>> >> >> > >> > +         }
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >>>>>>new_loop)
> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
> >>>>>>perform
> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >>>>>>vectorization
> >>>>>>>> >> >> > >> > separately that would be great.
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >> Could you please clarify your proposal.
> >>>>>>>> >> >> > >
> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
> >>>>>>vectorize
> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
> >>>>>>avoiding
> >>>>>>>> >> >> > > the re-use of ->aux.
> >>>>>>>> >> >> > >
> >>>>>>>> >> >> > > Richard.
> >>>>>>>> >> >> > >
> >>>>>>>> >> >> > >> Thanks.
> >>>>>>>> >> >> > >> Yuri.
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
> >>>>>><rguenther@suse.de>:
> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> >> Hi All,
> >>>>>>>> >> >> > >> >>
> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
> >>>>>>which support
> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
> >>>>>>trip count. We
> >>>>>>>> >> >> > >> >> assume that the only patch -
> >>>>>>vec-tails-07-combine-tail.patch - was not
> >>>>>>>> >> >> > >> >> approved by Jeff.
> >>>>>>>> >> >> > >> >>
> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
> >>>>>>bootstrapping and
> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
> >>>>>>Also all
> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
> >>>>>>been changed
> >>>>>>>> >> >> > >> >> accordingly.
> >>>>>>>> >> >> > >> >>
> >>>>>>>> >> >> > >> >> Is it OK for trunk?
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > I would have prefered that the series up to
> >>>>>>-03-nomask-tails would
> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
> >>>>>>unfortunately
> >>>>>>>> >> >> > >> > the patchset is oddly separated.
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
> >>>>>>(loop_vec_info
> >>>>>>>> >> >> > >> > loop_vinfo)
> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
> >>>>>>single_exit (loop))
> >>>>>>>> >> >> > >> > -      || loop->inner)
> >>>>>>>> >> >> > >> > +      || loop->inner
> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
> >>>>>>and
> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >>>>>>>> >> >> > >> >      do_peeling = false;
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> >    if (do_peeling
> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
> >>>>>>(loop_vec_info
> >>>>>>>> >> >> > >> > loop_vinfo)
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> >    do_versioning =
> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
> >>>>>>>> >> >> > >> > +          original loop and is not required for
> >>>>>>epilogue.  */
> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> >    if (do_versioning)
> >>>>>>>> >> >> > >> >      {
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > please do that check in the single caller of this
> >>>>>>function.
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
> >>>>>>believe that simply
> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
> >>>>>>_much_ cleaner.
> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >>>>>>vect_location,
> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
> >>>>>>it to be unrolled
> >>>>>>>> >> >> > >> >            etc.  */
> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >>>>>>it easier
> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >>>>>>in dumps
> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >>>>>>*/
> >>>>>>>> >> >> > >> > +       if (new_loop)
> >>>>>>>> >> >> > >> > +         {
> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >>>>>>>> >> >> > >> > +         }
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >>>>>>new_loop)
> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
> >>>>>>perform
> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >>>>>>vectorization
> >>>>>>>> >> >> > >> > separately that would be great.
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
> >>>>>>question its
> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
> >>>>>>vector loop).
> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
> >>>>>>>> >> >> > >> >
> >>>>>>>> >> >> > >> > Thanks,
> >>>>>>>> >> >> > >> > Richard.
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >>
> >>>>>>>> >> >> > >
> >>>>>>>> >> >> > > --
> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
> >>>>>>>> >> >> >
> >>>>>>>> >> >>
> >>>>>>>> >> >>
> >>>>>>>> >> >
> >>>>>>>> >> > --
> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >>>>>>>> >>
> >>>>>>>> >
> >>>>>>>> > --
> >>>>>>>> > Richard Biener <rguenther@suse.de>
> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Richard Biener <rguenther@suse.de>
> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >>>>>
> >>>>>
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-28 14:39                                   ` Richard Biener
@ 2016-11-28 16:57                                     ` Yuri Rumyantsev
  2016-12-01 11:34                                       ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-11-28 16:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: Christophe Lyon, Jeff Law, gcc-patches, Ilya Enkovich

[-- Attachment #1: Type: text/plain, Size: 23203 bytes --]

Richard!

I attached vect dump for hte part of attached test-case which
illustrated how vectorization of epilogues works through masking:
#define SIZE 1023
#define ALIGN 64

extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
__SIZE_TYPE__ size) __attribute__((weak));
extern void free (void *);

void __attribute__((noinline))
test_citer (int * __restrict__ a,
   int * __restrict__ b,
   int * __restrict__ c)
{
  int i;

  a = (int *)__builtin_assume_aligned (a, ALIGN);
  b = (int *)__builtin_assume_aligned (b, ALIGN);
  c = (int *)__builtin_assume_aligned (c, ALIGN);

  for (i = 0; i < SIZE; i++)
    c[i] = a[i] + b[i];
}

It was compiled with -mavx2 --param vect-epilogues-mask=1 options.

I did not include in this patch vectorization of low trip-count loops
since in the original patch additional parameter was introduced:
+DEFPARAM (PARAM_VECT_SHORT_LOOPS,
+  "vect-short-loops",
+  "Enable vectorization of low trip count loops using masking.",
+  0, 0, 1)

I assume that this ability can be included very quickly but it
requires cost model enhancements also.

Best regards.
Yuri.


2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
>
>> Hi All,
>>
>> Here is the second patch which supports epilogue vectorization using
>> masking without cost model. Currently it is possible
>> only with passing parameter "--param vect-epilogues-mask=1".
>>
>> Bootstrapping and regression testing did not show any new regression.
>>
>> Any comments will be appreciated.
>
> Going over the patch the main question is one how it works -- it looks
> like the decision whether to vectorize & mask the epilogue is made
> when vectorizing the loop that generates the epilogue rather than
> in the epilogue vectorization path?
>
> That is, I'd have expected to see this handling low-trip count loops
> by masking?  And thus masking the epilogue simply by it being
> low-trip count?
>
> Richard.
>
>> ChangeLog:
>> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>
>> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
>> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
>> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
>> required_mask fields.
>> (vect_check_required_masks_widening): New.
>> (vect_check_required_masks_narrowing): New.
>> (vect_get_masking_iv_elems): New.
>> (vect_get_masking_iv_type): New.
>> (vect_get_extreme_masks): New.
>> (vect_check_required_masks): New.
>> (vect_analyze_loop_operations): Call vect_check_required_masks if all
>> statements can be masked.
>> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
>> Add check that epilogue can be masked with the same vf with issue
>> fail notes.  Allow epilogue vectorization through masking of low trip
>> loops. Set to true can_be_masked field before loop operation analysis.
>> Do not set-up min_scalar_loop_bound for epilogue vectorization through
>> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
>> field before repeat analysis.
>> (vect_estimate_min_profitable_iters): Do not compute profitability
>> for epilogue masking.  Set up mask_loop filed to true if parameter
>> PARAM_VECT_EPILOGUES_MASK is non-zero.
>> (vectorizable_reduction): Add check that statement can be masked.
>> (vectorizable_induction): Do not support masking for induction.
>> (vect_gen_ivs_for_masking): New.
>> (vect_get_mask_index_for_elems): New.
>> (vect_get_mask_index_for_type): New.
>> (vect_create_narrowed_masks): New.
>> (vect_create_widened_masks): New.
>> (vect_gen_loop_masks): New.
>> (vect_mask_reduction_stmt): New.
>> (vect_mask_mask_load_store_stmt): New.
>> (vect_mask_load_store_stmt): New.
>> (vect_mask_loop): New.
>> (vect_transform_loop): Invoke vect_mask_loop if required.
>> Use div_ceil to recompute upper bounds for masked loops.  Issue
>> statistics for epilogue vectorization through masking. Do not reduce
>> vf for masking epilogue.
>> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
>> (can_mask_load_store): New.
>> (vectorizable_mask_load_store): Check that mask conjuction is
>> supported.  Set-up first_copy_p field of stmt_vinfo.
>> (vectorizable_simd_clone_call): Check that simd clone can not be
>> masked.
>> (vectorizable_store): Check that store can be masked. Mark the first
>> copy of generated vector stores and provide it with vectype and the
>> original data reference.
>> (vectorizable_load): Check that load can be masked.
>> (vect_stmt_should_be_masked_for_epilogue): New.
>> (vect_add_required_mask_for_stmt): New.
>> (vect_analyze_stmt): Add check on unsupported statements for masking
>> with printing message.
>> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
>> can_be_maske, required_masks, masl_loop.
>> (LOOP_VINFO_CAN_BE_MASKED): New.
>> (LOOP_VINFO_REQUIRED_MASKS): New.
>> (LOOP_VINFO_MASK_LOOP): New.
>> (struct _stmt_vec_info): Add first_copy_p field.
>> (STMT_VINFO_FIRST_COPY_P): New.
>>
>> gcc/testsuite/
>>
>> * gcc.dg/vect/vect-tail-mask-1.c: New test.
>>
>> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >> It is very strange that this test failed on arm, since it requires
>> >> target avx2 to check vectorizer dumps:
>> >>
>> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>> >> target avx2_runtime } } } */
>> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>> >>
>> >> Could you please clarify what is the reason of the failure?
>> >
>> > It's not the scan-dumps that fail, but the execution.
>> > The test calls abort() for some reason.
>> >
>> > It will take me a while to rebuild the test manually in the right
>> > debug environment to provide you with more traces.
>> >
>> >
>> >
>> >>
>> >> Thanks.
>> >>
>> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >>>> Hi All,
>> >>>>
>> >>>> Here is patch for non-masked epilogue vectoriziation.
>> >>>>
>> >>>> Bootstrap and regression testing did not show any new failures.
>> >>>>
>> >>>> Is it OK for trunk?
>> >>>>
>> >>>> Thanks.
>> >>>> Changelog:
>> >>>>
>> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>> >>>>
>> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
>> >>>> * * tree-if-conv.h: New file.
>> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>> >>>> dynamic alias checks for epilogues.
>> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>> >>>> * tree-vect-loop.c: include tree-if-conv.h.
>> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>> >>>> using passed argument.
>> >>>> (vect_transform_loop): Check if created epilogue should be returned
>> >>>> for further vectorization with less vf.  If-convert epilogue if
>> >>>> required. Print vectorization success for epilogue.
>> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>> >>>> if it is required, pass loop_vinfo produced during vectorization of
>> >>>> loop body to vect_analyze_loop.
>> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>> >>>> orig_loop_info.
>> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>> >>>> (LOOP_VINFO_EPILOGUE_P): New.
>> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>> >>>> (vect_do_peeling): Change prototype to return epilogue.
>> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>> >>>> (vect_transform_loop): Return created loop.
>> >>>>
>> >>>> gcc/testsuite/
>> >>>>
>> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>> >>>> (check_effective_target_avx2_runtime): New.
>> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>> >>>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>> >>>
>> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
>> >>>
>> >>> Christophe
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >>>>>>Richard,
>> >>>>>>
>> >>>>>>I checked one of the tests designed for epilogue vectorization using
>> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>> >>>>>>
>> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
>> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>> >>>>>>4
>> >>>>>> Without param only 2 loops are vectorized.
>> >>>>>>
>> >>>>>>Should I simply add a part of tests related to this feature or I must
>> >>>>>>delete all not necessary changes also?
>> >>>>>
>> >>>>> Please remove all not necessary changes.
>> >>>>>
>> >>>>> Richard.
>> >>>>>
>> >>>>>>Thanks.
>> >>>>>>Yuri.
>> >>>>>>
>> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>>
>> >>>>>>>> Richard,
>> >>>>>>>>
>> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>> >>>>>>field.
>> >>>>>>>> Here is the correct updated patch.
>> >>>>>>>
>> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
>> >>>>>>>
>> >>>>>>> Thus, can you please produce a single complete patch containing only
>> >>>>>>> non-masked epilogue vectoriziation?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Richard.
>> >>>>>>>
>> >>>>>>>> Thanks.
>> >>>>>>>> Yuri.
>> >>>>>>>>
>> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>>> >
>> >>>>>>>> >> Richard,
>> >>>>>>>> >>
>> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
>> >>>>>>>> >>
>> >>>>>>>> >> You wrote:
>> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>> >>>>>>>> >> changes only needed by later patches?
>> >>>>>>>> >>
>> >>>>>>>> >> Did you mean that I exclude all support for vectorization
>> >>>>>>epilogues,
>> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>> >>>>>>>> >> like
>> >>>>>>>> >>
>> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >>>>>>>> >> index 11863af..32011c1 100644
>> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
>> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>> >>>>>>>> >
>> >>>>>>>> > Yes.
>> >>>>>>>> >
>> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
>> >>>>>>i.e.
>> >>>>>>>> >> can be integrated without other patches?
>> >>>>>>>> >
>> >>>>>>>> > Yes.
>> >>>>>>>> >
>> >>>>>>>> >> Could you please look at updated patch?
>> >>>>>>>> >
>> >>>>>>>> > Will do.
>> >>>>>>>> >
>> >>>>>>>> > Thanks,
>> >>>>>>>> > Richard.
>> >>>>>>>> >
>> >>>>>>>> >> Thanks.
>> >>>>>>>> >> Yuri.
>> >>>>>>>> >>
>> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>> >>>>>>>> >> >
>> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> > Richard,
>> >>>>>>>> >> >> >
>> >>>>>>>> >> >> > Here is updated 3 patch.
>> >>>>>>>> >> >> >
>> >>>>>>>> >> >> > I checked that all new tests related to epilogue
>> >>>>>>vectorization passed with it.
>> >>>>>>>> >> >> >
>> >>>>>>>> >> >> > Your comments will be appreciated.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
>> >>>>>>(optional)
>> >>>>>>>> >> >> forced vectorization factor as well?
>> >>>>>>>> >> >
>> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>> >>>>>>>> >> > changes only needed by later patches?
>> >>>>>>>> >> >
>> >>>>>>>> >> > Thanks,
>> >>>>>>>> >> > Richard.
>> >>>>>>>> >> >
>> >>>>>>>> >> >> Richard.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>> >>>>>><rguenther@suse.de>:
>> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>>> >> >> > >
>> >>>>>>>> >> >> > >> Hi Richard,
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >> I did not understand your last remark:
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
>> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >>>>>>vect_location,
>> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
>> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>> >>>>>>it to be unrolled
>> >>>>>>>> >> >> > >> >           etc.  */
>> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >>>>>>it easier
>> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >>>>>>in dumps
>> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >>>>>>*/
>> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >>>>>>>> >> >> > >> > +         {
>> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >>>>>>>> >> >> > >> > +         }
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >>>>>>new_loop)
>> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>> >>>>>>perform
>> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >>>>>>vectorization
>> >>>>>>>> >> >> > >> > separately that would be great.
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >> Could you please clarify your proposal.
>> >>>>>>>> >> >> > >
>> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>> >>>>>>vectorize
>> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>> >>>>>>avoiding
>> >>>>>>>> >> >> > > the re-use of ->aux.
>> >>>>>>>> >> >> > >
>> >>>>>>>> >> >> > > Richard.
>> >>>>>>>> >> >> > >
>> >>>>>>>> >> >> > >> Thanks.
>> >>>>>>>> >> >> > >> Yuri.
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>> >>>>>><rguenther@suse.de>:
>> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> >> Hi All,
>> >>>>>>>> >> >> > >> >>
>> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>> >>>>>>which support
>> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>> >>>>>>trip count. We
>> >>>>>>>> >> >> > >> >> assume that the only patch -
>> >>>>>>vec-tails-07-combine-tail.patch - was not
>> >>>>>>>> >> >> > >> >> approved by Jeff.
>> >>>>>>>> >> >> > >> >>
>> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>> >>>>>>bootstrapping and
>> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>> >>>>>>Also all
>> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>> >>>>>>been changed
>> >>>>>>>> >> >> > >> >> accordingly.
>> >>>>>>>> >> >> > >> >>
>> >>>>>>>> >> >> > >> >> Is it OK for trunk?
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > I would have prefered that the series up to
>> >>>>>>-03-nomask-tails would
>> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>> >>>>>>unfortunately
>> >>>>>>>> >> >> > >> > the patchset is oddly separated.
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>> >>>>>>(loop_vec_info
>> >>>>>>>> >> >> > >> > loop_vinfo)
>> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>> >>>>>>single_exit (loop))
>> >>>>>>>> >> >> > >> > -      || loop->inner)
>> >>>>>>>> >> >> > >> > +      || loop->inner
>> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>> >>>>>>and
>> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >>>>>>>> >> >> > >> >      do_peeling = false;
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> >    if (do_peeling
>> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>> >>>>>>(loop_vec_info
>> >>>>>>>> >> >> > >> > loop_vinfo)
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> >    do_versioning =
>> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>> >>>>>>>> >> >> > >> > +          original loop and is not required for
>> >>>>>>epilogue.  */
>> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> >    if (do_versioning)
>> >>>>>>>> >> >> > >> >      {
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > please do that check in the single caller of this
>> >>>>>>function.
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>> >>>>>>believe that simply
>> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
>> >>>>>>_much_ cleaner.
>> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
>> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >>>>>>vect_location,
>> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
>> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>> >>>>>>it to be unrolled
>> >>>>>>>> >> >> > >> >            etc.  */
>> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >>>>>>it easier
>> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >>>>>>in dumps
>> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >>>>>>*/
>> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >>>>>>>> >> >> > >> > +         {
>> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >>>>>>>> >> >> > >> > +         }
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >>>>>>new_loop)
>> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>> >>>>>>perform
>> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >>>>>>vectorization
>> >>>>>>>> >> >> > >> > separately that would be great.
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>> >>>>>>question its
>> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>> >>>>>>vector loop).
>> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>> >>>>>>>> >> >> > >> >
>> >>>>>>>> >> >> > >> > Thanks,
>> >>>>>>>> >> >> > >> > Richard.
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >>
>> >>>>>>>> >> >> > >
>> >>>>>>>> >> >> > > --
>> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>>> >> >> >
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >
>> >>>>>>>> >> > --
>> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
>> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>>> >>
>> >>>>>>>> >
>> >>>>>>>> > --
>> >>>>>>>> > Richard Biener <rguenther@suse.de>
>> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Richard Biener <rguenther@suse.de>
>> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>
>> >>>>>
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: t11.c.156t.vect --]
[-- Type: application/octet-stream, Size: 35335 bytes --]


;; Function test_citer (test_citer, funcdef_no=0, decl_uid=2253, cgraph_uid=0, symbol_order=0)


Analyzing loop at t11.c:18
t11.c:18:3: note: ===== analyze_loop_nest =====
t11.c:18:3: note: === vect_analyze_loop_form ===
t11.c:18:3: note: === get_loop_niters ===
Analyzing # of iterations of loop 1
  exit condition [1022, + , 4294967295] != 0
  bounds on difference of bases: -1022 ... -1022
  result:
    # of iterations 1022, bounded by 1022
Creating dr for *_3
analyze_innermost: success.
	base_address: a_12
	offset from base address: 0
	constant offset from base address: 0
	step: 4
	aligned to: 256
	base_object: *a_12
	Access function 0: {0B, +, 4}_1
Creating dr for *_5
analyze_innermost: success.
	base_address: b_14
	offset from base address: 0
	constant offset from base address: 0
	step: 4
	aligned to: 256
	base_object: *b_14
	Access function 0: {0B, +, 4}_1
Creating dr for *_7
analyze_innermost: success.
	base_address: c_16
	offset from base address: 0
	constant offset from base address: 0
	step: 4
	aligned to: 256
	base_object: *c_16
	Access function 0: {0B, +, 4}_1
t11.c:18:3: note: === vect_analyze_data_refs ===
t11.c:18:3: note: got vectype for stmt: _4 = *_3;
vector(8) int
t11.c:18:3: note: got vectype for stmt: _6 = *_5;
vector(8) int
t11.c:18:3: note: got vectype for stmt: *_7 = _8;
vector(8) int
t11.c:18:3: note: === vect_analyze_scalar_cycles ===
t11.c:18:3: note: Analyze phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: Access function of PHI: {0, +, 1}_1
t11.c:18:3: note: step: 1,  init: 0
t11.c:18:3: note: Detected induction.
t11.c:18:3: note: Analyze phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: Analyze phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: Access function of PHI: {1023, +, 4294967295}_1
t11.c:18:3: note: step: 4294967295,  init: 1023
t11.c:18:3: note: Detected induction.
t11.c:18:3: note: === vect_pattern_recog ===
t11.c:18:3: note: vect_is_simple_use: operand _1
t11.c:18:3: note: def_stmt: _1 = (long unsigned int) i_23;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand i_23
t11.c:18:3: note: def_stmt: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: type of def: induction
t11.c:18:3: note: vect_is_simple_use: operand 4
t11.c:18:3: note: === vect_analyze_data_ref_accesses ===
t11.c:18:3: note: === vect_mark_stmts_to_be_vectorized ===
t11.c:18:3: note: init: phi relevant? i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: init: phi relevant? .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: init: phi relevant? ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: init: stmt relevant? _1 = (long unsigned int) i_23;
t11.c:18:3: note: init: stmt relevant? _2 = _1 * 4;
t11.c:18:3: note: init: stmt relevant? _3 = a_12 + _2;
t11.c:18:3: note: init: stmt relevant? _4 = *_3;
t11.c:18:3: note: init: stmt relevant? _5 = b_14 + _2;
t11.c:18:3: note: init: stmt relevant? _6 = *_5;
t11.c:18:3: note: init: stmt relevant? _7 = c_16 + _2;
t11.c:18:3: note: init: stmt relevant? _8 = _4 + _6;
t11.c:18:3: note: init: stmt relevant? *_7 = _8;
t11.c:18:3: note: vec_stmt_relevant_p: stmt has vdefs.
t11.c:18:3: note: mark relevant 5, live 0: *_7 = _8;
t11.c:18:3: note: init: stmt relevant? i_19 = i_23 + 1;
t11.c:18:3: note: init: stmt relevant? ivtmp_38 = ivtmp_39 - 1;
t11.c:18:3: note: init: stmt relevant? if (ivtmp_38 != 0)
t11.c:18:3: note: worklist: examine stmt: *_7 = _8;
t11.c:18:3: note: vect_is_simple_use: operand _8
t11.c:18:3: note: def_stmt: _8 = _4 + _6;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _8 = _4 + _6;
t11.c:18:3: note: worklist: examine stmt: _8 = _4 + _6;
t11.c:18:3: note: vect_is_simple_use: operand _4
t11.c:18:3: note: def_stmt: _4 = *_3;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _4 = *_3;
t11.c:18:3: note: vect_is_simple_use: operand _6
t11.c:18:3: note: def_stmt: _6 = *_5;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _6 = *_5;
t11.c:18:3: note: worklist: examine stmt: _6 = *_5;
t11.c:18:3: note: worklist: examine stmt: _4 = *_3;
t11.c:18:3: note: === vect_analyze_data_ref_dependences ===
(compute_affine_dependence
  stmt_a: _4 = *_3;
  stmt_b: _6 = *_5;
) -> no dependence
(compute_affine_dependence
  stmt_a: _4 = *_3;
  stmt_b: *_7 = _8;
) -> no dependence
(compute_affine_dependence
  stmt_a: _6 = *_5;
  stmt_b: *_7 = _8;
) -> no dependence
(compute_affine_dependence
  stmt_a: _4 = *_3;
  stmt_b: _4 = *_3;
(analyze_overlapping_iterations 
  (chrec_a = {0B, +, 4}_1)
  (chrec_b = {0B, +, 4}_1)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
(compute_affine_dependence
  stmt_a: _6 = *_5;
  stmt_b: _6 = *_5;
(analyze_overlapping_iterations 
  (chrec_a = {0B, +, 4}_1)
  (chrec_b = {0B, +, 4}_1)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
(compute_affine_dependence
  stmt_a: *_7 = _8;
  stmt_b: *_7 = _8;
(analyze_overlapping_iterations 
  (chrec_a = {0B, +, 4}_1)
  (chrec_b = {0B, +, 4}_1)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
t11.c:18:3: note: === vect_determine_vectorization_factor ===
t11.c:18:3: note: ==> examining phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: ==> examining phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: ==> examining phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: ==> examining statement: _1 = (long unsigned int) i_23;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _2 = _1 * 4;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _3 = a_12 + _2;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _4 = *_3;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: _5 = b_14 + _2;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _6 = *_5;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: _7 = c_16 + _2;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _8 = _4 + _6;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: *_7 = _8;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: i_19 = i_23 + 1;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: ivtmp_38 = ivtmp_39 - 1;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: if (ivtmp_38 != 0)
t11.c:18:3: note: skip.
t11.c:18:3: note: vectorization factor = 8
t11.c:18:3: note: === vect_analyze_slp ===
t11.c:18:3: note: === vect_make_slp_decision ===
t11.c:18:3: note: vectorization_factor = 8, niters = 1023
t11.c:18:3: note: === vect_analyze_data_refs_alignment ===
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_3
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_5
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_7
t11.c:18:3: note: === vect_prune_runtime_alias_test_list ===
t11.c:18:3: note: === vect_enhance_data_refs_alignment ===
t11.c:18:3: note: vect_can_advance_ivs_p:
t11.c:18:3: note: Analyze phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: Analyze phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: reduc or virtual phi. skip.
t11.c:18:3: note: Analyze phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
t11.c:18:3: note: vect_model_store_cost: aligned.
t11.c:18:3: note: vect_get_data_access_cost: inside_cost = 3, outside_cost = 0.
t11.c:18:3: note: === vect_analyze_loop_operations ===
t11.c:18:3: note: examining phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: examining phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: examining phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: ==> examining statement: _1 = (long unsigned int) i_23;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _2 = _1 * 4;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _3 = a_12 + _2;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _4 = *_3;
t11.c:18:3: note: vect_is_simple_use: operand *_3
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_is_simple_use: operand *_3
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: _5 = b_14 + _2;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _6 = *_5;
t11.c:18:3: note: vect_is_simple_use: operand *_5
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_is_simple_use: operand *_5
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: _7 = c_16 + _2;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _8 = _4 + _6;
t11.c:18:3: note: vect_is_simple_use: operand _4
t11.c:18:3: note: def_stmt: _4 = *_3;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand _6
t11.c:18:3: note: def_stmt: _6 = *_5;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: === vectorizable_operation ===
t11.c:18:3: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: *_7 = _8;
t11.c:18:3: note: vect_is_simple_use: operand _8
t11.c:18:3: note: def_stmt: _8 = _4 + _6;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_model_store_cost: aligned.
t11.c:18:3: note: vect_model_store_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: i_19 = i_23 + 1;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: ivtmp_38 = ivtmp_39 - 1;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: if (ivtmp_38 != 0)
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: Cost model analysis: 
  Vector inside of loop cost: 4
  Vector prologue cost: 0
  Vector epilogue cost: 28
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 28
  prologue iterations: 0
  epilogue iterations: 7
  Calculated minimum iters for profitability: 8
t11.c:18:3: note:   Runtime profitability threshold = 7
t11.c:18:3: note:   Static estimate profitability threshold = 7
t11.c:18:3: note: epilog loop required
t11.c:18:3: note: vect_can_advance_ivs_p:
t11.c:18:3: note: Analyze phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: Analyze phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: reduc or virtual phi. skip.
t11.c:18:3: note: Analyze phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: loop vectorized
t11.c:18:3: note: === vec_transform_loop ===
Removing basic block 6
basic block 6, loop depth 0
 pred:       2
 succ:      


t11.c:18:3: note: vect_can_advance_ivs_p:
t11.c:18:3: note: Analyze phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: Analyze phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: reduc or virtual phi. skip.
t11.c:18:3: note: Analyze phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: vect_update_ivs_after_vectorizer: phi: i_23 = PHI <i_19(4), 0(2)>
t11.c:18:3: note: vect_update_ivs_after_vectorizer: phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(2)>
t11.c:18:3: note: reduc or virtual phi. skip.
t11.c:18:3: note: vect_update_ivs_after_vectorizer: phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
t11.c:18:3: note: ------>vectorizing phi: i_23 = PHI <i_19(4), 0(10)>
t11.c:18:3: note: ------>vectorizing phi: .MEM_24 = PHI <.MEM_18(4), .MEM_17(D)(10)>
t11.c:18:3: note: ------>vectorizing phi: ivtmp_39 = PHI <ivtmp_38(4), 1023(10)>
t11.c:18:3: note: ------>vectorizing statement: _1 = (long unsigned int) i_23;
t11.c:18:3: note: ------>vectorizing statement: _2 = _1 * 4;
t11.c:18:3: note: ------>vectorizing statement: _3 = a_12 + _2;
t11.c:18:3: note: ------>vectorizing statement: _4 = *_3;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: transform load. ncopies = 1
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *a_12
Applying pattern match.pd:83, generic-match.c:11885
t11.c:18:3: note: created a_12
t11.c:18:3: note: add new stmt: vect__4.7_40 = MEM[(int *)vectp_a.5_29];
t11.c:18:3: note: ------>vectorizing statement: _5 = b_14 + _2;
t11.c:18:3: note: ------>vectorizing statement: _6 = *_5;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: transform load. ncopies = 1
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *b_14
Applying pattern match.pd:83, generic-match.c:11885
t11.c:18:3: note: created b_14
t11.c:18:3: note: add new stmt: vect__6.10_43 = MEM[(int *)vectp_b.8_41];
t11.c:18:3: note: ------>vectorizing statement: _7 = c_16 + _2;
t11.c:18:3: note: ------>vectorizing statement: _8 = _4 + _6;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: vect_is_simple_use: operand _4
t11.c:18:3: note: def_stmt: _4 = *_3;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand _6
t11.c:18:3: note: def_stmt: _6 = *_5;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: transform binary/unary operation.
t11.c:18:3: note: vect_get_vec_def_for_operand: _4
t11.c:18:3: note: vect_is_simple_use: operand _4
t11.c:18:3: note: def_stmt: _4 = *_3;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _4 = *_3;
t11.c:18:3: note: vect_get_vec_def_for_operand: _6
t11.c:18:3: note: vect_is_simple_use: operand _6
t11.c:18:3: note: def_stmt: _6 = *_5;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _6 = *_5;
t11.c:18:3: note: add new stmt: vect__8.11_44 = vect__4.7_40 + vect__6.10_43;
t11.c:18:3: note: ------>vectorizing statement: *_7 = _8;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: vect_is_simple_use: operand _8
t11.c:18:3: note: def_stmt: _8 = _4 + _6;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: transform store. ncopies = 1
t11.c:18:3: note: vect_get_vec_def_for_operand: _8
t11.c:18:3: note: vect_is_simple_use: operand _8
t11.c:18:3: note: def_stmt: _8 = _4 + _6;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _8 = _4 + _6;
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *c_16
Applying pattern match.pd:83, generic-match.c:11885
t11.c:18:3: note: created c_16
t11.c:18:3: note: add new stmt: MEM[(int *)vectp_c.12_45] = vect__8.11_44;
t11.c:18:3: note: ------>vectorizing statement: i_19 = i_23 + 1;
t11.c:18:3: note: ------>vectorizing statement: ivtmp_38 = ivtmp_39 - 1;
t11.c:18:3: note: ------>vectorizing statement: vectp_a.5_28 = vectp_a.5_29 + 32;
t11.c:18:3: note: ------>vectorizing statement: vectp_b.8_42 = vectp_b.8_41 + 32;
t11.c:18:3: note: ------>vectorizing statement: vectp_c.12_46 = vectp_c.12_45 + 32;
t11.c:18:3: note: ------>vectorizing statement: if (ivtmp_38 != 0)

loop at t11.c:19: if (ivtmp_49 < 127)
;; Scaling loop 1 with scale 0.125000, bounding iterations to 127 from guessed 98
;; guessed iterations are now 13
t11.c:18:3: note: LOOP VECTORIZED


Analyzing loop at t11.c:18
t11.c:18:3: note: ===== analyze_loop_nest =====
t11.c:18:3: note: === vect_analyze_loop_form ===
t11.c:18:3: note: === get_loop_niters ===
Analyzing # of iterations of loop 2
  exit condition [6, + , 4294967295] != 0
  bounds on difference of bases: -6 ... -6
  result:
    # of iterations 6, bounded by 6
Creating dr for *_25
analyze_innermost: Applying pattern match.pd:83, generic-match.c:11885
success.
Applying pattern match.pd:83, generic-match.c:11885
	base_address: a_12
	offset from base address: 0
	constant offset from base address: 4064
	step: 4
	aligned to: 256
	base_object: *a_12
	Access function 0: {4064B, +, 4}_2
Creating dr for *_21
analyze_innermost: Applying pattern match.pd:83, generic-match.c:11885
success.
Applying pattern match.pd:83, generic-match.c:11885
	base_address: b_14
	offset from base address: 0
	constant offset from base address: 4064
	step: 4
	aligned to: 256
	base_object: *b_14
	Access function 0: {4064B, +, 4}_2
Creating dr for *_10
analyze_innermost: Applying pattern match.pd:83, generic-match.c:11885
success.
Applying pattern match.pd:83, generic-match.c:11885
	base_address: c_16
	offset from base address: 0
	constant offset from base address: 4064
	step: 4
	aligned to: 256
	base_object: *c_16
	Access function 0: {4064B, +, 4}_2
t11.c:18:3: note: === vect_analyze_data_refs ===
t11.c:18:3: note: got vectype for stmt: _22 = *_25;
vector(8) int
t11.c:18:3: note: got vectype for stmt: _20 = *_21;
vector(8) int
t11.c:18:3: note: got vectype for stmt: *_10 = _9;
vector(8) int
t11.c:18:3: note: === vect_analyze_scalar_cycles ===
t11.c:18:3: note: Analyze phi: i_36 = PHI <1016(7), i_32(9)>
t11.c:18:3: note: Access function of PHI: {1016, +, 1}_2
t11.c:18:3: note: step: 1,  init: 1016
t11.c:18:3: note: Detected induction.
t11.c:18:3: note: Analyze phi: .MEM_35 = PHI <.MEM_30(7), .MEM_33(9)>
t11.c:18:3: note: Analyze phi: ivtmp_34 = PHI <7(7), ivtmp_31(9)>
t11.c:18:3: note: Access function of PHI: {7, +, 4294967295}_2
t11.c:18:3: note: step: 4294967295,  init: 7
t11.c:18:3: note: Detected induction.
t11.c:18:3: note: === vect_pattern_recog ===
t11.c:18:3: note: vect_is_simple_use: operand _27
t11.c:18:3: note: def_stmt: _27 = (long unsigned int) i_36;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand i_36
t11.c:18:3: note: def_stmt: i_36 = PHI <1016(7), i_32(9)>
t11.c:18:3: note: type of def: induction
t11.c:18:3: note: vect_is_simple_use: operand 4
t11.c:18:3: note: === vect_analyze_data_ref_accesses ===
t11.c:18:3: note: === vect_mark_stmts_to_be_vectorized ===
t11.c:18:3: note: init: phi relevant? i_36 = PHI <1016(7), i_32(9)>
t11.c:18:3: note: init: phi relevant? .MEM_35 = PHI <.MEM_30(7), .MEM_33(9)>
t11.c:18:3: note: init: phi relevant? ivtmp_34 = PHI <7(7), ivtmp_31(9)>
t11.c:18:3: note: init: stmt relevant? _27 = (long unsigned int) i_36;
t11.c:18:3: note: init: stmt relevant? _26 = _27 * 4;
t11.c:18:3: note: init: stmt relevant? _25 = a_12 + _26;
t11.c:18:3: note: init: stmt relevant? _22 = *_25;
t11.c:18:3: note: init: stmt relevant? _21 = b_14 + _26;
t11.c:18:3: note: init: stmt relevant? _20 = *_21;
t11.c:18:3: note: init: stmt relevant? _10 = c_16 + _26;
t11.c:18:3: note: init: stmt relevant? _9 = _22 + _20;
t11.c:18:3: note: init: stmt relevant? *_10 = _9;
t11.c:18:3: note: vec_stmt_relevant_p: stmt has vdefs.
t11.c:18:3: note: mark relevant 5, live 0: *_10 = _9;
t11.c:18:3: note: init: stmt relevant? i_32 = i_36 + 1;
t11.c:18:3: note: init: stmt relevant? ivtmp_31 = ivtmp_34 - 1;
t11.c:18:3: note: init: stmt relevant? if (ivtmp_31 != 0)
t11.c:18:3: note: worklist: examine stmt: *_10 = _9;
t11.c:18:3: note: vect_is_simple_use: operand _9
t11.c:18:3: note: def_stmt: _9 = _22 + _20;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _9 = _22 + _20;
t11.c:18:3: note: worklist: examine stmt: _9 = _22 + _20;
t11.c:18:3: note: vect_is_simple_use: operand _22
t11.c:18:3: note: def_stmt: _22 = *_25;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _22 = *_25;
t11.c:18:3: note: vect_is_simple_use: operand _20
t11.c:18:3: note: def_stmt: _20 = *_21;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: mark relevant 5, live 0: _20 = *_21;
t11.c:18:3: note: worklist: examine stmt: _20 = *_21;
t11.c:18:3: note: worklist: examine stmt: _22 = *_25;
t11.c:18:3: note: === vect_analyze_data_ref_dependences ===
(compute_affine_dependence
  stmt_a: _22 = *_25;
  stmt_b: _20 = *_21;
) -> no dependence
(compute_affine_dependence
  stmt_a: _22 = *_25;
  stmt_b: *_10 = _9;
) -> no dependence
(compute_affine_dependence
  stmt_a: _20 = *_21;
  stmt_b: *_10 = _9;
) -> no dependence
(compute_affine_dependence
  stmt_a: _22 = *_25;
  stmt_b: _22 = *_25;
(analyze_overlapping_iterations 
  (chrec_a = {4064B, +, 4}_2)
  (chrec_b = {4064B, +, 4}_2)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
(compute_affine_dependence
  stmt_a: _20 = *_21;
  stmt_b: _20 = *_21;
(analyze_overlapping_iterations 
  (chrec_a = {4064B, +, 4}_2)
  (chrec_b = {4064B, +, 4}_2)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
(compute_affine_dependence
  stmt_a: *_10 = _9;
  stmt_b: *_10 = _9;
(analyze_overlapping_iterations 
  (chrec_a = {4064B, +, 4}_2)
  (chrec_b = {4064B, +, 4}_2)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
t11.c:18:3: note: === vect_determine_vectorization_factor ===
t11.c:18:3: note: ==> examining phi: i_36 = PHI <1016(7), i_32(9)>
t11.c:18:3: note: ==> examining phi: .MEM_35 = PHI <.MEM_30(7), .MEM_33(9)>
t11.c:18:3: note: ==> examining phi: ivtmp_34 = PHI <7(7), ivtmp_31(9)>
t11.c:18:3: note: ==> examining statement: _27 = (long unsigned int) i_36;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _26 = _27 * 4;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _25 = a_12 + _26;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _22 = *_25;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: _21 = b_14 + _26;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _20 = *_21;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: _10 = c_16 + _26;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: _9 = _22 + _20;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: *_10 = _9;
t11.c:18:3: note: get vectype for scalar type:  int
t11.c:18:3: note: vectype: vector(8) int
t11.c:18:3: note: nunits = 8
t11.c:18:3: note: ==> examining statement: i_32 = i_36 + 1;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: ivtmp_31 = ivtmp_34 - 1;
t11.c:18:3: note: skip.
t11.c:18:3: note: ==> examining statement: if (ivtmp_31 != 0)
t11.c:18:3: note: skip.
t11.c:18:3: note: vectorization factor = 8
t11.c:18:3: note: === vect_analyze_slp ===
t11.c:18:3: note: === vect_make_slp_decision ===
t11.c:18:3: note: vectorization_factor = 8, niters = 7
t11.c:18:3: note: === vect_analyze_data_refs_alignment ===
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_25
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_21
t11.c:18:3: note: vect_compute_data_ref_alignment:
t11.c:18:3: note: misalign = 0 bytes of ref *_10
t11.c:18:3: note: === vect_prune_runtime_alias_test_list ===
t11.c:18:3: note: === vect_analyze_loop_operations ===
t11.c:18:3: note: examining phi: i_36 = PHI <1016(7), i_32(9)>
t11.c:18:3: note: examining phi: .MEM_35 = PHI <.MEM_30(7), .MEM_33(9)>
t11.c:18:3: note: examining phi: ivtmp_34 = PHI <7(7), ivtmp_31(9)>
t11.c:18:3: note: ==> examining statement: _27 = (long unsigned int) i_36;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _26 = _27 * 4;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _25 = a_12 + _26;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _22 = *_25;
t11.c:18:3: note: vect_is_simple_use: operand *_25
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_is_simple_use: operand *_25
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: _21 = b_14 + _26;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _20 = *_21;
t11.c:18:3: note: vect_is_simple_use: operand *_21
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_is_simple_use: operand *_21
t11.c:18:3: note: not ssa-name.
t11.c:18:3: note: use not simple.
t11.c:18:3: note: vect_model_load_cost: aligned.
t11.c:18:3: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: _10 = c_16 + _26;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: _9 = _22 + _20;
t11.c:18:3: note: vect_is_simple_use: operand _22
t11.c:18:3: note: def_stmt: _22 = *_25;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand _20
t11.c:18:3: note: def_stmt: _20 = *_21;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: === vectorizable_operation ===
t11.c:18:3: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: *_10 = _9;
t11.c:18:3: note: vect_is_simple_use: operand _9
t11.c:18:3: note: def_stmt: _9 = _22 + _20;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_model_store_cost: aligned.
t11.c:18:3: note: vect_model_store_cost: inside_cost = 1, prologue_cost = 0 .
t11.c:18:3: note: ==> examining statement: i_32 = i_36 + 1;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: ivtmp_31 = ivtmp_34 - 1;
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: ==> examining statement: if (ivtmp_31 != 0)
t11.c:18:3: note: irrelevant.
t11.c:18:3: note: cost model: mask loop epilogue.
t11.c:18:3: note: loop vectorized
t11.c:18:3: note: === vec_transform_loop ===
t11.c:18:3: note: ------>vectorizing phi: i_36 = PHI <1016(11), i_32(9)>
t11.c:18:3: note: ------>vectorizing phi: .MEM_35 = PHI <.MEM_30(11), .MEM_33(9)>
t11.c:18:3: note: ------>vectorizing phi: ivtmp_34 = PHI <7(11), ivtmp_31(9)>
t11.c:18:3: note: ------>vectorizing statement: _27 = (long unsigned int) i_36;
t11.c:18:3: note: ------>vectorizing statement: _26 = _27 * 4;
t11.c:18:3: note: ------>vectorizing statement: _25 = a_12 + _26;
t11.c:18:3: note: ------>vectorizing statement: _22 = *_25;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: transform load. ncopies = 1
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *a_12
t11.c:18:3: note: created vectp_a.15_50
t11.c:18:3: note: add new stmt: vect__22.16_53 = MEM[(int *)vectp_a.14_51];
t11.c:18:3: note: ------>vectorizing statement: _21 = b_14 + _26;
t11.c:18:3: note: ------>vectorizing statement: _20 = *_21;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: transform load. ncopies = 1
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *b_14
t11.c:18:3: note: created vectp_b.18_54
t11.c:18:3: note: add new stmt: vect__20.19_57 = MEM[(int *)vectp_b.17_55];
t11.c:18:3: note: ------>vectorizing statement: _10 = c_16 + _26;
t11.c:18:3: note: ------>vectorizing statement: _9 = _22 + _20;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: vect_is_simple_use: operand _22
t11.c:18:3: note: def_stmt: _22 = *_25;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: vect_is_simple_use: operand _20
t11.c:18:3: note: def_stmt: _20 = *_21;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: transform binary/unary operation.
t11.c:18:3: note: vect_get_vec_def_for_operand: _22
t11.c:18:3: note: vect_is_simple_use: operand _22
t11.c:18:3: note: def_stmt: _22 = *_25;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _22 = *_25;
t11.c:18:3: note: vect_get_vec_def_for_operand: _20
t11.c:18:3: note: vect_is_simple_use: operand _20
t11.c:18:3: note: def_stmt: _20 = *_21;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _20 = *_21;
t11.c:18:3: note: add new stmt: vect__9.20_58 = vect__22.16_53 + vect__20.19_57;
t11.c:18:3: note: ------>vectorizing statement: *_10 = _9;
t11.c:18:3: note: transform statement.
t11.c:18:3: note: vect_is_simple_use: operand _9
t11.c:18:3: note: def_stmt: _9 = _22 + _20;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note: transform store. ncopies = 1
t11.c:18:3: note: vect_get_vec_def_for_operand: _9
t11.c:18:3: note: vect_is_simple_use: operand _9
t11.c:18:3: note: def_stmt: _9 = _22 + _20;
t11.c:18:3: note: type of def: internal
t11.c:18:3: note:   def_stmt =  _9 = _22 + _20;
t11.c:18:3: note: create vector_type-pointer variable to type: vector(8) int  vectorizing a pointer ref: *c_16
t11.c:18:3: note: created vectp_c.22_59
t11.c:18:3: note: add new stmt: MEM[(int *)vectp_c.21_60] = vect__9.20_58;
t11.c:18:3: note: ------>vectorizing statement: i_32 = i_36 + 1;
t11.c:18:3: note: ------>vectorizing statement: ivtmp_31 = ivtmp_34 - 1;
t11.c:18:3: note: ------>vectorizing statement: vectp_a.14_52 = vectp_a.14_51 + 32;
t11.c:18:3: note: ------>vectorizing statement: vectp_b.17_56 = vectp_b.17_55 + 32;
t11.c:18:3: note: ------>vectorizing statement: vectp_c.21_61 = vectp_c.21_60 + 32;
t11.c:18:3: note: ------>vectorizing statement: if (ivtmp_31 != 0)

loop at t11.c:19: if (ivtmp_64 < 0)
t11.c:18:3: note: === Loop has beed masked ===
;; Scaling loop 2 with scale 0.125000, bounding iterations to 0 from guessed 6
;; guessed iterations are now 6
t11.c:18:3: note: LOOP EPILOGUE VECTORIZED AND MASKED (VS=32)
t11.c:8:1: note: vectorized 2 loops in function.

Updating SSA:
Registering new PHI nodes in block #0
Registering new PHI nodes in block #2
Registering new PHI nodes in block #10
Registering new PHI nodes in block #3
Updating SSA information for statement vect__4.7_40 = MEM[(int *)vectp_a.5_29];
Updating SSA information for statement _4 = *_3;
Updating SSA information for statement vect__6.10_43 = MEM[(int *)vectp_b.8_41];
Updating SSA information for statement _6 = *_5;
Updating SSA information for statement MEM[(int *)vectp_c.12_45] = vect__8.11_44;
Registering new PHI nodes in block #7
Registering new PHI nodes in block #11
Registering new PHI nodes in block #8
Updating SSA information for statement vect__22.16_53 = MASK_LOAD (vectp_a.14_51, 32B, mask_68);
Updating SSA information for statement _22 = *_25;
Updating SSA information for statement vect__20.19_57 = MASK_LOAD (vectp_b.17_55, 32B, mask_68);
Updating SSA information for statement _20 = *_21;
Updating SSA information for statement MASK_STORE (vectp_c.21_60, 32B, mask_68, vect__9.20_58);
Registering new PHI nodes in block #5
Updating SSA information for statement return;
Registering new PHI nodes in block #9
Registering new PHI nodes in block #4

Symbols to be put in SSA form
{ D.2260 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 12
Number of blocks to update: 10 ( 83%)
Affected blocks: 0 2 3 4 5 7 8 9 10 11


Merging blocks 2 and 10
Merging blocks 7 and 11
Applying pattern match.pd:2020, gimple-match.c:4233
Applying pattern match.pd:2654, gimple-match.c:6588
Removing basic block 9
basic block 9, loop depth 1
 pred:       8
goto <bb 8>;
 succ:       8


Merging blocks 7 and 8
gimple_simplified to i_32 = 1017;
gimple_simplified to _27 = 1016;
gimple_simplified to ivtmp_31 = 6;
gimple_simplified to ivtmp_64 = 1;
gimple_simplified to ivtmp_66 = { 8, 9, 10, 11, 12, 13, 14, 15 };
Merging blocks 7 and 5
fix_loop_structure: fixing up loops for function
fix_loop_structure: removing loop 2
__attribute__((noinline))
test_citer (int * restrict a, int * restrict b, int * restrict c)
{
  vector(8) int * vectp_c.22;
  vector(8) int * vectp_c.21;
  vector(8) int vect__9.20;
  vector(8) int vect__20.19;
  vector(8) int * vectp_b.18;
  vector(8) int * vectp_b.17;
  vector(8) int vect__22.16;
  vector(8) int * vectp_a.15;
  vector(8) int * vectp_a.14;
  vector(8) int * vectp_c.13;
  vector(8) int * vectp_c.12;
  vector(8) int vect__8.11;
  vector(8) int vect__6.10;
  vector(8) int * vectp_b.9;
  vector(8) int * vectp_b.8;
  vector(8) int vect__4.7;
  vector(8) int * vectp_a.6;
  vector(8) int * vectp_a.5;
  unsigned int tmp.4;
  int tmp.3;
  int i;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  int * _5;
  int _6;
  int * _7;
  int _8;
  int _9;
  int * _10;
  int _20;
  int * _21;
  int _22;
  int * _25;
  long unsigned int _26;
  long unsigned int _27;
  unsigned int ivtmp_31;
  unsigned int ivtmp_38;
  unsigned int ivtmp_39;
  unsigned int ivtmp_48;
  unsigned int ivtmp_49;
  unsigned int ivtmp_64;
  vector(8) unsigned int ivtmp_66;
  vector(8) unsigned int vect_niters_67;
  vector(8) <unnamed type> mask_68;

  <bb 2>:
  a_12 = __builtin_assume_aligned (a_11(D), 64);
  b_14 = __builtin_assume_aligned (b_13(D), 64);
  c_16 = __builtin_assume_aligned (c_15(D), 64);

  <bb 3>:
  # i_23 = PHI <i_19(4), 0(2)>
  # ivtmp_39 = PHI <ivtmp_38(4), 1023(2)>
  # vectp_a.5_29 = PHI <vectp_a.5_28(4), a_12(2)>
  # vectp_b.8_41 = PHI <vectp_b.8_42(4), b_14(2)>
  # vectp_c.12_45 = PHI <vectp_c.12_46(4), c_16(2)>
  # ivtmp_48 = PHI <ivtmp_49(4), 0(2)>
  _1 = (long unsigned int) i_23;
  _2 = _1 * 4;
  _3 = a_12 + _2;
  vect__4.7_40 = MEM[(int *)vectp_a.5_29];
  _4 = *_3;
  _5 = b_14 + _2;
  vect__6.10_43 = MEM[(int *)vectp_b.8_41];
  _6 = *_5;
  _7 = c_16 + _2;
  vect__8.11_44 = vect__4.7_40 + vect__6.10_43;
  _8 = _4 + _6;
  MEM[(int *)vectp_c.12_45] = vect__8.11_44;
  i_19 = i_23 + 1;
  ivtmp_38 = ivtmp_39 - 1;
  vectp_a.5_28 = vectp_a.5_29 + 32;
  vectp_b.8_42 = vectp_b.8_41 + 32;
  vectp_c.12_46 = vectp_c.12_45 + 32;
  ivtmp_49 = ivtmp_48 + 1;
  if (ivtmp_49 < 127)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

  <bb 5>:
  vectp_a.15_50 = a_12 + 4064;
  vectp_b.18_54 = b_14 + 4064;
  vectp_c.22_59 = c_16 + 4064;
  vect_niters_67 = { 7, 7, 7, 7, 7, 7, 7, 7 };
  vectp_a.14_51 = vectp_a.15_50;
  vectp_b.17_55 = vectp_b.18_54;
  vectp_c.21_60 = vectp_c.22_59;
  mask_68 = vect_niters_67 > { 0, 1, 2, 3, 4, 5, 6, 7 };
  _27 = 1016;
  _26 = _27 * 4;
  _25 = a_12 + _26;
  vect__22.16_53 = MASK_LOAD (vectp_a.14_51, 32B, mask_68);
  _22 = *_25;
  _21 = b_14 + _26;
  vect__20.19_57 = MASK_LOAD (vectp_b.17_55, 32B, mask_68);
  _20 = *_21;
  _10 = c_16 + _26;
  vect__9.20_58 = vect__22.16_53 + vect__20.19_57;
  _9 = _22 + _20;
  MASK_STORE (vectp_c.21_60, 32B, mask_68, vect__9.20_58);
  i_32 = 1017;
  ivtmp_31 = 6;
  vectp_a.14_52 = vectp_a.14_51 + 32;
  vectp_b.17_56 = vectp_b.17_55 + 32;
  vectp_c.21_61 = vectp_c.21_60 + 32;
  ivtmp_64 = 1;
  ivtmp_66 = { 8, 9, 10, 11, 12, 13, 14, 15 };
  return;

}



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-18 15:54                               ` Christophe Lyon
  2016-11-24 13:42                                 ` Yuri Rumyantsev
@ 2016-11-29 16:22                                 ` Christophe Lyon
  1 sibling, 0 replies; 38+ messages in thread
From: Christophe Lyon @ 2016-11-29 16:22 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, Jeff Law, gcc-patches, Ilya Enkovich

On 18 November 2016 at 16:54, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> It is very strange that this test failed on arm, since it requires
>> target avx2 to check vectorizer dumps:
>>
>> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>> target avx2_runtime } } } */
>> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>>
>> Could you please clarify what is the reason of the failure?
>
> It's not the scan-dumps that fail, but the execution.
> The test calls abort() for some reason.
>
> It will take me a while to rebuild the test manually in the right
> debug environment to provide you with more traces.
>
>
Sorry for the delay... This problem is not directly related to your patch.

The tests in gcc.dg/vect are compiled with -mfpu=neon
-mfloat-abi=softfp -march=armv7-a
and thus cannot be executed on older versions of the architecture.

This is another instance of what I discussed with Jakub several months ago:
https://gcc.gnu.org/ml/gcc-patches/2016-06/msg00666.html
but the thread died.

Basically, check_vect_support_and_set_flags sets set
dg-do-what-default compile, but
some tests in gcc.dg/vect have dg-do run hardcoded.

Jakub was not happy with my patch that was removing all these dg-do
run directives :-)

Christophe


>
>>
>> Thanks.
>>
>> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi All,
>>>>
>>>> Here is patch for non-masked epilogue vectoriziation.
>>>>
>>>> Bootstrap and regression testing did not show any new failures.
>>>>
>>>> Is it OK for trunk?
>>>>
>>>> Thanks.
>>>> Changelog:
>>>>
>>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>>
>>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>>>> * tree-if-conv.c (tree_if_conversion): Make public.
>>>> * * tree-if-conv.h: New file.
>>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>>>> dynamic alias checks for epilogues.
>>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>>>> * tree-vect-loop.c: include tree-if-conv.h.
>>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>>>> using passed argument.
>>>> (vect_transform_loop): Check if created epilogue should be returned
>>>> for further vectorization with less vf.  If-convert epilogue if
>>>> required. Print vectorization success for epilogue.
>>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>>>> if it is required, pass loop_vinfo produced during vectorization of
>>>> loop body to vect_analyze_loop.
>>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>>>> orig_loop_info.
>>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>>>> (LOOP_VINFO_EPILOGUE_P): New.
>>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>>>> (vect_do_peeling): Change prototype to return epilogue.
>>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>>>> (vect_transform_loop): Return created loop.
>>>>
>>>> gcc/testsuite/
>>>>
>>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>>>> (check_effective_target_avx2_runtime): New.
>>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>>>>
>>>
>>> Hi,
>>>
>>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>>>
>>> It does pass on the same target if configured --with-cpu=cortex-a9.
>>>
>>> Christophe
>>>
>>>
>>>
>>>>
>>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>Richard,
>>>>>>
>>>>>>I checked one of the tests designed for epilogue vectorization using
>>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>>>>>
>>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>>>>>t1.new-nomask.s -fdump-tree-vect-details
>>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>>>>>>4
>>>>>> Without param only 2 loops are vectorized.
>>>>>>
>>>>>>Should I simply add a part of tests related to this feature or I must
>>>>>>delete all not necessary changes also?
>>>>>
>>>>> Please remove all not necessary changes.
>>>>>
>>>>> Richard.
>>>>>
>>>>>>Thanks.
>>>>>>Yuri.
>>>>>>
>>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>
>>>>>>>> Richard,
>>>>>>>>
>>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>>>>>>field.
>>>>>>>> Here is the correct updated patch.
>>>>>>>
>>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>>>>>>> (and you'd add the testcases covering non-masked tail vect).
>>>>>>>
>>>>>>> Thus, can you please produce a single complete patch containing only
>>>>>>> non-masked epilogue vectoriziation?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Richard.
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Yuri.
>>>>>>>>
>>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >
>>>>>>>> >> Richard,
>>>>>>>> >>
>>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>>>>>>>> >> vect_analyze_loop as you proposed (untested).
>>>>>>>> >>
>>>>>>>> >> You wrote:
>>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>>> >> changes only needed by later patches?
>>>>>>>> >>
>>>>>>>> >> Did you mean that I exclude all support for vectorization
>>>>>>epilogues,
>>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>>>>>>> >> like
>>>>>>>> >>
>>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>>>>>>> >> index 11863af..32011c1 100644
>>>>>>>> >> --- a/gcc/tree-vect-loop.c
>>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>>>>>>> >
>>>>>>>> > Yes.
>>>>>>>> >
>>>>>>>> >> Did you mean also that new combined patch must be working patch,
>>>>>>i.e.
>>>>>>>> >> can be integrated without other patches?
>>>>>>>> >
>>>>>>>> > Yes.
>>>>>>>> >
>>>>>>>> >> Could you please look at updated patch?
>>>>>>>> >
>>>>>>>> > Will do.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Richard.
>>>>>>>> >
>>>>>>>> >> Thanks.
>>>>>>>> >> Yuri.
>>>>>>>> >>
>>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>>>>>>> >> >
>>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> > Richard,
>>>>>>>> >> >> >
>>>>>>>> >> >> > Here is updated 3 patch.
>>>>>>>> >> >> >
>>>>>>>> >> >> > I checked that all new tests related to epilogue
>>>>>>vectorization passed with it.
>>>>>>>> >> >> >
>>>>>>>> >> >> > Your comments will be appreciated.
>>>>>>>> >> >>
>>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>>>>>>> >> >> original vectorization factor?  So we can pass down an
>>>>>>(optional)
>>>>>>>> >> >> forced vectorization factor as well?
>>>>>>>> >> >
>>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>>>>>>> >> > changes only needed by later patches?
>>>>>>>> >> >
>>>>>>>> >> > Thanks,
>>>>>>>> >> > Richard.
>>>>>>>> >> >
>>>>>>>> >> >> Richard.
>>>>>>>> >> >>
>>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>>>>><rguenther@suse.de>:
>>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >> > >
>>>>>>>> >> >> > >> Hi Richard,
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> I did not understand your last remark:
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>>> >> >> > >> >           && dump_enabled_p ())
>>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>>vect_location,
>>>>>>>> >> >> > >> >                            "loop vectorized\n");
>>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>>>>>it to be unrolled
>>>>>>>> >> >> > >> >           etc.  */
>>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>>it easier
>>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>>in dumps
>>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>>*/
>>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>>> >> >> > >> > +         {
>>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>>> >> >> > >> > +         }
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>>new_loop)
>>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>>>>>perform
>>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>>vectorization
>>>>>>>> >> >> > >> > separately that would be great.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> Could you please clarify your proposal.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>>>>>vectorize
>>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>>>>>avoiding
>>>>>>>> >> >> > > the re-use of ->aux.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > Richard.
>>>>>>>> >> >> > >
>>>>>>>> >> >> > >> Thanks.
>>>>>>>> >> >> > >> Yuri.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>>>>><rguenther@suse.de>:
>>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >> Hi All,
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>>>>>which support
>>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>>>>>trip count. We
>>>>>>>> >> >> > >> >> assume that the only patch -
>>>>>>vec-tails-07-combine-tail.patch - was not
>>>>>>>> >> >> > >> >> approved by Jeff.
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>>>>>>bootstrapping and
>>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>>>>>Also all
>>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>>>>>been changed
>>>>>>>> >> >> > >> >> accordingly.
>>>>>>>> >> >> > >> >>
>>>>>>>> >> >> > >> >> Is it OK for trunk?
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I would have prefered that the series up to
>>>>>>-03-nomask-tails would
>>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>>>>>unfortunately
>>>>>>>> >> >> > >> > the patchset is oddly separated.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>>>>>(loop_vec_info
>>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>>>>>single_exit (loop))
>>>>>>>> >> >> > >> > -      || loop->inner)
>>>>>>>> >> >> > >> > +      || loop->inner
>>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>>>>>and
>>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>>>>>>> >> >> > >> >      do_peeling = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    if (do_peeling
>>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>>>>>(loop_vec_info
>>>>>>>> >> >> > >> > loop_vinfo)
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    do_versioning =
>>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>>>>>>> >> >> > >> > +          original loop and is not required for
>>>>>>epilogue.  */
>>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> >    if (do_versioning)
>>>>>>>> >> >> > >> >      {
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > please do that check in the single caller of this
>>>>>>function.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>>>>>believe that simply
>>>>>>>> >> >> > >> > passing down info from the processed parent would be
>>>>>>_much_ cleaner.
>>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>>>>>>> >> >> > >> >             && dump_enabled_p ())
>>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>>>>>vect_location,
>>>>>>>> >> >> > >> >                             "loop vectorized\n");
>>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>>>>>it to be unrolled
>>>>>>>> >> >> > >> >            etc.  */
>>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>>>>>it easier
>>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>>>>>in dumps
>>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>>>>>*/
>>>>>>>> >> >> > >> > +       if (new_loop)
>>>>>>>> >> >> > >> > +         {
>>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>>>>>>> >> >> > >> > +         }
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>>>>>new_loop)
>>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>>>>>>perform
>>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>>>>>vectorization
>>>>>>>> >> >> > >> > separately that would be great.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>>>>>question its
>>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>>>>>vector loop).
>>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>>>>>>>> >> >> > >> >
>>>>>>>> >> >> > >> > Thanks,
>>>>>>>> >> >> > >> > Richard.
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >>
>>>>>>>> >> >> > >
>>>>>>>> >> >> > > --
>>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >> > --
>>>>>>>> >> > Richard Biener <rguenther@suse.de>
>>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Richard Biener <rguenther@suse.de>
>>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Richard Biener <rguenther@suse.de>
>>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>>>>>Norton, HRB 21284 (AG Nuernberg)
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-11-28 16:57                                     ` Yuri Rumyantsev
@ 2016-12-01 11:34                                       ` Richard Biener
  2016-12-01 14:27                                         ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-12-01 11:34 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Christophe Lyon, Jeff Law, gcc-patches, Ilya Enkovich

On Mon, 28 Nov 2016, Yuri Rumyantsev wrote:

> Richard!
> 
> I attached vect dump for hte part of attached test-case which
> illustrated how vectorization of epilogues works through masking:
> #define SIZE 1023
> #define ALIGN 64
> 
> extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
> __SIZE_TYPE__ size) __attribute__((weak));
> extern void free (void *);
> 
> void __attribute__((noinline))
> test_citer (int * __restrict__ a,
>    int * __restrict__ b,
>    int * __restrict__ c)
> {
>   int i;
> 
>   a = (int *)__builtin_assume_aligned (a, ALIGN);
>   b = (int *)__builtin_assume_aligned (b, ALIGN);
>   c = (int *)__builtin_assume_aligned (c, ALIGN);
> 
>   for (i = 0; i < SIZE; i++)
>     c[i] = a[i] + b[i];
> }
> 
> It was compiled with -mavx2 --param vect-epilogues-mask=1 options.
> 
> I did not include in this patch vectorization of low trip-count loops
> since in the original patch additional parameter was introduced:
> +DEFPARAM (PARAM_VECT_SHORT_LOOPS,
> +  "vect-short-loops",
> +  "Enable vectorization of low trip count loops using masking.",
> +  0, 0, 1)
> 
> I assume that this ability can be included very quickly but it
> requires cost model enhancements also.

Comments on the patch itself (as I'm having a closer look again,
I know how it vectorizes the above but I wondered why epilogue
and short-trip loops are not basically the same code path).

Btw, I don't like that the features are behind a --param paywall.
That just means a) nobody will use it, b) it will bit-rot quickly,
c) bugs are well-hidden.

+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && integer_zerop (nested_in_vect_loop
+                       ? STMT_VINFO_DR_STEP (stmt_info)
+                       : DR_STEP (dr)))
+    {
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_NOTE, vect_location,
+                        "allow invariant load for masked loop.\n");
+    }

this can test memory_access_type == VMAT_INVARIANT.  Please put
all the checks in a common

  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
    {
       if (memory_access_type == VMAT_INVARIANT)
         {
         }
       else if (...)
         {
            LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
         }
       else if (..)
...
    }

@@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt, 
gimple_stmt_iterator *gsi, gimple **vec_stmt,
       gcc_assert (!nested_in_vect_loop);
       gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));

+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "cannot be masked: grouped access is not"
+                            " supported.");
+         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      }
+

isn't this already handled by the above?  Or rather the general
disallowance of SLP?

@@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt, 
gimple_stmt_iterator *gsi, gimple **vec_stmt,
                            &memory_access_type, &gs_info))
     return false;

+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && memory_access_type != VMAT_CONTIGUOUS)
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "cannot be masked: unsupported memory access 
type.\n");
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && !can_mask_load_store (stmt))
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "cannot be masked: unsupported masked store.\n");
+    }
+

likewise please combine the ifs.

@@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt, 
gimple_stmt_iterator *gsi,
                                          ptr, vec_mask, vec_rhs);
          vect_finish_stmt_generation (stmt, new_stmt, gsi);
          if (i == 0)
-           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+           {
+             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+             STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
+           }
          else
            STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
          prev_stmt_info = vinfo_for_stmt (new_stmt);

here you only set the flag, elsewhere you copy DR and VECTYPE as well.

@@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt, 
gimple_stmt_iterator *gsi,
               && !useless_type_conversion_p (vectype, rhs_vectype)))
     return false;

+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Check that mask conjuction is supported.  */
+      optab tab;
+      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == 
CODE_FOR_nothing)
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "cannot be masked: unsupported mask 
operation\n");
+         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+       }
+    }

does this really test whether we can bit-and the mask?  You are
using the vector type of the store (which might be V2DF for example),
also for AVX512 it might be a vector-bool type with integer mode?
Of course we maybe can simply assume mask conjunction is available
(I know no ISA where that would be not true).

+/* Return true if STMT can be converted to masked form.  */
+
+static bool
+can_mask_load_store (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  tree vectype, mask_vectype;
+  tree lhs, ref;
+
+  if (!stmt_info)
+    return false;
+  lhs = gimple_assign_lhs (stmt);
+  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
+  if (may_be_nonaddressable_p (ref))
+    return false;
+  vectype = STMT_VINFO_VECTYPE (stmt_info);

You probably modeled this after ifcvt_can_use_mask_load_store but I
don't think checking may_be_nonaddressable_p is necessary (we couldn't
even vectorize such refs).  stmt_info should never be NULL either.
With the check removed tree-ssa-loop-ivopts.h should no longer be
necessary.

+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+                          data_reference *dr, gimple_stmt_iterator *si)
+{
...
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+                                  true, NULL_TREE, true,
+                                  GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+                      misalign ? misalign & -misalign : align);

you should simply use

  align = get_object_alignment (mem) / BITS_PER_UNIT;

here rather than trying to be clever.  Eventually you don't need
the DR then (see question above).

+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);

when you replace the load/store please previously copy VUSE and VDEF
from the original one (we were nearly clean enough to no longer
require a virtual operand rewrite after vectorization...)  Thus

  gimple_set_vuse (new_stmt, gimple_vuse (stmt));
  gimple_set_vdef (new_stmt, gimple_vdef (stmt));

+static void
+vect_mask_loop (loop_vec_info loop_vinfo)
+{
...
+  /* Scan all loop statements to convert vector load/store including 
masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+          !gsi_end_p (si); gsi_next (&si))
+       {
+         gimple *stmt = gsi_stmt (si);
+         stmt_vec_info stmt_info = NULL;
+         tree vectype = NULL;
+         data_reference *dr;
+
+         /* Mask load case.  */
+         if (is_gimple_call (stmt)
+             && gimple_call_internal_p (stmt)
+             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+           {
...
+             /* Skip invariant loads.  */
+             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+                                ? STMT_VINFO_DR_STEP (stmt_info)
+                                : DR_STEP (STMT_VINFO_DATA_REF 
(stmt_info))))
+               continue;

seeing this it would be nice if stmt_info had a flag for whether
the stmt needs masking (and a flag on wheter this is a scalar or a
vectorized stmt).

+         /* Skip hoisted out statements.  */
+         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+           continue;

err, you walk stmts in the loop!  Isn't this covered by the above
skipping of 'invariant loads'?

+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{

depending on the reduction operand there are variants that
could get away w/o the VEC_COND_EXPR, like

  S1': tem_4 = d_3 & MASK;
  S2': r_1 = r_2 + tem_4;

which works for plus at least.  More generally doing

  S1': tem_4 = VEC_COND_EXPR<MASK, d_3, neutral operand>
  S2': r_1 = r_2 OP tem_4;

and leaving optimization to & to later opts (& won't work for
AVX512 mask registers I guess).

Good enough for later enhacement of course.

+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
...

isn't it enough to always create a single IV and derive the
additional copies by IV + i * { elems, elems, elems ... }?
IVs are expensive -- I'm sure we can optimize the rest of the
scheme further as well but this one looks obvious to me.

@@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
   int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
   void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);

+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    {
+      /* Currently we don't produce scalar epilogue version in case
+        its masked version is provided.  It means we don't need to
+        compute profitability one more time here.  Just make a
+        masked loop version.  */
+      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
+       {
+         dump_printf_loc (MSG_NOTE, vect_location,
+                          "cost model: mask loop epilogue.\n");
+         LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
+         *ret_min_profitable_niters = 0;
+         *ret_min_profitable_estimate = 0;
+         return;
+       }
+    }
   /* Cost model disabled.  */
-  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
     {
       dump_printf_loc (MSG_NOTE, vect_location, "cost model 
disabled.\n");
       *ret_min_profitable_niters = 0;
       *ret_min_profitable_estimate = 0;
+      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
+         && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
       return;
     }

the unlimited_cost_model case should come first?  OTOH masking or
not is probably not sth covered by 'unlimited' - that is about
vectorizing or not.  But the above code means that for
epilogue vectorization w/o masking we ignore unlimited_cost_model ()?
That doesn't make sense to me.

Plus if this is short-trip or epilogue vectorization and the
cost model is _not_ unlimited then we dont' want to enable
masking always (if it is possible).  It might be we statically
know the epilogue executes for at most two iterations for example.

I don't see _any_ cost model for vectorizing the epilogue with
masking?  Am I missing something?  A "trivial" cost model
should at least consider the additional IV(s), the mask
compute and the widening and narrowing ops required.

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index e13d6a2..36be342 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
niters, tree nitersm1,
   bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
                         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));

+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    {
+      prolog_peeling = false;
+      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
+       epilog_peeling = false;
+    }
+
   if (!prolog_peeling && !epilog_peeling)
     return NULL;

I think the prolog_peeling was fixed during the epilogue vectorization 
review and should no longer be necessary.  Please add
a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead
(it should also work for short-trip loop vectorization).

@@ -2022,11 +2291,18 @@ start_over:
       || (max_niter != -1
          && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
     {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "not vectorized: iteration count smaller than "
-                        "vectorization factor.\n");
-      return false;
+      /* Allow low trip count for loop epilogue we want to mask.  */
+      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
+       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
+      else
+       {
+         if (dump_enabled_p ())

so why do we test only LOOP_VINFO_EPILOGUE_P here?  All the code
I saw sofar would also work for the main loop (but the cost
model is missing).

I am missing testcases.  There's only a single one but we should
have cases covering all kinds of mask IV widths and widen/shorten
masks.

Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking
enabled for an AVX2 machine?

Oh, and I really dislike the --param paywall.

Thanks,
Richard.

> Best regards.
> Yuri.
> 
> 
> 2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
> >
> >> Hi All,
> >>
> >> Here is the second patch which supports epilogue vectorization using
> >> masking without cost model. Currently it is possible
> >> only with passing parameter "--param vect-epilogues-mask=1".
> >>
> >> Bootstrapping and regression testing did not show any new regression.
> >>
> >> Any comments will be appreciated.
> >
> > Going over the patch the main question is one how it works -- it looks
> > like the decision whether to vectorize & mask the epilogue is made
> > when vectorizing the loop that generates the epilogue rather than
> > in the epilogue vectorization path?
> >
> > That is, I'd have expected to see this handling low-trip count loops
> > by masking?  And thus masking the epilogue simply by it being
> > low-trip count?
> >
> > Richard.
> >
> >> ChangeLog:
> >> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
> >>
> >> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
> >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
> >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
> >> required_mask fields.
> >> (vect_check_required_masks_widening): New.
> >> (vect_check_required_masks_narrowing): New.
> >> (vect_get_masking_iv_elems): New.
> >> (vect_get_masking_iv_type): New.
> >> (vect_get_extreme_masks): New.
> >> (vect_check_required_masks): New.
> >> (vect_analyze_loop_operations): Call vect_check_required_masks if all
> >> statements can be masked.
> >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
> >> Add check that epilogue can be masked with the same vf with issue
> >> fail notes.  Allow epilogue vectorization through masking of low trip
> >> loops. Set to true can_be_masked field before loop operation analysis.
> >> Do not set-up min_scalar_loop_bound for epilogue vectorization through
> >> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
> >> field before repeat analysis.
> >> (vect_estimate_min_profitable_iters): Do not compute profitability
> >> for epilogue masking.  Set up mask_loop filed to true if parameter
> >> PARAM_VECT_EPILOGUES_MASK is non-zero.
> >> (vectorizable_reduction): Add check that statement can be masked.
> >> (vectorizable_induction): Do not support masking for induction.
> >> (vect_gen_ivs_for_masking): New.
> >> (vect_get_mask_index_for_elems): New.
> >> (vect_get_mask_index_for_type): New.
> >> (vect_create_narrowed_masks): New.
> >> (vect_create_widened_masks): New.
> >> (vect_gen_loop_masks): New.
> >> (vect_mask_reduction_stmt): New.
> >> (vect_mask_mask_load_store_stmt): New.
> >> (vect_mask_load_store_stmt): New.
> >> (vect_mask_loop): New.
> >> (vect_transform_loop): Invoke vect_mask_loop if required.
> >> Use div_ceil to recompute upper bounds for masked loops.  Issue
> >> statistics for epilogue vectorization through masking. Do not reduce
> >> vf for masking epilogue.
> >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
> >> (can_mask_load_store): New.
> >> (vectorizable_mask_load_store): Check that mask conjuction is
> >> supported.  Set-up first_copy_p field of stmt_vinfo.
> >> (vectorizable_simd_clone_call): Check that simd clone can not be
> >> masked.
> >> (vectorizable_store): Check that store can be masked. Mark the first
> >> copy of generated vector stores and provide it with vectype and the
> >> original data reference.
> >> (vectorizable_load): Check that load can be masked.
> >> (vect_stmt_should_be_masked_for_epilogue): New.
> >> (vect_add_required_mask_for_stmt): New.
> >> (vect_analyze_stmt): Add check on unsupported statements for masking
> >> with printing message.
> >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
> >> can_be_maske, required_masks, masl_loop.
> >> (LOOP_VINFO_CAN_BE_MASKED): New.
> >> (LOOP_VINFO_REQUIRED_MASKS): New.
> >> (LOOP_VINFO_MASK_LOOP): New.
> >> (struct _stmt_vec_info): Add first_copy_p field.
> >> (STMT_VINFO_FIRST_COPY_P): New.
> >>
> >> gcc/testsuite/
> >>
> >> * gcc.dg/vect/vect-tail-mask-1.c: New test.
> >>
> >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> >> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >> It is very strange that this test failed on arm, since it requires
> >> >> target avx2 to check vectorizer dumps:
> >> >>
> >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
> >> >> target avx2_runtime } } } */
> >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
> >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
> >> >>
> >> >> Could you please clarify what is the reason of the failure?
> >> >
> >> > It's not the scan-dumps that fail, but the execution.
> >> > The test calls abort() for some reason.
> >> >
> >> > It will take me a while to rebuild the test manually in the right
> >> > debug environment to provide you with more traces.
> >> >
> >> >
> >> >
> >> >>
> >> >> Thanks.
> >> >>
> >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >>>> Hi All,
> >> >>>>
> >> >>>> Here is patch for non-masked epilogue vectoriziation.
> >> >>>>
> >> >>>> Bootstrap and regression testing did not show any new failures.
> >> >>>>
> >> >>>> Is it OK for trunk?
> >> >>>>
> >> >>>> Thanks.
> >> >>>> Changelog:
> >> >>>>
> >> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
> >> >>>>
> >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
> >> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
> >> >>>> * * tree-if-conv.h: New file.
> >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
> >> >>>> dynamic alias checks for epilogues.
> >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
> >> >>>> * tree-vect-loop.c: include tree-if-conv.h.
> >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
> >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
> >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
> >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
> >> >>>> using passed argument.
> >> >>>> (vect_transform_loop): Check if created epilogue should be returned
> >> >>>> for further vectorization with less vf.  If-convert epilogue if
> >> >>>> required. Print vectorization success for epilogue.
> >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
> >> >>>> if it is required, pass loop_vinfo produced during vectorization of
> >> >>>> loop body to vect_analyze_loop.
> >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
> >> >>>> orig_loop_info.
> >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
> >> >>>> (LOOP_VINFO_EPILOGUE_P): New.
> >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
> >> >>>> (vect_do_peeling): Change prototype to return epilogue.
> >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
> >> >>>> (vect_transform_loop): Return created loop.
> >> >>>>
> >> >>>> gcc/testsuite/
> >> >>>>
> >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
> >> >>>> (check_effective_target_avx2_runtime): New.
> >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
> >> >>>>
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
> >> >>>
> >> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
> >> >>>
> >> >>> Christophe
> >> >>>
> >> >>>
> >> >>>
> >> >>>>
> >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >>>>>>Richard,
> >> >>>>>>
> >> >>>>>>I checked one of the tests designed for epilogue vectorization using
> >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
> >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
> >> >>>>>>
> >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
> >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
> >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
> >> >>>>>>4
> >> >>>>>> Without param only 2 loops are vectorized.
> >> >>>>>>
> >> >>>>>>Should I simply add a part of tests related to this feature or I must
> >> >>>>>>delete all not necessary changes also?
> >> >>>>>
> >> >>>>> Please remove all not necessary changes.
> >> >>>>>
> >> >>>>> Richard.
> >> >>>>>
> >> >>>>>>Thanks.
> >> >>>>>>Yuri.
> >> >>>>>>
> >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
> >> >>>>>>>
> >> >>>>>>>> Richard,
> >> >>>>>>>>
> >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
> >> >>>>>>field.
> >> >>>>>>>> Here is the correct updated patch.
> >> >>>>>>>
> >> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
> >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
> >> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
> >> >>>>>>>
> >> >>>>>>> Thus, can you please produce a single complete patch containing only
> >> >>>>>>> non-masked epilogue vectoriziation?
> >> >>>>>>>
> >> >>>>>>> Thanks,
> >> >>>>>>> Richard.
> >> >>>>>>>
> >> >>>>>>>> Thanks.
> >> >>>>>>>> Yuri.
> >> >>>>>>>>
> >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
> >> >>>>>>>> >
> >> >>>>>>>> >> Richard,
> >> >>>>>>>> >>
> >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
> >> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
> >> >>>>>>>> >>
> >> >>>>>>>> >> You wrote:
> >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
> >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >> >>>>>>>> >> changes only needed by later patches?
> >> >>>>>>>> >>
> >> >>>>>>>> >> Did you mean that I exclude all support for vectorization
> >> >>>>>>epilogues,
> >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
> >> >>>>>>>> >> like
> >> >>>>>>>> >>
> >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> >>>>>>>> >> index 11863af..32011c1 100644
> >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
> >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
> >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >> >>>>>>>> >
> >> >>>>>>>> > Yes.
> >> >>>>>>>> >
> >> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
> >> >>>>>>i.e.
> >> >>>>>>>> >> can be integrated without other patches?
> >> >>>>>>>> >
> >> >>>>>>>> > Yes.
> >> >>>>>>>> >
> >> >>>>>>>> >> Could you please look at updated patch?
> >> >>>>>>>> >
> >> >>>>>>>> > Will do.
> >> >>>>>>>> >
> >> >>>>>>>> > Thanks,
> >> >>>>>>>> > Richard.
> >> >>>>>>>> >
> >> >>>>>>>> >> Thanks.
> >> >>>>>>>> >> Yuri.
> >> >>>>>>>> >>
> >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >> >>>>>>>> >> >
> >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >> >>>>>>>> >> >>
> >> >>>>>>>> >> >> > Richard,
> >> >>>>>>>> >> >> >
> >> >>>>>>>> >> >> > Here is updated 3 patch.
> >> >>>>>>>> >> >> >
> >> >>>>>>>> >> >> > I checked that all new tests related to epilogue
> >> >>>>>>vectorization passed with it.
> >> >>>>>>>> >> >> >
> >> >>>>>>>> >> >> > Your comments will be appreciated.
> >> >>>>>>>> >> >>
> >> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
> >> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
> >> >>>>>>(optional)
> >> >>>>>>>> >> >> forced vectorization factor as well?
> >> >>>>>>>> >> >
> >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
> >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
> >> >>>>>>>> >> > changes only needed by later patches?
> >> >>>>>>>> >> >
> >> >>>>>>>> >> > Thanks,
> >> >>>>>>>> >> > Richard.
> >> >>>>>>>> >> >
> >> >>>>>>>> >> >> Richard.
> >> >>>>>>>> >> >>
> >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
> >> >>>>>><rguenther@suse.de>:
> >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >> >>>>>>>> >> >> > >
> >> >>>>>>>> >> >> > >> Hi Richard,
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >> I did not understand your last remark:
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
> >> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >> >>>>>>vect_location,
> >> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
> >> >>>>>>it to be unrolled
> >> >>>>>>>> >> >> > >> >           etc.  */
> >> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >> >>>>>>it easier
> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >> >>>>>>in dumps
> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >> >>>>>>*/
> >> >>>>>>>> >> >> > >> > +       if (new_loop)
> >> >>>>>>>> >> >> > >> > +         {
> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >>>>>>>> >> >> > >> > +         }
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >> >>>>>>new_loop)
> >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
> >> >>>>>>perform
> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >> >>>>>>vectorization
> >> >>>>>>>> >> >> > >> > separately that would be great.
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >> Could you please clarify your proposal.
> >> >>>>>>>> >> >> > >
> >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
> >> >>>>>>vectorize
> >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
> >> >>>>>>avoiding
> >> >>>>>>>> >> >> > > the re-use of ->aux.
> >> >>>>>>>> >> >> > >
> >> >>>>>>>> >> >> > > Richard.
> >> >>>>>>>> >> >> > >
> >> >>>>>>>> >> >> > >> Thanks.
> >> >>>>>>>> >> >> > >> Yuri.
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
> >> >>>>>><rguenther@suse.de>:
> >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> >> Hi All,
> >> >>>>>>>> >> >> > >> >>
> >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
> >> >>>>>>which support
> >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
> >> >>>>>>trip count. We
> >> >>>>>>>> >> >> > >> >> assume that the only patch -
> >> >>>>>>vec-tails-07-combine-tail.patch - was not
> >> >>>>>>>> >> >> > >> >> approved by Jeff.
> >> >>>>>>>> >> >> > >> >>
> >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
> >> >>>>>>bootstrapping and
> >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
> >> >>>>>>Also all
> >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
> >> >>>>>>been changed
> >> >>>>>>>> >> >> > >> >> accordingly.
> >> >>>>>>>> >> >> > >> >>
> >> >>>>>>>> >> >> > >> >> Is it OK for trunk?
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > I would have prefered that the series up to
> >> >>>>>>-03-nomask-tails would
> >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
> >> >>>>>>unfortunately
> >> >>>>>>>> >> >> > >> > the patchset is oddly separated.
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
> >> >>>>>>(loop_vec_info
> >> >>>>>>>> >> >> > >> > loop_vinfo)
> >> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
> >> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
> >> >>>>>>single_exit (loop))
> >> >>>>>>>> >> >> > >> > -      || loop->inner)
> >> >>>>>>>> >> >> > >> > +      || loop->inner
> >> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
> >> >>>>>>and
> >> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
> >> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> >>>>>>>> >> >> > >> >      do_peeling = false;
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> >    if (do_peeling
> >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
> >> >>>>>>(loop_vec_info
> >> >>>>>>>> >> >> > >> > loop_vinfo)
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> >    do_versioning =
> >> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
> >> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
> >> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
> >> >>>>>>>> >> >> > >> > +          original loop and is not required for
> >> >>>>>>epilogue.  */
> >> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> >    if (do_versioning)
> >> >>>>>>>> >> >> > >> >      {
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > please do that check in the single caller of this
> >> >>>>>>function.
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
> >> >>>>>>believe that simply
> >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
> >> >>>>>>_much_ cleaner.
> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
> >> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >> >>>>>>vect_location,
> >> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
> >> >>>>>>it to be unrolled
> >> >>>>>>>> >> >> > >> >            etc.  */
> >> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >> >>>>>>it easier
> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >> >>>>>>in dumps
> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >> >>>>>>*/
> >> >>>>>>>> >> >> > >> > +       if (new_loop)
> >> >>>>>>>> >> >> > >> > +         {
> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >>>>>>>> >> >> > >> > +         }
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >> >>>>>>new_loop)
> >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
> >> >>>>>>perform
> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >> >>>>>>vectorization
> >> >>>>>>>> >> >> > >> > separately that would be great.
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
> >> >>>>>>question its
> >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
> >> >>>>>>vector loop).
> >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
> >> >>>>>>>> >> >> > >> >
> >> >>>>>>>> >> >> > >> > Thanks,
> >> >>>>>>>> >> >> > >> > Richard.
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >>
> >> >>>>>>>> >> >> > >
> >> >>>>>>>> >> >> > > --
> >> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
> >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
> >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
> >> >>>>>>>> >> >> >
> >> >>>>>>>> >> >>
> >> >>>>>>>> >> >>
> >> >>>>>>>> >> >
> >> >>>>>>>> >> > --
> >> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
> >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >>>>>>>> >>
> >> >>>>>>>> >
> >> >>>>>>>> > --
> >> >>>>>>>> > Richard Biener <rguenther@suse.de>
> >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Richard Biener <rguenther@suse.de>
> >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >>>>>
> >> >>>>>
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-12-01 11:34                                       ` Richard Biener
@ 2016-12-01 14:27                                         ` Yuri Rumyantsev
  2016-12-01 14:46                                           ` Richard Biener
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-12-01 14:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: Christophe Lyon, Jeff Law, gcc-patches, Ilya Enkovich

Thanks Richard for your comments.

You asked me about possible performance improvements for AVX2 machines
- we did not see any visible speed-up for spec2k with any method of
masking, including epilogue masking and combining, only on AVX512
machine aka knl.

I will answer on your question later.

Best regards.
Yuri

2016-12-01 14:33 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Mon, 28 Nov 2016, Yuri Rumyantsev wrote:
>
>> Richard!
>>
>> I attached vect dump for hte part of attached test-case which
>> illustrated how vectorization of epilogues works through masking:
>> #define SIZE 1023
>> #define ALIGN 64
>>
>> extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
>> __SIZE_TYPE__ size) __attribute__((weak));
>> extern void free (void *);
>>
>> void __attribute__((noinline))
>> test_citer (int * __restrict__ a,
>>    int * __restrict__ b,
>>    int * __restrict__ c)
>> {
>>   int i;
>>
>>   a = (int *)__builtin_assume_aligned (a, ALIGN);
>>   b = (int *)__builtin_assume_aligned (b, ALIGN);
>>   c = (int *)__builtin_assume_aligned (c, ALIGN);
>>
>>   for (i = 0; i < SIZE; i++)
>>     c[i] = a[i] + b[i];
>> }
>>
>> It was compiled with -mavx2 --param vect-epilogues-mask=1 options.
>>
>> I did not include in this patch vectorization of low trip-count loops
>> since in the original patch additional parameter was introduced:
>> +DEFPARAM (PARAM_VECT_SHORT_LOOPS,
>> +  "vect-short-loops",
>> +  "Enable vectorization of low trip count loops using masking.",
>> +  0, 0, 1)
>>
>> I assume that this ability can be included very quickly but it
>> requires cost model enhancements also.
>
> Comments on the patch itself (as I'm having a closer look again,
> I know how it vectorizes the above but I wondered why epilogue
> and short-trip loops are not basically the same code path).
>
> Btw, I don't like that the features are behind a --param paywall.
> That just means a) nobody will use it, b) it will bit-rot quickly,
> c) bugs are well-hidden.
>
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && integer_zerop (nested_in_vect_loop
> +                       ? STMT_VINFO_DR_STEP (stmt_info)
> +                       : DR_STEP (dr)))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "allow invariant load for masked loop.\n");
> +    }
>
> this can test memory_access_type == VMAT_INVARIANT.  Please put
> all the checks in a common
>
>   if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>     {
>        if (memory_access_type == VMAT_INVARIANT)
>          {
>          }
>        else if (...)
>          {
>             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>          }
>        else if (..)
> ...
>     }
>
> @@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt,
> gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        gcc_assert (!nested_in_vect_loop);
>        gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));
>
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: grouped access is not"
> +                            " supported.");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      }
> +
>
> isn't this already handled by the above?  Or rather the general
> disallowance of SLP?
>
> @@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt,
> gimple_stmt_iterator *gsi, gimple **vec_stmt,
>                             &memory_access_type, &gs_info))
>      return false;
>
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && memory_access_type != VMAT_CONTIGUOUS)
> +    {
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: unsupported memory access
> type.\n");
> +    }
> +
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && !can_mask_load_store (stmt))
> +    {
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: unsupported masked store.\n");
> +    }
> +
>
> likewise please combine the ifs.
>
> @@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt,
> gimple_stmt_iterator *gsi,
>                                           ptr, vec_mask, vec_rhs);
>           vect_finish_stmt_generation (stmt, new_stmt, gsi);
>           if (i == 0)
> -           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +           {
> +             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +             STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
> +           }
>           else
>             STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
>           prev_stmt_info = vinfo_for_stmt (new_stmt);
>
> here you only set the flag, elsewhere you copy DR and VECTYPE as well.
>
> @@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt,
> gimple_stmt_iterator *gsi,
>                && !useless_type_conversion_p (vectype, rhs_vectype)))
>      return false;
>
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      /* Check that mask conjuction is supported.  */
> +      optab tab;
> +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
> +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
> CODE_FOR_nothing)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: unsupported mask
> operation\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +       }
> +    }
>
> does this really test whether we can bit-and the mask?  You are
> using the vector type of the store (which might be V2DF for example),
> also for AVX512 it might be a vector-bool type with integer mode?
> Of course we maybe can simply assume mask conjunction is available
> (I know no ISA where that would be not true).
>
> +/* Return true if STMT can be converted to masked form.  */
> +
> +static bool
> +can_mask_load_store (gimple *stmt)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  tree vectype, mask_vectype;
> +  tree lhs, ref;
> +
> +  if (!stmt_info)
> +    return false;
> +  lhs = gimple_assign_lhs (stmt);
> +  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
> +  if (may_be_nonaddressable_p (ref))
> +    return false;
> +  vectype = STMT_VINFO_VECTYPE (stmt_info);
>
> You probably modeled this after ifcvt_can_use_mask_load_store but I
> don't think checking may_be_nonaddressable_p is necessary (we couldn't
> even vectorize such refs).  stmt_info should never be NULL either.
> With the check removed tree-ssa-loop-ivopts.h should no longer be
> necessary.
>
> +static void
> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> +                          data_reference *dr, gimple_stmt_iterator *si)
> +{
> ...
> +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> +                                  true, NULL_TREE, true,
> +                                  GSI_SAME_STMT);
> +
> +  align = TYPE_ALIGN_UNIT (vectype);
> +  if (aligned_access_p (dr))
> +    misalign = 0;
> +  else if (DR_MISALIGNMENT (dr) == -1)
> +    {
> +      align = TYPE_ALIGN_UNIT (elem_type);
> +      misalign = 0;
> +    }
> +  else
> +    misalign = DR_MISALIGNMENT (dr);
> +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> +  ptr = build_int_cst (reference_alias_ptr_type (mem),
> +                      misalign ? misalign & -misalign : align);
>
> you should simply use
>
>   align = get_object_alignment (mem) / BITS_PER_UNIT;
>
> here rather than trying to be clever.  Eventually you don't need
> the DR then (see question above).
>
> +    }
> +  gsi_replace (si ? si : &gsi, new_stmt, false);
>
> when you replace the load/store please previously copy VUSE and VDEF
> from the original one (we were nearly clean enough to no longer
> require a virtual operand rewrite after vectorization...)  Thus
>
>   gimple_set_vuse (new_stmt, gimple_vuse (stmt));
>   gimple_set_vdef (new_stmt, gimple_vdef (stmt));
>
> +static void
> +vect_mask_loop (loop_vec_info loop_vinfo)
> +{
> ...
> +  /* Scan all loop statements to convert vector load/store including
> masked
> +     form.  */
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = bbs[i];
> +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
> +          !gsi_end_p (si); gsi_next (&si))
> +       {
> +         gimple *stmt = gsi_stmt (si);
> +         stmt_vec_info stmt_info = NULL;
> +         tree vectype = NULL;
> +         data_reference *dr;
> +
> +         /* Mask load case.  */
> +         if (is_gimple_call (stmt)
> +             && gimple_call_internal_p (stmt)
> +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> +           {
> ...
> +             /* Skip invariant loads.  */
> +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> +                                ? STMT_VINFO_DR_STEP (stmt_info)
> +                                : DR_STEP (STMT_VINFO_DATA_REF
> (stmt_info))))
> +               continue;
>
> seeing this it would be nice if stmt_info had a flag for whether
> the stmt needs masking (and a flag on wheter this is a scalar or a
> vectorized stmt).
>
> +         /* Skip hoisted out statements.  */
> +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +           continue;
>
> err, you walk stmts in the loop!  Isn't this covered by the above
> skipping of 'invariant loads'?
>
> +static gimple *
> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> +{
>
> depending on the reduction operand there are variants that
> could get away w/o the VEC_COND_EXPR, like
>
>   S1': tem_4 = d_3 & MASK;
>   S2': r_1 = r_2 + tem_4;
>
> which works for plus at least.  More generally doing
>
>   S1': tem_4 = VEC_COND_EXPR<MASK, d_3, neutral operand>
>   S2': r_1 = r_2 OP tem_4;
>
> and leaving optimization to & to later opts (& won't work for
> AVX512 mask registers I guess).
>
> Good enough for later enhacement of course.
>
> +static void
> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> +{
> ...
>
> isn't it enough to always create a single IV and derive the
> additional copies by IV + i * { elems, elems, elems ... }?
> IVs are expensive -- I'm sure we can optimize the rest of the
> scheme further as well but this one looks obvious to me.
>
> @@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info
> loop_vinfo,
>    int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>    void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
>
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +    {
> +      /* Currently we don't produce scalar epilogue version in case
> +        its masked version is provided.  It means we don't need to
> +        compute profitability one more time here.  Just make a
> +        masked loop version.  */
> +      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
> +       {
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "cost model: mask loop epilogue.\n");
> +         LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
> +         *ret_min_profitable_niters = 0;
> +         *ret_min_profitable_estimate = 0;
> +         return;
> +       }
> +    }
>    /* Cost model disabled.  */
> -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> +  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>      {
>        dump_printf_loc (MSG_NOTE, vect_location, "cost model
> disabled.\n");
>        *ret_min_profitable_niters = 0;
>        *ret_min_profitable_estimate = 0;
> +      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
> +         && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>        return;
>      }
>
> the unlimited_cost_model case should come first?  OTOH masking or
> not is probably not sth covered by 'unlimited' - that is about
> vectorizing or not.  But the above code means that for
> epilogue vectorization w/o masking we ignore unlimited_cost_model ()?
> That doesn't make sense to me.
>
> Plus if this is short-trip or epilogue vectorization and the
> cost model is _not_ unlimited then we dont' want to enable
> masking always (if it is possible).  It might be we statically
> know the epilogue executes for at most two iterations for example.
>
> I don't see _any_ cost model for vectorizing the epilogue with
> masking?  Am I missing something?  A "trivial" cost model
> should at least consider the additional IV(s), the mask
> compute and the widening and narrowing ops required.
>
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index e13d6a2..36be342 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
>    bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
>                          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
>
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +    {
> +      prolog_peeling = false;
> +      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
> +       epilog_peeling = false;
> +    }
> +
>    if (!prolog_peeling && !epilog_peeling)
>      return NULL;
>
> I think the prolog_peeling was fixed during the epilogue vectorization
> review and should no longer be necessary.  Please add
> a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead
> (it should also work for short-trip loop vectorization).
>
> @@ -2022,11 +2291,18 @@ start_over:
>        || (max_niter != -1
>           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
>      {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "not vectorized: iteration count smaller than "
> -                        "vectorization factor.\n");
> -      return false;
> +      /* Allow low trip count for loop epilogue we want to mask.  */
> +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
> +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
> +      else
> +       {
> +         if (dump_enabled_p ())
>
> so why do we test only LOOP_VINFO_EPILOGUE_P here?  All the code
> I saw sofar would also work for the main loop (but the cost
> model is missing).
>
> I am missing testcases.  There's only a single one but we should
> have cases covering all kinds of mask IV widths and widen/shorten
> masks.
>
> Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking
> enabled for an AVX2 machine?
>
> Oh, and I really dislike the --param paywall.
>
> Thanks,
> Richard.
>
>> Best regards.
>> Yuri.
>>
>>
>> 2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
>> >
>> >> Hi All,
>> >>
>> >> Here is the second patch which supports epilogue vectorization using
>> >> masking without cost model. Currently it is possible
>> >> only with passing parameter "--param vect-epilogues-mask=1".
>> >>
>> >> Bootstrapping and regression testing did not show any new regression.
>> >>
>> >> Any comments will be appreciated.
>> >
>> > Going over the patch the main question is one how it works -- it looks
>> > like the decision whether to vectorize & mask the epilogue is made
>> > when vectorizing the loop that generates the epilogue rather than
>> > in the epilogue vectorization path?
>> >
>> > That is, I'd have expected to see this handling low-trip count loops
>> > by masking?  And thus masking the epilogue simply by it being
>> > low-trip count?
>> >
>> > Richard.
>> >
>> >> ChangeLog:
>> >> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
>> >>
>> >> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
>> >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>> >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
>> >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
>> >> required_mask fields.
>> >> (vect_check_required_masks_widening): New.
>> >> (vect_check_required_masks_narrowing): New.
>> >> (vect_get_masking_iv_elems): New.
>> >> (vect_get_masking_iv_type): New.
>> >> (vect_get_extreme_masks): New.
>> >> (vect_check_required_masks): New.
>> >> (vect_analyze_loop_operations): Call vect_check_required_masks if all
>> >> statements can be masked.
>> >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
>> >> Add check that epilogue can be masked with the same vf with issue
>> >> fail notes.  Allow epilogue vectorization through masking of low trip
>> >> loops. Set to true can_be_masked field before loop operation analysis.
>> >> Do not set-up min_scalar_loop_bound for epilogue vectorization through
>> >> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
>> >> field before repeat analysis.
>> >> (vect_estimate_min_profitable_iters): Do not compute profitability
>> >> for epilogue masking.  Set up mask_loop filed to true if parameter
>> >> PARAM_VECT_EPILOGUES_MASK is non-zero.
>> >> (vectorizable_reduction): Add check that statement can be masked.
>> >> (vectorizable_induction): Do not support masking for induction.
>> >> (vect_gen_ivs_for_masking): New.
>> >> (vect_get_mask_index_for_elems): New.
>> >> (vect_get_mask_index_for_type): New.
>> >> (vect_create_narrowed_masks): New.
>> >> (vect_create_widened_masks): New.
>> >> (vect_gen_loop_masks): New.
>> >> (vect_mask_reduction_stmt): New.
>> >> (vect_mask_mask_load_store_stmt): New.
>> >> (vect_mask_load_store_stmt): New.
>> >> (vect_mask_loop): New.
>> >> (vect_transform_loop): Invoke vect_mask_loop if required.
>> >> Use div_ceil to recompute upper bounds for masked loops.  Issue
>> >> statistics for epilogue vectorization through masking. Do not reduce
>> >> vf for masking epilogue.
>> >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
>> >> (can_mask_load_store): New.
>> >> (vectorizable_mask_load_store): Check that mask conjuction is
>> >> supported.  Set-up first_copy_p field of stmt_vinfo.
>> >> (vectorizable_simd_clone_call): Check that simd clone can not be
>> >> masked.
>> >> (vectorizable_store): Check that store can be masked. Mark the first
>> >> copy of generated vector stores and provide it with vectype and the
>> >> original data reference.
>> >> (vectorizable_load): Check that load can be masked.
>> >> (vect_stmt_should_be_masked_for_epilogue): New.
>> >> (vect_add_required_mask_for_stmt): New.
>> >> (vect_analyze_stmt): Add check on unsupported statements for masking
>> >> with printing message.
>> >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
>> >> can_be_maske, required_masks, masl_loop.
>> >> (LOOP_VINFO_CAN_BE_MASKED): New.
>> >> (LOOP_VINFO_REQUIRED_MASKS): New.
>> >> (LOOP_VINFO_MASK_LOOP): New.
>> >> (struct _stmt_vec_info): Add first_copy_p field.
>> >> (STMT_VINFO_FIRST_COPY_P): New.
>> >>
>> >> gcc/testsuite/
>> >>
>> >> * gcc.dg/vect/vect-tail-mask-1.c: New test.
>> >>
>> >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> >> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >> >> It is very strange that this test failed on arm, since it requires
>> >> >> target avx2 to check vectorizer dumps:
>> >> >>
>> >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>> >> >> target avx2_runtime } } } */
>> >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>> >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>> >> >>
>> >> >> Could you please clarify what is the reason of the failure?
>> >> >
>> >> > It's not the scan-dumps that fail, but the execution.
>> >> > The test calls abort() for some reason.
>> >> >
>> >> > It will take me a while to rebuild the test manually in the right
>> >> > debug environment to provide you with more traces.
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> Thanks.
>> >> >>
>> >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >> >>>> Hi All,
>> >> >>>>
>> >> >>>> Here is patch for non-masked epilogue vectoriziation.
>> >> >>>>
>> >> >>>> Bootstrap and regression testing did not show any new failures.
>> >> >>>>
>> >> >>>> Is it OK for trunk?
>> >> >>>>
>> >> >>>> Thanks.
>> >> >>>> Changelog:
>> >> >>>>
>> >> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>> >> >>>>
>> >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>> >> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
>> >> >>>> * * tree-if-conv.h: New file.
>> >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>> >> >>>> dynamic alias checks for epilogues.
>> >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>> >> >>>> * tree-vect-loop.c: include tree-if-conv.h.
>> >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>> >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>> >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>> >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>> >> >>>> using passed argument.
>> >> >>>> (vect_transform_loop): Check if created epilogue should be returned
>> >> >>>> for further vectorization with less vf.  If-convert epilogue if
>> >> >>>> required. Print vectorization success for epilogue.
>> >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>> >> >>>> if it is required, pass loop_vinfo produced during vectorization of
>> >> >>>> loop body to vect_analyze_loop.
>> >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>> >> >>>> orig_loop_info.
>> >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>> >> >>>> (LOOP_VINFO_EPILOGUE_P): New.
>> >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>> >> >>>> (vect_do_peeling): Change prototype to return epilogue.
>> >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>> >> >>>> (vect_transform_loop): Return created loop.
>> >> >>>>
>> >> >>>> gcc/testsuite/
>> >> >>>>
>> >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>> >> >>>> (check_effective_target_avx2_runtime): New.
>> >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>> >> >>>>
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>> >> >>>
>> >> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
>> >> >>>
>> >> >>> Christophe
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>>
>> >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >> >>>>>>Richard,
>> >> >>>>>>
>> >> >>>>>>I checked one of the tests designed for epilogue vectorization using
>> >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>> >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>> >> >>>>>>
>> >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>> >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
>> >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>> >> >>>>>>4
>> >> >>>>>> Without param only 2 loops are vectorized.
>> >> >>>>>>
>> >> >>>>>>Should I simply add a part of tests related to this feature or I must
>> >> >>>>>>delete all not necessary changes also?
>> >> >>>>>
>> >> >>>>> Please remove all not necessary changes.
>> >> >>>>>
>> >> >>>>> Richard.
>> >> >>>>>
>> >> >>>>>>Thanks.
>> >> >>>>>>Yuri.
>> >> >>>>>>
>> >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Richard,
>> >> >>>>>>>>
>> >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>> >> >>>>>>field.
>> >> >>>>>>>> Here is the correct updated patch.
>> >> >>>>>>>
>> >> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>> >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>> >> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
>> >> >>>>>>>
>> >> >>>>>>> Thus, can you please produce a single complete patch containing only
>> >> >>>>>>> non-masked epilogue vectoriziation?
>> >> >>>>>>>
>> >> >>>>>>> Thanks,
>> >> >>>>>>> Richard.
>> >> >>>>>>>
>> >> >>>>>>>> Thanks.
>> >> >>>>>>>> Yuri.
>> >> >>>>>>>>
>> >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>>>>>>> >
>> >> >>>>>>>> >> Richard,
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>> >> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> You wrote:
>> >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>> >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>> >> >>>>>>>> >> changes only needed by later patches?
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> Did you mean that I exclude all support for vectorization
>> >> >>>>>>epilogues,
>> >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>> >> >>>>>>>> >> like
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> >>>>>>>> >> index 11863af..32011c1 100644
>> >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
>> >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>> >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>> >> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> >> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> >> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> >> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> >> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> >> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> >> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>> >> >>>>>>>> >
>> >> >>>>>>>> > Yes.
>> >> >>>>>>>> >
>> >> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
>> >> >>>>>>i.e.
>> >> >>>>>>>> >> can be integrated without other patches?
>> >> >>>>>>>> >
>> >> >>>>>>>> > Yes.
>> >> >>>>>>>> >
>> >> >>>>>>>> >> Could you please look at updated patch?
>> >> >>>>>>>> >
>> >> >>>>>>>> > Will do.
>> >> >>>>>>>> >
>> >> >>>>>>>> > Thanks,
>> >> >>>>>>>> > Richard.
>> >> >>>>>>>> >
>> >> >>>>>>>> >> Thanks.
>> >> >>>>>>>> >> Yuri.
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> > Richard,
>> >> >>>>>>>> >> >> >
>> >> >>>>>>>> >> >> > Here is updated 3 patch.
>> >> >>>>>>>> >> >> >
>> >> >>>>>>>> >> >> > I checked that all new tests related to epilogue
>> >> >>>>>>vectorization passed with it.
>> >> >>>>>>>> >> >> >
>> >> >>>>>>>> >> >> > Your comments will be appreciated.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>> >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>> >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>> >> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> >> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
>> >> >>>>>>(optional)
>> >> >>>>>>>> >> >> forced vectorization factor as well?
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>> >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>> >> >>>>>>>> >> > changes only needed by later patches?
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > Thanks,
>> >> >>>>>>>> >> > Richard.
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> >> Richard.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>> >> >>>>>><rguenther@suse.de>:
>> >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>>>>>>> >> >> > >
>> >> >>>>>>>> >> >> > >> Hi Richard,
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >> I did not understand your last remark:
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
>> >> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >> >>>>>>vect_location,
>> >> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
>> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>> >> >>>>>>it to be unrolled
>> >> >>>>>>>> >> >> > >> >           etc.  */
>> >> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >> >>>>>>it easier
>> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >> >>>>>>in dumps
>> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >> >>>>>>*/
>> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >> >>>>>>>> >> >> > >> > +         {
>> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> >>>>>>>> >> >> > >> > +         }
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >> >>>>>>new_loop)
>> >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>> >> >>>>>>perform
>> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >> >>>>>>vectorization
>> >> >>>>>>>> >> >> > >> > separately that would be great.
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >> Could you please clarify your proposal.
>> >> >>>>>>>> >> >> > >
>> >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>> >> >>>>>>vectorize
>> >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>> >> >>>>>>avoiding
>> >> >>>>>>>> >> >> > > the re-use of ->aux.
>> >> >>>>>>>> >> >> > >
>> >> >>>>>>>> >> >> > > Richard.
>> >> >>>>>>>> >> >> > >
>> >> >>>>>>>> >> >> > >> Thanks.
>> >> >>>>>>>> >> >> > >> Yuri.
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>> >> >>>>>><rguenther@suse.de>:
>> >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> >> Hi All,
>> >> >>>>>>>> >> >> > >> >>
>> >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>> >> >>>>>>which support
>> >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>> >> >>>>>>trip count. We
>> >> >>>>>>>> >> >> > >> >> assume that the only patch -
>> >> >>>>>>vec-tails-07-combine-tail.patch - was not
>> >> >>>>>>>> >> >> > >> >> approved by Jeff.
>> >> >>>>>>>> >> >> > >> >>
>> >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>> >> >>>>>>bootstrapping and
>> >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>> >> >>>>>>Also all
>> >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>> >> >>>>>>been changed
>> >> >>>>>>>> >> >> > >> >> accordingly.
>> >> >>>>>>>> >> >> > >> >>
>> >> >>>>>>>> >> >> > >> >> Is it OK for trunk?
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > I would have prefered that the series up to
>> >> >>>>>>-03-nomask-tails would
>> >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>> >> >>>>>>unfortunately
>> >> >>>>>>>> >> >> > >> > the patchset is oddly separated.
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>> >> >>>>>>(loop_vec_info
>> >> >>>>>>>> >> >> > >> > loop_vinfo)
>> >> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>> >> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>> >> >>>>>>single_exit (loop))
>> >> >>>>>>>> >> >> > >> > -      || loop->inner)
>> >> >>>>>>>> >> >> > >> > +      || loop->inner
>> >> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>> >> >>>>>>and
>> >> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>> >> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >> >>>>>>>> >> >> > >> >      do_peeling = false;
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> >    if (do_peeling
>> >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>> >> >>>>>>(loop_vec_info
>> >> >>>>>>>> >> >> > >> > loop_vinfo)
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> >    do_versioning =
>> >> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>> >> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>> >> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>> >> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>> >> >>>>>>>> >> >> > >> > +          original loop and is not required for
>> >> >>>>>>epilogue.  */
>> >> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> >    if (do_versioning)
>> >> >>>>>>>> >> >> > >> >      {
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > please do that check in the single caller of this
>> >> >>>>>>function.
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>> >> >>>>>>believe that simply
>> >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
>> >> >>>>>>_much_ cleaner.
>> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
>> >> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >> >>>>>>vect_location,
>> >> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
>> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>> >> >>>>>>it to be unrolled
>> >> >>>>>>>> >> >> > >> >            etc.  */
>> >> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >> >>>>>>it easier
>> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >> >>>>>>in dumps
>> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >> >>>>>>*/
>> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >> >>>>>>>> >> >> > >> > +         {
>> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >> >>>>>>>> >> >> > >> > +         }
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >> >>>>>>new_loop)
>> >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>> >> >>>>>>perform
>> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >> >>>>>>vectorization
>> >> >>>>>>>> >> >> > >> > separately that would be great.
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>> >> >>>>>>question its
>> >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>> >> >>>>>>vector loop).
>> >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>> >> >>>>>>>> >> >> > >> >
>> >> >>>>>>>> >> >> > >> > Thanks,
>> >> >>>>>>>> >> >> > >> > Richard.
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >>
>> >> >>>>>>>> >> >> > >
>> >> >>>>>>>> >> >> > > --
>> >> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>> >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>> >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>> >> >>>>>>>> >> >> >
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > --
>> >> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
>> >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >> >>>>>>>> >>
>> >> >>>>>>>> >
>> >> >>>>>>>> > --
>> >> >>>>>>>> > Richard Biener <rguenther@suse.de>
>> >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >> >>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> --
>> >> >>>>>>> Richard Biener <rguenther@suse.de>
>> >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >> >>>>>
>> >> >>>>>
>> >>
>> >
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-12-01 14:27                                         ` Yuri Rumyantsev
@ 2016-12-01 14:46                                           ` Richard Biener
       [not found]                                             ` <CAEoMCqSkWgz+DJLe1M1CDxbk4LBtBU4r3rcVv7OcgpsGW4eTJA@mail.gmail.com>
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Biener @ 2016-12-01 14:46 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Christophe Lyon, Jeff Law, gcc-patches, Ilya Enkovich

On Thu, 1 Dec 2016, Yuri Rumyantsev wrote:

> Thanks Richard for your comments.
> 
> You asked me about possible performance improvements for AVX2 machines
> - we did not see any visible speed-up for spec2k with any method of

Spec 2000?  Can you check with SPEC 2006 or CPUv6?

Did you see performance degradation?  What about compile-time and
binary size effects?

> masking, including epilogue masking and combining, only on AVX512
> machine aka knl.

I see.

Note that as said in the initial review patch the cost model I
saw therein looked flawed.  In the end I'd expect a sensible
approach would be to do

 if (n < scalar-most-profitable-niter)
   {
     no vectorization
   }
 else if (n < masking-more-profitable-than-not-masking-plus-epilogue)
   {
     do masked vectorization
   }
 else
   {
     do unmasked vectorization (with epilogue, eventually vectorized)
   }

where for short trip loops the else path would never be taken
(statically).

And yes, that means masking will only be useful for short-trip loops
which in the end means an overall performance benfit is unlikely
unless we have a lot of short-trip loops that are slow because of
the overhead of main unmasked loop plus epilogue.

Richard.

> I will answer on your question later.
> 
> Best regards.
> Yuri
> 
> 2016-12-01 14:33 GMT+03:00 Richard Biener <rguenther@suse.de>:
> > On Mon, 28 Nov 2016, Yuri Rumyantsev wrote:
> >
> >> Richard!
> >>
> >> I attached vect dump for hte part of attached test-case which
> >> illustrated how vectorization of epilogues works through masking:
> >> #define SIZE 1023
> >> #define ALIGN 64
> >>
> >> extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
> >> __SIZE_TYPE__ size) __attribute__((weak));
> >> extern void free (void *);
> >>
> >> void __attribute__((noinline))
> >> test_citer (int * __restrict__ a,
> >>    int * __restrict__ b,
> >>    int * __restrict__ c)
> >> {
> >>   int i;
> >>
> >>   a = (int *)__builtin_assume_aligned (a, ALIGN);
> >>   b = (int *)__builtin_assume_aligned (b, ALIGN);
> >>   c = (int *)__builtin_assume_aligned (c, ALIGN);
> >>
> >>   for (i = 0; i < SIZE; i++)
> >>     c[i] = a[i] + b[i];
> >> }
> >>
> >> It was compiled with -mavx2 --param vect-epilogues-mask=1 options.
> >>
> >> I did not include in this patch vectorization of low trip-count loops
> >> since in the original patch additional parameter was introduced:
> >> +DEFPARAM (PARAM_VECT_SHORT_LOOPS,
> >> +  "vect-short-loops",
> >> +  "Enable vectorization of low trip count loops using masking.",
> >> +  0, 0, 1)
> >>
> >> I assume that this ability can be included very quickly but it
> >> requires cost model enhancements also.
> >
> > Comments on the patch itself (as I'm having a closer look again,
> > I know how it vectorizes the above but I wondered why epilogue
> > and short-trip loops are not basically the same code path).
> >
> > Btw, I don't like that the features are behind a --param paywall.
> > That just means a) nobody will use it, b) it will bit-rot quickly,
> > c) bugs are well-hidden.
> >
> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> > +      && integer_zerop (nested_in_vect_loop
> > +                       ? STMT_VINFO_DR_STEP (stmt_info)
> > +                       : DR_STEP (dr)))
> > +    {
> > +      if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_NOTE, vect_location,
> > +                        "allow invariant load for masked loop.\n");
> > +    }
> >
> > this can test memory_access_type == VMAT_INVARIANT.  Please put
> > all the checks in a common
> >
> >   if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> >     {
> >        if (memory_access_type == VMAT_INVARIANT)
> >          {
> >          }
> >        else if (...)
> >          {
> >             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> >          }
> >        else if (..)
> > ...
> >     }
> >
> > @@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt,
> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
> >        gcc_assert (!nested_in_vect_loop);
> >        gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));
> >
> > +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "cannot be masked: grouped access is not"
> > +                            " supported.");
> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> > +      }
> > +
> >
> > isn't this already handled by the above?  Or rather the general
> > disallowance of SLP?
> >
> > @@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt,
> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
> >                             &memory_access_type, &gs_info))
> >      return false;
> >
> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> > +      && memory_access_type != VMAT_CONTIGUOUS)
> > +    {
> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> > +      if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                        "cannot be masked: unsupported memory access
> > type.\n");
> > +    }
> > +
> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> > +      && !can_mask_load_store (stmt))
> > +    {
> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> > +      if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                        "cannot be masked: unsupported masked store.\n");
> > +    }
> > +
> >
> > likewise please combine the ifs.
> >
> > @@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt,
> > gimple_stmt_iterator *gsi,
> >                                           ptr, vec_mask, vec_rhs);
> >           vect_finish_stmt_generation (stmt, new_stmt, gsi);
> >           if (i == 0)
> > -           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> > +           {
> > +             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> > +             STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
> > +           }
> >           else
> >             STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
> >           prev_stmt_info = vinfo_for_stmt (new_stmt);
> >
> > here you only set the flag, elsewhere you copy DR and VECTYPE as well.
> >
> > @@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt,
> > gimple_stmt_iterator *gsi,
> >                && !useless_type_conversion_p (vectype, rhs_vectype)))
> >      return false;
> >
> > +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> > +    {
> > +      /* Check that mask conjuction is supported.  */
> > +      optab tab;
> > +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
> > +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
> > CODE_FOR_nothing)
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "cannot be masked: unsupported mask
> > operation\n");
> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> > +       }
> > +    }
> >
> > does this really test whether we can bit-and the mask?  You are
> > using the vector type of the store (which might be V2DF for example),
> > also for AVX512 it might be a vector-bool type with integer mode?
> > Of course we maybe can simply assume mask conjunction is available
> > (I know no ISA where that would be not true).
> >
> > +/* Return true if STMT can be converted to masked form.  */
> > +
> > +static bool
> > +can_mask_load_store (gimple *stmt)
> > +{
> > +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> > +  tree vectype, mask_vectype;
> > +  tree lhs, ref;
> > +
> > +  if (!stmt_info)
> > +    return false;
> > +  lhs = gimple_assign_lhs (stmt);
> > +  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
> > +  if (may_be_nonaddressable_p (ref))
> > +    return false;
> > +  vectype = STMT_VINFO_VECTYPE (stmt_info);
> >
> > You probably modeled this after ifcvt_can_use_mask_load_store but I
> > don't think checking may_be_nonaddressable_p is necessary (we couldn't
> > even vectorize such refs).  stmt_info should never be NULL either.
> > With the check removed tree-ssa-loop-ivopts.h should no longer be
> > necessary.
> >
> > +static void
> > +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> > +                          data_reference *dr, gimple_stmt_iterator *si)
> > +{
> > ...
> > +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> > +                                  true, NULL_TREE, true,
> > +                                  GSI_SAME_STMT);
> > +
> > +  align = TYPE_ALIGN_UNIT (vectype);
> > +  if (aligned_access_p (dr))
> > +    misalign = 0;
> > +  else if (DR_MISALIGNMENT (dr) == -1)
> > +    {
> > +      align = TYPE_ALIGN_UNIT (elem_type);
> > +      misalign = 0;
> > +    }
> > +  else
> > +    misalign = DR_MISALIGNMENT (dr);
> > +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> > +  ptr = build_int_cst (reference_alias_ptr_type (mem),
> > +                      misalign ? misalign & -misalign : align);
> >
> > you should simply use
> >
> >   align = get_object_alignment (mem) / BITS_PER_UNIT;
> >
> > here rather than trying to be clever.  Eventually you don't need
> > the DR then (see question above).
> >
> > +    }
> > +  gsi_replace (si ? si : &gsi, new_stmt, false);
> >
> > when you replace the load/store please previously copy VUSE and VDEF
> > from the original one (we were nearly clean enough to no longer
> > require a virtual operand rewrite after vectorization...)  Thus
> >
> >   gimple_set_vuse (new_stmt, gimple_vuse (stmt));
> >   gimple_set_vdef (new_stmt, gimple_vdef (stmt));
> >
> > +static void
> > +vect_mask_loop (loop_vec_info loop_vinfo)
> > +{
> > ...
> > +  /* Scan all loop statements to convert vector load/store including
> > masked
> > +     form.  */
> > +  for (unsigned i = 0; i < loop->num_nodes; i++)
> > +    {
> > +      basic_block bb = bbs[i];
> > +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
> > +          !gsi_end_p (si); gsi_next (&si))
> > +       {
> > +         gimple *stmt = gsi_stmt (si);
> > +         stmt_vec_info stmt_info = NULL;
> > +         tree vectype = NULL;
> > +         data_reference *dr;
> > +
> > +         /* Mask load case.  */
> > +         if (is_gimple_call (stmt)
> > +             && gimple_call_internal_p (stmt)
> > +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> > +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> > +           {
> > ...
> > +             /* Skip invariant loads.  */
> > +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> > +                                ? STMT_VINFO_DR_STEP (stmt_info)
> > +                                : DR_STEP (STMT_VINFO_DATA_REF
> > (stmt_info))))
> > +               continue;
> >
> > seeing this it would be nice if stmt_info had a flag for whether
> > the stmt needs masking (and a flag on wheter this is a scalar or a
> > vectorized stmt).
> >
> > +         /* Skip hoisted out statements.  */
> > +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > +           continue;
> >
> > err, you walk stmts in the loop!  Isn't this covered by the above
> > skipping of 'invariant loads'?
> >
> > +static gimple *
> > +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> > +{
> >
> > depending on the reduction operand there are variants that
> > could get away w/o the VEC_COND_EXPR, like
> >
> >   S1': tem_4 = d_3 & MASK;
> >   S2': r_1 = r_2 + tem_4;
> >
> > which works for plus at least.  More generally doing
> >
> >   S1': tem_4 = VEC_COND_EXPR<MASK, d_3, neutral operand>
> >   S2': r_1 = r_2 OP tem_4;
> >
> > and leaving optimization to & to later opts (& won't work for
> > AVX512 mask registers I guess).
> >
> > Good enough for later enhacement of course.
> >
> > +static void
> > +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> > +{
> > ...
> >
> > isn't it enough to always create a single IV and derive the
> > additional copies by IV + i * { elems, elems, elems ... }?
> > IVs are expensive -- I'm sure we can optimize the rest of the
> > scheme further as well but this one looks obvious to me.
> >
> > @@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info
> > loop_vinfo,
> >    int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> >    void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
> >
> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> > +    {
> > +      /* Currently we don't produce scalar epilogue version in case
> > +        its masked version is provided.  It means we don't need to
> > +        compute profitability one more time here.  Just make a
> > +        masked loop version.  */
> > +      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
> > +       {
> > +         dump_printf_loc (MSG_NOTE, vect_location,
> > +                          "cost model: mask loop epilogue.\n");
> > +         LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
> > +         *ret_min_profitable_niters = 0;
> > +         *ret_min_profitable_estimate = 0;
> > +         return;
> > +       }
> > +    }
> >    /* Cost model disabled.  */
> > -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > +  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> >      {
> >        dump_printf_loc (MSG_NOTE, vect_location, "cost model
> > disabled.\n");
> >        *ret_min_profitable_niters = 0;
> >        *ret_min_profitable_estimate = 0;
> > +      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
> > +         && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
> >        return;
> >      }
> >
> > the unlimited_cost_model case should come first?  OTOH masking or
> > not is probably not sth covered by 'unlimited' - that is about
> > vectorizing or not.  But the above code means that for
> > epilogue vectorization w/o masking we ignore unlimited_cost_model ()?
> > That doesn't make sense to me.
> >
> > Plus if this is short-trip or epilogue vectorization and the
> > cost model is _not_ unlimited then we dont' want to enable
> > masking always (if it is possible).  It might be we statically
> > know the epilogue executes for at most two iterations for example.
> >
> > I don't see _any_ cost model for vectorizing the epilogue with
> > masking?  Am I missing something?  A "trivial" cost model
> > should at least consider the additional IV(s), the mask
> > compute and the widening and narrowing ops required.
> >
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index e13d6a2..36be342 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> >    bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> >                          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> >
> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> > +    {
> > +      prolog_peeling = false;
> > +      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
> > +       epilog_peeling = false;
> > +    }
> > +
> >    if (!prolog_peeling && !epilog_peeling)
> >      return NULL;
> >
> > I think the prolog_peeling was fixed during the epilogue vectorization
> > review and should no longer be necessary.  Please add
> > a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead
> > (it should also work for short-trip loop vectorization).
> >
> > @@ -2022,11 +2291,18 @@ start_over:
> >        || (max_niter != -1
> >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> >      {
> > -      if (dump_enabled_p ())
> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > -                        "not vectorized: iteration count smaller than "
> > -                        "vectorization factor.\n");
> > -      return false;
> > +      /* Allow low trip count for loop epilogue we want to mask.  */
> > +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
> > +      else
> > +       {
> > +         if (dump_enabled_p ())
> >
> > so why do we test only LOOP_VINFO_EPILOGUE_P here?  All the code
> > I saw sofar would also work for the main loop (but the cost
> > model is missing).
> >
> > I am missing testcases.  There's only a single one but we should
> > have cases covering all kinds of mask IV widths and widen/shorten
> > masks.
> >
> > Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking
> > enabled for an AVX2 machine?
> >
> > Oh, and I really dislike the --param paywall.
> >
> > Thanks,
> > Richard.
> >
> >> Best regards.
> >> Yuri.
> >>
> >>
> >> 2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
> >> >
> >> >> Hi All,
> >> >>
> >> >> Here is the second patch which supports epilogue vectorization using
> >> >> masking without cost model. Currently it is possible
> >> >> only with passing parameter "--param vect-epilogues-mask=1".
> >> >>
> >> >> Bootstrapping and regression testing did not show any new regression.
> >> >>
> >> >> Any comments will be appreciated.
> >> >
> >> > Going over the patch the main question is one how it works -- it looks
> >> > like the decision whether to vectorize & mask the epilogue is made
> >> > when vectorizing the loop that generates the epilogue rather than
> >> > in the epilogue vectorization path?
> >> >
> >> > That is, I'd have expected to see this handling low-trip count loops
> >> > by masking?  And thus masking the epilogue simply by it being
> >> > low-trip count?
> >> >
> >> > Richard.
> >> >
> >> >> ChangeLog:
> >> >> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
> >> >>
> >> >> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
> >> >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> >> >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
> >> >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
> >> >> required_mask fields.
> >> >> (vect_check_required_masks_widening): New.
> >> >> (vect_check_required_masks_narrowing): New.
> >> >> (vect_get_masking_iv_elems): New.
> >> >> (vect_get_masking_iv_type): New.
> >> >> (vect_get_extreme_masks): New.
> >> >> (vect_check_required_masks): New.
> >> >> (vect_analyze_loop_operations): Call vect_check_required_masks if all
> >> >> statements can be masked.
> >> >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
> >> >> Add check that epilogue can be masked with the same vf with issue
> >> >> fail notes.  Allow epilogue vectorization through masking of low trip
> >> >> loops. Set to true can_be_masked field before loop operation analysis.
> >> >> Do not set-up min_scalar_loop_bound for epilogue vectorization through
> >> >> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
> >> >> field before repeat analysis.
> >> >> (vect_estimate_min_profitable_iters): Do not compute profitability
> >> >> for epilogue masking.  Set up mask_loop filed to true if parameter
> >> >> PARAM_VECT_EPILOGUES_MASK is non-zero.
> >> >> (vectorizable_reduction): Add check that statement can be masked.
> >> >> (vectorizable_induction): Do not support masking for induction.
> >> >> (vect_gen_ivs_for_masking): New.
> >> >> (vect_get_mask_index_for_elems): New.
> >> >> (vect_get_mask_index_for_type): New.
> >> >> (vect_create_narrowed_masks): New.
> >> >> (vect_create_widened_masks): New.
> >> >> (vect_gen_loop_masks): New.
> >> >> (vect_mask_reduction_stmt): New.
> >> >> (vect_mask_mask_load_store_stmt): New.
> >> >> (vect_mask_load_store_stmt): New.
> >> >> (vect_mask_loop): New.
> >> >> (vect_transform_loop): Invoke vect_mask_loop if required.
> >> >> Use div_ceil to recompute upper bounds for masked loops.  Issue
> >> >> statistics for epilogue vectorization through masking. Do not reduce
> >> >> vf for masking epilogue.
> >> >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
> >> >> (can_mask_load_store): New.
> >> >> (vectorizable_mask_load_store): Check that mask conjuction is
> >> >> supported.  Set-up first_copy_p field of stmt_vinfo.
> >> >> (vectorizable_simd_clone_call): Check that simd clone can not be
> >> >> masked.
> >> >> (vectorizable_store): Check that store can be masked. Mark the first
> >> >> copy of generated vector stores and provide it with vectype and the
> >> >> original data reference.
> >> >> (vectorizable_load): Check that load can be masked.
> >> >> (vect_stmt_should_be_masked_for_epilogue): New.
> >> >> (vect_add_required_mask_for_stmt): New.
> >> >> (vect_analyze_stmt): Add check on unsupported statements for masking
> >> >> with printing message.
> >> >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
> >> >> can_be_maske, required_masks, masl_loop.
> >> >> (LOOP_VINFO_CAN_BE_MASKED): New.
> >> >> (LOOP_VINFO_REQUIRED_MASKS): New.
> >> >> (LOOP_VINFO_MASK_LOOP): New.
> >> >> (struct _stmt_vec_info): Add first_copy_p field.
> >> >> (STMT_VINFO_FIRST_COPY_P): New.
> >> >>
> >> >> gcc/testsuite/
> >> >>
> >> >> * gcc.dg/vect/vect-tail-mask-1.c: New test.
> >> >>
> >> >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> >> >> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >> >> It is very strange that this test failed on arm, since it requires
> >> >> >> target avx2 to check vectorizer dumps:
> >> >> >>
> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
> >> >> >> target avx2_runtime } } } */
> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
> >> >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
> >> >> >>
> >> >> >> Could you please clarify what is the reason of the failure?
> >> >> >
> >> >> > It's not the scan-dumps that fail, but the execution.
> >> >> > The test calls abort() for some reason.
> >> >> >
> >> >> > It will take me a while to rebuild the test manually in the right
> >> >> > debug environment to provide you with more traces.
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Thanks.
> >> >> >>
> >> >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
> >> >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >> >>>> Hi All,
> >> >> >>>>
> >> >> >>>> Here is patch for non-masked epilogue vectoriziation.
> >> >> >>>>
> >> >> >>>> Bootstrap and regression testing did not show any new failures.
> >> >> >>>>
> >> >> >>>> Is it OK for trunk?
> >> >> >>>>
> >> >> >>>> Thanks.
> >> >> >>>> Changelog:
> >> >> >>>>
> >> >> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
> >> >> >>>>
> >> >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
> >> >> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
> >> >> >>>> * * tree-if-conv.h: New file.
> >> >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
> >> >> >>>> dynamic alias checks for epilogues.
> >> >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
> >> >> >>>> * tree-vect-loop.c: include tree-if-conv.h.
> >> >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
> >> >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
> >> >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
> >> >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
> >> >> >>>> using passed argument.
> >> >> >>>> (vect_transform_loop): Check if created epilogue should be returned
> >> >> >>>> for further vectorization with less vf.  If-convert epilogue if
> >> >> >>>> required. Print vectorization success for epilogue.
> >> >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
> >> >> >>>> if it is required, pass loop_vinfo produced during vectorization of
> >> >> >>>> loop body to vect_analyze_loop.
> >> >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
> >> >> >>>> orig_loop_info.
> >> >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
> >> >> >>>> (LOOP_VINFO_EPILOGUE_P): New.
> >> >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
> >> >> >>>> (vect_do_peeling): Change prototype to return epilogue.
> >> >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
> >> >> >>>> (vect_transform_loop): Return created loop.
> >> >> >>>>
> >> >> >>>> gcc/testsuite/
> >> >> >>>>
> >> >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
> >> >> >>>> (check_effective_target_avx2_runtime): New.
> >> >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
> >> >> >>>>
> >> >> >>>
> >> >> >>> Hi,
> >> >> >>>
> >> >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
> >> >> >>>
> >> >> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
> >> >> >>>
> >> >> >>> Christophe
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>>
> >> >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> >> >> >>>>>>Richard,
> >> >> >>>>>>
> >> >> >>>>>>I checked one of the tests designed for epilogue vectorization using
> >> >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
> >> >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
> >> >> >>>>>>
> >> >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
> >> >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
> >> >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
> >> >> >>>>>>4
> >> >> >>>>>> Without param only 2 loops are vectorized.
> >> >> >>>>>>
> >> >> >>>>>>Should I simply add a part of tests related to this feature or I must
> >> >> >>>>>>delete all not necessary changes also?
> >> >> >>>>>
> >> >> >>>>> Please remove all not necessary changes.
> >> >> >>>>>
> >> >> >>>>> Richard.
> >> >> >>>>>
> >> >> >>>>>>Thanks.
> >> >> >>>>>>Yuri.
> >> >> >>>>>>
> >> >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
> >> >> >>>>>>>
> >> >> >>>>>>>> Richard,
> >> >> >>>>>>>>
> >> >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
> >> >> >>>>>>field.
> >> >> >>>>>>>> Here is the correct updated patch.
> >> >> >>>>>>>
> >> >> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
> >> >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
> >> >> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
> >> >> >>>>>>>
> >> >> >>>>>>> Thus, can you please produce a single complete patch containing only
> >> >> >>>>>>> non-masked epilogue vectoriziation?
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks,
> >> >> >>>>>>> Richard.
> >> >> >>>>>>>
> >> >> >>>>>>>> Thanks.
> >> >> >>>>>>>> Yuri.
> >> >> >>>>>>>>
> >> >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
> >> >> >>>>>>>> >
> >> >> >>>>>>>> >> Richard,
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
> >> >> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> You wrote:
> >> >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
> >> >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
> >> >> >>>>>>>> >> changes only needed by later patches?
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> Did you mean that I exclude all support for vectorization
> >> >> >>>>>>epilogues,
> >> >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
> >> >> >>>>>>>> >> like
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> >> >>>>>>>> >> index 11863af..32011c1 100644
> >> >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
> >> >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
> >> >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
> >> >> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
> >> >> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
> >> >> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
> >> >> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
> >> >> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
> >> >> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
> >> >> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > Yes.
> >> >> >>>>>>>> >
> >> >> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
> >> >> >>>>>>i.e.
> >> >> >>>>>>>> >> can be integrated without other patches?
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > Yes.
> >> >> >>>>>>>> >
> >> >> >>>>>>>> >> Could you please look at updated patch?
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > Will do.
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > Thanks,
> >> >> >>>>>>>> > Richard.
> >> >> >>>>>>>> >
> >> >> >>>>>>>> >> Thanks.
> >> >> >>>>>>>> >> Yuri.
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
> >> >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
> >> >> >>>>>>>> >> >
> >> >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
> >> >> >>>>>>>> >> >>
> >> >> >>>>>>>> >> >> > Richard,
> >> >> >>>>>>>> >> >> >
> >> >> >>>>>>>> >> >> > Here is updated 3 patch.
> >> >> >>>>>>>> >> >> >
> >> >> >>>>>>>> >> >> > I checked that all new tests related to epilogue
> >> >> >>>>>>vectorization passed with it.
> >> >> >>>>>>>> >> >> >
> >> >> >>>>>>>> >> >> > Your comments will be appreciated.
> >> >> >>>>>>>> >> >>
> >> >> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
> >> >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
> >> >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
> >> >> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
> >> >> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
> >> >> >>>>>>(optional)
> >> >> >>>>>>>> >> >> forced vectorization factor as well?
> >> >> >>>>>>>> >> >
> >> >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
> >> >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
> >> >> >>>>>>>> >> > changes only needed by later patches?
> >> >> >>>>>>>> >> >
> >> >> >>>>>>>> >> > Thanks,
> >> >> >>>>>>>> >> > Richard.
> >> >> >>>>>>>> >> >
> >> >> >>>>>>>> >> >> Richard.
> >> >> >>>>>>>> >> >>
> >> >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
> >> >> >>>>>><rguenther@suse.de>:
> >> >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
> >> >> >>>>>>>> >> >> > >
> >> >> >>>>>>>> >> >> > >> Hi Richard,
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >> I did not understand your last remark:
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
> >> >> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >> >> >>>>>>vect_location,
> >> >> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >> >> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
> >> >> >>>>>>it to be unrolled
> >> >> >>>>>>>> >> >> > >> >           etc.  */
> >> >> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >> >> >>>>>>it easier
> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >> >> >>>>>>in dumps
> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >> >> >>>>>>*/
> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
> >> >> >>>>>>>> >> >> > >> > +         {
> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >> >>>>>>>> >> >> > >> > +         }
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >> >> >>>>>>new_loop)
> >> >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
> >> >> >>>>>>perform
> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >> >> >>>>>>vectorization
> >> >> >>>>>>>> >> >> > >> > separately that would be great.
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >> Could you please clarify your proposal.
> >> >> >>>>>>>> >> >> > >
> >> >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
> >> >> >>>>>>vectorize
> >> >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
> >> >> >>>>>>avoiding
> >> >> >>>>>>>> >> >> > > the re-use of ->aux.
> >> >> >>>>>>>> >> >> > >
> >> >> >>>>>>>> >> >> > > Richard.
> >> >> >>>>>>>> >> >> > >
> >> >> >>>>>>>> >> >> > >> Thanks.
> >> >> >>>>>>>> >> >> > >> Yuri.
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
> >> >> >>>>>><rguenther@suse.de>:
> >> >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> >> Hi All,
> >> >> >>>>>>>> >> >> > >> >>
> >> >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
> >> >> >>>>>>which support
> >> >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
> >> >> >>>>>>trip count. We
> >> >> >>>>>>>> >> >> > >> >> assume that the only patch -
> >> >> >>>>>>vec-tails-07-combine-tail.patch - was not
> >> >> >>>>>>>> >> >> > >> >> approved by Jeff.
> >> >> >>>>>>>> >> >> > >> >>
> >> >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
> >> >> >>>>>>bootstrapping and
> >> >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
> >> >> >>>>>>Also all
> >> >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
> >> >> >>>>>>been changed
> >> >> >>>>>>>> >> >> > >> >> accordingly.
> >> >> >>>>>>>> >> >> > >> >>
> >> >> >>>>>>>> >> >> > >> >> Is it OK for trunk?
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > I would have prefered that the series up to
> >> >> >>>>>>-03-nomask-tails would
> >> >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
> >> >> >>>>>>unfortunately
> >> >> >>>>>>>> >> >> > >> > the patchset is oddly separated.
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
> >> >> >>>>>>(loop_vec_info
> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
> >> >> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
> >> >> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
> >> >> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
> >> >> >>>>>>single_exit (loop))
> >> >> >>>>>>>> >> >> > >> > -      || loop->inner)
> >> >> >>>>>>>> >> >> > >> > +      || loop->inner
> >> >> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
> >> >> >>>>>>and
> >> >> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
> >> >> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> >> >>>>>>>> >> >> > >> >      do_peeling = false;
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> >    if (do_peeling
> >> >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
> >> >> >>>>>>(loop_vec_info
> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> >    do_versioning =
> >> >> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
> >> >> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
> >> >> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
> >> >> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
> >> >> >>>>>>>> >> >> > >> > +          original loop and is not required for
> >> >> >>>>>>epilogue.  */
> >> >> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> >    if (do_versioning)
> >> >> >>>>>>>> >> >> > >> >      {
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > please do that check in the single caller of this
> >> >> >>>>>>function.
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
> >> >> >>>>>>believe that simply
> >> >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
> >> >> >>>>>>_much_ cleaner.
> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
> >> >> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
> >> >> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >> >> >>>>>>vect_location,
> >> >> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
> >> >> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
> >> >> >>>>>>it to be unrolled
> >> >> >>>>>>>> >> >> > >> >            etc.  */
> >> >> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
> >> >> >>>>>>it easier
> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
> >> >> >>>>>>in dumps
> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
> >> >> >>>>>>*/
> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
> >> >> >>>>>>>> >> >> > >> > +         {
> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
> >> >> >>>>>>>> >> >> > >> > +         }
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
> >> >> >>>>>>new_loop)
> >> >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
> >> >> >>>>>>perform
> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
> >> >> >>>>>>vectorization
> >> >> >>>>>>>> >> >> > >> > separately that would be great.
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
> >> >> >>>>>>question its
> >> >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
> >> >> >>>>>>vector loop).
> >> >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
> >> >> >>>>>>>> >> >> > >> >
> >> >> >>>>>>>> >> >> > >> > Thanks,
> >> >> >>>>>>>> >> >> > >> > Richard.
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >>
> >> >> >>>>>>>> >> >> > >
> >> >> >>>>>>>> >> >> > > --
> >> >> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
> >> >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
> >> >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
> >> >> >>>>>>>> >> >> >
> >> >> >>>>>>>> >> >>
> >> >> >>>>>>>> >> >>
> >> >> >>>>>>>> >> >
> >> >> >>>>>>>> >> > --
> >> >> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
> >> >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > --
> >> >> >>>>>>>> > Richard Biener <rguenther@suse.de>
> >> >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >> >>>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>> --
> >> >> >>>>>>> Richard Biener <rguenther@suse.de>
> >> >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
> >> >> >>>>>
> >> >> >>>>>
> >> >>
> >> >
> >> > --
> >> > Richard Biener <rguenther@suse.de>
> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
       [not found]                                                         ` <alpine.LSU.2.11.1612131455080.5294@t29.fhfr.qr>
@ 2016-12-21 10:14                                                           ` Yuri Rumyantsev
  2016-12-21 17:23                                                             ` Yuri Rumyantsev
  0 siblings, 1 reply; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-12-21 10:14 UTC (permalink / raw)
  To: Richard Biener, gcc-patches, Pavel Chupin

[-- Attachment #1: Type: text/plain, Size: 58864 bytes --]

Hi Richard,

I occasionally found out a bug in my patch related to epilogue
vectorization without masking : need to put label before
initialization.

Could you please review and integrate it to trunk. Test-case is also attached.


Thanks ahead.
Yuri.

ChangeLog:
2016-12-21  Yuri Rumyantsev  <ysrumyan@gmail.com>

* tree-vectorizer.c (vectorize_loops): Put label before initialization
of loop_vectorized_call.

gcc/testsuite/

* gcc.dg/vect/vect-tail-nomask-2.c: New test.

2016-12-13 16:59 GMT+03:00 Richard Biener <rguenther@suse.de>:
> On Mon, 12 Dec 2016, Yuri Rumyantsev wrote:
>
>> Richard,
>>
>> Could you please review cost model patch before to include it to
>> epilogue masking patch and add masking cost estimation as you
>> requested.
>
> That's just the middle-end / target changes.  I was not 100% happy
> with them but well, the vectorizer cost modeling needs work
> (aka another rewrite).
>
> From below...
>
>> Thanks.
>>
>> Patch and ChangeLog are attached.
>>
>> 2016-12-12 15:47 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> > Hi Richard,
>> >
>> > You asked me about performance of spec2006 on AVX2 machine with new feature.
>> >
>> > I tried the following on Haswell using original patch designed by Ilya.
>> > 1. Masking low trip count loops  only 6 benchmarks are affected and
>> > performance is almost the same
>> > 464.h264ref     63.9000    64.0000 +0.15%
>> > 416.gamess      42.9000    42.9000 +0%
>> > 435.gromacs     32.8000    32.7000 -0.30%
>> > 447.dealII      68.5000    68.3000 -0.29%
>> > 453.povray      61.9000    62.1000 +0.32%
>> > 454.calculix    39.8000    39.8000 +0%
>> > 465.tonto       29.9000    29.9000 +0%
>> >
>> > 2. epilogue vectorization without masking (use less vf) (3 benchmarks
>> > are not affected)
>> > 400.perlbench     47.2000    46.5000 -1.48%
>> > 401.bzip2         29.9000    29.9000 +0%
>> > 403.gcc           41.8000    41.6000 -0.47%
>> > 456.hmmer         32.0000    32.0000 +0%
>> > 462.libquantum    81.5000    82.0000 +0.61%
>> > 464.h264ref       65.0000    65.5000 +0.76%
>> > 471.omnetpp       27.8000    28.2000 +1.43%
>> > 473.astar         28.7000    28.6000 -0.34%
>> > 483.xalancbmk     48.7000    48.6000 -0.20%
>> > 410.bwaves        95.3000    95.3000 +0%
>> > 416.gamess        42.9000    42.8000 -0.23%
>> > 433.milc          38.8000    38.8000 +0%
>> > 434.zeusmp        51.7000    51.4000 -0.58%
>> > 435.gromacs       32.8000    32.8000 +0%
>> > 436.cactusADM     85.0000    83.0000 -2.35%
>> > 437.leslie3d      55.5000    55.5000 +0%
>> > 444.namd          31.3000    31.3000 +0%
>> > 447.dealII        68.7000    68.9000 +0.29%
>> > 450.soplex        47.3000    47.4000 +0.21%
>> > 453.povray        62.1000    61.4000 -1.12%
>> > 454.calculix      39.7000    39.3000 -1.00%
>> > 459.GemsFDTD      44.9000    45.0000 +0.22%
>> > 465.tonto         29.8000    29.8000 +0%
>> > 481.wrf           51.0000    51.2000 +0.39%
>> > 482.sphinx3       69.8000    71.2000 +2.00%
>
> I see 471.omnetpp and 482.sphinx3 are in a similar ballpark and it
> would be nice to catch the relevant case(s) with a cost model for
> epilogue vectorization without masking first (to get rid of
> --param vect-epilogues-nomask).
>
> As said elsewhere any non-conservative cost modeling (if the
> number of scalar iterations is not statically constant) might
> require versioning of the loop into a non-vectorized,
> short-trip vectorized and regular vectorized case (the Intel
> compiler does way more aggressive versioning IIRC).
>
> Richard.
>
>> > 3. epilogue vectorization using masking (4 benchmarks are not affected):
>> > 400.perlbench     47.5000    46.8000 -1.47%
>> > 401.bzip2         30.0000    29.9000 -0.33%
>> > 403.gcc           42.3000    42.3000 +0%
>> > 445.gobmk         32.1000    32.8000 +2.18%
>> > 456.hmmer         32.0000    32.0000 +0%
>> > 458.sjeng         36.1000    35.5000 -1.66%
>> > 462.libquantum    81.1000    81.1000 +0%
>> > 464.h264ref       65.4000    65.0000 -0.61%
>> > 483.xalancbmk     49.4000    49.3000 -0.20%
>> > 410.bwaves        95.9000    95.5000 -0.41%
>> > 416.gamess        42.8000    42.6000 -0.46%
>> > 433.milc          38.8000    39.1000 +0.77%
>> > 434.zeusmp        52.1000    51.3000 -1.53%
>> > 435.gromacs       32.9000    32.9000 +0%
>> > 436.cactusADM     78.8000    85.3000 +8.24%
>> > 437.leslie3d      55.4000    55.4000 +0%
>> > 444.namd          31.3000    31.3000 +0%
>> > 447.dealII        69.0000    69.2000 +0.28%
>> > 450.soplex        47.7000    47.6000 -0.20%
>> > 453.povray        62.2000    61.7000 -0.80%
>> > 454.calculix      39.7000    38.2000 -3.77%
>> > 459.GemsFDTD      44.9000    45.0000 +0.22%
>> > 465.tonto         29.8000    29.9000 +0.33%
>> > 481.wrf           51.2000    51.6000 +0.78%
>> > 482.sphinx3       70.3000    65.4000 -6.97%
>> >
>> > There is a good speed-up for 436 but there is essential slow0down on 482, 454.
>> >
>> > So In general we don't have any advantages for AVX2.
>> >
>> > Best regards.
>> > Yuri.
>> >
>> > P.S.
>> > I  am not able to provide you with avx512 numbers because i don't have
>> > an access to it.
>> > Updated patch will be sent later.
>> >
>> > Best regards.
>> > Yuri.
>> >
>> >
>> > 2016-12-05 15:44 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> >> Richard,
>> >>
>> >> Sorry, U sent you the bad assembly produced for loop with low trip
>> >> count, here is the correct one:
>> >>
>> >> vmovdqa .LC0(%rip), %ymm0
>> >> vpmaskmovd b(%rip), %ymm0, %ymm1
>> >> vpmaskmovd c(%rip), %ymm0, %ymm2
>> >> vpaddd %ymm2, %ymm1, %ymm1
>> >> vpmaskmovd %ymm1, %ymm0, a(%rip)
>> >>
>> >> where .LC0 vector with all elements equal to -1 except for the last.
>> >>
>> >> Note also that additional option is required --param
>> >> vect-short-loops=1 to do such conversion.
>> >>
>> >> Best regards.
>> >> Yuri.
>> >>
>> >> 2016-12-02 18:59 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> >>> Richard,
>> >>>
>> >>> Important clarification: the test I sent you with low trip count is
>> >>> vectorized through masking only under
>> >>> --param vect-epilogues-combine=1 -fvect-epilogue-cost-model=unlimited
>> >>> for avx2. The laast option isrequired for avx2 since masked store has
>> >>> big cost in comparison with masked load.
>> >>>
>> >>> Below is assemby produced for it:
>> >>> vpcmpeqd %xmm0, %xmm0, %xmm0
>> >>> vpmaskmovd b(%rip), %xmm0, %xmm1
>> >>> vpmaskmovd c(%rip), %xmm0, %xmm2
>> >>> vpaddd %xmm2, %xmm1, %xmm1
>> >>> vpmaskmovd %xmm1, %xmm0, a(%rip)
>> >>> ret
>> >>>
>> >>> Thanks.
>> >>> Yuri.
>> >>>
>> >>>
>> >>> 2016-12-02 18:49 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> >>>> Richard,
>> >>>>
>> >>>> I have also question about low trip count loops.
>> >>>> Did you mean that
>> >>>> int a[128], b[128], c[128];
>> >>>>
>> >>>> void foo ()
>> >>>> {
>> >>>>   int i;
>> >>>>   for (i = 0; i<7; i++)
>> >>>>     a[i] = b[i] + c[i];
>> >>>> }
>> >>>>
>> >>>> must be vectorizzed with masking without epilogue creation (e.g. for avx2)?
>> >>>>
>> >>>> Currently it vectorized with vector size 128. I also noticed that
>> >>>> original Ilya patch does nothing for such masking.
>> >>>>
>> >>>> Thanks.
>> >>>> Yuri.
>> >>>>
>> >>>> 2016-12-02 17:08 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> >>>>> Richard,
>> >>>>>
>> >>>>> You wrote:
>> >>>>> I don't see _any_ cost model for vectorizing the epilogue with
>> >>>>> masking?  Am I missing something?  A "trivial" cost model
>> >>>>> should at least consider the additional IV(s), the mask
>> >>>>> compute and the widening and narrowing ops required.
>> >>>>>
>> >>>>> I skipped all changes related to cost model assuming that one of the
>> >>>>> next patch will contain all cost model changes.
>> >>>>>
>> >>>>> Should I include it to this patch?
>> >>>>>
>> >>>>> Thanks.
>> >>>>> Yuri.
>> >>>>>
>> >>>>> 2016-12-01 17:45 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>> On Thu, 1 Dec 2016, Yuri Rumyantsev wrote:
>> >>>>>>
>> >>>>>>> Thanks Richard for your comments.
>> >>>>>>>
>> >>>>>>> You asked me about possible performance improvements for AVX2 machines
>> >>>>>>> - we did not see any visible speed-up for spec2k with any method of
>> >>>>>>
>> >>>>>> Spec 2000?  Can you check with SPEC 2006 or CPUv6?
>> >>>>>>
>> >>>>>> Did you see performance degradation?  What about compile-time and
>> >>>>>> binary size effects?
>> >>>>>>
>> >>>>>>> masking, including epilogue masking and combining, only on AVX512
>> >>>>>>> machine aka knl.
>> >>>>>>
>> >>>>>> I see.
>> >>>>>>
>> >>>>>> Note that as said in the initial review patch the cost model I
>> >>>>>> saw therein looked flawed.  In the end I'd expect a sensible
>> >>>>>> approach would be to do
>> >>>>>>
>> >>>>>>  if (n < scalar-most-profitable-niter)
>> >>>>>>    {
>> >>>>>>      no vectorization
>> >>>>>>    }
>> >>>>>>  else if (n < masking-more-profitable-than-not-masking-plus-epilogue)
>> >>>>>>    {
>> >>>>>>      do masked vectorization
>> >>>>>>    }
>> >>>>>>  else
>> >>>>>>    {
>> >>>>>>      do unmasked vectorization (with epilogue, eventually vectorized)
>> >>>>>>    }
>> >>>>>>
>> >>>>>> where for short trip loops the else path would never be taken
>> >>>>>> (statically).
>> >>>>>>
>> >>>>>> And yes, that means masking will only be useful for short-trip loops
>> >>>>>> which in the end means an overall performance benfit is unlikely
>> >>>>>> unless we have a lot of short-trip loops that are slow because of
>> >>>>>> the overhead of main unmasked loop plus epilogue.
>> >>>>>>
>> >>>>>> Richard.
>> >>>>>>
>> >>>>>>> I will answer on your question later.
>> >>>>>>>
>> >>>>>>> Best regards.
>> >>>>>>> Yuri
>> >>>>>>>
>> >>>>>>> 2016-12-01 14:33 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> > On Mon, 28 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >
>> >>>>>>> >> Richard!
>> >>>>>>> >>
>> >>>>>>> >> I attached vect dump for hte part of attached test-case which
>> >>>>>>> >> illustrated how vectorization of epilogues works through masking:
>> >>>>>>> >> #define SIZE 1023
>> >>>>>>> >> #define ALIGN 64
>> >>>>>>> >>
>> >>>>>>> >> extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
>> >>>>>>> >> __SIZE_TYPE__ size) __attribute__((weak));
>> >>>>>>> >> extern void free (void *);
>> >>>>>>> >>
>> >>>>>>> >> void __attribute__((noinline))
>> >>>>>>> >> test_citer (int * __restrict__ a,
>> >>>>>>> >>    int * __restrict__ b,
>> >>>>>>> >>    int * __restrict__ c)
>> >>>>>>> >> {
>> >>>>>>> >>   int i;
>> >>>>>>> >>
>> >>>>>>> >>   a = (int *)__builtin_assume_aligned (a, ALIGN);
>> >>>>>>> >>   b = (int *)__builtin_assume_aligned (b, ALIGN);
>> >>>>>>> >>   c = (int *)__builtin_assume_aligned (c, ALIGN);
>> >>>>>>> >>
>> >>>>>>> >>   for (i = 0; i < SIZE; i++)
>> >>>>>>> >>     c[i] = a[i] + b[i];
>> >>>>>>> >> }
>> >>>>>>> >>
>> >>>>>>> >> It was compiled with -mavx2 --param vect-epilogues-mask=1 options.
>> >>>>>>> >>
>> >>>>>>> >> I did not include in this patch vectorization of low trip-count loops
>> >>>>>>> >> since in the original patch additional parameter was introduced:
>> >>>>>>> >> +DEFPARAM (PARAM_VECT_SHORT_LOOPS,
>> >>>>>>> >> +  "vect-short-loops",
>> >>>>>>> >> +  "Enable vectorization of low trip count loops using masking.",
>> >>>>>>> >> +  0, 0, 1)
>> >>>>>>> >>
>> >>>>>>> >> I assume that this ability can be included very quickly but it
>> >>>>>>> >> requires cost model enhancements also.
>> >>>>>>> >
>> >>>>>>> > Comments on the patch itself (as I'm having a closer look again,
>> >>>>>>> > I know how it vectorizes the above but I wondered why epilogue
>> >>>>>>> > and short-trip loops are not basically the same code path).
>> >>>>>>> >
>> >>>>>>> > Btw, I don't like that the features are behind a --param paywall.
>> >>>>>>> > That just means a) nobody will use it, b) it will bit-rot quickly,
>> >>>>>>> > c) bugs are well-hidden.
>> >>>>>>> >
>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> >>>>>>> > +      && integer_zerop (nested_in_vect_loop
>> >>>>>>> > +                       ? STMT_VINFO_DR_STEP (stmt_info)
>> >>>>>>> > +                       : DR_STEP (dr)))
>> >>>>>>> > +    {
>> >>>>>>> > +      if (dump_enabled_p ())
>> >>>>>>> > +       dump_printf_loc (MSG_NOTE, vect_location,
>> >>>>>>> > +                        "allow invariant load for masked loop.\n");
>> >>>>>>> > +    }
>> >>>>>>> >
>> >>>>>>> > this can test memory_access_type == VMAT_INVARIANT.  Please put
>> >>>>>>> > all the checks in a common
>> >>>>>>> >
>> >>>>>>> >   if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> >>>>>>> >     {
>> >>>>>>> >        if (memory_access_type == VMAT_INVARIANT)
>> >>>>>>> >          {
>> >>>>>>> >          }
>> >>>>>>> >        else if (...)
>> >>>>>>> >          {
>> >>>>>>> >             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> >>>>>>> >          }
>> >>>>>>> >        else if (..)
>> >>>>>>> > ...
>> >>>>>>> >     }
>> >>>>>>> >
>> >>>>>>> > @@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt,
>> >>>>>>> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
>> >>>>>>> >        gcc_assert (!nested_in_vect_loop);
>> >>>>>>> >        gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));
>> >>>>>>> >
>> >>>>>>> > +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> >>>>>>> > +       {
>> >>>>>>> > +         if (dump_enabled_p ())
>> >>>>>>> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >>>>>>> > +                            "cannot be masked: grouped access is not"
>> >>>>>>> > +                            " supported.");
>> >>>>>>> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> >>>>>>> > +      }
>> >>>>>>> > +
>> >>>>>>> >
>> >>>>>>> > isn't this already handled by the above?  Or rather the general
>> >>>>>>> > disallowance of SLP?
>> >>>>>>> >
>> >>>>>>> > @@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt,
>> >>>>>>> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
>> >>>>>>> >                             &memory_access_type, &gs_info))
>> >>>>>>> >      return false;
>> >>>>>>> >
>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> >>>>>>> > +      && memory_access_type != VMAT_CONTIGUOUS)
>> >>>>>>> > +    {
>> >>>>>>> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> >>>>>>> > +      if (dump_enabled_p ())
>> >>>>>>> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >>>>>>> > +                        "cannot be masked: unsupported memory access
>> >>>>>>> > type.\n");
>> >>>>>>> > +    }
>> >>>>>>> > +
>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> >>>>>>> > +      && !can_mask_load_store (stmt))
>> >>>>>>> > +    {
>> >>>>>>> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> >>>>>>> > +      if (dump_enabled_p ())
>> >>>>>>> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >>>>>>> > +                        "cannot be masked: unsupported masked store.\n");
>> >>>>>>> > +    }
>> >>>>>>> > +
>> >>>>>>> >
>> >>>>>>> > likewise please combine the ifs.
>> >>>>>>> >
>> >>>>>>> > @@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt,
>> >>>>>>> > gimple_stmt_iterator *gsi,
>> >>>>>>> >                                           ptr, vec_mask, vec_rhs);
>> >>>>>>> >           vect_finish_stmt_generation (stmt, new_stmt, gsi);
>> >>>>>>> >           if (i == 0)
>> >>>>>>> > -           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
>> >>>>>>> > +           {
>> >>>>>>> > +             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
>> >>>>>>> > +             STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
>> >>>>>>> > +           }
>> >>>>>>> >           else
>> >>>>>>> >             STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
>> >>>>>>> >           prev_stmt_info = vinfo_for_stmt (new_stmt);
>> >>>>>>> >
>> >>>>>>> > here you only set the flag, elsewhere you copy DR and VECTYPE as well.
>> >>>>>>> >
>> >>>>>>> > @@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt,
>> >>>>>>> > gimple_stmt_iterator *gsi,
>> >>>>>>> >                && !useless_type_conversion_p (vectype, rhs_vectype)))
>> >>>>>>> >      return false;
>> >>>>>>> >
>> >>>>>>> > +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> >>>>>>> > +    {
>> >>>>>>> > +      /* Check that mask conjuction is supported.  */
>> >>>>>>> > +      optab tab;
>> >>>>>>> > +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
>> >>>>>>> > +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
>> >>>>>>> > CODE_FOR_nothing)
>> >>>>>>> > +       {
>> >>>>>>> > +         if (dump_enabled_p ())
>> >>>>>>> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >>>>>>> > +                            "cannot be masked: unsupported mask
>> >>>>>>> > operation\n");
>> >>>>>>> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> >>>>>>> > +       }
>> >>>>>>> > +    }
>> >>>>>>> >
>> >>>>>>> > does this really test whether we can bit-and the mask?  You are
>> >>>>>>> > using the vector type of the store (which might be V2DF for example),
>> >>>>>>> > also for AVX512 it might be a vector-bool type with integer mode?
>> >>>>>>> > Of course we maybe can simply assume mask conjunction is available
>> >>>>>>> > (I know no ISA where that would be not true).
>> >>>>>>> >
>> >>>>>>> > +/* Return true if STMT can be converted to masked form.  */
>> >>>>>>> > +
>> >>>>>>> > +static bool
>> >>>>>>> > +can_mask_load_store (gimple *stmt)
>> >>>>>>> > +{
>> >>>>>>> > +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>> >>>>>>> > +  tree vectype, mask_vectype;
>> >>>>>>> > +  tree lhs, ref;
>> >>>>>>> > +
>> >>>>>>> > +  if (!stmt_info)
>> >>>>>>> > +    return false;
>> >>>>>>> > +  lhs = gimple_assign_lhs (stmt);
>> >>>>>>> > +  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
>> >>>>>>> > +  if (may_be_nonaddressable_p (ref))
>> >>>>>>> > +    return false;
>> >>>>>>> > +  vectype = STMT_VINFO_VECTYPE (stmt_info);
>> >>>>>>> >
>> >>>>>>> > You probably modeled this after ifcvt_can_use_mask_load_store but I
>> >>>>>>> > don't think checking may_be_nonaddressable_p is necessary (we couldn't
>> >>>>>>> > even vectorize such refs).  stmt_info should never be NULL either.
>> >>>>>>> > With the check removed tree-ssa-loop-ivopts.h should no longer be
>> >>>>>>> > necessary.
>> >>>>>>> >
>> >>>>>>> > +static void
>> >>>>>>> > +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
>> >>>>>>> > +                          data_reference *dr, gimple_stmt_iterator *si)
>> >>>>>>> > +{
>> >>>>>>> > ...
>> >>>>>>> > +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
>> >>>>>>> > +                                  true, NULL_TREE, true,
>> >>>>>>> > +                                  GSI_SAME_STMT);
>> >>>>>>> > +
>> >>>>>>> > +  align = TYPE_ALIGN_UNIT (vectype);
>> >>>>>>> > +  if (aligned_access_p (dr))
>> >>>>>>> > +    misalign = 0;
>> >>>>>>> > +  else if (DR_MISALIGNMENT (dr) == -1)
>> >>>>>>> > +    {
>> >>>>>>> > +      align = TYPE_ALIGN_UNIT (elem_type);
>> >>>>>>> > +      misalign = 0;
>> >>>>>>> > +    }
>> >>>>>>> > +  else
>> >>>>>>> > +    misalign = DR_MISALIGNMENT (dr);
>> >>>>>>> > +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
>> >>>>>>> > +  ptr = build_int_cst (reference_alias_ptr_type (mem),
>> >>>>>>> > +                      misalign ? misalign & -misalign : align);
>> >>>>>>> >
>> >>>>>>> > you should simply use
>> >>>>>>> >
>> >>>>>>> >   align = get_object_alignment (mem) / BITS_PER_UNIT;
>> >>>>>>> >
>> >>>>>>> > here rather than trying to be clever.  Eventually you don't need
>> >>>>>>> > the DR then (see question above).
>> >>>>>>> >
>> >>>>>>> > +    }
>> >>>>>>> > +  gsi_replace (si ? si : &gsi, new_stmt, false);
>> >>>>>>> >
>> >>>>>>> > when you replace the load/store please previously copy VUSE and VDEF
>> >>>>>>> > from the original one (we were nearly clean enough to no longer
>> >>>>>>> > require a virtual operand rewrite after vectorization...)  Thus
>> >>>>>>> >
>> >>>>>>> >   gimple_set_vuse (new_stmt, gimple_vuse (stmt));
>> >>>>>>> >   gimple_set_vdef (new_stmt, gimple_vdef (stmt));
>> >>>>>>> >
>> >>>>>>> > +static void
>> >>>>>>> > +vect_mask_loop (loop_vec_info loop_vinfo)
>> >>>>>>> > +{
>> >>>>>>> > ...
>> >>>>>>> > +  /* Scan all loop statements to convert vector load/store including
>> >>>>>>> > masked
>> >>>>>>> > +     form.  */
>> >>>>>>> > +  for (unsigned i = 0; i < loop->num_nodes; i++)
>> >>>>>>> > +    {
>> >>>>>>> > +      basic_block bb = bbs[i];
>> >>>>>>> > +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
>> >>>>>>> > +          !gsi_end_p (si); gsi_next (&si))
>> >>>>>>> > +       {
>> >>>>>>> > +         gimple *stmt = gsi_stmt (si);
>> >>>>>>> > +         stmt_vec_info stmt_info = NULL;
>> >>>>>>> > +         tree vectype = NULL;
>> >>>>>>> > +         data_reference *dr;
>> >>>>>>> > +
>> >>>>>>> > +         /* Mask load case.  */
>> >>>>>>> > +         if (is_gimple_call (stmt)
>> >>>>>>> > +             && gimple_call_internal_p (stmt)
>> >>>>>>> > +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
>> >>>>>>> > +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
>> >>>>>>> > +           {
>> >>>>>>> > ...
>> >>>>>>> > +             /* Skip invariant loads.  */
>> >>>>>>> > +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
>> >>>>>>> > +                                ? STMT_VINFO_DR_STEP (stmt_info)
>> >>>>>>> > +                                : DR_STEP (STMT_VINFO_DATA_REF
>> >>>>>>> > (stmt_info))))
>> >>>>>>> > +               continue;
>> >>>>>>> >
>> >>>>>>> > seeing this it would be nice if stmt_info had a flag for whether
>> >>>>>>> > the stmt needs masking (and a flag on wheter this is a scalar or a
>> >>>>>>> > vectorized stmt).
>> >>>>>>> >
>> >>>>>>> > +         /* Skip hoisted out statements.  */
>> >>>>>>> > +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
>> >>>>>>> > +           continue;
>> >>>>>>> >
>> >>>>>>> > err, you walk stmts in the loop!  Isn't this covered by the above
>> >>>>>>> > skipping of 'invariant loads'?
>> >>>>>>> >
>> >>>>>>> > +static gimple *
>> >>>>>>> > +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
>> >>>>>>> > +{
>> >>>>>>> >
>> >>>>>>> > depending on the reduction operand there are variants that
>> >>>>>>> > could get away w/o the VEC_COND_EXPR, like
>> >>>>>>> >
>> >>>>>>> >   S1': tem_4 = d_3 & MASK;
>> >>>>>>> >   S2': r_1 = r_2 + tem_4;
>> >>>>>>> >
>> >>>>>>> > which works for plus at least.  More generally doing
>> >>>>>>> >
>> >>>>>>> >   S1': tem_4 = VEC_COND_EXPR<MASK, d_3, neutral operand>
>> >>>>>>> >   S2': r_1 = r_2 OP tem_4;
>> >>>>>>> >
>> >>>>>>> > and leaving optimization to & to later opts (& won't work for
>> >>>>>>> > AVX512 mask registers I guess).
>> >>>>>>> >
>> >>>>>>> > Good enough for later enhacement of course.
>> >>>>>>> >
>> >>>>>>> > +static void
>> >>>>>>> > +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
>> >>>>>>> > +{
>> >>>>>>> > ...
>> >>>>>>> >
>> >>>>>>> > isn't it enough to always create a single IV and derive the
>> >>>>>>> > additional copies by IV + i * { elems, elems, elems ... }?
>> >>>>>>> > IVs are expensive -- I'm sure we can optimize the rest of the
>> >>>>>>> > scheme further as well but this one looks obvious to me.
>> >>>>>>> >
>> >>>>>>> > @@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info
>> >>>>>>> > loop_vinfo,
>> >>>>>>> >    int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>> >>>>>>> >    void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
>> >>>>>>> >
>> >>>>>>> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >>>>>>> > +    {
>> >>>>>>> > +      /* Currently we don't produce scalar epilogue version in case
>> >>>>>>> > +        its masked version is provided.  It means we don't need to
>> >>>>>>> > +        compute profitability one more time here.  Just make a
>> >>>>>>> > +        masked loop version.  */
>> >>>>>>> > +      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> >>>>>>> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
>> >>>>>>> > +       {
>> >>>>>>> > +         dump_printf_loc (MSG_NOTE, vect_location,
>> >>>>>>> > +                          "cost model: mask loop epilogue.\n");
>> >>>>>>> > +         LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>> >>>>>>> > +         *ret_min_profitable_niters = 0;
>> >>>>>>> > +         *ret_min_profitable_estimate = 0;
>> >>>>>>> > +         return;
>> >>>>>>> > +       }
>> >>>>>>> > +    }
>> >>>>>>> >    /* Cost model disabled.  */
>> >>>>>>> > -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>> >>>>>>> > +  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>> >>>>>>> >      {
>> >>>>>>> >        dump_printf_loc (MSG_NOTE, vect_location, "cost model
>> >>>>>>> > disabled.\n");
>> >>>>>>> >        *ret_min_profitable_niters = 0;
>> >>>>>>> >        *ret_min_profitable_estimate = 0;
>> >>>>>>> > +      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
>> >>>>>>> > +         && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> >>>>>>> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>> >>>>>>> >        return;
>> >>>>>>> >      }
>> >>>>>>> >
>> >>>>>>> > the unlimited_cost_model case should come first?  OTOH masking or
>> >>>>>>> > not is probably not sth covered by 'unlimited' - that is about
>> >>>>>>> > vectorizing or not.  But the above code means that for
>> >>>>>>> > epilogue vectorization w/o masking we ignore unlimited_cost_model ()?
>> >>>>>>> > That doesn't make sense to me.
>> >>>>>>> >
>> >>>>>>> > Plus if this is short-trip or epilogue vectorization and the
>> >>>>>>> > cost model is _not_ unlimited then we dont' want to enable
>> >>>>>>> > masking always (if it is possible).  It might be we statically
>> >>>>>>> > know the epilogue executes for at most two iterations for example.
>> >>>>>>> >
>> >>>>>>> > I don't see _any_ cost model for vectorizing the epilogue with
>> >>>>>>> > masking?  Am I missing something?  A "trivial" cost model
>> >>>>>>> > should at least consider the additional IV(s), the mask
>> >>>>>>> > compute and the widening and narrowing ops required.
>> >>>>>>> >
>> >>>>>>> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> >>>>>>> > index e13d6a2..36be342 100644
>> >>>>>>> > --- a/gcc/tree-vect-loop-manip.c
>> >>>>>>> > +++ b/gcc/tree-vect-loop-manip.c
>> >>>>>>> > @@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
>> >>>>>>> > niters, tree nitersm1,
>> >>>>>>> >    bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
>> >>>>>>> >                          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
>> >>>>>>> >
>> >>>>>>> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >>>>>>> > +    {
>> >>>>>>> > +      prolog_peeling = false;
>> >>>>>>> > +      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
>> >>>>>>> > +       epilog_peeling = false;
>> >>>>>>> > +    }
>> >>>>>>> > +
>> >>>>>>> >    if (!prolog_peeling && !epilog_peeling)
>> >>>>>>> >      return NULL;
>> >>>>>>> >
>> >>>>>>> > I think the prolog_peeling was fixed during the epilogue vectorization
>> >>>>>>> > review and should no longer be necessary.  Please add
>> >>>>>>> > a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead
>> >>>>>>> > (it should also work for short-trip loop vectorization).
>> >>>>>>> >
>> >>>>>>> > @@ -2022,11 +2291,18 @@ start_over:
>> >>>>>>> >        || (max_niter != -1
>> >>>>>>> >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
>> >>>>>>> >      {
>> >>>>>>> > -      if (dump_enabled_p ())
>> >>>>>>> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >>>>>>> > -                        "not vectorized: iteration count smaller than "
>> >>>>>>> > -                        "vectorization factor.\n");
>> >>>>>>> > -      return false;
>> >>>>>>> > +      /* Allow low trip count for loop epilogue we want to mask.  */
>> >>>>>>> > +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>> >>>>>>> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
>> >>>>>>> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>> >>>>>>> > +      else
>> >>>>>>> > +       {
>> >>>>>>> > +         if (dump_enabled_p ())
>> >>>>>>> >
>> >>>>>>> > so why do we test only LOOP_VINFO_EPILOGUE_P here?  All the code
>> >>>>>>> > I saw sofar would also work for the main loop (but the cost
>> >>>>>>> > model is missing).
>> >>>>>>> >
>> >>>>>>> > I am missing testcases.  There's only a single one but we should
>> >>>>>>> > have cases covering all kinds of mask IV widths and widen/shorten
>> >>>>>>> > masks.
>> >>>>>>> >
>> >>>>>>> > Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking
>> >>>>>>> > enabled for an AVX2 machine?
>> >>>>>>> >
>> >>>>>>> > Oh, and I really dislike the --param paywall.
>> >>>>>>> >
>> >>>>>>> > Thanks,
>> >>>>>>> > Richard.
>> >>>>>>> >
>> >>>>>>> >> Best regards.
>> >>>>>>> >> Yuri.
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> 2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> >> > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >
>> >>>>>>> >> >> Hi All,
>> >>>>>>> >> >>
>> >>>>>>> >> >> Here is the second patch which supports epilogue vectorization using
>> >>>>>>> >> >> masking without cost model. Currently it is possible
>> >>>>>>> >> >> only with passing parameter "--param vect-epilogues-mask=1".
>> >>>>>>> >> >>
>> >>>>>>> >> >> Bootstrapping and regression testing did not show any new regression.
>> >>>>>>> >> >>
>> >>>>>>> >> >> Any comments will be appreciated.
>> >>>>>>> >> >
>> >>>>>>> >> > Going over the patch the main question is one how it works -- it looks
>> >>>>>>> >> > like the decision whether to vectorize & mask the epilogue is made
>> >>>>>>> >> > when vectorizing the loop that generates the epilogue rather than
>> >>>>>>> >> > in the epilogue vectorization path?
>> >>>>>>> >> >
>> >>>>>>> >> > That is, I'd have expected to see this handling low-trip count loops
>> >>>>>>> >> > by masking?  And thus masking the epilogue simply by it being
>> >>>>>>> >> > low-trip count?
>> >>>>>>> >> >
>> >>>>>>> >> > Richard.
>> >>>>>>> >> >
>> >>>>>>> >> >> ChangeLog:
>> >>>>>>> >> >> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
>> >>>>>>> >> >>
>> >>>>>>> >> >> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
>> >>>>>>> >> >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>> >>>>>>> >> >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
>> >>>>>>> >> >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
>> >>>>>>> >> >> required_mask fields.
>> >>>>>>> >> >> (vect_check_required_masks_widening): New.
>> >>>>>>> >> >> (vect_check_required_masks_narrowing): New.
>> >>>>>>> >> >> (vect_get_masking_iv_elems): New.
>> >>>>>>> >> >> (vect_get_masking_iv_type): New.
>> >>>>>>> >> >> (vect_get_extreme_masks): New.
>> >>>>>>> >> >> (vect_check_required_masks): New.
>> >>>>>>> >> >> (vect_analyze_loop_operations): Call vect_check_required_masks if all
>> >>>>>>> >> >> statements can be masked.
>> >>>>>>> >> >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
>> >>>>>>> >> >> Add check that epilogue can be masked with the same vf with issue
>> >>>>>>> >> >> fail notes.  Allow epilogue vectorization through masking of low trip
>> >>>>>>> >> >> loops. Set to true can_be_masked field before loop operation analysis.
>> >>>>>>> >> >> Do not set-up min_scalar_loop_bound for epilogue vectorization through
>> >>>>>>> >> >> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
>> >>>>>>> >> >> field before repeat analysis.
>> >>>>>>> >> >> (vect_estimate_min_profitable_iters): Do not compute profitability
>> >>>>>>> >> >> for epilogue masking.  Set up mask_loop filed to true if parameter
>> >>>>>>> >> >> PARAM_VECT_EPILOGUES_MASK is non-zero.
>> >>>>>>> >> >> (vectorizable_reduction): Add check that statement can be masked.
>> >>>>>>> >> >> (vectorizable_induction): Do not support masking for induction.
>> >>>>>>> >> >> (vect_gen_ivs_for_masking): New.
>> >>>>>>> >> >> (vect_get_mask_index_for_elems): New.
>> >>>>>>> >> >> (vect_get_mask_index_for_type): New.
>> >>>>>>> >> >> (vect_create_narrowed_masks): New.
>> >>>>>>> >> >> (vect_create_widened_masks): New.
>> >>>>>>> >> >> (vect_gen_loop_masks): New.
>> >>>>>>> >> >> (vect_mask_reduction_stmt): New.
>> >>>>>>> >> >> (vect_mask_mask_load_store_stmt): New.
>> >>>>>>> >> >> (vect_mask_load_store_stmt): New.
>> >>>>>>> >> >> (vect_mask_loop): New.
>> >>>>>>> >> >> (vect_transform_loop): Invoke vect_mask_loop if required.
>> >>>>>>> >> >> Use div_ceil to recompute upper bounds for masked loops.  Issue
>> >>>>>>> >> >> statistics for epilogue vectorization through masking. Do not reduce
>> >>>>>>> >> >> vf for masking epilogue.
>> >>>>>>> >> >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
>> >>>>>>> >> >> (can_mask_load_store): New.
>> >>>>>>> >> >> (vectorizable_mask_load_store): Check that mask conjuction is
>> >>>>>>> >> >> supported.  Set-up first_copy_p field of stmt_vinfo.
>> >>>>>>> >> >> (vectorizable_simd_clone_call): Check that simd clone can not be
>> >>>>>>> >> >> masked.
>> >>>>>>> >> >> (vectorizable_store): Check that store can be masked. Mark the first
>> >>>>>>> >> >> copy of generated vector stores and provide it with vectype and the
>> >>>>>>> >> >> original data reference.
>> >>>>>>> >> >> (vectorizable_load): Check that load can be masked.
>> >>>>>>> >> >> (vect_stmt_should_be_masked_for_epilogue): New.
>> >>>>>>> >> >> (vect_add_required_mask_for_stmt): New.
>> >>>>>>> >> >> (vect_analyze_stmt): Add check on unsupported statements for masking
>> >>>>>>> >> >> with printing message.
>> >>>>>>> >> >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
>> >>>>>>> >> >> can_be_maske, required_masks, masl_loop.
>> >>>>>>> >> >> (LOOP_VINFO_CAN_BE_MASKED): New.
>> >>>>>>> >> >> (LOOP_VINFO_REQUIRED_MASKS): New.
>> >>>>>>> >> >> (LOOP_VINFO_MASK_LOOP): New.
>> >>>>>>> >> >> (struct _stmt_vec_info): Add first_copy_p field.
>> >>>>>>> >> >> (STMT_VINFO_FIRST_COPY_P): New.
>> >>>>>>> >> >>
>> >>>>>>> >> >> gcc/testsuite/
>> >>>>>>> >> >>
>> >>>>>>> >> >> * gcc.dg/vect/vect-tail-mask-1.c: New test.
>> >>>>>>> >> >>
>> >>>>>>> >> >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> >>>>>>> >> >> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >>>>>>> >> >> >> It is very strange that this test failed on arm, since it requires
>> >>>>>>> >> >> >> target avx2 to check vectorizer dumps:
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>> >>>>>>> >> >> >> target avx2_runtime } } } */
>> >>>>>>> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>> >>>>>>> >> >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> Could you please clarify what is the reason of the failure?
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > It's not the scan-dumps that fail, but the execution.
>> >>>>>>> >> >> > The test calls abort() for some reason.
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > It will take me a while to rebuild the test manually in the right
>> >>>>>>> >> >> > debug environment to provide you with more traces.
>> >>>>>>> >> >> >
>> >>>>>>> >> >> >
>> >>>>>>> >> >> >
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> Thanks.
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>> >>>>>>> >> >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >>>>>>> >> >> >>>> Hi All,
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> Here is patch for non-masked epilogue vectoriziation.
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> Bootstrap and regression testing did not show any new failures.
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> Is it OK for trunk?
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> Thanks.
>> >>>>>>> >> >> >>>> Changelog:
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>> >>>>>>> >> >> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
>> >>>>>>> >> >> >>>> * * tree-if-conv.h: New file.
>> >>>>>>> >> >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>> >>>>>>> >> >> >>>> dynamic alias checks for epilogues.
>> >>>>>>> >> >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>> >>>>>>> >> >> >>>> * tree-vect-loop.c: include tree-if-conv.h.
>> >>>>>>> >> >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>> >>>>>>> >> >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>> >>>>>>> >> >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>> >>>>>>> >> >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>> >>>>>>> >> >> >>>> using passed argument.
>> >>>>>>> >> >> >>>> (vect_transform_loop): Check if created epilogue should be returned
>> >>>>>>> >> >> >>>> for further vectorization with less vf.  If-convert epilogue if
>> >>>>>>> >> >> >>>> required. Print vectorization success for epilogue.
>> >>>>>>> >> >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>> >>>>>>> >> >> >>>> if it is required, pass loop_vinfo produced during vectorization of
>> >>>>>>> >> >> >>>> loop body to vect_analyze_loop.
>> >>>>>>> >> >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>> >>>>>>> >> >> >>>> orig_loop_info.
>> >>>>>>> >> >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>> >>>>>>> >> >> >>>> (LOOP_VINFO_EPILOGUE_P): New.
>> >>>>>>> >> >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>> >>>>>>> >> >> >>>> (vect_do_peeling): Change prototype to return epilogue.
>> >>>>>>> >> >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>> >>>>>>> >> >> >>>> (vect_transform_loop): Return created loop.
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> gcc/testsuite/
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>> >>>>>>> >> >> >>>> (check_effective_target_avx2_runtime): New.
>> >>>>>>> >> >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> Hi,
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>> >>>>>>> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>> >>>>>>> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> Christophe
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>>>
>> >>>>>>> >> >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> >> >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> >>>>>>> >> >> >>>>>>Richard,
>> >>>>>>> >> >> >>>>>>
>> >>>>>>> >> >> >>>>>>I checked one of the tests designed for epilogue vectorization using
>> >>>>>>> >> >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>> >>>>>>> >> >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>> >>>>>>> >> >> >>>>>>
>> >>>>>>> >> >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>> >>>>>>> >> >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
>> >>>>>>> >> >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>> >>>>>>> >> >> >>>>>>4
>> >>>>>>> >> >> >>>>>> Without param only 2 loops are vectorized.
>> >>>>>>> >> >> >>>>>>
>> >>>>>>> >> >> >>>>>>Should I simply add a part of tests related to this feature or I must
>> >>>>>>> >> >> >>>>>>delete all not necessary changes also?
>> >>>>>>> >> >> >>>>>
>> >>>>>>> >> >> >>>>> Please remove all not necessary changes.
>> >>>>>>> >> >> >>>>>
>> >>>>>>> >> >> >>>>> Richard.
>> >>>>>>> >> >> >>>>>
>> >>>>>>> >> >> >>>>>>Thanks.
>> >>>>>>> >> >> >>>>>>Yuri.
>> >>>>>>> >> >> >>>>>>
>> >>>>>>> >> >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> >> >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>>> Richard,
>> >>>>>>> >> >> >>>>>>>>
>> >>>>>>> >> >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>> >>>>>>> >> >> >>>>>>field.
>> >>>>>>> >> >> >>>>>>>> Here is the correct updated patch.
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>> >>>>>>> >> >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>> >>>>>>> >> >> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>> Thus, can you please produce a single complete patch containing only
>> >>>>>>> >> >> >>>>>>> non-masked epilogue vectoriziation?
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>> Thanks,
>> >>>>>>> >> >> >>>>>>> Richard.
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>>> Thanks.
>> >>>>>>> >> >> >>>>>>>> Yuri.
>> >>>>>>> >> >> >>>>>>>>
>> >>>>>>> >> >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> >> >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> >> Richard,
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>> >>>>>>> >> >> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >> You wrote:
>> >>>>>>> >> >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>> >>>>>>> >> >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>> >>>>>>> >> >> >>>>>>>> >> changes only needed by later patches?
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >> Did you mean that I exclude all support for vectorization
>> >>>>>>> >> >> >>>>>>epilogues,
>> >>>>>>> >> >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>> >>>>>>> >> >> >>>>>>>> >> like
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >>>>>>> >> >> >>>>>>>> >> index 11863af..32011c1 100644
>> >>>>>>> >> >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
>> >>>>>>> >> >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>> >>>>>>> >> >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> > Yes.
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
>> >>>>>>> >> >> >>>>>>i.e.
>> >>>>>>> >> >> >>>>>>>> >> can be integrated without other patches?
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> > Yes.
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> >> Could you please look at updated patch?
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> > Will do.
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> > Thanks,
>> >>>>>>> >> >> >>>>>>>> > Richard.
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> >> Thanks.
>> >>>>>>> >> >> >>>>>>>> >> Yuri.
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> >>>>>>> >> >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>> >>>>>>> >> >> >>>>>>>> >> >
>> >>>>>>> >> >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >> >>>>>>>> >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> > Richard,
>> >>>>>>> >> >> >>>>>>>> >> >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > Here is updated 3 patch.
>> >>>>>>> >> >> >>>>>>>> >> >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > I checked that all new tests related to epilogue
>> >>>>>>> >> >> >>>>>>vectorization passed with it.
>> >>>>>>> >> >> >>>>>>>> >> >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > Your comments will be appreciated.
>> >>>>>>> >> >> >>>>>>>> >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>> >>>>>>> >> >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>> >>>>>>> >> >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>> >>>>>>> >> >> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>> >>>>>>> >> >> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
>> >>>>>>> >> >> >>>>>>(optional)
>> >>>>>>> >> >> >>>>>>>> >> >> forced vectorization factor as well?
>> >>>>>>> >> >> >>>>>>>> >> >
>> >>>>>>> >> >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>> >>>>>>> >> >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>> >>>>>>> >> >> >>>>>>>> >> > changes only needed by later patches?
>> >>>>>>> >> >> >>>>>>>> >> >
>> >>>>>>> >> >> >>>>>>>> >> > Thanks,
>> >>>>>>> >> >> >>>>>>>> >> > Richard.
>> >>>>>>> >> >> >>>>>>>> >> >
>> >>>>>>> >> >> >>>>>>>> >> >> Richard.
>> >>>>>>> >> >> >>>>>>>> >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>> >>>>>>> >> >> >>>>>><rguenther@suse.de>:
>> >>>>>>> >> >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >> >>>>>>>> >> >> > >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> Hi Richard,
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> I did not understand your last remark:
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >>>>>>> >> >> >>>>>>vect_location,
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>> >>>>>>> >> >> >>>>>>it to be unrolled
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           etc.  */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >>>>>>> >> >> >>>>>>it easier
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >>>>>>> >> >> >>>>>>in dumps
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >>>>>>> >> >> >>>>>>*/
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         {
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         }
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >>>>>>> >> >> >>>>>>new_loop)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>> >>>>>>> >> >> >>>>>>perform
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >>>>>>> >> >> >>>>>>vectorization
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > separately that would be great.
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> Could you please clarify your proposal.
>> >>>>>>> >> >> >>>>>>>> >> >> > >
>> >>>>>>> >> >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>> >>>>>>> >> >> >>>>>>vectorize
>> >>>>>>> >> >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>> >>>>>>> >> >> >>>>>>avoiding
>> >>>>>>> >> >> >>>>>>>> >> >> > > the re-use of ->aux.
>> >>>>>>> >> >> >>>>>>>> >> >> > >
>> >>>>>>> >> >> >>>>>>>> >> >> > > Richard.
>> >>>>>>> >> >> >>>>>>>> >> >> > >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> Thanks.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> Yuri.
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>> >>>>>>> >> >> >>>>>><rguenther@suse.de>:
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> Hi All,
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>> >>>>>>> >> >> >>>>>>which support
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>> >>>>>>> >> >> >>>>>>trip count. We
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> assume that the only patch -
>> >>>>>>> >> >> >>>>>>vec-tails-07-combine-tail.patch - was not
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> approved by Jeff.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>> >>>>>>> >> >> >>>>>>bootstrapping and
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>> >>>>>>> >> >> >>>>>>Also all
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>> >>>>>>> >> >> >>>>>>been changed
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> accordingly.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> Is it OK for trunk?
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I would have prefered that the series up to
>> >>>>>>> >> >> >>>>>>-03-nomask-tails would
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>> >>>>>>> >> >> >>>>>>unfortunately
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the patchset is oddly separated.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>> >>>>>>> >> >> >>>>>>(loop_vec_info
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>> >>>>>>> >> >> >>>>>>single_exit (loop))
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -      || loop->inner)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      || loop->inner
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>> >>>>>>> >> >> >>>>>>and
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      do_peeling = false;
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (do_peeling
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>> >>>>>>> >> >> >>>>>>(loop_vec_info
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    do_versioning =
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          original loop and is not required for
>> >>>>>>> >> >> >>>>>>epilogue.  */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (do_versioning)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      {
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > please do that check in the single caller of this
>> >>>>>>> >> >> >>>>>>function.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>> >>>>>>> >> >> >>>>>>believe that simply
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
>> >>>>>>> >> >> >>>>>>_much_ cleaner.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>> >>>>>>> >> >> >>>>>>vect_location,
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>> >>>>>>> >> >> >>>>>>it to be unrolled
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >            etc.  */
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>> >>>>>>> >> >> >>>>>>it easier
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>> >>>>>>> >> >> >>>>>>in dumps
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>> >>>>>>> >> >> >>>>>>*/
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         {
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         }
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>> >>>>>>> >> >> >>>>>>new_loop)
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>> >>>>>>> >> >> >>>>>>perform
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>> >>>>>>> >> >> >>>>>>vectorization
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > separately that would be great.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>> >>>>>>> >> >> >>>>>>question its
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>> >>>>>>> >> >> >>>>>>vector loop).
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Thanks,
>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Richard.
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>> >>>>>>> >> >> >>>>>>>> >> >> > >
>> >>>>>>> >> >> >>>>>>>> >> >> > > --
>> >>>>>>> >> >> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>> >>>>>>> >> >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>> >>>>>>> >> >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>> >> >> >>>>>>>> >> >> >
>> >>>>>>> >> >> >>>>>>>> >> >>
>> >>>>>>> >> >> >>>>>>>> >> >>
>> >>>>>>> >> >> >>>>>>>> >> >
>> >>>>>>> >> >> >>>>>>>> >> > --
>> >>>>>>> >> >> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
>> >>>>>>> >> >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>> >> >> >>>>>>>> >>
>> >>>>>>> >> >> >>>>>>>> >
>> >>>>>>> >> >> >>>>>>>> > --
>> >>>>>>> >> >> >>>>>>>> > Richard Biener <rguenther@suse.de>
>> >>>>>>> >> >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>> >> >> >>>>>>>>
>> >>>>>>> >> >> >>>>>>>
>> >>>>>>> >> >> >>>>>>> --
>> >>>>>>> >> >> >>>>>>> Richard Biener <rguenther@suse.de>
>> >>>>>>> >> >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>> >> >> >>>>>
>> >>>>>>> >> >> >>>>>
>> >>>>>>> >> >>
>> >>>>>>> >> >
>> >>>>>>> >> > --
>> >>>>>>> >> > Richard Biener <rguenther@suse.de>
>> >>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>> >>
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > Richard Biener <rguenther@suse.de>
>> >>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Richard Biener <rguenther@suse.de>
>> >>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: nomask.patch --]
[-- Type: application/octet-stream, Size: 4428 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c b/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c
new file mode 100755
index 0000000..47bb4b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c
@@ -0,0 +1,155 @@
+/* { dg-do run } */
+/* { dg-require-weak "" } */
+/* { dg-additional-options "-ffast-math --param vect-epilogues-nomask=1 -mavx2" { target avx2_runtime } } */
+
+#define SIZE 1023
+#define ALIGN 64
+
+extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment, __SIZE_TYPE__ size);
+extern void free (void *);
+
+double __attribute__((noinline))
+test_citer (int * __restrict__ a,
+	    long long * __restrict__ b,
+	    float * __restrict__ c,
+	    double * __restrict__ d)
+{
+  double res = 0;
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (long long *)__builtin_assume_aligned (b, ALIGN);
+  c = (float *)__builtin_assume_aligned (c, ALIGN);
+  d = (double *)__builtin_assume_aligned (d, ALIGN);
+
+  for (i = 0; i < SIZE; i++)
+    {
+      a[i] = c[i] + 1;
+      if (b[i] < 0)
+	res += d[i];
+    }
+
+  return res;
+}
+
+double __attribute__((noinline))
+test_viter (int * __restrict__ a,
+	    long long * __restrict__ b,
+	    float * __restrict__ c,
+	    double * __restrict__ d,
+	    int size)
+{
+  double res = 0;
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (long long *)__builtin_assume_aligned (b, ALIGN);
+  c = (float *)__builtin_assume_aligned (c, ALIGN);
+  d = (double *)__builtin_assume_aligned (d, ALIGN);
+
+  for (i = 0; i < size; i++)
+    {
+      a[i] = c[i] + 1;
+      if (b[i] < 0)
+	res += d[i];
+    }
+
+  return res;
+}
+
+void __attribute__((noinline))
+init_data (int * __restrict__ a,
+	   long long * __restrict__ b,
+	   float * __restrict__ c,
+	   double * __restrict__ d,
+	   int size)
+{
+  int i;
+  for (i = 0; i < size; i++)
+    {
+      if (i % 2)
+	{
+	  a[i] = 0;
+	  b[i] = i;
+	  c[i] = 2.5;
+	  d[i] = 1;
+	}
+      else
+	{
+	  a[i] = 0;
+	  b[i] = -i;
+	  c[i] = 2.5;
+	  d[i] = -1;
+	}
+      asm volatile("": : :"memory");
+    }
+  a[size] = (int)size;
+  b[size] = (long long)size;
+  c[size] = (float)size;
+  d[size] = (double)size;
+}
+
+void __attribute__((noinline))
+run_test ()
+{
+  int *a;
+  long long *b;
+  float *c;
+  double *d;
+  double res;
+  int i;
+
+  if (posix_memalign ((void **)&a, ALIGN, (SIZE + 1) * sizeof (int)) != 0)
+    return;
+  if (posix_memalign ((void **)&b, ALIGN, (SIZE + 1) * sizeof (long long)) != 0)
+    return;
+  if (posix_memalign ((void **)&c, ALIGN, (SIZE + 1) * sizeof (float)) != 0)
+    return;
+  if (posix_memalign ((void **)&d, ALIGN, (SIZE + 1) * sizeof (double)) != 0)
+    return;
+
+  init_data (a, b, c, d, SIZE);
+  res = test_citer (a, b, c, d);
+  res += SIZE / 2;
+  if (res > 0.01 || res < -0.01)
+    __builtin_abort ();
+  for (i = 0; i < SIZE; i++)
+    if (a[i] != 3)
+      __builtin_abort ();
+  if (a[SIZE] != (int)SIZE
+      || b[SIZE] != (long long)SIZE
+      || c[SIZE] != (float)SIZE
+      || d[SIZE] != (double)SIZE)
+    __builtin_abort ();
+
+  init_data (a, b, c, d, SIZE);
+  res = test_viter (a, b, c, d, SIZE);
+  res += SIZE / 2;
+  if (res > 0.01 || res < -0.01)
+    __builtin_abort ();
+  for (i = 0; i < SIZE; i++)
+    if (a[i] != 3)
+      __builtin_abort ();
+  if (a[SIZE] != (int)SIZE
+      || b[SIZE] != (long long)SIZE
+      || c[SIZE] != (float)SIZE
+      || d[SIZE] != (double)SIZE)
+    __builtin_abort ();
+
+  free (a);
+  free (b);
+  free (c);
+}
+
+int
+main (int argc, const char **argv)
+{
+  if (!posix_memalign)
+    return 0;
+
+  run_test ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED \\(VS=32\\)" 2 "vect" { target avx2_runtime } } } */
+/* { dg-final { scan-tree-dump-times "LOOP EPILOGUE COMBINED \\(VS=32\\)" 2 "vect" { target avx2_runtime } } } */
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 7538c6c..39762cb 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -540,8 +540,8 @@ vectorize_loops (void)
 	     || loop->force_vectorize)
       {
 	loop_vec_info loop_vinfo, orig_loop_vinfo = NULL;
-	gimple *loop_vectorized_call = vect_loop_vectorized_call (loop);
 vectorize_epilogue:
+	gimple *loop_vectorized_call = vect_loop_vectorized_call (loop);
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH, vec-tails] Support loop epilogue vectorization
  2016-12-21 10:14                                                           ` Yuri Rumyantsev
@ 2016-12-21 17:23                                                             ` Yuri Rumyantsev
  0 siblings, 0 replies; 38+ messages in thread
From: Yuri Rumyantsev @ 2016-12-21 17:23 UTC (permalink / raw)
  To: Richard Biener, gcc-patches, Pavel Chupin

[-- Attachment #1: Type: text/plain, Size: 60196 bytes --]

Sorry,

I put wrong test - fix it here.

2016-12-21 13:12 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Hi Richard,
>
> I occasionally found out a bug in my patch related to epilogue
> vectorization without masking : need to put label before
> initialization.
>
> Could you please review and integrate it to trunk. Test-case is also attached.
>
>
> Thanks ahead.
> Yuri.
>
> ChangeLog:
> 2016-12-21  Yuri Rumyantsev  <ysrumyan@gmail.com>
>
> * tree-vectorizer.c (vectorize_loops): Put label before initialization
> of loop_vectorized_call.
>
> gcc/testsuite/
>
> * gcc.dg/vect/vect-tail-nomask-2.c: New test.
>
> 2016-12-13 16:59 GMT+03:00 Richard Biener <rguenther@suse.de>:
>> On Mon, 12 Dec 2016, Yuri Rumyantsev wrote:
>>
>>> Richard,
>>>
>>> Could you please review cost model patch before to include it to
>>> epilogue masking patch and add masking cost estimation as you
>>> requested.
>>
>> That's just the middle-end / target changes.  I was not 100% happy
>> with them but well, the vectorizer cost modeling needs work
>> (aka another rewrite).
>>
>> From below...
>>
>>> Thanks.
>>>
>>> Patch and ChangeLog are attached.
>>>
>>> 2016-12-12 15:47 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>>> > Hi Richard,
>>> >
>>> > You asked me about performance of spec2006 on AVX2 machine with new feature.
>>> >
>>> > I tried the following on Haswell using original patch designed by Ilya.
>>> > 1. Masking low trip count loops  only 6 benchmarks are affected and
>>> > performance is almost the same
>>> > 464.h264ref     63.9000    64.0000 +0.15%
>>> > 416.gamess      42.9000    42.9000 +0%
>>> > 435.gromacs     32.8000    32.7000 -0.30%
>>> > 447.dealII      68.5000    68.3000 -0.29%
>>> > 453.povray      61.9000    62.1000 +0.32%
>>> > 454.calculix    39.8000    39.8000 +0%
>>> > 465.tonto       29.9000    29.9000 +0%
>>> >
>>> > 2. epilogue vectorization without masking (use less vf) (3 benchmarks
>>> > are not affected)
>>> > 400.perlbench     47.2000    46.5000 -1.48%
>>> > 401.bzip2         29.9000    29.9000 +0%
>>> > 403.gcc           41.8000    41.6000 -0.47%
>>> > 456.hmmer         32.0000    32.0000 +0%
>>> > 462.libquantum    81.5000    82.0000 +0.61%
>>> > 464.h264ref       65.0000    65.5000 +0.76%
>>> > 471.omnetpp       27.8000    28.2000 +1.43%
>>> > 473.astar         28.7000    28.6000 -0.34%
>>> > 483.xalancbmk     48.7000    48.6000 -0.20%
>>> > 410.bwaves        95.3000    95.3000 +0%
>>> > 416.gamess        42.9000    42.8000 -0.23%
>>> > 433.milc          38.8000    38.8000 +0%
>>> > 434.zeusmp        51.7000    51.4000 -0.58%
>>> > 435.gromacs       32.8000    32.8000 +0%
>>> > 436.cactusADM     85.0000    83.0000 -2.35%
>>> > 437.leslie3d      55.5000    55.5000 +0%
>>> > 444.namd          31.3000    31.3000 +0%
>>> > 447.dealII        68.7000    68.9000 +0.29%
>>> > 450.soplex        47.3000    47.4000 +0.21%
>>> > 453.povray        62.1000    61.4000 -1.12%
>>> > 454.calculix      39.7000    39.3000 -1.00%
>>> > 459.GemsFDTD      44.9000    45.0000 +0.22%
>>> > 465.tonto         29.8000    29.8000 +0%
>>> > 481.wrf           51.0000    51.2000 +0.39%
>>> > 482.sphinx3       69.8000    71.2000 +2.00%
>>
>> I see 471.omnetpp and 482.sphinx3 are in a similar ballpark and it
>> would be nice to catch the relevant case(s) with a cost model for
>> epilogue vectorization without masking first (to get rid of
>> --param vect-epilogues-nomask).
>>
>> As said elsewhere any non-conservative cost modeling (if the
>> number of scalar iterations is not statically constant) might
>> require versioning of the loop into a non-vectorized,
>> short-trip vectorized and regular vectorized case (the Intel
>> compiler does way more aggressive versioning IIRC).
>>
>> Richard.
>>
>>> > 3. epilogue vectorization using masking (4 benchmarks are not affected):
>>> > 400.perlbench     47.5000    46.8000 -1.47%
>>> > 401.bzip2         30.0000    29.9000 -0.33%
>>> > 403.gcc           42.3000    42.3000 +0%
>>> > 445.gobmk         32.1000    32.8000 +2.18%
>>> > 456.hmmer         32.0000    32.0000 +0%
>>> > 458.sjeng         36.1000    35.5000 -1.66%
>>> > 462.libquantum    81.1000    81.1000 +0%
>>> > 464.h264ref       65.4000    65.0000 -0.61%
>>> > 483.xalancbmk     49.4000    49.3000 -0.20%
>>> > 410.bwaves        95.9000    95.5000 -0.41%
>>> > 416.gamess        42.8000    42.6000 -0.46%
>>> > 433.milc          38.8000    39.1000 +0.77%
>>> > 434.zeusmp        52.1000    51.3000 -1.53%
>>> > 435.gromacs       32.9000    32.9000 +0%
>>> > 436.cactusADM     78.8000    85.3000 +8.24%
>>> > 437.leslie3d      55.4000    55.4000 +0%
>>> > 444.namd          31.3000    31.3000 +0%
>>> > 447.dealII        69.0000    69.2000 +0.28%
>>> > 450.soplex        47.7000    47.6000 -0.20%
>>> > 453.povray        62.2000    61.7000 -0.80%
>>> > 454.calculix      39.7000    38.2000 -3.77%
>>> > 459.GemsFDTD      44.9000    45.0000 +0.22%
>>> > 465.tonto         29.8000    29.9000 +0.33%
>>> > 481.wrf           51.2000    51.6000 +0.78%
>>> > 482.sphinx3       70.3000    65.4000 -6.97%
>>> >
>>> > There is a good speed-up for 436 but there is essential slow0down on 482, 454.
>>> >
>>> > So In general we don't have any advantages for AVX2.
>>> >
>>> > Best regards.
>>> > Yuri.
>>> >
>>> > P.S.
>>> > I  am not able to provide you with avx512 numbers because i don't have
>>> > an access to it.
>>> > Updated patch will be sent later.
>>> >
>>> > Best regards.
>>> > Yuri.
>>> >
>>> >
>>> > 2016-12-05 15:44 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>>> >> Richard,
>>> >>
>>> >> Sorry, U sent you the bad assembly produced for loop with low trip
>>> >> count, here is the correct one:
>>> >>
>>> >> vmovdqa .LC0(%rip), %ymm0
>>> >> vpmaskmovd b(%rip), %ymm0, %ymm1
>>> >> vpmaskmovd c(%rip), %ymm0, %ymm2
>>> >> vpaddd %ymm2, %ymm1, %ymm1
>>> >> vpmaskmovd %ymm1, %ymm0, a(%rip)
>>> >>
>>> >> where .LC0 vector with all elements equal to -1 except for the last.
>>> >>
>>> >> Note also that additional option is required --param
>>> >> vect-short-loops=1 to do such conversion.
>>> >>
>>> >> Best regards.
>>> >> Yuri.
>>> >>
>>> >> 2016-12-02 18:59 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>>> >>> Richard,
>>> >>>
>>> >>> Important clarification: the test I sent you with low trip count is
>>> >>> vectorized through masking only under
>>> >>> --param vect-epilogues-combine=1 -fvect-epilogue-cost-model=unlimited
>>> >>> for avx2. The laast option isrequired for avx2 since masked store has
>>> >>> big cost in comparison with masked load.
>>> >>>
>>> >>> Below is assemby produced for it:
>>> >>> vpcmpeqd %xmm0, %xmm0, %xmm0
>>> >>> vpmaskmovd b(%rip), %xmm0, %xmm1
>>> >>> vpmaskmovd c(%rip), %xmm0, %xmm2
>>> >>> vpaddd %xmm2, %xmm1, %xmm1
>>> >>> vpmaskmovd %xmm1, %xmm0, a(%rip)
>>> >>> ret
>>> >>>
>>> >>> Thanks.
>>> >>> Yuri.
>>> >>>
>>> >>>
>>> >>> 2016-12-02 18:49 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>>> >>>> Richard,
>>> >>>>
>>> >>>> I have also question about low trip count loops.
>>> >>>> Did you mean that
>>> >>>> int a[128], b[128], c[128];
>>> >>>>
>>> >>>> void foo ()
>>> >>>> {
>>> >>>>   int i;
>>> >>>>   for (i = 0; i<7; i++)
>>> >>>>     a[i] = b[i] + c[i];
>>> >>>> }
>>> >>>>
>>> >>>> must be vectorizzed with masking without epilogue creation (e.g. for avx2)?
>>> >>>>
>>> >>>> Currently it vectorized with vector size 128. I also noticed that
>>> >>>> original Ilya patch does nothing for such masking.
>>> >>>>
>>> >>>> Thanks.
>>> >>>> Yuri.
>>> >>>>
>>> >>>> 2016-12-02 17:08 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>>> >>>>> Richard,
>>> >>>>>
>>> >>>>> You wrote:
>>> >>>>> I don't see _any_ cost model for vectorizing the epilogue with
>>> >>>>> masking?  Am I missing something?  A "trivial" cost model
>>> >>>>> should at least consider the additional IV(s), the mask
>>> >>>>> compute and the widening and narrowing ops required.
>>> >>>>>
>>> >>>>> I skipped all changes related to cost model assuming that one of the
>>> >>>>> next patch will contain all cost model changes.
>>> >>>>>
>>> >>>>> Should I include it to this patch?
>>> >>>>>
>>> >>>>> Thanks.
>>> >>>>> Yuri.
>>> >>>>>
>>> >>>>> 2016-12-01 17:45 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>> On Thu, 1 Dec 2016, Yuri Rumyantsev wrote:
>>> >>>>>>
>>> >>>>>>> Thanks Richard for your comments.
>>> >>>>>>>
>>> >>>>>>> You asked me about possible performance improvements for AVX2 machines
>>> >>>>>>> - we did not see any visible speed-up for spec2k with any method of
>>> >>>>>>
>>> >>>>>> Spec 2000?  Can you check with SPEC 2006 or CPUv6?
>>> >>>>>>
>>> >>>>>> Did you see performance degradation?  What about compile-time and
>>> >>>>>> binary size effects?
>>> >>>>>>
>>> >>>>>>> masking, including epilogue masking and combining, only on AVX512
>>> >>>>>>> machine aka knl.
>>> >>>>>>
>>> >>>>>> I see.
>>> >>>>>>
>>> >>>>>> Note that as said in the initial review patch the cost model I
>>> >>>>>> saw therein looked flawed.  In the end I'd expect a sensible
>>> >>>>>> approach would be to do
>>> >>>>>>
>>> >>>>>>  if (n < scalar-most-profitable-niter)
>>> >>>>>>    {
>>> >>>>>>      no vectorization
>>> >>>>>>    }
>>> >>>>>>  else if (n < masking-more-profitable-than-not-masking-plus-epilogue)
>>> >>>>>>    {
>>> >>>>>>      do masked vectorization
>>> >>>>>>    }
>>> >>>>>>  else
>>> >>>>>>    {
>>> >>>>>>      do unmasked vectorization (with epilogue, eventually vectorized)
>>> >>>>>>    }
>>> >>>>>>
>>> >>>>>> where for short trip loops the else path would never be taken
>>> >>>>>> (statically).
>>> >>>>>>
>>> >>>>>> And yes, that means masking will only be useful for short-trip loops
>>> >>>>>> which in the end means an overall performance benfit is unlikely
>>> >>>>>> unless we have a lot of short-trip loops that are slow because of
>>> >>>>>> the overhead of main unmasked loop plus epilogue.
>>> >>>>>>
>>> >>>>>> Richard.
>>> >>>>>>
>>> >>>>>>> I will answer on your question later.
>>> >>>>>>>
>>> >>>>>>> Best regards.
>>> >>>>>>> Yuri
>>> >>>>>>>
>>> >>>>>>> 2016-12-01 14:33 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> > On Mon, 28 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >
>>> >>>>>>> >> Richard!
>>> >>>>>>> >>
>>> >>>>>>> >> I attached vect dump for hte part of attached test-case which
>>> >>>>>>> >> illustrated how vectorization of epilogues works through masking:
>>> >>>>>>> >> #define SIZE 1023
>>> >>>>>>> >> #define ALIGN 64
>>> >>>>>>> >>
>>> >>>>>>> >> extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment,
>>> >>>>>>> >> __SIZE_TYPE__ size) __attribute__((weak));
>>> >>>>>>> >> extern void free (void *);
>>> >>>>>>> >>
>>> >>>>>>> >> void __attribute__((noinline))
>>> >>>>>>> >> test_citer (int * __restrict__ a,
>>> >>>>>>> >>    int * __restrict__ b,
>>> >>>>>>> >>    int * __restrict__ c)
>>> >>>>>>> >> {
>>> >>>>>>> >>   int i;
>>> >>>>>>> >>
>>> >>>>>>> >>   a = (int *)__builtin_assume_aligned (a, ALIGN);
>>> >>>>>>> >>   b = (int *)__builtin_assume_aligned (b, ALIGN);
>>> >>>>>>> >>   c = (int *)__builtin_assume_aligned (c, ALIGN);
>>> >>>>>>> >>
>>> >>>>>>> >>   for (i = 0; i < SIZE; i++)
>>> >>>>>>> >>     c[i] = a[i] + b[i];
>>> >>>>>>> >> }
>>> >>>>>>> >>
>>> >>>>>>> >> It was compiled with -mavx2 --param vect-epilogues-mask=1 options.
>>> >>>>>>> >>
>>> >>>>>>> >> I did not include in this patch vectorization of low trip-count loops
>>> >>>>>>> >> since in the original patch additional parameter was introduced:
>>> >>>>>>> >> +DEFPARAM (PARAM_VECT_SHORT_LOOPS,
>>> >>>>>>> >> +  "vect-short-loops",
>>> >>>>>>> >> +  "Enable vectorization of low trip count loops using masking.",
>>> >>>>>>> >> +  0, 0, 1)
>>> >>>>>>> >>
>>> >>>>>>> >> I assume that this ability can be included very quickly but it
>>> >>>>>>> >> requires cost model enhancements also.
>>> >>>>>>> >
>>> >>>>>>> > Comments on the patch itself (as I'm having a closer look again,
>>> >>>>>>> > I know how it vectorizes the above but I wondered why epilogue
>>> >>>>>>> > and short-trip loops are not basically the same code path).
>>> >>>>>>> >
>>> >>>>>>> > Btw, I don't like that the features are behind a --param paywall.
>>> >>>>>>> > That just means a) nobody will use it, b) it will bit-rot quickly,
>>> >>>>>>> > c) bugs are well-hidden.
>>> >>>>>>> >
>>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>> >>>>>>> > +      && integer_zerop (nested_in_vect_loop
>>> >>>>>>> > +                       ? STMT_VINFO_DR_STEP (stmt_info)
>>> >>>>>>> > +                       : DR_STEP (dr)))
>>> >>>>>>> > +    {
>>> >>>>>>> > +      if (dump_enabled_p ())
>>> >>>>>>> > +       dump_printf_loc (MSG_NOTE, vect_location,
>>> >>>>>>> > +                        "allow invariant load for masked loop.\n");
>>> >>>>>>> > +    }
>>> >>>>>>> >
>>> >>>>>>> > this can test memory_access_type == VMAT_INVARIANT.  Please put
>>> >>>>>>> > all the checks in a common
>>> >>>>>>> >
>>> >>>>>>> >   if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>>> >>>>>>> >     {
>>> >>>>>>> >        if (memory_access_type == VMAT_INVARIANT)
>>> >>>>>>> >          {
>>> >>>>>>> >          }
>>> >>>>>>> >        else if (...)
>>> >>>>>>> >          {
>>> >>>>>>> >             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> >>>>>>> >          }
>>> >>>>>>> >        else if (..)
>>> >>>>>>> > ...
>>> >>>>>>> >     }
>>> >>>>>>> >
>>> >>>>>>> > @@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt,
>>> >>>>>>> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
>>> >>>>>>> >        gcc_assert (!nested_in_vect_loop);
>>> >>>>>>> >        gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info));
>>> >>>>>>> >
>>> >>>>>>> > +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>>> >>>>>>> > +       {
>>> >>>>>>> > +         if (dump_enabled_p ())
>>> >>>>>>> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> >>>>>>> > +                            "cannot be masked: grouped access is not"
>>> >>>>>>> > +                            " supported.");
>>> >>>>>>> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> >>>>>>> > +      }
>>> >>>>>>> > +
>>> >>>>>>> >
>>> >>>>>>> > isn't this already handled by the above?  Or rather the general
>>> >>>>>>> > disallowance of SLP?
>>> >>>>>>> >
>>> >>>>>>> > @@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt,
>>> >>>>>>> > gimple_stmt_iterator *gsi, gimple **vec_stmt,
>>> >>>>>>> >                             &memory_access_type, &gs_info))
>>> >>>>>>> >      return false;
>>> >>>>>>> >
>>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>> >>>>>>> > +      && memory_access_type != VMAT_CONTIGUOUS)
>>> >>>>>>> > +    {
>>> >>>>>>> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> >>>>>>> > +      if (dump_enabled_p ())
>>> >>>>>>> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> >>>>>>> > +                        "cannot be masked: unsupported memory access
>>> >>>>>>> > type.\n");
>>> >>>>>>> > +    }
>>> >>>>>>> > +
>>> >>>>>>> > +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>> >>>>>>> > +      && !can_mask_load_store (stmt))
>>> >>>>>>> > +    {
>>> >>>>>>> > +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> >>>>>>> > +      if (dump_enabled_p ())
>>> >>>>>>> > +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> >>>>>>> > +                        "cannot be masked: unsupported masked store.\n");
>>> >>>>>>> > +    }
>>> >>>>>>> > +
>>> >>>>>>> >
>>> >>>>>>> > likewise please combine the ifs.
>>> >>>>>>> >
>>> >>>>>>> > @@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt,
>>> >>>>>>> > gimple_stmt_iterator *gsi,
>>> >>>>>>> >                                           ptr, vec_mask, vec_rhs);
>>> >>>>>>> >           vect_finish_stmt_generation (stmt, new_stmt, gsi);
>>> >>>>>>> >           if (i == 0)
>>> >>>>>>> > -           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
>>> >>>>>>> > +           {
>>> >>>>>>> > +             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
>>> >>>>>>> > +             STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true;
>>> >>>>>>> > +           }
>>> >>>>>>> >           else
>>> >>>>>>> >             STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
>>> >>>>>>> >           prev_stmt_info = vinfo_for_stmt (new_stmt);
>>> >>>>>>> >
>>> >>>>>>> > here you only set the flag, elsewhere you copy DR and VECTYPE as well.
>>> >>>>>>> >
>>> >>>>>>> > @@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt,
>>> >>>>>>> > gimple_stmt_iterator *gsi,
>>> >>>>>>> >                && !useless_type_conversion_p (vectype, rhs_vectype)))
>>> >>>>>>> >      return false;
>>> >>>>>>> >
>>> >>>>>>> > +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>>> >>>>>>> > +    {
>>> >>>>>>> > +      /* Check that mask conjuction is supported.  */
>>> >>>>>>> > +      optab tab;
>>> >>>>>>> > +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
>>> >>>>>>> > +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
>>> >>>>>>> > CODE_FOR_nothing)
>>> >>>>>>> > +       {
>>> >>>>>>> > +         if (dump_enabled_p ())
>>> >>>>>>> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> >>>>>>> > +                            "cannot be masked: unsupported mask
>>> >>>>>>> > operation\n");
>>> >>>>>>> > +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> >>>>>>> > +       }
>>> >>>>>>> > +    }
>>> >>>>>>> >
>>> >>>>>>> > does this really test whether we can bit-and the mask?  You are
>>> >>>>>>> > using the vector type of the store (which might be V2DF for example),
>>> >>>>>>> > also for AVX512 it might be a vector-bool type with integer mode?
>>> >>>>>>> > Of course we maybe can simply assume mask conjunction is available
>>> >>>>>>> > (I know no ISA where that would be not true).
>>> >>>>>>> >
>>> >>>>>>> > +/* Return true if STMT can be converted to masked form.  */
>>> >>>>>>> > +
>>> >>>>>>> > +static bool
>>> >>>>>>> > +can_mask_load_store (gimple *stmt)
>>> >>>>>>> > +{
>>> >>>>>>> > +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>>> >>>>>>> > +  tree vectype, mask_vectype;
>>> >>>>>>> > +  tree lhs, ref;
>>> >>>>>>> > +
>>> >>>>>>> > +  if (!stmt_info)
>>> >>>>>>> > +    return false;
>>> >>>>>>> > +  lhs = gimple_assign_lhs (stmt);
>>> >>>>>>> > +  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
>>> >>>>>>> > +  if (may_be_nonaddressable_p (ref))
>>> >>>>>>> > +    return false;
>>> >>>>>>> > +  vectype = STMT_VINFO_VECTYPE (stmt_info);
>>> >>>>>>> >
>>> >>>>>>> > You probably modeled this after ifcvt_can_use_mask_load_store but I
>>> >>>>>>> > don't think checking may_be_nonaddressable_p is necessary (we couldn't
>>> >>>>>>> > even vectorize such refs).  stmt_info should never be NULL either.
>>> >>>>>>> > With the check removed tree-ssa-loop-ivopts.h should no longer be
>>> >>>>>>> > necessary.
>>> >>>>>>> >
>>> >>>>>>> > +static void
>>> >>>>>>> > +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
>>> >>>>>>> > +                          data_reference *dr, gimple_stmt_iterator *si)
>>> >>>>>>> > +{
>>> >>>>>>> > ...
>>> >>>>>>> > +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
>>> >>>>>>> > +                                  true, NULL_TREE, true,
>>> >>>>>>> > +                                  GSI_SAME_STMT);
>>> >>>>>>> > +
>>> >>>>>>> > +  align = TYPE_ALIGN_UNIT (vectype);
>>> >>>>>>> > +  if (aligned_access_p (dr))
>>> >>>>>>> > +    misalign = 0;
>>> >>>>>>> > +  else if (DR_MISALIGNMENT (dr) == -1)
>>> >>>>>>> > +    {
>>> >>>>>>> > +      align = TYPE_ALIGN_UNIT (elem_type);
>>> >>>>>>> > +      misalign = 0;
>>> >>>>>>> > +    }
>>> >>>>>>> > +  else
>>> >>>>>>> > +    misalign = DR_MISALIGNMENT (dr);
>>> >>>>>>> > +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
>>> >>>>>>> > +  ptr = build_int_cst (reference_alias_ptr_type (mem),
>>> >>>>>>> > +                      misalign ? misalign & -misalign : align);
>>> >>>>>>> >
>>> >>>>>>> > you should simply use
>>> >>>>>>> >
>>> >>>>>>> >   align = get_object_alignment (mem) / BITS_PER_UNIT;
>>> >>>>>>> >
>>> >>>>>>> > here rather than trying to be clever.  Eventually you don't need
>>> >>>>>>> > the DR then (see question above).
>>> >>>>>>> >
>>> >>>>>>> > +    }
>>> >>>>>>> > +  gsi_replace (si ? si : &gsi, new_stmt, false);
>>> >>>>>>> >
>>> >>>>>>> > when you replace the load/store please previously copy VUSE and VDEF
>>> >>>>>>> > from the original one (we were nearly clean enough to no longer
>>> >>>>>>> > require a virtual operand rewrite after vectorization...)  Thus
>>> >>>>>>> >
>>> >>>>>>> >   gimple_set_vuse (new_stmt, gimple_vuse (stmt));
>>> >>>>>>> >   gimple_set_vdef (new_stmt, gimple_vdef (stmt));
>>> >>>>>>> >
>>> >>>>>>> > +static void
>>> >>>>>>> > +vect_mask_loop (loop_vec_info loop_vinfo)
>>> >>>>>>> > +{
>>> >>>>>>> > ...
>>> >>>>>>> > +  /* Scan all loop statements to convert vector load/store including
>>> >>>>>>> > masked
>>> >>>>>>> > +     form.  */
>>> >>>>>>> > +  for (unsigned i = 0; i < loop->num_nodes; i++)
>>> >>>>>>> > +    {
>>> >>>>>>> > +      basic_block bb = bbs[i];
>>> >>>>>>> > +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
>>> >>>>>>> > +          !gsi_end_p (si); gsi_next (&si))
>>> >>>>>>> > +       {
>>> >>>>>>> > +         gimple *stmt = gsi_stmt (si);
>>> >>>>>>> > +         stmt_vec_info stmt_info = NULL;
>>> >>>>>>> > +         tree vectype = NULL;
>>> >>>>>>> > +         data_reference *dr;
>>> >>>>>>> > +
>>> >>>>>>> > +         /* Mask load case.  */
>>> >>>>>>> > +         if (is_gimple_call (stmt)
>>> >>>>>>> > +             && gimple_call_internal_p (stmt)
>>> >>>>>>> > +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
>>> >>>>>>> > +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
>>> >>>>>>> > +           {
>>> >>>>>>> > ...
>>> >>>>>>> > +             /* Skip invariant loads.  */
>>> >>>>>>> > +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
>>> >>>>>>> > +                                ? STMT_VINFO_DR_STEP (stmt_info)
>>> >>>>>>> > +                                : DR_STEP (STMT_VINFO_DATA_REF
>>> >>>>>>> > (stmt_info))))
>>> >>>>>>> > +               continue;
>>> >>>>>>> >
>>> >>>>>>> > seeing this it would be nice if stmt_info had a flag for whether
>>> >>>>>>> > the stmt needs masking (and a flag on wheter this is a scalar or a
>>> >>>>>>> > vectorized stmt).
>>> >>>>>>> >
>>> >>>>>>> > +         /* Skip hoisted out statements.  */
>>> >>>>>>> > +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
>>> >>>>>>> > +           continue;
>>> >>>>>>> >
>>> >>>>>>> > err, you walk stmts in the loop!  Isn't this covered by the above
>>> >>>>>>> > skipping of 'invariant loads'?
>>> >>>>>>> >
>>> >>>>>>> > +static gimple *
>>> >>>>>>> > +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
>>> >>>>>>> > +{
>>> >>>>>>> >
>>> >>>>>>> > depending on the reduction operand there are variants that
>>> >>>>>>> > could get away w/o the VEC_COND_EXPR, like
>>> >>>>>>> >
>>> >>>>>>> >   S1': tem_4 = d_3 & MASK;
>>> >>>>>>> >   S2': r_1 = r_2 + tem_4;
>>> >>>>>>> >
>>> >>>>>>> > which works for plus at least.  More generally doing
>>> >>>>>>> >
>>> >>>>>>> >   S1': tem_4 = VEC_COND_EXPR<MASK, d_3, neutral operand>
>>> >>>>>>> >   S2': r_1 = r_2 OP tem_4;
>>> >>>>>>> >
>>> >>>>>>> > and leaving optimization to & to later opts (& won't work for
>>> >>>>>>> > AVX512 mask registers I guess).
>>> >>>>>>> >
>>> >>>>>>> > Good enough for later enhacement of course.
>>> >>>>>>> >
>>> >>>>>>> > +static void
>>> >>>>>>> > +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
>>> >>>>>>> > +{
>>> >>>>>>> > ...
>>> >>>>>>> >
>>> >>>>>>> > isn't it enough to always create a single IV and derive the
>>> >>>>>>> > additional copies by IV + i * { elems, elems, elems ... }?
>>> >>>>>>> > IVs are expensive -- I'm sure we can optimize the rest of the
>>> >>>>>>> > scheme further as well but this one looks obvious to me.
>>> >>>>>>> >
>>> >>>>>>> > @@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info
>>> >>>>>>> > loop_vinfo,
>>> >>>>>>> >    int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>>> >>>>>>> >    void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
>>> >>>>>>> >
>>> >>>>>>> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>> >>>>>>> > +    {
>>> >>>>>>> > +      /* Currently we don't produce scalar epilogue version in case
>>> >>>>>>> > +        its masked version is provided.  It means we don't need to
>>> >>>>>>> > +        compute profitability one more time here.  Just make a
>>> >>>>>>> > +        masked loop version.  */
>>> >>>>>>> > +      if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>>> >>>>>>> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
>>> >>>>>>> > +       {
>>> >>>>>>> > +         dump_printf_loc (MSG_NOTE, vect_location,
>>> >>>>>>> > +                          "cost model: mask loop epilogue.\n");
>>> >>>>>>> > +         LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>>> >>>>>>> > +         *ret_min_profitable_niters = 0;
>>> >>>>>>> > +         *ret_min_profitable_estimate = 0;
>>> >>>>>>> > +         return;
>>> >>>>>>> > +       }
>>> >>>>>>> > +    }
>>> >>>>>>> >    /* Cost model disabled.  */
>>> >>>>>>> > -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>>> >>>>>>> > +  else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
>>> >>>>>>> >      {
>>> >>>>>>> >        dump_printf_loc (MSG_NOTE, vect_location, "cost model
>>> >>>>>>> > disabled.\n");
>>> >>>>>>> >        *ret_min_profitable_niters = 0;
>>> >>>>>>> >        *ret_min_profitable_estimate = 0;
>>> >>>>>>> > +      if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)
>>> >>>>>>> > +         && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>>> >>>>>>> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>>> >>>>>>> >        return;
>>> >>>>>>> >      }
>>> >>>>>>> >
>>> >>>>>>> > the unlimited_cost_model case should come first?  OTOH masking or
>>> >>>>>>> > not is probably not sth covered by 'unlimited' - that is about
>>> >>>>>>> > vectorizing or not.  But the above code means that for
>>> >>>>>>> > epilogue vectorization w/o masking we ignore unlimited_cost_model ()?
>>> >>>>>>> > That doesn't make sense to me.
>>> >>>>>>> >
>>> >>>>>>> > Plus if this is short-trip or epilogue vectorization and the
>>> >>>>>>> > cost model is _not_ unlimited then we dont' want to enable
>>> >>>>>>> > masking always (if it is possible).  It might be we statically
>>> >>>>>>> > know the epilogue executes for at most two iterations for example.
>>> >>>>>>> >
>>> >>>>>>> > I don't see _any_ cost model for vectorizing the epilogue with
>>> >>>>>>> > masking?  Am I missing something?  A "trivial" cost model
>>> >>>>>>> > should at least consider the additional IV(s), the mask
>>> >>>>>>> > compute and the widening and narrowing ops required.
>>> >>>>>>> >
>>> >>>>>>> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>>> >>>>>>> > index e13d6a2..36be342 100644
>>> >>>>>>> > --- a/gcc/tree-vect-loop-manip.c
>>> >>>>>>> > +++ b/gcc/tree-vect-loop-manip.c
>>> >>>>>>> > @@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
>>> >>>>>>> > niters, tree nitersm1,
>>> >>>>>>> >    bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
>>> >>>>>>> >                          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
>>> >>>>>>> >
>>> >>>>>>> > +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>> >>>>>>> > +    {
>>> >>>>>>> > +      prolog_peeling = false;
>>> >>>>>>> > +      if (LOOP_VINFO_MASK_LOOP (loop_vinfo))
>>> >>>>>>> > +       epilog_peeling = false;
>>> >>>>>>> > +    }
>>> >>>>>>> > +
>>> >>>>>>> >    if (!prolog_peeling && !epilog_peeling)
>>> >>>>>>> >      return NULL;
>>> >>>>>>> >
>>> >>>>>>> > I think the prolog_peeling was fixed during the epilogue vectorization
>>> >>>>>>> > review and should no longer be necessary.  Please add
>>> >>>>>>> > a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead
>>> >>>>>>> > (it should also work for short-trip loop vectorization).
>>> >>>>>>> >
>>> >>>>>>> > @@ -2022,11 +2291,18 @@ start_over:
>>> >>>>>>> >        || (max_niter != -1
>>> >>>>>>> >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
>>> >>>>>>> >      {
>>> >>>>>>> > -      if (dump_enabled_p ())
>>> >>>>>>> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> >>>>>>> > -                        "not vectorized: iteration count smaller than "
>>> >>>>>>> > -                        "vectorization factor.\n");
>>> >>>>>>> > -      return false;
>>> >>>>>>> > +      /* Allow low trip count for loop epilogue we want to mask.  */
>>> >>>>>>> > +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>> >>>>>>> > +         && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK))
>>> >>>>>>> > +       LOOP_VINFO_MASK_LOOP (loop_vinfo) = true;
>>> >>>>>>> > +      else
>>> >>>>>>> > +       {
>>> >>>>>>> > +         if (dump_enabled_p ())
>>> >>>>>>> >
>>> >>>>>>> > so why do we test only LOOP_VINFO_EPILOGUE_P here?  All the code
>>> >>>>>>> > I saw sofar would also work for the main loop (but the cost
>>> >>>>>>> > model is missing).
>>> >>>>>>> >
>>> >>>>>>> > I am missing testcases.  There's only a single one but we should
>>> >>>>>>> > have cases covering all kinds of mask IV widths and widen/shorten
>>> >>>>>>> > masks.
>>> >>>>>>> >
>>> >>>>>>> > Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking
>>> >>>>>>> > enabled for an AVX2 machine?
>>> >>>>>>> >
>>> >>>>>>> > Oh, and I really dislike the --param paywall.
>>> >>>>>>> >
>>> >>>>>>> > Thanks,
>>> >>>>>>> > Richard.
>>> >>>>>>> >
>>> >>>>>>> >> Best regards.
>>> >>>>>>> >> Yuri.
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>> >> 2016-11-28 17:39 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> >> > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >
>>> >>>>>>> >> >> Hi All,
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> Here is the second patch which supports epilogue vectorization using
>>> >>>>>>> >> >> masking without cost model. Currently it is possible
>>> >>>>>>> >> >> only with passing parameter "--param vect-epilogues-mask=1".
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> Bootstrapping and regression testing did not show any new regression.
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> Any comments will be appreciated.
>>> >>>>>>> >> >
>>> >>>>>>> >> > Going over the patch the main question is one how it works -- it looks
>>> >>>>>>> >> > like the decision whether to vectorize & mask the epilogue is made
>>> >>>>>>> >> > when vectorizing the loop that generates the epilogue rather than
>>> >>>>>>> >> > in the epilogue vectorization path?
>>> >>>>>>> >> >
>>> >>>>>>> >> > That is, I'd have expected to see this handling low-trip count loops
>>> >>>>>>> >> > by masking?  And thus masking the epilogue simply by it being
>>> >>>>>>> >> > low-trip count?
>>> >>>>>>> >> >
>>> >>>>>>> >> > Richard.
>>> >>>>>>> >> >
>>> >>>>>>> >> >> ChangeLog:
>>> >>>>>>> >> >> 2016-11-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> * params.def (PARAM_VECT_EPILOGUES_MASK): New.
>>> >>>>>>> >> >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>>> >>>>>>> >> >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h.
>>> >>>>>>> >> >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and
>>> >>>>>>> >> >> required_mask fields.
>>> >>>>>>> >> >> (vect_check_required_masks_widening): New.
>>> >>>>>>> >> >> (vect_check_required_masks_narrowing): New.
>>> >>>>>>> >> >> (vect_get_masking_iv_elems): New.
>>> >>>>>>> >> >> (vect_get_masking_iv_type): New.
>>> >>>>>>> >> >> (vect_get_extreme_masks): New.
>>> >>>>>>> >> >> (vect_check_required_masks): New.
>>> >>>>>>> >> >> (vect_analyze_loop_operations): Call vect_check_required_masks if all
>>> >>>>>>> >> >> statements can be masked.
>>> >>>>>>> >> >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound.
>>> >>>>>>> >> >> Add check that epilogue can be masked with the same vf with issue
>>> >>>>>>> >> >> fail notes.  Allow epilogue vectorization through masking of low trip
>>> >>>>>>> >> >> loops. Set to true can_be_masked field before loop operation analysis.
>>> >>>>>>> >> >> Do not set-up min_scalar_loop_bound for epilogue vectorization through
>>> >>>>>>> >> >> masking.  Do not peeling for epilogue masking.  Reset can_be_masked
>>> >>>>>>> >> >> field before repeat analysis.
>>> >>>>>>> >> >> (vect_estimate_min_profitable_iters): Do not compute profitability
>>> >>>>>>> >> >> for epilogue masking.  Set up mask_loop filed to true if parameter
>>> >>>>>>> >> >> PARAM_VECT_EPILOGUES_MASK is non-zero.
>>> >>>>>>> >> >> (vectorizable_reduction): Add check that statement can be masked.
>>> >>>>>>> >> >> (vectorizable_induction): Do not support masking for induction.
>>> >>>>>>> >> >> (vect_gen_ivs_for_masking): New.
>>> >>>>>>> >> >> (vect_get_mask_index_for_elems): New.
>>> >>>>>>> >> >> (vect_get_mask_index_for_type): New.
>>> >>>>>>> >> >> (vect_create_narrowed_masks): New.
>>> >>>>>>> >> >> (vect_create_widened_masks): New.
>>> >>>>>>> >> >> (vect_gen_loop_masks): New.
>>> >>>>>>> >> >> (vect_mask_reduction_stmt): New.
>>> >>>>>>> >> >> (vect_mask_mask_load_store_stmt): New.
>>> >>>>>>> >> >> (vect_mask_load_store_stmt): New.
>>> >>>>>>> >> >> (vect_mask_loop): New.
>>> >>>>>>> >> >> (vect_transform_loop): Invoke vect_mask_loop if required.
>>> >>>>>>> >> >> Use div_ceil to recompute upper bounds for masked loops.  Issue
>>> >>>>>>> >> >> statistics for epilogue vectorization through masking. Do not reduce
>>> >>>>>>> >> >> vf for masking epilogue.
>>> >>>>>>> >> >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h.
>>> >>>>>>> >> >> (can_mask_load_store): New.
>>> >>>>>>> >> >> (vectorizable_mask_load_store): Check that mask conjuction is
>>> >>>>>>> >> >> supported.  Set-up first_copy_p field of stmt_vinfo.
>>> >>>>>>> >> >> (vectorizable_simd_clone_call): Check that simd clone can not be
>>> >>>>>>> >> >> masked.
>>> >>>>>>> >> >> (vectorizable_store): Check that store can be masked. Mark the first
>>> >>>>>>> >> >> copy of generated vector stores and provide it with vectype and the
>>> >>>>>>> >> >> original data reference.
>>> >>>>>>> >> >> (vectorizable_load): Check that load can be masked.
>>> >>>>>>> >> >> (vect_stmt_should_be_masked_for_epilogue): New.
>>> >>>>>>> >> >> (vect_add_required_mask_for_stmt): New.
>>> >>>>>>> >> >> (vect_analyze_stmt): Add check on unsupported statements for masking
>>> >>>>>>> >> >> with printing message.
>>> >>>>>>> >> >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields
>>> >>>>>>> >> >> can_be_maske, required_masks, masl_loop.
>>> >>>>>>> >> >> (LOOP_VINFO_CAN_BE_MASKED): New.
>>> >>>>>>> >> >> (LOOP_VINFO_REQUIRED_MASKS): New.
>>> >>>>>>> >> >> (LOOP_VINFO_MASK_LOOP): New.
>>> >>>>>>> >> >> (struct _stmt_vec_info): Add first_copy_p field.
>>> >>>>>>> >> >> (STMT_VINFO_FIRST_COPY_P): New.
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> gcc/testsuite/
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> * gcc.dg/vect/vect-tail-mask-1.c: New test.
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>>> >>>>>>> >> >> > On 18 November 2016 at 16:46, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> >>>>>>> >> >> >> It is very strange that this test failed on arm, since it requires
>>> >>>>>>> >> >> >> target avx2 to check vectorizer dumps:
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" {
>>> >>>>>>> >> >> >> target avx2_runtime } } } */
>>> >>>>>>> >> >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED
>>> >>>>>>> >> >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> Could you please clarify what is the reason of the failure?
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > It's not the scan-dumps that fail, but the execution.
>>> >>>>>>> >> >> > The test calls abort() for some reason.
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > It will take me a while to rebuild the test manually in the right
>>> >>>>>>> >> >> > debug environment to provide you with more traces.
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> Thanks.
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon <christophe.lyon@linaro.org>:
>>> >>>>>>> >> >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> >>>>>>> >> >> >>>> Hi All,
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> Here is patch for non-masked epilogue vectoriziation.
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> Bootstrap and regression testing did not show any new failures.
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> Is it OK for trunk?
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> Thanks.
>>> >>>>>>> >> >> >>>> Changelog:
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> 2016-11-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New.
>>> >>>>>>> >> >> >>>> * tree-if-conv.c (tree_if_conversion): Make public.
>>> >>>>>>> >> >> >>>> * * tree-if-conv.h: New file.
>>> >>>>>>> >> >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid
>>> >>>>>>> >> >> >>>> dynamic alias checks for epilogues.
>>> >>>>>>> >> >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog.
>>> >>>>>>> >> >> >>>> * tree-vect-loop.c: include tree-if-conv.h.
>>> >>>>>>> >> >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field.
>>> >>>>>>> >> >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues.
>>> >>>>>>> >> >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL
>>> >>>>>>> >> >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo
>>> >>>>>>> >> >> >>>> using passed argument.
>>> >>>>>>> >> >> >>>> (vect_transform_loop): Check if created epilogue should be returned
>>> >>>>>>> >> >> >>>> for further vectorization with less vf.  If-convert epilogue if
>>> >>>>>>> >> >> >>>> required. Print vectorization success for epilogue.
>>> >>>>>>> >> >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization
>>> >>>>>>> >> >> >>>> if it is required, pass loop_vinfo produced during vectorization of
>>> >>>>>>> >> >> >>>> loop body to vect_analyze_loop.
>>> >>>>>>> >> >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field
>>> >>>>>>> >> >> >>>> orig_loop_info.
>>> >>>>>>> >> >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New.
>>> >>>>>>> >> >> >>>> (LOOP_VINFO_EPILOGUE_P): New.
>>> >>>>>>> >> >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>>> >>>>>>> >> >> >>>> (vect_do_peeling): Change prototype to return epilogue.
>>> >>>>>>> >> >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type.
>>> >>>>>>> >> >> >>>> (vect_transform_loop): Return created loop.
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> gcc/testsuite/
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New.
>>> >>>>>>> >> >> >>>> (check_effective_target_avx2_runtime): New.
>>> >>>>>>> >> >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test.
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> Hi,
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode):
>>> >>>>>>> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test
>>> >>>>>>> >> >> >>>   gcc.dg/vect/vect-tail-nomask-1.c execution test
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> It does pass on the same target if configured --with-cpu=cortex-a9.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> Christophe
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>>>
>>> >>>>>>> >> >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> >>>>>>> >> >> >>>>>>Richard,
>>> >>>>>>> >> >> >>>>>>
>>> >>>>>>> >> >> >>>>>>I checked one of the tests designed for epilogue vectorization using
>>> >>>>>>> >> >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization
>>> >>>>>>> >> >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed:
>>> >>>>>>> >> >> >>>>>>
>>> >>>>>>> >> >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o
>>> >>>>>>> >> >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details
>>> >>>>>>> >> >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect
>>> >>>>>>> >> >> >>>>>>4
>>> >>>>>>> >> >> >>>>>> Without param only 2 loops are vectorized.
>>> >>>>>>> >> >> >>>>>>
>>> >>>>>>> >> >> >>>>>>Should I simply add a part of tests related to this feature or I must
>>> >>>>>>> >> >> >>>>>>delete all not necessary changes also?
>>> >>>>>>> >> >> >>>>>
>>> >>>>>>> >> >> >>>>> Please remove all not necessary changes.
>>> >>>>>>> >> >> >>>>>
>>> >>>>>>> >> >> >>>>> Richard.
>>> >>>>>>> >> >> >>>>>
>>> >>>>>>> >> >> >>>>>>Thanks.
>>> >>>>>>> >> >> >>>>>>Yuri.
>>> >>>>>>> >> >> >>>>>>
>>> >>>>>>> >> >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>>> Richard,
>>> >>>>>>> >> >> >>>>>>>>
>>> >>>>>>> >> >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux
>>> >>>>>>> >> >> >>>>>>field.
>>> >>>>>>> >> >> >>>>>>>> Here is the correct updated patch.
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>> Yeah, I noticed.  This patch would be ok for trunk (together with
>>> >>>>>>> >> >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed
>>> >>>>>>> >> >> >>>>>>> (and you'd add the testcases covering non-masked tail vect).
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>> Thus, can you please produce a single complete patch containing only
>>> >>>>>>> >> >> >>>>>>> non-masked epilogue vectoriziation?
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>> Thanks,
>>> >>>>>>> >> >> >>>>>>> Richard.
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>>> Thanks.
>>> >>>>>>> >> >> >>>>>>>> Yuri.
>>> >>>>>>> >> >> >>>>>>>>
>>> >>>>>>> >> >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> >> Richard,
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to
>>> >>>>>>> >> >> >>>>>>>> >> vect_analyze_loop as you proposed (untested).
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >> You wrote:
>>> >>>>>>> >> >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just
>>> >>>>>>> >> >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out
>>> >>>>>>> >> >> >>>>>>>> >> changes only needed by later patches?
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >> Did you mean that I exclude all support for vectorization
>>> >>>>>>> >> >> >>>>>>epilogues,
>>> >>>>>>> >> >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes
>>> >>>>>>> >> >> >>>>>>>> >> like
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>> >>>>>>> >> >> >>>>>>>> >> index 11863af..32011c1 100644
>>> >>>>>>> >> >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c
>>> >>>>>>> >> >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c
>>> >>>>>>> >> >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop)
>>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_PEELING_FOR_NITER (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >>    LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_CAN_BE_MASKED (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_REQUIRED_MASKS (res) = 0;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_COMBINE_EPILOGUE (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_MASK_EPILOGUE (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_NEED_MASKING (res) = false;
>>> >>>>>>> >> >> >>>>>>>> >> +  LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> > Yes.
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> >> Did you mean also that new combined patch must be working patch,
>>> >>>>>>> >> >> >>>>>>i.e.
>>> >>>>>>> >> >> >>>>>>>> >> can be integrated without other patches?
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> > Yes.
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> >> Could you please look at updated patch?
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> > Will do.
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> > Thanks,
>>> >>>>>>> >> >> >>>>>>>> > Richard.
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> >> Thanks.
>>> >>>>>>> >> >> >>>>>>>> >> Yuri.
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener <rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote:
>>> >>>>>>> >> >> >>>>>>>> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >> >>>>>>>> >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > Richard,
>>> >>>>>>> >> >> >>>>>>>> >> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > Here is updated 3 patch.
>>> >>>>>>> >> >> >>>>>>>> >> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > I checked that all new tests related to epilogue
>>> >>>>>>> >> >> >>>>>>vectorization passed with it.
>>> >>>>>>> >> >> >>>>>>>> >> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > Your comments will be appreciated.
>>> >>>>>>> >> >> >>>>>>>> >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> A lot better now.  Instead of the ->aux dance I now prefer to
>>> >>>>>>> >> >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as
>>> >>>>>>> >> >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that
>>> >>>>>>> >> >> >>>>>>>> >> >> loop_vinfo).  OTOH I remember we mainly use it to get at the
>>> >>>>>>> >> >> >>>>>>>> >> >> original vectorization factor?  So we can pass down an
>>> >>>>>>> >> >> >>>>>>(optional)
>>> >>>>>>> >> >> >>>>>>>> >> >> forced vectorization factor as well?
>>> >>>>>>> >> >> >>>>>>>> >> >
>>> >>>>>>> >> >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just
>>> >>>>>>> >> >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out
>>> >>>>>>> >> >> >>>>>>>> >> > changes only needed by later patches?
>>> >>>>>>> >> >> >>>>>>>> >> >
>>> >>>>>>> >> >> >>>>>>>> >> > Thanks,
>>> >>>>>>> >> >> >>>>>>>> >> > Richard.
>>> >>>>>>> >> >> >>>>>>>> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> Richard.
>>> >>>>>>> >> >> >>>>>>>> >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener
>>> >>>>>>> >> >> >>>>>><rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >> >>>>>>>> >> >> > >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> Hi Richard,
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> I did not understand your last remark:
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           && dump_enabled_p ())
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>> >>>>>>> >> >> >>>>>>vect_location,
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >                            "loop vectorized\n");
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >        /* Now that the loop has been vectorized, allow
>>> >>>>>>> >> >> >>>>>>it to be unrolled
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >           etc.  */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      loop->force_vectorize = false;
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>> >>>>>>> >> >> >>>>>>it easier
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>> >>>>>>> >> >> >>>>>>in dumps
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>> >>>>>>> >> >> >>>>>>*/
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         {
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         }
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>> >>>>>>> >> >> >>>>>>new_loop)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also
>>> >>>>>>> >> >> >>>>>>perform
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>> >>>>>>> >> >> >>>>>>vectorization
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > separately that would be great.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> Could you please clarify your proposal.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >
>>> >>>>>>> >> >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately
>>> >>>>>>> >> >> >>>>>>vectorize
>>> >>>>>>> >> >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and
>>> >>>>>>> >> >> >>>>>>avoiding
>>> >>>>>>> >> >> >>>>>>>> >> >> > > the re-use of ->aux.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >
>>> >>>>>>> >> >> >>>>>>>> >> >> > > Richard.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> Thanks.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> Yuri.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener
>>> >>>>>>> >> >> >>>>>><rguenther@suse.de>:
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote:
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> Hi All,
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review
>>> >>>>>>> >> >> >>>>>>which support
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low
>>> >>>>>>> >> >> >>>>>>trip count. We
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> assume that the only patch -
>>> >>>>>>> >> >> >>>>>>vec-tails-07-combine-tail.patch - was not
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> approved by Jeff.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed
>>> >>>>>>> >> >> >>>>>>bootstrapping and
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures.
>>> >>>>>>> >> >> >>>>>>Also all
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have
>>> >>>>>>> >> >> >>>>>>been changed
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> accordingly.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >> Is it OK for trunk?
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I would have prefered that the series up to
>>> >>>>>>> >> >> >>>>>>-03-nomask-tails would
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but
>>> >>>>>>> >> >> >>>>>>unfortunately
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the patchset is oddly separated.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless:
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment
>>> >>>>>>> >> >> >>>>>>(loop_vec_info
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    /* Check if we can possibly peel the loop.  */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (!vect_can_advance_ivs_p (loop_vinfo)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >        || !slpeel_can_duplicate_loop_p (loop,
>>> >>>>>>> >> >> >>>>>>single_exit (loop))
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -      || loop->inner)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      || loop->inner
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      /* Required peeling was performed in prologue
>>> >>>>>>> >> >> >>>>>>and
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +        is not required for epilogue.  */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +      || LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      do_peeling = false;
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (do_peeling
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment
>>> >>>>>>> >> >> >>>>>>(loop_vec_info
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > loop_vinfo)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    do_versioning =
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         optimize_loop_nest_for_speed_p (loop)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       && (!loop->inner); /* FORNOW */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       && (!loop->inner) /* FORNOW */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +        /* Required versioning was performed for the
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          original loop and is not required for
>>> >>>>>>> >> >> >>>>>>epilogue.  */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >    if (do_versioning)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >      {
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > please do that check in the single caller of this
>>> >>>>>>> >> >> >>>>>>function.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I
>>> >>>>>>> >> >> >>>>>>believe that simply
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be
>>> >>>>>>> >> >> >>>>>>_much_ cleaner.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change):
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >             && dump_enabled_p ())
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>>> >>>>>>> >> >> >>>>>>vect_location,
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >                             "loop vectorized\n");
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > -       vect_transform_loop (loop_vinfo);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       new_loop = vect_transform_loop (loop_vinfo);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         num_vectorized_loops++;
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         /* Now that the loop has been vectorized, allow
>>> >>>>>>> >> >> >>>>>>it to be unrolled
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >            etc.  */
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >         loop->force_vectorize = false;
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       /* Add new loop to a processing queue.  To make
>>> >>>>>>> >> >> >>>>>>it easier
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          to match loop and its epilogue vectorization
>>> >>>>>>> >> >> >>>>>>in dumps
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +          put new loop as the next loop to process.
>>> >>>>>>> >> >> >>>>>>*/
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +       if (new_loop)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         {
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           loops.safe_insert (i + 1, new_loop->num);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +           vect_loops_num = number_of_loops (cfun);
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > +         }
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo,
>>> >>>>>>> >> >> >>>>>>new_loop)
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also
>>> >>>>>>> >> >> >>>>>>perform
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there).
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue
>>> >>>>>>> >> >> >>>>>>vectorization
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > separately that would be great.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and
>>> >>>>>>> >> >> >>>>>>question its
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main
>>> >>>>>>> >> >> >>>>>>vector loop).
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> >
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Thanks,
>>> >>>>>>> >> >> >>>>>>>> >> >> > >> > Richard.
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >>
>>> >>>>>>> >> >> >>>>>>>> >> >> > >
>>> >>>>>>> >> >> >>>>>>>> >> >> > > --
>>> >>>>>>> >> >> >>>>>>>> >> >> > > Richard Biener <rguenther@suse.de>
>>> >>>>>>> >> >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard,
>>> >>>>>>> >> >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>> >> >> >>>>>>>> >> >> >
>>> >>>>>>> >> >> >>>>>>>> >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >>
>>> >>>>>>> >> >> >>>>>>>> >> >
>>> >>>>>>> >> >> >>>>>>>> >> > --
>>> >>>>>>> >> >> >>>>>>>> >> > Richard Biener <rguenther@suse.de>
>>> >>>>>>> >> >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>> >> >> >>>>>>>> >>
>>> >>>>>>> >> >> >>>>>>>> >
>>> >>>>>>> >> >> >>>>>>>> > --
>>> >>>>>>> >> >> >>>>>>>> > Richard Biener <rguenther@suse.de>
>>> >>>>>>> >> >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>> >> >> >>>>>>>>
>>> >>>>>>> >> >> >>>>>>>
>>> >>>>>>> >> >> >>>>>>> --
>>> >>>>>>> >> >> >>>>>>> Richard Biener <rguenther@suse.de>
>>> >>>>>>> >> >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
>>> >>>>>>> >> >> >>>>>>Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>> >> >> >>>>>
>>> >>>>>>> >> >> >>>>>
>>> >>>>>>> >> >>
>>> >>>>>>> >> >
>>> >>>>>>> >> > --
>>> >>>>>>> >> > Richard Biener <rguenther@suse.de>
>>> >>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>> >>
>>> >>>>>>> >
>>> >>>>>>> > --
>>> >>>>>>> > Richard Biener <rguenther@suse.de>
>>> >>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Richard Biener <rguenther@suse.de>
>>> >>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>>>
>>
>> --
>> Richard Biener <rguenther@suse.de>
>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

[-- Attachment #2: nomask.patch --]
[-- Type: application/octet-stream, Size: 4430 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c b/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c
new file mode 100755
index 0000000..47bb4b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vec-tail-nomask-2.c
@@ -0,0 +1,155 @@
+/* { dg-do run } */
+/* { dg-require-weak "" } */
+/* { dg-additional-options "-ffast-math --param vect-epilogues-nomask=1 -mavx2" { target avx2_runtime } } */
+
+#define SIZE 1023
+#define ALIGN 64
+
+extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment, __SIZE_TYPE__ size);
+extern void free (void *);
+
+double __attribute__((noinline))
+test_citer (int * __restrict__ a,
+	    long long * __restrict__ b,
+	    float * __restrict__ c,
+	    double * __restrict__ d)
+{
+  double res = 0;
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (long long *)__builtin_assume_aligned (b, ALIGN);
+  c = (float *)__builtin_assume_aligned (c, ALIGN);
+  d = (double *)__builtin_assume_aligned (d, ALIGN);
+
+  for (i = 0; i < SIZE; i++)
+    {
+      a[i] = c[i] + 1;
+      if (b[i] < 0)
+	res += d[i];
+    }
+
+  return res;
+}
+
+double __attribute__((noinline))
+test_viter (int * __restrict__ a,
+	    long long * __restrict__ b,
+	    float * __restrict__ c,
+	    double * __restrict__ d,
+	    int size)
+{
+  double res = 0;
+  int i;
+
+  a = (int *)__builtin_assume_aligned (a, ALIGN);
+  b = (long long *)__builtin_assume_aligned (b, ALIGN);
+  c = (float *)__builtin_assume_aligned (c, ALIGN);
+  d = (double *)__builtin_assume_aligned (d, ALIGN);
+
+  for (i = 0; i < size; i++)
+    {
+      a[i] = c[i] + 1;
+      if (b[i] < 0)
+	res += d[i];
+    }
+
+  return res;
+}
+
+void __attribute__((noinline))
+init_data (int * __restrict__ a,
+	   long long * __restrict__ b,
+	   float * __restrict__ c,
+	   double * __restrict__ d,
+	   int size)
+{
+  int i;
+  for (i = 0; i < size; i++)
+    {
+      if (i % 2)
+	{
+	  a[i] = 0;
+	  b[i] = i;
+	  c[i] = 2.5;
+	  d[i] = 1;
+	}
+      else
+	{
+	  a[i] = 0;
+	  b[i] = -i;
+	  c[i] = 2.5;
+	  d[i] = -1;
+	}
+      asm volatile("": : :"memory");
+    }
+  a[size] = (int)size;
+  b[size] = (long long)size;
+  c[size] = (float)size;
+  d[size] = (double)size;
+}
+
+void __attribute__((noinline))
+run_test ()
+{
+  int *a;
+  long long *b;
+  float *c;
+  double *d;
+  double res;
+  int i;
+
+  if (posix_memalign ((void **)&a, ALIGN, (SIZE + 1) * sizeof (int)) != 0)
+    return;
+  if (posix_memalign ((void **)&b, ALIGN, (SIZE + 1) * sizeof (long long)) != 0)
+    return;
+  if (posix_memalign ((void **)&c, ALIGN, (SIZE + 1) * sizeof (float)) != 0)
+    return;
+  if (posix_memalign ((void **)&d, ALIGN, (SIZE + 1) * sizeof (double)) != 0)
+    return;
+
+  init_data (a, b, c, d, SIZE);
+  res = test_citer (a, b, c, d);
+  res += SIZE / 2;
+  if (res > 0.01 || res < -0.01)
+    __builtin_abort ();
+  for (i = 0; i < SIZE; i++)
+    if (a[i] != 3)
+      __builtin_abort ();
+  if (a[SIZE] != (int)SIZE
+      || b[SIZE] != (long long)SIZE
+      || c[SIZE] != (float)SIZE
+      || d[SIZE] != (double)SIZE)
+    __builtin_abort ();
+
+  init_data (a, b, c, d, SIZE);
+  res = test_viter (a, b, c, d, SIZE);
+  res += SIZE / 2;
+  if (res > 0.01 || res < -0.01)
+    __builtin_abort ();
+  for (i = 0; i < SIZE; i++)
+    if (a[i] != 3)
+      __builtin_abort ();
+  if (a[SIZE] != (int)SIZE
+      || b[SIZE] != (long long)SIZE
+      || c[SIZE] != (float)SIZE
+      || d[SIZE] != (double)SIZE)
+    __builtin_abort ();
+
+  free (a);
+  free (b);
+  free (c);
+}
+
+int
+main (int argc, const char **argv)
+{
+  if (!posix_memalign)
+    return 0;
+
+  run_test ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED \\(VS=32\\)" 2 "vect" { target avx2_runtime } } } */
+/* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 7538c6c..39762cb 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -540,8 +540,8 @@ vectorize_loops (void)
 	     || loop->force_vectorize)
       {
 	loop_vec_info loop_vinfo, orig_loop_vinfo = NULL;
-	gimple *loop_vectorized_call = vect_loop_vectorized_call (loop);
 vectorize_epilogue:
+	gimple *loop_vectorized_call = vect_loop_vectorized_call (loop);
 	vect_location = find_loop_location (loop);
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-12-21 16:33 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01 12:38 [PATCH, vec-tails] Support loop epilogue vectorization Yuri Rumyantsev
2016-11-02 12:27 ` Richard Biener
2016-11-03 12:33   ` Yuri Rumyantsev
2016-11-08 12:39     ` Richard Biener
2016-11-08 14:17       ` Yuri Rumyantsev
2016-11-10 12:34         ` Richard Biener
2016-11-10 12:36           ` Richard Biener
2016-11-11 11:15             ` Yuri Rumyantsev
2016-11-11 14:15               ` Yuri Rumyantsev
2016-11-11 14:43                 ` Yuri Rumyantsev
2016-11-14 12:56                   ` Richard Biener
2016-11-14 12:51               ` Richard Biener
2016-11-14 13:30                 ` Yuri Rumyantsev
2016-11-14 13:41                   ` Richard Biener
2016-11-14 15:39                     ` Yuri Rumyantsev
2016-11-14 17:59                       ` Richard Biener
2016-11-15 14:42                         ` Yuri Rumyantsev
2016-11-16  9:56                           ` Richard Biener
2016-11-18 13:20                           ` Christophe Lyon
2016-11-18 15:46                             ` Yuri Rumyantsev
2016-11-18 15:54                               ` Christophe Lyon
2016-11-24 13:42                                 ` Yuri Rumyantsev
2016-11-28 14:39                                   ` Richard Biener
2016-11-28 16:57                                     ` Yuri Rumyantsev
2016-12-01 11:34                                       ` Richard Biener
2016-12-01 14:27                                         ` Yuri Rumyantsev
2016-12-01 14:46                                           ` Richard Biener
     [not found]                                             ` <CAEoMCqSkWgz+DJLe1M1CDxbk4LBtBU4r3rcVv7OcgpsGW4eTJA@mail.gmail.com>
     [not found]                                               ` <CAEoMCqRVVYTYWfhYrpi3TOuBe6XBw4ScVNstoqd8YShBsvRwMw@mail.gmail.com>
     [not found]                                                 ` <CAEoMCqTdOHO_OxJ-5gxDJRPQDS+9kYkKd+WdgGJz8WMuUzD61A@mail.gmail.com>
     [not found]                                                   ` <CAEoMCqQ5ZaT6TPbDL37DOZCEF5DHKWx995yn2fQZO3kV+vQ+EA@mail.gmail.com>
     [not found]                                                     ` <CAEoMCqTCaRQU-mia98uX00CtpKA9w03fhaR2hXCdywXuVAQmSw@mail.gmail.com>
     [not found]                                                       ` <CAEoMCqST8pOZmndKKuYWSyD=juPdGG1UAJ6NyAV3qkuxjV+3gA@mail.gmail.com>
     [not found]                                                         ` <alpine.LSU.2.11.1612131455080.5294@t29.fhfr.qr>
2016-12-21 10:14                                                           ` Yuri Rumyantsev
2016-12-21 17:23                                                             ` Yuri Rumyantsev
2016-11-29 16:22                                 ` Christophe Lyon
2016-11-05 18:35   ` Jeff Law
2016-11-06 11:16     ` Richard Biener
2016-11-09 10:37 ` Bin.Cheng
2016-11-09 11:28   ` Yuri Rumyantsev
2016-11-09 11:46     ` Bin.Cheng
2016-11-09 12:12       ` Yuri Rumyantsev
2016-11-09 12:40         ` Bin.Cheng
2016-11-09 12:52     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).