public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Allow fwprop to undo vectorization harm (PR68961)
@ 2016-06-10  9:20 Richard Biener
  2016-06-13  8:49 ` Richard Biener
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Richard Biener @ 2016-06-10  9:20 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher


With the proposed cost change for vector construction we will end up
vectorizing the testcase in PR68961 again (on x86_64 and likely
on ppc64le as well after that target gets adjustments).  Currently
we can't optimize that away again noticing the direct overlap of
argument and return registers.  The obstackle is

(insn 7 4 8 2 (set (reg:V2DF 93)
        (vec_concat:V2DF (reg/v:DF 91 [ a ])
            (reg/v:DF 92 [ aa ]))) 
...
(insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
        (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
(insn 24 21 11 2 (set (reg:DI 100 [+8 ])
        (subreg:DI (reg:TI 88 [ D.1756 ]) 8))

which we eventually optimize to DFmode subregs of (reg:V2DF 93).

First of all simplify_subreg doesn't handle the subregs of a vec_concat
(easy fix below).

Then combine doesn't like to simplify the multi-use (it tries some
parallel it seems).  So I went to forwprop which eventually manages
to do this but throws away the result (reg:DF 91) or (reg:DF 92)
because it is not a constant.  Thus I allow arbitrary simplification
results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
to be a magic flag to tell it to restrict to the case where all
uses can be simplified or so, nor to restrict simplifications to a REG.
But I don't see any undesirable simplifications of (subreg 
([vec_]concat)).

For the testcase I'm not sure if I have to exclude some ABIs (mingw?).

Boostrap and regtest in progress on x86_64-unknown-linux-gnu, I'll
install the simplify-rtx.c if that succeeds but like to have opinions
on the fwprop.c change.

Thanks,
Richard.

2016-06-10  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Index: gcc/simplify-rtx.c
===================================================================
*** gcc/simplify-rtx.c	(revision 237286)
--- gcc/simplify-rtx.c	(working copy)
*************** simplify_subreg (machine_mode outermode,
*** 6108,6116 ****
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex values represented as CONCAT
!      of real and imaginary part.  */
!   if (GET_CODE (op) == CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
--- 6108,6117 ----
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex or vector values represented as CONCAT or VEC_CONCAT
!      of two parts.  */
!   if (GET_CODE (op) == CONCAT
!       || GET_CODE (op) == VEC_CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c	(revision 237286)
--- gcc/fwprop.c	(working copy)
*************** propagate_rtx (rtx x, machine_mode mode,
*** 664,670 ****
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
--- 664,673 ----
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
!       || ((GET_CODE (new_rtx) == VEC_CONCAT
! 	   || GET_CODE (new_rtx) == CONCAT)
! 	  && GET_CODE (x) == SUBREG))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-10  9:20 [PATCH] Allow fwprop to undo vectorization harm (PR68961) Richard Biener
@ 2016-06-13  8:49 ` Richard Biener
  2016-06-13 13:11   ` Richard Biener
  2016-06-15 18:37 ` Richard Sandiford
  2016-06-17  0:07 ` RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)) Segher Boessenkool
  2 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2016-06-13  8:49 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

On Fri, 10 Jun 2016, Richard Biener wrote:

> 
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments).  Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers.  The obstackle is
> 
> (insn 7 4 8 2 (set (reg:V2DF 93)
>         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>             (reg/v:DF 92 [ aa ]))) 
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> 
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> 
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
> 
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).  So I went to forwprop which eventually manages
> to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> because it is not a constant.  Thus I allow arbitrary simplification
> results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> to be a magic flag to tell it to restrict to the case where all
> uses can be simplified or so, nor to restrict simplifications to a REG.
> But I don't see any undesirable simplifications of (subreg 
> ([vec_]concat)).
> 
> For the testcase I'm not sure if I have to exclude some ABIs (mingw?).
> 
> Boostrap and regtest in progress on x86_64-unknown-linux-gnu, I'll
> install the simplify-rtx.c if that succeeds but like to have opinions
> on the fwprop.c change.

So the bootstrap exposes a latent issue in simplify-rtx.c in the changed
hunk via gcc.target/i386/mmx-8.c on i?86 which ends up with a 

(vec_concat:V2SI (reg:SI 103)
    (const_int 0 [0]))

and thus a VOIDmode 2nd operand (I'm sure this can happen for
complex integer concat as well, thus latent).  I am adjusting the
simplify_subreg hunk to always pass GET_MODE_INNER (innermode)
(that hopefully exercises it a bit more than just using that
if GET_MODE (part) == VOIDmode - and hopefully they should always
agree).

Re-bootstrap / regtest running on x86_64-unknown-linux-gnu.

Comments still welcome.

Thanks,
Richard.

2016-06-13  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Index: gcc/simplify-rtx.c
===================================================================
*** gcc/simplify-rtx.c	(revision 237286)
--- gcc/simplify-rtx.c	(working copy)
*************** simplify_subreg (machine_mode outermode,
*** 6108,6116 ****
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex values represented as CONCAT
!      of real and imaginary part.  */
!   if (GET_CODE (op) == CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
--- 6108,6117 ----
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex or vector values represented as CONCAT or VEC_CONCAT
!      of two parts.  */
!   if (GET_CODE (op) == CONCAT
!       || GET_CODE (op) == VEC_CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
*************** simplify_subreg (machine_mode outermode,
*** 6130,6139 ****
        if (final_offset + GET_MODE_SIZE (outermode) > part_size)
  	return NULL_RTX;
  
!       res = simplify_subreg (outermode, part, GET_MODE (part), final_offset);
        if (res)
  	return res;
!       if (validate_subreg (outermode, GET_MODE (part), part, final_offset))
  	return gen_rtx_SUBREG (outermode, part, final_offset);
        return NULL_RTX;
      }
--- 6131,6141 ----
        if (final_offset + GET_MODE_SIZE (outermode) > part_size)
  	return NULL_RTX;
  
!       enum machine_mode part_mode = GET_MODE_INNER (innermode);
!       res = simplify_subreg (outermode, part, part_mode, final_offset);
        if (res)
  	return res;
!       if (validate_subreg (outermode, part_mode, part, final_offset))
  	return gen_rtx_SUBREG (outermode, part, final_offset);
        return NULL_RTX;
      }
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c	(revision 237286)
--- gcc/fwprop.c	(working copy)
*************** propagate_rtx (rtx x, machine_mode mode,
*** 664,670 ****
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
--- 664,673 ----
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
!       || ((GET_CODE (new_rtx) == VEC_CONCAT
! 	   || GET_CODE (new_rtx) == CONCAT)
! 	  && GET_CODE (x) == SUBREG))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-13  8:49 ` Richard Biener
@ 2016-06-13 13:11   ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2016-06-13 13:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

On Mon, 13 Jun 2016, Richard Biener wrote:

> On Fri, 10 Jun 2016, Richard Biener wrote:
> 
> > 
> > With the proposed cost change for vector construction we will end up
> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> > on ppc64le as well after that target gets adjustments).  Currently
> > we can't optimize that away again noticing the direct overlap of
> > argument and return registers.  The obstackle is
> > 
> > (insn 7 4 8 2 (set (reg:V2DF 93)
> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> >             (reg/v:DF 92 [ aa ]))) 
> > ...
> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> > 
> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> > 
> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> > (easy fix below).
> > 
> > Then combine doesn't like to simplify the multi-use (it tries some
> > parallel it seems).  So I went to forwprop which eventually manages
> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> > because it is not a constant.  Thus I allow arbitrary simplification
> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> > to be a magic flag to tell it to restrict to the case where all
> > uses can be simplified or so, nor to restrict simplifications to a REG.
> > But I don't see any undesirable simplifications of (subreg 
> > ([vec_]concat)).
> > 
> > For the testcase I'm not sure if I have to exclude some ABIs (mingw?).
> > 
> > Boostrap and regtest in progress on x86_64-unknown-linux-gnu, I'll
> > install the simplify-rtx.c if that succeeds but like to have opinions
> > on the fwprop.c change.
> 
> So the bootstrap exposes a latent issue in simplify-rtx.c in the changed
> hunk via gcc.target/i386/mmx-8.c on i?86 which ends up with a 
> 
> (vec_concat:V2SI (reg:SI 103)
>     (const_int 0 [0]))
> 
> and thus a VOIDmode 2nd operand (I'm sure this can happen for
> complex integer concat as well, thus latent).  I am adjusting the
> simplify_subreg hunk to always pass GET_MODE_INNER (innermode)
> (that hopefully exercises it a bit more than just using that
> if GET_MODE (part) == VOIDmode - and hopefully they should always
> agree).
> 
> Re-bootstrap / regtest running on x86_64-unknown-linux-gnu.

That works worse given that vec_concat can be

(vec_concat:V16QI (us_truncate:V8QI (reg:V8HI 159))
    (us_truncate:V8QI (reg:V8HI 160)))

... now I think the VOIDmode case can only happen for scalar vec_concat
and thus

      enum machine_mode part_mode = GET_MODE (part);
      if (part_mode == VOIDmode)
        part_mode = GET_MODE_INNER (GET_MODE (op));

should work.  Re-testing with that... (ok, I know it has coverage of
exactly one testcase on x86_64 as it would otherwise ICE).

Richard.


2016-06-13  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Index: gcc/simplify-rtx.c
===================================================================
*** gcc/simplify-rtx.c	(revision 237372)
--- gcc/simplify-rtx.c	(working copy)
*************** simplify_subreg (machine_mode outermode,
*** 6108,6116 ****
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex values represented as CONCAT
!      of real and imaginary part.  */
!   if (GET_CODE (op) == CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
--- 6108,6117 ----
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex or vector values represented as CONCAT or VEC_CONCAT
!      of two parts.  */
!   if (GET_CODE (op) == CONCAT
!       || GET_CODE (op) == VEC_CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
*************** simplify_subreg (machine_mode outermode,
*** 6130,6139 ****
        if (final_offset + GET_MODE_SIZE (outermode) > part_size)
  	return NULL_RTX;
  
!       res = simplify_subreg (outermode, part, GET_MODE (part), final_offset);
        if (res)
  	return res;
!       if (validate_subreg (outermode, GET_MODE (part), part, final_offset))
  	return gen_rtx_SUBREG (outermode, part, final_offset);
        return NULL_RTX;
      }
--- 6131,6143 ----
        if (final_offset + GET_MODE_SIZE (outermode) > part_size)
  	return NULL_RTX;
  
!       enum machine_mode part_mode = GET_MODE (part);
!       if (part_mode == VOIDmode)
! 	part_mode = GET_MODE_INNER (GET_MODE (op));
!       res = simplify_subreg (outermode, part, part_mode, final_offset);
        if (res)
  	return res;
!       if (validate_subreg (outermode, part_mode, part, final_offset))
  	return gen_rtx_SUBREG (outermode, part, final_offset);
        return NULL_RTX;
      }
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c	(revision 237372)
--- gcc/fwprop.c	(working copy)
*************** propagate_rtx (rtx x, machine_mode mode,
*** 664,670 ****
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
--- 664,673 ----
        || (GET_CODE (new_rtx) == SUBREG
  	  && REG_P (SUBREG_REG (new_rtx))
  	  && (GET_MODE_SIZE (mode)
! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
!       || ((GET_CODE (new_rtx) == VEC_CONCAT
! 	   || GET_CODE (new_rtx) == CONCAT)
! 	  && GET_CODE (x) == SUBREG))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-10  9:20 [PATCH] Allow fwprop to undo vectorization harm (PR68961) Richard Biener
  2016-06-13  8:49 ` Richard Biener
@ 2016-06-15 18:37 ` Richard Sandiford
  2016-06-27  9:59   ` Richard Biener
  2016-06-17  0:07 ` RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)) Segher Boessenkool
  2 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2016-06-15 18:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher

Richard Biener <rguenther@suse.de> writes:
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments).  Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers.  The obstackle is
>
> (insn 7 4 8 2 (set (reg:V2DF 93)
>         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>             (reg/v:DF 92 [ aa ]))) 
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
>
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).  So I went to forwprop which eventually manages
> to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> because it is not a constant.  Thus I allow arbitrary simplification
> results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> to be a magic flag to tell it to restrict to the case where all
> uses can be simplified or so, nor to restrict simplifications to a REG.
> But I don't see any undesirable simplifications of (subreg 
> ([vec_]concat)).

Adding that as a special case to propgate_rtx feels like a hack though :-)
I think:

> Index: gcc/fwprop.c
> ===================================================================
> *** gcc/fwprop.c	(revision 237286)
> --- gcc/fwprop.c	(working copy)
> *************** propagate_rtx (rtx x, machine_mode mode,
> *** 664,670 ****
>         || (GET_CODE (new_rtx) == SUBREG
>   	  && REG_P (SUBREG_REG (new_rtx))
>   	  && (GET_MODE_SIZE (mode)
> ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
>       flags |= PR_CAN_APPEAR;
>     if (!varying_mem_p (new_rtx))
>       flags |= PR_HANDLE_MEM;
> --- 664,673 ----
>         || (GET_CODE (new_rtx) == SUBREG
>   	  && REG_P (SUBREG_REG (new_rtx))
>   	  && (GET_MODE_SIZE (mode)
> ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> ! 	   || GET_CODE (new_rtx) == CONCAT)
> ! 	  && GET_CODE (x) == SUBREG))
>       flags |= PR_CAN_APPEAR;
>     if (!varying_mem_p (new_rtx))
>       flags |= PR_HANDLE_MEM;

...this if statement should fundamentally only test new_rtx.
E.g. we'd want the same thing for any SUBREG inside X.

How about changing:

  /* The replacement we made so far is valid, if all of the recursive
     replacements were valid, or we could simplify everything to
     a constant.  */
  return valid_ops || can_appear || CONSTANT_P (tem);

so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
I suppose that's likely to increase register pressure though,
if only some uses of new_rtx simplify.  (There again, requiring all
uses to be replacable could make hot code the hostage of cold code.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961))
  2016-06-10  9:20 [PATCH] Allow fwprop to undo vectorization harm (PR68961) Richard Biener
  2016-06-13  8:49 ` Richard Biener
  2016-06-15 18:37 ` Richard Sandiford
@ 2016-06-17  0:07 ` Segher Boessenkool
  2016-06-20 13:39   ` RFC: 2->2 combine patch Kyrill Tkachov
  2 siblings, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2016-06-17  0:07 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments).  Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers.  The obstackle is
> 
> (insn 7 4 8 2 (set (reg:V2DF 93)
>         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>             (reg/v:DF 92 [ aa ]))) 
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> 
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> 
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
> 
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).

Combine will not do a 2->2 combination currently.  Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C.  That usually does not match an insn for
the target.

If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches.  We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.

So I tried out the patch below.  It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move).  It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.

So: can people try this out with their favourite benchmarks, please?


Segher


diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
 	   && XVECLEN (newpat, 0) == 2
 	   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
 	   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
-		  || set_noop_p (XVECEXP (newpat, 0, 1)))
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RFC: 2->2 combine patch
  2016-06-17  0:07 ` RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)) Segher Boessenkool
@ 2016-06-20 13:39   ` Kyrill Tkachov
  2016-06-20 14:38     ` Segher Boessenkool
  0 siblings, 1 reply; 20+ messages in thread
From: Kyrill Tkachov @ 2016-06-20 13:39 UTC (permalink / raw)
  To: Segher Boessenkool, Richard Biener; +Cc: gcc-patches

Hi Segher,

On 17/06/16 01:07, Segher Boessenkool wrote:
> On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
>> With the proposed cost change for vector construction we will end up
>> vectorizing the testcase in PR68961 again (on x86_64 and likely
>> on ppc64le as well after that target gets adjustments).  Currently
>> we can't optimize that away again noticing the direct overlap of
>> argument and return registers.  The obstackle is
>>
>> (insn 7 4 8 2 (set (reg:V2DF 93)
>>          (vec_concat:V2DF (reg/v:DF 91 [ a ])
>>              (reg/v:DF 92 [ aa ])))
>> ...
>> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>>          (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
>> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>>          (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>>
>> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>>
>> First of all simplify_subreg doesn't handle the subregs of a vec_concat
>> (easy fix below).
>>
>> Then combine doesn't like to simplify the multi-use (it tries some
>> parallel it seems).
> Combine will not do a 2->2 combination currently.  Say it is combining
> A with a later B into C, and the result of A is used again later, then
> it tries a parallel of A with C.  That usually does not match an insn for
> the target.
>
> If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
> (so that they will disappear later in combines), combine will break the
> parallel into two and see if that matches.  We can in fact do that for
> 2->2 combinations as well: this removes a log_link (from A to B), so
> combine cannot get into an infinite loop, even though it does not make
> the number of RTL insns lower.
>
> So I tried out the patch below.  It decreases code size on most targets
> (mostly fixed length insn targets), and increases it a small bit on some
> variable length insn targets (doing an op twice, instead of doing it once
> and doing a move).  It looks to be all good there too, but there are so
> many changes that it is almost impossible to really check.
>
> So: can people try this out with their favourite benchmarks, please?

I hope to give this a run on AArch64 but I'll probably manage to get to it only next week.
In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
Some benchmarks decrease in size, others increase. One recurring theme I spotted is
shifts being repeatedly combined with arithmetic operations rather than being computed
once and reusing the result. For example:
     lsl    x30, x15, 3
     add    x4, x5, x30
     add    x9, x7, x30
     add    x24, x8, x30
     add    x10, x0, x30
     add    x2, x22, x30

becoming (modulo regalloc fluctuations):
     add    x14, x2, x15, lsl 3
     add    x13, x22, x15, lsl 3
     add    x21, x4, x15, lsl 3
     add    x6, x0, x15, lsl 3
     add    x3, x30, x15, lsl 3

which, while saving one instruction can be harmful overall because the extra shift operation
in the arithmetic instructions can increase the latency of the instruction. I believe the aarch64
rtx costs should convey this information. Do you expect RTX costs to gate this behaviour?

Some occurrences that hurt code size look like:
     cmp    x8, x11, asr 5

being transformed into:
     asr    x12, x11, 5
     cmp    x12, x8, uxtw //zero-extend x8
with the user of the condition code inverted to match the change in order of operands
to the comparisons.
I haven't looked at the RTL dumps yet to figure out why this is happening, it could be a backend
RTL representation issue.

Kyrill


>
> Segher
>
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index 6b5d000..2c99b4e 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>   	   && XVECLEN (newpat, 0) == 2
>   	   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
>   	   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
> -	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
> -		  || set_noop_p (XVECEXP (newpat, 0, 1)))
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RFC: 2->2 combine patch
  2016-06-20 13:39   ` RFC: 2->2 combine patch Kyrill Tkachov
@ 2016-06-20 14:38     ` Segher Boessenkool
  0 siblings, 0 replies; 20+ messages in thread
From: Segher Boessenkool @ 2016-06-20 14:38 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: Richard Biener, gcc-patches

On Mon, Jun 20, 2016 at 02:39:06PM +0100, Kyrill Tkachov wrote:
> >So I tried out the patch below.  It decreases code size on most targets
> >(mostly fixed length insn targets), and increases it a small bit on some
> >variable length insn targets (doing an op twice, instead of doing it once
> >and doing a move).  It looks to be all good there too, but there are so
> >many changes that it is almost impossible to really check.
> >
> >So: can people try this out with their favourite benchmarks, please?
> 
> I hope to give this a run on AArch64 but I'll probably manage to get to it 
> only next week.
> In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
> Some benchmarks decrease in size, others increase. One recurring theme I 
> spotted is
> shifts being repeatedly combined with arithmetic operations rather than 
> being computed
> once and reusing the result. For example:
>     lsl    x30, x15, 3
>     add    x4, x5, x30
>     add    x9, x7, x30
>     add    x24, x8, x30
>     add    x10, x0, x30
>     add    x2, x22, x30
> 
> becoming (modulo regalloc fluctuations):
>     add    x14, x2, x15, lsl 3
>     add    x13, x22, x15, lsl 3
>     add    x21, x4, x15, lsl 3
>     add    x6, x0, x15, lsl 3
>     add    x3, x30, x15, lsl 3
> 
> which, while saving one instruction can be harmful overall because the 
> extra shift operation
> in the arithmetic instructions can increase the latency of the instruction. 
> I believe the aarch64
> rtx costs should convey this information. Do you expect RTX costs to gate 
> this behaviour?

Yes, RTX costs are used for *all* of combine's combinations.  So it seems
your add,lsl patterns are the same cost as plain add?  If it were more
expensive, combine would reject this combination.

> Some occurrences that hurt code size look like:
>     cmp    x8, x11, asr 5
> 
> being transformed into:
>     asr    x12, x11, 5
>     cmp    x12, x8, uxtw //zero-extend x8
> with the user of the condition code inverted to match the change in order 
> of operands
> to the comparisons.
> I haven't looked at the RTL dumps yet to figure out why this is happening, 
> it could be a backend
> RTL representation issue.

That could be a target thing yes, hard to tell; it's not clear to me
what combination is made here (if any).


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-15 18:37 ` Richard Sandiford
@ 2016-06-27  9:59   ` Richard Biener
  2016-06-29  8:08     ` Richard Biener
  2016-07-03 17:25     ` Richard Sandiford
  0 siblings, 2 replies; 20+ messages in thread
From: Richard Biener @ 2016-06-27  9:59 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, segher

On Wed, 15 Jun 2016, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > With the proposed cost change for vector construction we will end up
> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> > on ppc64le as well after that target gets adjustments).  Currently
> > we can't optimize that away again noticing the direct overlap of
> > argument and return registers.  The obstackle is
> >
> > (insn 7 4 8 2 (set (reg:V2DF 93)
> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> >             (reg/v:DF 92 [ aa ]))) 
> > ...
> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> >
> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> >
> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> > (easy fix below).
> >
> > Then combine doesn't like to simplify the multi-use (it tries some
> > parallel it seems).  So I went to forwprop which eventually manages
> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> > because it is not a constant.  Thus I allow arbitrary simplification
> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> > to be a magic flag to tell it to restrict to the case where all
> > uses can be simplified or so, nor to restrict simplifications to a REG.
> > But I don't see any undesirable simplifications of (subreg 
> > ([vec_]concat)).
> 
> Adding that as a special case to propgate_rtx feels like a hack though :-)
> I think:
> 
> > Index: gcc/fwprop.c
> > ===================================================================
> > *** gcc/fwprop.c	(revision 237286)
> > --- gcc/fwprop.c	(working copy)
> > *************** propagate_rtx (rtx x, machine_mode mode,
> > *** 664,670 ****
> >         || (GET_CODE (new_rtx) == SUBREG
> >   	  && REG_P (SUBREG_REG (new_rtx))
> >   	  && (GET_MODE_SIZE (mode)
> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
> >       flags |= PR_CAN_APPEAR;
> >     if (!varying_mem_p (new_rtx))
> >       flags |= PR_HANDLE_MEM;
> > --- 664,673 ----
> >         || (GET_CODE (new_rtx) == SUBREG
> >   	  && REG_P (SUBREG_REG (new_rtx))
> >   	  && (GET_MODE_SIZE (mode)
> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> > ! 	   || GET_CODE (new_rtx) == CONCAT)
> > ! 	  && GET_CODE (x) == SUBREG))
> >       flags |= PR_CAN_APPEAR;
> >     if (!varying_mem_p (new_rtx))
> >       flags |= PR_HANDLE_MEM;
> 
> ...this if statement should fundamentally only test new_rtx.
> E.g. we'd want the same thing for any SUBREG inside X.
> 
> How about changing:
> 
>   /* The replacement we made so far is valid, if all of the recursive
>      replacements were valid, or we could simplify everything to
>      a constant.  */
>   return valid_ops || can_appear || CONSTANT_P (tem);
> 
> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
> I suppose that's likely to increase register pressure though,
> if only some uses of new_rtx simplify.  (There again, requiring all
> uses to be replacable could make hot code the hostage of cold code.)

Yes, my fear was about register presure increase for the case not all
uses can be replaced (fwprop doesn't seem to have code to verify or
require that).

I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
new_rtx == [VEC_]CONCAT case for example.

Richard.

> Thanks,
> Richard
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-27  9:59   ` Richard Biener
@ 2016-06-29  8:08     ` Richard Biener
  2016-07-03 17:25     ` Richard Sandiford
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Biener @ 2016-06-29  8:08 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, segher

On Mon, 27 Jun 2016, Richard Biener wrote:

> On Wed, 15 Jun 2016, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > With the proposed cost change for vector construction we will end up
> > > vectorizing the testcase in PR68961 again (on x86_64 and likely
> > > on ppc64le as well after that target gets adjustments).  Currently
> > > we can't optimize that away again noticing the direct overlap of
> > > argument and return registers.  The obstackle is
> > >
> > > (insn 7 4 8 2 (set (reg:V2DF 93)
> > >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> > >             (reg/v:DF 92 [ aa ]))) 
> > > ...
> > > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> > >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> > > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> > >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> > >
> > > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> > >
> > > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> > > (easy fix below).
> > >
> > > Then combine doesn't like to simplify the multi-use (it tries some
> > > parallel it seems).  So I went to forwprop which eventually manages
> > > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> > > because it is not a constant.  Thus I allow arbitrary simplification
> > > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> > > to be a magic flag to tell it to restrict to the case where all
> > > uses can be simplified or so, nor to restrict simplifications to a REG.
> > > But I don't see any undesirable simplifications of (subreg 
> > > ([vec_]concat)).
> > 
> > Adding that as a special case to propgate_rtx feels like a hack though :-)
> > I think:
> > 
> > > Index: gcc/fwprop.c
> > > ===================================================================
> > > *** gcc/fwprop.c	(revision 237286)
> > > --- gcc/fwprop.c	(working copy)
> > > *************** propagate_rtx (rtx x, machine_mode mode,
> > > *** 664,670 ****
> > >         || (GET_CODE (new_rtx) == SUBREG
> > >   	  && REG_P (SUBREG_REG (new_rtx))
> > >   	  && (GET_MODE_SIZE (mode)
> > > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
> > >       flags |= PR_CAN_APPEAR;
> > >     if (!varying_mem_p (new_rtx))
> > >       flags |= PR_HANDLE_MEM;
> > > --- 664,673 ----
> > >         || (GET_CODE (new_rtx) == SUBREG
> > >   	  && REG_P (SUBREG_REG (new_rtx))
> > >   	  && (GET_MODE_SIZE (mode)
> > > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> > > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> > > ! 	   || GET_CODE (new_rtx) == CONCAT)
> > > ! 	  && GET_CODE (x) == SUBREG))
> > >       flags |= PR_CAN_APPEAR;
> > >     if (!varying_mem_p (new_rtx))
> > >       flags |= PR_HANDLE_MEM;
> > 
> > ...this if statement should fundamentally only test new_rtx.
> > E.g. we'd want the same thing for any SUBREG inside X.
> > 
> > How about changing:
> > 
> >   /* The replacement we made so far is valid, if all of the recursive
> >      replacements were valid, or we could simplify everything to
> >      a constant.  */
> >   return valid_ops || can_appear || CONSTANT_P (tem);
> > 
> > so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
> > I suppose that's likely to increase register pressure though,
> > if only some uses of new_rtx simplify.  (There again, requiring all
> > uses to be replacable could make hot code the hostage of cold code.)
> 
> Yes, my fear was about register presure increase for the case not all
> uses can be replaced (fwprop doesn't seem to have code to verify or
> require that).
> 
> I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
> case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
> new_rtx == [VEC_]CONCAT case for example.

Btw, I have installed the simplify-rtx.c part now.

Richard.

2016-06-29  Richard Biener  <rguenther@suse.de>

        PR rtl-optimization/68961
        * simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-06-27  9:59   ` Richard Biener
  2016-06-29  8:08     ` Richard Biener
@ 2016-07-03 17:25     ` Richard Sandiford
  2016-07-04 10:48       ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2016-07-03 17:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher

Richard Biener <rguenther@suse.de> writes:
> On Wed, 15 Jun 2016, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > With the proposed cost change for vector construction we will end up
>> > vectorizing the testcase in PR68961 again (on x86_64 and likely
>> > on ppc64le as well after that target gets adjustments).  Currently
>> > we can't optimize that away again noticing the direct overlap of
>> > argument and return registers.  The obstackle is
>> >
>> > (insn 7 4 8 2 (set (reg:V2DF 93)
>> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>> >             (reg/v:DF 92 [ aa ]))) 
>> > ...
>> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
>> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>> >
>> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>> >
>> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
>> > (easy fix below).
>> >
>> > Then combine doesn't like to simplify the multi-use (it tries some
>> > parallel it seems).  So I went to forwprop which eventually manages
>> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
>> > because it is not a constant.  Thus I allow arbitrary simplification
>> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
>> > to be a magic flag to tell it to restrict to the case where all
>> > uses can be simplified or so, nor to restrict simplifications to a REG.
>> > But I don't see any undesirable simplifications of (subreg 
>> > ([vec_]concat)).
>> 
>> Adding that as a special case to propgate_rtx feels like a hack though :-)
>> I think:
>> 
>> > Index: gcc/fwprop.c
>> > ===================================================================
>> > *** gcc/fwprop.c	(revision 237286)
>> > --- gcc/fwprop.c	(working copy)
>> > *************** propagate_rtx (rtx x, machine_mode mode,
>> > *** 664,670 ****
>> >         || (GET_CODE (new_rtx) == SUBREG
>> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >   	  && (GET_MODE_SIZE (mode)
>> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
>> >       flags |= PR_CAN_APPEAR;
>> >     if (!varying_mem_p (new_rtx))
>> >       flags |= PR_HANDLE_MEM;
>> > --- 664,673 ----
>> >         || (GET_CODE (new_rtx) == SUBREG
>> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >   	  && (GET_MODE_SIZE (mode)
>> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
>> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
>> > ! 	   || GET_CODE (new_rtx) == CONCAT)
>> > ! 	  && GET_CODE (x) == SUBREG))
>> >       flags |= PR_CAN_APPEAR;
>> >     if (!varying_mem_p (new_rtx))
>> >       flags |= PR_HANDLE_MEM;
>> 
>> ...this if statement should fundamentally only test new_rtx.
>> E.g. we'd want the same thing for any SUBREG inside X.
>> 
>> How about changing:
>> 
>>   /* The replacement we made so far is valid, if all of the recursive
>>      replacements were valid, or we could simplify everything to
>>      a constant.  */
>>   return valid_ops || can_appear || CONSTANT_P (tem);
>> 
>> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
>> I suppose that's likely to increase register pressure though,
>> if only some uses of new_rtx simplify.  (There again, requiring all
>> uses to be replacable could make hot code the hostage of cold code.)
>
> Yes, my fear was about register presure increase for the case not all
> uses can be replaced (fwprop doesn't seem to have code to verify or
> require that).
>
> I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
> case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
> new_rtx == [VEC_]CONCAT case for example.

I don't think that helps though.  There might be other uses of a
VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
problem of keeping both values live at once.

How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
to cases where new_rtx has more words than tem?

Thanks,
Richard


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-03 17:25     ` Richard Sandiford
@ 2016-07-04 10:48       ` Richard Biener
  2016-07-05 20:50         ` Richard Sandiford
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2016-07-04 10:48 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, segher

On Sun, 3 Jul 2016, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 15 Jun 2016, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > With the proposed cost change for vector construction we will end up
> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> >> > on ppc64le as well after that target gets adjustments).  Currently
> >> > we can't optimize that away again noticing the direct overlap of
> >> > argument and return registers.  The obstackle is
> >> >
> >> > (insn 7 4 8 2 (set (reg:V2DF 93)
> >> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> >> >             (reg/v:DF 92 [ aa ]))) 
> >> > ...
> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> >> >
> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> >> >
> >> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> >> > (easy fix below).
> >> >
> >> > Then combine doesn't like to simplify the multi-use (it tries some
> >> > parallel it seems).  So I went to forwprop which eventually manages
> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> >> > because it is not a constant.  Thus I allow arbitrary simplification
> >> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> >> > to be a magic flag to tell it to restrict to the case where all
> >> > uses can be simplified or so, nor to restrict simplifications to a REG.
> >> > But I don't see any undesirable simplifications of (subreg 
> >> > ([vec_]concat)).
> >> 
> >> Adding that as a special case to propgate_rtx feels like a hack though :-)
> >> I think:
> >> 
> >> > Index: gcc/fwprop.c
> >> > ===================================================================
> >> > *** gcc/fwprop.c	(revision 237286)
> >> > --- gcc/fwprop.c	(working copy)
> >> > *************** propagate_rtx (rtx x, machine_mode mode,
> >> > *** 664,670 ****
> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >   	  && REG_P (SUBREG_REG (new_rtx))
> >> >   	  && (GET_MODE_SIZE (mode)
> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
> >> >       flags |= PR_CAN_APPEAR;
> >> >     if (!varying_mem_p (new_rtx))
> >> >       flags |= PR_HANDLE_MEM;
> >> > --- 664,673 ----
> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >   	  && REG_P (SUBREG_REG (new_rtx))
> >> >   	  && (GET_MODE_SIZE (mode)
> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> >> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> >> > ! 	   || GET_CODE (new_rtx) == CONCAT)
> >> > ! 	  && GET_CODE (x) == SUBREG))
> >> >       flags |= PR_CAN_APPEAR;
> >> >     if (!varying_mem_p (new_rtx))
> >> >       flags |= PR_HANDLE_MEM;
> >> 
> >> ...this if statement should fundamentally only test new_rtx.
> >> E.g. we'd want the same thing for any SUBREG inside X.
> >> 
> >> How about changing:
> >> 
> >>   /* The replacement we made so far is valid, if all of the recursive
> >>      replacements were valid, or we could simplify everything to
> >>      a constant.  */
> >>   return valid_ops || can_appear || CONSTANT_P (tem);
> >> 
> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
> >> I suppose that's likely to increase register pressure though,
> >> if only some uses of new_rtx simplify.  (There again, requiring all
> >> uses to be replacable could make hot code the hostage of cold code.)
> >
> > Yes, my fear was about register presure increase for the case not all
> > uses can be replaced (fwprop doesn't seem to have code to verify or
> > require that).
> >
> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
> > new_rtx == [VEC_]CONCAT case for example.
> 
> I don't think that helps though.  There might be other uses of a
> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
> problem of keeping both values live at once.
> 
> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
> to cases where new_rtx has more words than tem?

So would you really make a simple mode-size check here?  I wonder
which cases are there other than the subreg of [vec_concat] that
would end up with this case.  That is,

  if (REG_P (tem) && !HARD_REGISTER_P (tem)
      && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
      && (VECTOR_MODE_P (GET_MODE (new_rtx))
          || COMPLEX_MODE_P (GET_MODE (new_rtx))))
    return true;

works as would constraining new_rtx to [VEC_]CONCAT instead of
vector/complex modes.  I'm worried about relaxing it further
as partial int modes also have a GET_MODE_INNER - relaxing to

      && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx)

I mean.  If we restrict it to [VEC_]CONCAT we can even allow
any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop)
operation that will simplify into a REG_P besides a subreg.

But the above form would capture the intent to allow simplifying
operations on vector / complex to a scalar component.

Which means I can test the following if that looks better
than what I had originally.  There's Seghers 2->2 combine
patch but that has wider impact as well.

Thanks,
Richard.

2016-07-04  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c	(revision 237955)
--- gcc/fwprop.c	(working copy)
*************** propagate_rtx_1 (rtx *px, rtx old_rtx, r
*** 619,624 ****
--- 619,633 ----
  
    *px = tem;
  
+   /* Allow replacements that simplify operations on a vector or complex
+      value to a component.  The most prominent case is
+      (subreg ([vec_]concat ...)).   */
+   if (REG_P (tem) && !HARD_REGISTER_P (tem)
+       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
+       && (VECTOR_MODE_P (GET_MODE (new_rtx))
+ 	  || COMPLEX_MODE_P (GET_MODE (new_rtx))))
+     return true;
+ 
    /* The replacement we made so far is valid, if all of the recursive
       replacements were valid, or we could simplify everything to
       a constant.  */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-04 10:48       ` Richard Biener
@ 2016-07-05 20:50         ` Richard Sandiford
  2016-07-06 13:18           ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2016-07-05 20:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher

Richard Biener <rguenther@suse.de> writes:
> On Sun, 3 Jul 2016, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 15 Jun 2016, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > With the proposed cost change for vector construction we will end up
>> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely
>> >> > on ppc64le as well after that target gets adjustments).  Currently
>> >> > we can't optimize that away again noticing the direct overlap of
>> >> > argument and return registers.  The obstackle is
>> >> >
>> >> > (insn 7 4 8 2 (set (reg:V2DF 93)
>> >> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>> >> >             (reg/v:DF 92 [ aa ]))) 
>> >> > ...
>> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
>> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>> >> >
>> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>> >> >
>> >> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
>> >> > (easy fix below).
>> >> >
>> >> > Then combine doesn't like to simplify the multi-use (it tries some
>> >> > parallel it seems).  So I went to forwprop which eventually manages
>> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
>> >> > because it is not a constant.  Thus I allow arbitrary simplification
>> >> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
>> >> > to be a magic flag to tell it to restrict to the case where all
>> >> > uses can be simplified or so, nor to restrict simplifications to a REG.
>> >> > But I don't see any undesirable simplifications of (subreg 
>> >> > ([vec_]concat)).
>> >> 
>> >> Adding that as a special case to propgate_rtx feels like a hack though :-)
>> >> I think:
>> >> 
>> >> > Index: gcc/fwprop.c
>> >> > ===================================================================
>> >> > *** gcc/fwprop.c	(revision 237286)
>> >> > --- gcc/fwprop.c	(working copy)
>> >> > *************** propagate_rtx (rtx x, machine_mode mode,
>> >> > *** 664,670 ****
>> >> >         || (GET_CODE (new_rtx) == SUBREG
>> >> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >> >   	  && (GET_MODE_SIZE (mode)
>> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
>> >> >       flags |= PR_CAN_APPEAR;
>> >> >     if (!varying_mem_p (new_rtx))
>> >> >       flags |= PR_HANDLE_MEM;
>> >> > --- 664,673 ----
>> >> >         || (GET_CODE (new_rtx) == SUBREG
>> >> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >> >   	  && (GET_MODE_SIZE (mode)
>> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
>> >> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
>> >> > ! 	   || GET_CODE (new_rtx) == CONCAT)
>> >> > ! 	  && GET_CODE (x) == SUBREG))
>> >> >       flags |= PR_CAN_APPEAR;
>> >> >     if (!varying_mem_p (new_rtx))
>> >> >       flags |= PR_HANDLE_MEM;
>> >> 
>> >> ...this if statement should fundamentally only test new_rtx.
>> >> E.g. we'd want the same thing for any SUBREG inside X.
>> >> 
>> >> How about changing:
>> >> 
>> >>   /* The replacement we made so far is valid, if all of the recursive
>> >>      replacements were valid, or we could simplify everything to
>> >>      a constant.  */
>> >>   return valid_ops || can_appear || CONSTANT_P (tem);
>> >> 
>> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
>> >> I suppose that's likely to increase register pressure though,
>> >> if only some uses of new_rtx simplify.  (There again, requiring all
>> >> uses to be replacable could make hot code the hostage of cold code.)
>> >
>> > Yes, my fear was about register presure increase for the case not all
>> > uses can be replaced (fwprop doesn't seem to have code to verify or
>> > require that).
>> >
>> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
>> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
>> > new_rtx == [VEC_]CONCAT case for example.
>> 
>> I don't think that helps though.  There might be other uses of a
>> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
>> problem of keeping both values live at once.
>> 
>> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
>> to cases where new_rtx has more words than tem?
>
> So would you really make a simple mode-size check here?

I thought a mode check would be worth trying since (for better or worse)
words are special for subregs.  But...

> I wonder which cases are there other than the subreg of [vec_concat]
> that would end up with this case.  That is,
>
>   if (REG_P (tem) && !HARD_REGISTER_P (tem)
>       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
>       && (VECTOR_MODE_P (GET_MODE (new_rtx))
>           || COMPLEX_MODE_P (GET_MODE (new_rtx))))
>     return true;

this looks good to me too FWIW.  I think it reads more naturally if
you do the GET_MODE_INNER check last.

> works as would constraining new_rtx to [VEC_]CONCAT instead of
> vector/complex modes.  I'm worried about relaxing it further
> as partial int modes also have a GET_MODE_INNER - relaxing to
>
>       && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx)
>
> I mean.  If we restrict it to [VEC_]CONCAT we can even allow
> any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop)
> operation that will simplify into a REG_P besides a subreg.

I think the simplification could also trigger for VEC_DUPLICATE and
VEC_MERGE, so the patch seems better than a check for specific codes.

> But the above form would capture the intent to allow simplifying
> operations on vector / complex to a scalar component.
>
> Which means I can test the following if that looks better
> than what I had originally.  There's Seghers 2->2 combine
> patch but that has wider impact as well.
>
> Thanks,
> Richard.
>
> 2016-07-04  Richard Biener  <rguenther@suse.de>
>
> 	PR rtl-optimization/68961
> 	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> 	to simplify to a non-constant.
>
> 	* gcc.target/i386/pr68961.c: New testcase.

Thanks, LGTM.

Richard

>
> Index: gcc/testsuite/gcc.target/i386/pr68961.c
> ===================================================================
> *** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
> --- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
> ***************
> *** 0 ****
> --- 1,19 ----
> + /* { dg-do compile { target lp64 } } */
> + /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
> + 
> + struct x { double d[2]; };
> + 
> + struct x
> + pack (double a, double aa)
> + {
> +   struct x u;
> +   u.d[0] = a;
> +   u.d[1] = aa;
> +   return u;
> + }
> + 
> + /* The function should be optimized to just return as arguments and
> +    result exactly overlap even when previously vectorized.  */
> + 
> + /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
> + /* { dg-final { scan-assembler-not "mov" } } */
> Index: gcc/fwprop.c
> ===================================================================
> *** gcc/fwprop.c	(revision 237955)
> --- gcc/fwprop.c	(working copy)
> *************** propagate_rtx_1 (rtx *px, rtx old_rtx, r
> *** 619,624 ****
> --- 619,633 ----
>   
>     *px = tem;
>   
> +   /* Allow replacements that simplify operations on a vector or complex
> +      value to a component.  The most prominent case is
> +      (subreg ([vec_]concat ...)).   */
> +   if (REG_P (tem) && !HARD_REGISTER_P (tem)
> +       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
> +       && (VECTOR_MODE_P (GET_MODE (new_rtx))
> + 	  || COMPLEX_MODE_P (GET_MODE (new_rtx))))
> +     return true;
> + 
>     /* The replacement we made so far is valid, if all of the recursive
>        replacements were valid, or we could simplify everything to
>        a constant.  */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-05 20:50         ` Richard Sandiford
@ 2016-07-06 13:18           ` Richard Biener
  2016-07-10  8:13             ` Uros Bizjak
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2016-07-06 13:18 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, segher, ubizjak

On Tue, 5 Jul 2016, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Sun, 3 Jul 2016, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Wed, 15 Jun 2016, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> > With the proposed cost change for vector construction we will end up
> >> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> >> >> > on ppc64le as well after that target gets adjustments).  Currently
> >> >> > we can't optimize that away again noticing the direct overlap of
> >> >> > argument and return registers.  The obstackle is
> >> >> >
> >> >> > (insn 7 4 8 2 (set (reg:V2DF 93)
> >> >> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> >> >> >             (reg/v:DF 92 [ aa ]))) 
> >> >> > ...
> >> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> >> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> >> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> >> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> >> >> >
> >> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> >> >> >
> >> >> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> >> >> > (easy fix below).
> >> >> >
> >> >> > Then combine doesn't like to simplify the multi-use (it tries some
> >> >> > parallel it seems).  So I went to forwprop which eventually manages
> >> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> >> >> > because it is not a constant.  Thus I allow arbitrary simplification
> >> >> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> >> >> > to be a magic flag to tell it to restrict to the case where all
> >> >> > uses can be simplified or so, nor to restrict simplifications to a REG.
> >> >> > But I don't see any undesirable simplifications of (subreg 
> >> >> > ([vec_]concat)).
> >> >> 
> >> >> Adding that as a special case to propgate_rtx feels like a hack though :-)
> >> >> I think:
> >> >> 
> >> >> > Index: gcc/fwprop.c
> >> >> > ===================================================================
> >> >> > *** gcc/fwprop.c	(revision 237286)
> >> >> > --- gcc/fwprop.c	(working copy)
> >> >> > *************** propagate_rtx (rtx x, machine_mode mode,
> >> >> > *** 664,670 ****
> >> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >> >   	  && REG_P (SUBREG_REG (new_rtx))
> >> >> >   	  && (GET_MODE_SIZE (mode)
> >> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
> >> >> >       flags |= PR_CAN_APPEAR;
> >> >> >     if (!varying_mem_p (new_rtx))
> >> >> >       flags |= PR_HANDLE_MEM;
> >> >> > --- 664,673 ----
> >> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >> >   	  && REG_P (SUBREG_REG (new_rtx))
> >> >> >   	  && (GET_MODE_SIZE (mode)
> >> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> >> >> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> >> >> > ! 	   || GET_CODE (new_rtx) == CONCAT)
> >> >> > ! 	  && GET_CODE (x) == SUBREG))
> >> >> >       flags |= PR_CAN_APPEAR;
> >> >> >     if (!varying_mem_p (new_rtx))
> >> >> >       flags |= PR_HANDLE_MEM;
> >> >> 
> >> >> ...this if statement should fundamentally only test new_rtx.
> >> >> E.g. we'd want the same thing for any SUBREG inside X.
> >> >> 
> >> >> How about changing:
> >> >> 
> >> >>   /* The replacement we made so far is valid, if all of the recursive
> >> >>      replacements were valid, or we could simplify everything to
> >> >>      a constant.  */
> >> >>   return valid_ops || can_appear || CONSTANT_P (tem);
> >> >> 
> >> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
> >> >> I suppose that's likely to increase register pressure though,
> >> >> if only some uses of new_rtx simplify.  (There again, requiring all
> >> >> uses to be replacable could make hot code the hostage of cold code.)
> >> >
> >> > Yes, my fear was about register presure increase for the case not all
> >> > uses can be replaced (fwprop doesn't seem to have code to verify or
> >> > require that).
> >> >
> >> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
> >> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
> >> > new_rtx == [VEC_]CONCAT case for example.
> >> 
> >> I don't think that helps though.  There might be other uses of a
> >> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
> >> problem of keeping both values live at once.
> >> 
> >> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
> >> to cases where new_rtx has more words than tem?
> >
> > So would you really make a simple mode-size check here?
> 
> I thought a mode check would be worth trying since (for better or worse)
> words are special for subregs.  But...
> 
> > I wonder which cases are there other than the subreg of [vec_concat]
> > that would end up with this case.  That is,
> >
> >   if (REG_P (tem) && !HARD_REGISTER_P (tem)
> >       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
> >       && (VECTOR_MODE_P (GET_MODE (new_rtx))
> >           || COMPLEX_MODE_P (GET_MODE (new_rtx))))
> >     return true;
> 
> this looks good to me too FWIW.  I think it reads more naturally if
> you do the GET_MODE_INNER check last.
> 
> > works as would constraining new_rtx to [VEC_]CONCAT instead of
> > vector/complex modes.  I'm worried about relaxing it further
> > as partial int modes also have a GET_MODE_INNER - relaxing to
> >
> >       && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx)
> >
> > I mean.  If we restrict it to [VEC_]CONCAT we can even allow
> > any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop)
> > operation that will simplify into a REG_P besides a subreg.
> 
> I think the simplification could also trigger for VEC_DUPLICATE and
> VEC_MERGE, so the patch seems better than a check for specific codes.
> 
> > But the above form would capture the intent to allow simplifying
> > operations on vector / complex to a scalar component.
> >
> > Which means I can test the following if that looks better
> > than what I had originally.  There's Seghers 2->2 combine
> > patch but that has wider impact as well.
> >
> > Thanks,
> > Richard.
> >
> > 2016-07-04  Richard Biener  <rguenther@suse.de>
> >
> > 	PR rtl-optimization/68961
> > 	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> > 	to simplify to a non-constant.
> >
> > 	* gcc.target/i386/pr68961.c: New testcase.
> 
> Thanks, LGTM.

Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes

FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2

as the peephole created for that testcase no longer applies as fwprop
does

In insn 10, replacing
 (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
            (parallel [
                    (const_int 0 [0])
                ]))
        (mem:DF (reg/f:DI 95) [0  S8 A128]))
 with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
        (mem:DF (reg/f:DI 95) [0  S8 A128]))
Changed insn 10

resulting in

        movsd   a+8(%rip), %xmm0
        movhpd  a+16(%rip), %xmm0

again rather than movupd.

Uros, there is probably a missing peephole for the new form - can you
fix this as a followup or should I hold on this patch for a bit longer?

Thanks,
Richard.

2016-07-06  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/68961
	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
	to simplify to a non-constant.

	* gcc.target/i386/pr68961.c: New testcase.

Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c	(revision 238005)
--- gcc/fwprop.c	(working copy)
*************** propagate_rtx_1 (rtx *px, rtx old_rtx, r
*** 619,624 ****
--- 619,633 ----
  
    *px = tem;
  
+   /* Allow replacements that simplify operations on a vector or complex
+      value to a component.  The most prominent case is
+      (subreg ([vec_]concat ...)).   */
+   if (REG_P (tem) && !HARD_REGISTER_P (tem)
+       && (VECTOR_MODE_P (GET_MODE (new_rtx))
+ 	  || COMPLEX_MODE_P (GET_MODE (new_rtx)))
+       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx)))
+     return true;
+ 
    /* The replacement we made so far is valid, if all of the recursive
       replacements were valid, or we could simplify everything to
       a constant.  */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-06 13:18           ` Richard Biener
@ 2016-07-10  8:13             ` Uros Bizjak
  2016-07-12  8:59               ` Richard Biener
  2016-07-12 22:08               ` Uros Bizjak
  0 siblings, 2 replies; 20+ messages in thread
From: Uros Bizjak @ 2016-07-10  8:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, gcc-patches, Segher Boessenkool

On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:

>> > 2016-07-04  Richard Biener  <rguenther@suse.de>
>> >
>> >     PR rtl-optimization/68961
>> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
>> >     to simplify to a non-constant.
>> >
>> >     * gcc.target/i386/pr68961.c: New testcase.
>>
>> Thanks, LGTM.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
>
> FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
>
> as the peephole created for that testcase no longer applies as fwprop
> does
>
> In insn 10, replacing
>  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
>             (parallel [
>                     (const_int 0 [0])
>                 ]))
>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
>  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> Changed insn 10
>
> resulting in
>
>         movsd   a+8(%rip), %xmm0
>         movhpd  a+16(%rip), %xmm0
>
> again rather than movupd.
>
> Uros, there is probably a missing peephole for the new form - can you
> fix this as a followup or should I hold on this patch for a bit longer?

No, please proceed with the patch, I'll fix this fallout with a
followup patch in a couple of days.

Uros.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-10  8:13             ` Uros Bizjak
@ 2016-07-12  8:59               ` Richard Biener
  2016-07-12  9:09                 ` Uros Bizjak
  2016-07-12 22:08               ` Uros Bizjak
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2016-07-12  8:59 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches, Segher Boessenkool

On Sun, 10 Jul 2016, Uros Bizjak wrote:

> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
> 
> >> > 2016-07-04  Richard Biener  <rguenther@suse.de>
> >> >
> >> >     PR rtl-optimization/68961
> >> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> >> >     to simplify to a non-constant.
> >> >
> >> >     * gcc.target/i386/pr68961.c: New testcase.
> >>
> >> Thanks, LGTM.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
> >
> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
> >
> > as the peephole created for that testcase no longer applies as fwprop
> > does
> >
> > In insn 10, replacing
> >  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
> >             (parallel [
> >                     (const_int 0 [0])
> >                 ]))
> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> > Changed insn 10
> >
> > resulting in
> >
> >         movsd   a+8(%rip), %xmm0
> >         movhpd  a+16(%rip), %xmm0
> >
> > again rather than movupd.
> >
> > Uros, there is probably a missing peephole for the new form - can you
> > fix this as a followup or should I hold on this patch for a bit longer?
> 
> No, please proceed with the patch, I'll fix this fallout with a
> followup patch in a couple of days.

Applied as r238238.  Is the following x86 change ok then which
adjusts the vectorizer vector construction cost to sth more sensible?
I have adjusted the generic implementation in targhooks.c this way
a few weeks ago already.

Thanks,
Richard.

2016-07-12  Richard Biener  <rguenther@suse.de>

	* targhooks.c (default_builtin_vectorization_cost): Adjust
	vec_construct cost.
	* config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 237196)
+++ gcc/config/i386/i386.c	(working copy)
@@ -49503,8 +49520,6 @@ static int
 ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
                                  tree vectype, int)
 {
-  unsigned elements;
-
   switch (type_of_cost)
     {
       case scalar_stmt:
@@ -49546,8 +49561,7 @@ ix86_builtin_vectorization_cost (enum ve
         return ix86_cost->vec_stmt_cost;
 
       case vec_construct:
-	elements = TYPE_VECTOR_SUBPARTS (vectype);
-	return ix86_cost->vec_stmt_cost * (elements / 2 + 1);
+	return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
 
       default:
         gcc_unreachable ();

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-12  8:59               ` Richard Biener
@ 2016-07-12  9:09                 ` Uros Bizjak
  2016-07-12  9:14                   ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Uros Bizjak @ 2016-07-12  9:09 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, gcc-patches, Segher Boessenkool, Kirill Yukhin

On Tue, Jul 12, 2016 at 10:58 AM, Richard Biener <rguenther@suse.de> wrote:
> On Sun, 10 Jul 2016, Uros Bizjak wrote:
>
>> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
>>
>> >> > 2016-07-04  Richard Biener  <rguenther@suse.de>
>> >> >
>> >> >     PR rtl-optimization/68961
>> >> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
>> >> >     to simplify to a non-constant.
>> >> >
>> >> >     * gcc.target/i386/pr68961.c: New testcase.
>> >>
>> >> Thanks, LGTM.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
>> >
>> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
>> >
>> > as the peephole created for that testcase no longer applies as fwprop
>> > does
>> >
>> > In insn 10, replacing
>> >  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
>> >             (parallel [
>> >                     (const_int 0 [0])
>> >                 ]))
>> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
>> >  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
>> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
>> > Changed insn 10
>> >
>> > resulting in
>> >
>> >         movsd   a+8(%rip), %xmm0
>> >         movhpd  a+16(%rip), %xmm0
>> >
>> > again rather than movupd.
>> >
>> > Uros, there is probably a missing peephole for the new form - can you
>> > fix this as a followup or should I hold on this patch for a bit longer?
>>
>> No, please proceed with the patch, I'll fix this fallout with a
>> followup patch in a couple of days.
>
> Applied as r238238.  Is the following x86 change ok then which
> adjusts the vectorizer vector construction cost to sth more sensible?
> I have adjusted the generic implementation in targhooks.c this way
> a few weeks ago already.
>
> Thanks,
> Richard.
>
> 2016-07-12  Richard Biener  <rguenther@suse.de>
>
>         * targhooks.c (default_builtin_vectorization_cost): Adjust
>         vec_construct cost.
>         * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.

Looks OK to me, but let's also give Intel chance to comment.

Uros.

> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      (revision 237196)
> +++ gcc/config/i386/i386.c      (working copy)
> @@ -49503,8 +49520,6 @@ static int
>  ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>                                   tree vectype, int)
>  {
> -  unsigned elements;
> -
>    switch (type_of_cost)
>      {
>        case scalar_stmt:
> @@ -49546,8 +49561,7 @@ ix86_builtin_vectorization_cost (enum ve
>          return ix86_cost->vec_stmt_cost;
>
>        case vec_construct:
> -       elements = TYPE_VECTOR_SUBPARTS (vectype);
> -       return ix86_cost->vec_stmt_cost * (elements / 2 + 1);
> +       return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
>
>        default:
>          gcc_unreachable ();

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-12  9:09                 ` Uros Bizjak
@ 2016-07-12  9:14                   ` Richard Biener
  2016-07-15  7:35                     ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2016-07-12  9:14 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Sandiford, gcc-patches, Segher Boessenkool, Kirill Yukhin

On Tue, 12 Jul 2016, Uros Bizjak wrote:

> On Tue, Jul 12, 2016 at 10:58 AM, Richard Biener <rguenther@suse.de> wrote:
> > On Sun, 10 Jul 2016, Uros Bizjak wrote:
> >
> >> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
> >>
> >> >> > 2016-07-04  Richard Biener  <rguenther@suse.de>
> >> >> >
> >> >> >     PR rtl-optimization/68961
> >> >> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> >> >> >     to simplify to a non-constant.
> >> >> >
> >> >> >     * gcc.target/i386/pr68961.c: New testcase.
> >> >>
> >> >> Thanks, LGTM.
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
> >> >
> >> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
> >> >
> >> > as the peephole created for that testcase no longer applies as fwprop
> >> > does
> >> >
> >> > In insn 10, replacing
> >> >  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
> >> >             (parallel [
> >> >                     (const_int 0 [0])
> >> >                 ]))
> >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >> >  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
> >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >> > Changed insn 10
> >> >
> >> > resulting in
> >> >
> >> >         movsd   a+8(%rip), %xmm0
> >> >         movhpd  a+16(%rip), %xmm0
> >> >
> >> > again rather than movupd.
> >> >
> >> > Uros, there is probably a missing peephole for the new form - can you
> >> > fix this as a followup or should I hold on this patch for a bit longer?
> >>
> >> No, please proceed with the patch, I'll fix this fallout with a
> >> followup patch in a couple of days.
> >
> > Applied as r238238.  Is the following x86 change ok then which
> > adjusts the vectorizer vector construction cost to sth more sensible?
> > I have adjusted the generic implementation in targhooks.c this way
> > a few weeks ago already.
> >
> > Thanks,
> > Richard.
> >
> > 2016-07-12  Richard Biener  <rguenther@suse.de>
> >
> >         * targhooks.c (default_builtin_vectorization_cost): Adjust
> >         vec_construct cost.
> >         * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
> 
> Looks OK to me, but let's also give Intel chance to comment.

Btw, the motivation is that the cost of large initializers like for
v16qi or v32qi is underestimated currently.  You end up with
15 or 31 vinsert calls (or similar with other ISAs) and you can't do
better than elements - 1 operations.  It doesn't really matter
for smaller vectors of course (seen for CPU v6 x264)

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-10  8:13             ` Uros Bizjak
  2016-07-12  8:59               ` Richard Biener
@ 2016-07-12 22:08               ` Uros Bizjak
  2016-07-13  7:04                 ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Uros Bizjak @ 2016-07-12 22:08 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, gcc-patches, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 1743 bytes --]

On Sun, Jul 10, 2016 at 10:12 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
>
>>> > 2016-07-04  Richard Biener  <rguenther@suse.de>
>>> >
>>> >     PR rtl-optimization/68961
>>> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
>>> >     to simplify to a non-constant.
>>> >
>>> >     * gcc.target/i386/pr68961.c: New testcase.
>>>
>>> Thanks, LGTM.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
>>
>> FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
>>
>> as the peephole created for that testcase no longer applies as fwprop
>> does
>>
>> In insn 10, replacing
>>  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
>>             (parallel [
>>                     (const_int 0 [0])
>>                 ]))
>>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
>>  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
>>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
>> Changed insn 10
>>
>> resulting in
>>
>>         movsd   a+8(%rip), %xmm0
>>         movhpd  a+16(%rip), %xmm0
>>
>> again rather than movupd.
>>
>> Uros, there is probably a missing peephole for the new form - can you
>> fix this as a followup or should I hold on this patch for a bit longer?
>
> No, please proceed with the patch, I'll fix this fallout with a
> followup patch in a couple of days.

Fixed with attached patch.

2016-07-13  Uros Bizjak  <ubizjak@gmail.com>

    PR rtl-optimization/68961
    * config/i386/sse.md (movsd/movhpd to movupd peephole2s): Add new
    peephole variant.  Use sse_reg_operand predicates.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Committed to mainline SVN.

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 2026 bytes --]

Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 238258)
+++ config/i386/sse.md	(working copy)
@@ -1169,10 +1169,10 @@
 
 ;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targets.
 (define_peephole2
-  [(set (match_operand:V2DF 0 "register_operand")
+  [(set (match_operand:V2DF 0 "sse_reg_operand")
 	(vec_concat:V2DF (match_operand:DF 1 "memory_operand")
 			 (match_operand:DF 4 "const0_operand")))
-   (set (match_operand:V2DF 2 "register_operand")
+   (set (match_operand:V2DF 2 "sse_reg_operand")
 	(vec_concat:V2DF (vec_select:DF (match_dup 2)
 					(parallel [(const_int 0)]))
 			 (match_operand:DF 3 "memory_operand")))]
@@ -1181,13 +1181,25 @@
   [(set (match_dup 2) (match_dup 4))]
   "operands[4] = adjust_address (operands[1], V2DFmode, 0);")
 
+(define_peephole2
+  [(set (match_operand:DF 0 "sse_reg_operand")
+	(match_operand:DF 1 "memory_operand"))
+   (set (match_operand:V2DF 2 "sse_reg_operand")
+	(vec_concat:V2DF (match_operand:DF 4 "sse_reg_operand")
+			 (match_operand:DF 3 "memory_operand")))]
+  "TARGET_SSE2 && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+   && REGNO (operands[4]) == REGNO (operands[2])
+   && ix86_operands_ok_for_move_multiple (operands, true, DFmode)"
+  [(set (match_dup 2) (match_dup 4))]
+  "operands[4] = adjust_address (operands[1], V2DFmode, 0);")
+
 ;; Merge movlpd/movhpd to movupd for TARGET_SSE_UNALIGNED_STORE_OPTIMAL targets.
 (define_peephole2
   [(set (match_operand:DF 0 "memory_operand")
-	(vec_select:DF (match_operand:V2DF 1 "register_operand")
+	(vec_select:DF (match_operand:V2DF 1 "sse_reg_operand")
 		       (parallel [(const_int 0)])))
    (set (match_operand:DF 2 "memory_operand")
-	(vec_select:DF (match_operand:V2DF 3 "register_operand")
+	(vec_select:DF (match_operand:V2DF 3 "sse_reg_operand")
 		       (parallel [(const_int 1)])))]
   "TARGET_SSE2 && TARGET_SSE_UNALIGNED_STORE_OPTIMAL
    && ix86_operands_ok_for_move_multiple (operands, false, DFmode)"

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-12 22:08               ` Uros Bizjak
@ 2016-07-13  7:04                 ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2016-07-13  7:04 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Sandiford, gcc-patches, Segher Boessenkool

On Wed, 13 Jul 2016, Uros Bizjak wrote:

> On Sun, Jul 10, 2016 at 10:12 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> > On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
> >
> >>> > 2016-07-04  Richard Biener  <rguenther@suse.de>
> >>> >
> >>> >     PR rtl-optimization/68961
> >>> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> >>> >     to simplify to a non-constant.
> >>> >
> >>> >     * gcc.target/i386/pr68961.c: New testcase.
> >>>
> >>> Thanks, LGTM.
> >>
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
> >>
> >> FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
> >>
> >> as the peephole created for that testcase no longer applies as fwprop
> >> does
> >>
> >> In insn 10, replacing
> >>  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
> >>             (parallel [
> >>                     (const_int 0 [0])
> >>                 ]))
> >>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >>  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
> >>         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >> Changed insn 10
> >>
> >> resulting in
> >>
> >>         movsd   a+8(%rip), %xmm0
> >>         movhpd  a+16(%rip), %xmm0
> >>
> >> again rather than movupd.
> >>
> >> Uros, there is probably a missing peephole for the new form - can you
> >> fix this as a followup or should I hold on this patch for a bit longer?
> >
> > No, please proceed with the patch, I'll fix this fallout with a
> > followup patch in a couple of days.
> 
> Fixed with attached patch.
> 
> 2016-07-13  Uros Bizjak  <ubizjak@gmail.com>
> 
>     PR rtl-optimization/68961
>     * config/i386/sse.md (movsd/movhpd to movupd peephole2s): Add new
>     peephole variant.  Use sse_reg_operand predicates.
> 
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> 
> Committed to mainline SVN.

Thanks Uros.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)
  2016-07-12  9:14                   ` Richard Biener
@ 2016-07-15  7:35                     ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2016-07-15  7:35 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Sandiford, gcc-patches, Segher Boessenkool, Kirill Yukhin

On Tue, 12 Jul 2016, Richard Biener wrote:

> On Tue, 12 Jul 2016, Uros Bizjak wrote:
> 
> > On Tue, Jul 12, 2016 at 10:58 AM, Richard Biener <rguenther@suse.de> wrote:
> > > On Sun, 10 Jul 2016, Uros Bizjak wrote:
> > >
> > >> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguenther@suse.de> wrote:
> > >>
> > >> >> > 2016-07-04  Richard Biener  <rguenther@suse.de>
> > >> >> >
> > >> >> >     PR rtl-optimization/68961
> > >> >> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> > >> >> >     to simplify to a non-constant.
> > >> >> >
> > >> >> >     * gcc.target/i386/pr68961.c: New testcase.
> > >> >>
> > >> >> Thanks, LGTM.
> > >> >
> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
> > >> >
> > >> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
> > >> >
> > >> > as the peephole created for that testcase no longer applies as fwprop
> > >> > does
> > >> >
> > >> > In insn 10, replacing
> > >> >  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
> > >> >             (parallel [
> > >> >                     (const_int 0 [0])
> > >> >                 ]))
> > >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> > >> >  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
> > >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> > >> > Changed insn 10
> > >> >
> > >> > resulting in
> > >> >
> > >> >         movsd   a+8(%rip), %xmm0
> > >> >         movhpd  a+16(%rip), %xmm0
> > >> >
> > >> > again rather than movupd.
> > >> >
> > >> > Uros, there is probably a missing peephole for the new form - can you
> > >> > fix this as a followup or should I hold on this patch for a bit longer?
> > >>
> > >> No, please proceed with the patch, I'll fix this fallout with a
> > >> followup patch in a couple of days.
> > >
> > > Applied as r238238.  Is the following x86 change ok then which
> > > adjusts the vectorizer vector construction cost to sth more sensible?
> > > I have adjusted the generic implementation in targhooks.c this way
> > > a few weeks ago already.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 2016-07-12  Richard Biener  <rguenther@suse.de>
> > >
> > >         * targhooks.c (default_builtin_vectorization_cost): Adjust
> > >         vec_construct cost.
> > >         * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
> > 
> > Looks OK to me, but let's also give Intel chance to comment.
> 
> Btw, the motivation is that the cost of large initializers like for
> v16qi or v32qi is underestimated currently.  You end up with
> 15 or 31 vinsert calls (or similar with other ISAs) and you can't do
> better than elements - 1 operations.  It doesn't really matter
> for smaller vectors of course (seen for CPU v6 x264)

I've applied the patch now given no further comments from Intel.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-07-15  7:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-10  9:20 [PATCH] Allow fwprop to undo vectorization harm (PR68961) Richard Biener
2016-06-13  8:49 ` Richard Biener
2016-06-13 13:11   ` Richard Biener
2016-06-15 18:37 ` Richard Sandiford
2016-06-27  9:59   ` Richard Biener
2016-06-29  8:08     ` Richard Biener
2016-07-03 17:25     ` Richard Sandiford
2016-07-04 10:48       ` Richard Biener
2016-07-05 20:50         ` Richard Sandiford
2016-07-06 13:18           ` Richard Biener
2016-07-10  8:13             ` Uros Bizjak
2016-07-12  8:59               ` Richard Biener
2016-07-12  9:09                 ` Uros Bizjak
2016-07-12  9:14                   ` Richard Biener
2016-07-15  7:35                     ` Richard Biener
2016-07-12 22:08               ` Uros Bizjak
2016-07-13  7:04                 ` Richard Biener
2016-06-17  0:07 ` RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961)) Segher Boessenkool
2016-06-20 13:39   ` RFC: 2->2 combine patch Kyrill Tkachov
2016-06-20 14:38     ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).