public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
@ 2014-01-22 14:33 Bingfeng Mei
  2014-01-28 11:09 ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bingfeng Mei @ 2014-01-22 14:33 UTC (permalink / raw)
  To: gcc

Hi,
I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?

     /* The result of a vectorized widening operation usually requires
	 two vectors (because the widened results do not fit into one vector).
	 The generated vector results would normally be expected to be
	 generated in the same order as in the original scalar computation,
	 i.e. if 8 results are generated in each vector iteration, they are
	 to be organized as follows:
		vect1: [res1,res2,res3,res4],
		vect2: [res5,res6,res7,res8].

	 However, in the special case that the result of the widening
	 operation is used in a reduction computation only, the order doesn't
	 matter (because when vectorizing a reduction we change the order of
	 the computation).  Some targets can take advantage of this and
	 generate more efficient code.  For example, targets like Altivec,
	 that support widen_mult using a sequence of {mult_even,mult_odd}
	 generate the following vectors:
		vect1: [res1,res3,res5,res7],
		vect2: [res2,res4,res6,res8].

	 When vectorizing outer-loops, we execute the inner-loop sequentially
	 (each vectorized inner-loop iteration contributes to VF outer-loop
	 iterations in parallel).  We therefore don't allow to change the
	 order of the computation in the inner-loop during outer-loop
	 vectorization.  */
      /* TODO: Another case in which order doesn't *really* matter is when we
	 widen and then contract again, e.g. (short)((int)x * y >> 8).
	 Normally, pack_trunc performs an even/odd permute, whereas the 
	 repack from an even/odd expansion would be an interleave, which
	 would be significantly simpler for e.g. AVX2.  */
      /* In any case, in order to avoid duplicating the code below, recurse
	 on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
	 are properly set up for the caller.  If we fail, we'll continue with
	 a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
      if (vect_loop
	  && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
	  && !nested_in_vect_loop_p (vect_loop, stmt)
	  && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
					     stmt, vectype_out, vectype_in,
					     code1, code2, multi_step_cvt,
					     interm_types))
	return true;


Thanks,
Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
  2014-01-22 14:33 VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization Bingfeng Mei
@ 2014-01-28 11:09 ` Richard Biener
  2014-01-28 11:56   ` Bingfeng Mei
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2014-01-28 11:09 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc

On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hi,
> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?

Hmm, what's the reason for a target to support both?  I think the idea
was that a target only supports either (the more efficient case).

Richard.

>      /* The result of a vectorized widening operation usually requires
>          two vectors (because the widened results do not fit into one vector).
>          The generated vector results would normally be expected to be
>          generated in the same order as in the original scalar computation,
>          i.e. if 8 results are generated in each vector iteration, they are
>          to be organized as follows:
>                 vect1: [res1,res2,res3,res4],
>                 vect2: [res5,res6,res7,res8].
>
>          However, in the special case that the result of the widening
>          operation is used in a reduction computation only, the order doesn't
>          matter (because when vectorizing a reduction we change the order of
>          the computation).  Some targets can take advantage of this and
>          generate more efficient code.  For example, targets like Altivec,
>          that support widen_mult using a sequence of {mult_even,mult_odd}
>          generate the following vectors:
>                 vect1: [res1,res3,res5,res7],
>                 vect2: [res2,res4,res6,res8].
>
>          When vectorizing outer-loops, we execute the inner-loop sequentially
>          (each vectorized inner-loop iteration contributes to VF outer-loop
>          iterations in parallel).  We therefore don't allow to change the
>          order of the computation in the inner-loop during outer-loop
>          vectorization.  */
>       /* TODO: Another case in which order doesn't *really* matter is when we
>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>          Normally, pack_trunc performs an even/odd permute, whereas the
>          repack from an even/odd expansion would be an interleave, which
>          would be significantly simpler for e.g. AVX2.  */
>       /* In any case, in order to avoid duplicating the code below, recurse
>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>          are properly set up for the caller.  If we fail, we'll continue with
>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>       if (vect_loop
>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>           && !nested_in_vect_loop_p (vect_loop, stmt)
>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>                                              stmt, vectype_out, vectype_in,
>                                              code1, code2, multi_step_cvt,
>                                              interm_types))
>         return true;
>
>
> Thanks,
> Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
  2014-01-28 11:09 ` Richard Biener
@ 2014-01-28 11:56   ` Bingfeng Mei
  2014-01-28 15:17     ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bingfeng Mei @ 2014-01-28 11:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Thanks, Richard. It is not very clear from documents. 

"Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
are vectors with N signed/unsigned elements of size S. Multiply the high/low
or even/odd elements of the two vectors, and put the N/2 products of size 2*S
in the output vector (operand 0)."

So I thought that implementing both can help vectorizer to optimize more loops.
Maybe we should improve documents.

Bingfeng 



-----Original Message-----
From: Richard Biener [mailto:richard.guenther@gmail.com] 
Sent: 28 January 2014 11:02
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.

On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hi,
> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?

Hmm, what's the reason for a target to support both?  I think the idea
was that a target only supports either (the more efficient case).

Richard.

>      /* The result of a vectorized widening operation usually requires
>          two vectors (because the widened results do not fit into one vector).
>          The generated vector results would normally be expected to be
>          generated in the same order as in the original scalar computation,
>          i.e. if 8 results are generated in each vector iteration, they are
>          to be organized as follows:
>                 vect1: [res1,res2,res3,res4],
>                 vect2: [res5,res6,res7,res8].
>
>          However, in the special case that the result of the widening
>          operation is used in a reduction computation only, the order doesn't
>          matter (because when vectorizing a reduction we change the order of
>          the computation).  Some targets can take advantage of this and
>          generate more efficient code.  For example, targets like Altivec,
>          that support widen_mult using a sequence of {mult_even,mult_odd}
>          generate the following vectors:
>                 vect1: [res1,res3,res5,res7],
>                 vect2: [res2,res4,res6,res8].
>
>          When vectorizing outer-loops, we execute the inner-loop sequentially
>          (each vectorized inner-loop iteration contributes to VF outer-loop
>          iterations in parallel).  We therefore don't allow to change the
>          order of the computation in the inner-loop during outer-loop
>          vectorization.  */
>       /* TODO: Another case in which order doesn't *really* matter is when we
>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>          Normally, pack_trunc performs an even/odd permute, whereas the
>          repack from an even/odd expansion would be an interleave, which
>          would be significantly simpler for e.g. AVX2.  */
>       /* In any case, in order to avoid duplicating the code below, recurse
>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>          are properly set up for the caller.  If we fail, we'll continue with
>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>       if (vect_loop
>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>           && !nested_in_vect_loop_p (vect_loop, stmt)
>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>                                              stmt, vectype_out, vectype_in,
>                                              code1, code2, multi_step_cvt,
>                                              interm_types))
>         return true;
>
>
> Thanks,
> Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
  2014-01-28 11:56   ` Bingfeng Mei
@ 2014-01-28 15:17     ` Richard Biener
  2014-01-28 17:28       ` Bingfeng Mei
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2014-01-28 15:17 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc

On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks, Richard. It is not very clear from documents.
>
> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
> are vectors with N signed/unsigned elements of size S. Multiply the high/low
> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> in the output vector (operand 0)."
>
> So I thought that implementing both can help vectorizer to optimize more loops.
> Maybe we should improve documents.

Maybe.  But my answer was from the top of my head - so better double-check
in the vectorizer sources.

Richard.

> Bingfeng
>
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 11:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>
> Hmm, what's the reason for a target to support both?  I think the idea
> was that a target only supports either (the more efficient case).
>
> Richard.
>
>>      /* The result of a vectorized widening operation usually requires
>>          two vectors (because the widened results do not fit into one vector).
>>          The generated vector results would normally be expected to be
>>          generated in the same order as in the original scalar computation,
>>          i.e. if 8 results are generated in each vector iteration, they are
>>          to be organized as follows:
>>                 vect1: [res1,res2,res3,res4],
>>                 vect2: [res5,res6,res7,res8].
>>
>>          However, in the special case that the result of the widening
>>          operation is used in a reduction computation only, the order doesn't
>>          matter (because when vectorizing a reduction we change the order of
>>          the computation).  Some targets can take advantage of this and
>>          generate more efficient code.  For example, targets like Altivec,
>>          that support widen_mult using a sequence of {mult_even,mult_odd}
>>          generate the following vectors:
>>                 vect1: [res1,res3,res5,res7],
>>                 vect2: [res2,res4,res6,res8].
>>
>>          When vectorizing outer-loops, we execute the inner-loop sequentially
>>          (each vectorized inner-loop iteration contributes to VF outer-loop
>>          iterations in parallel).  We therefore don't allow to change the
>>          order of the computation in the inner-loop during outer-loop
>>          vectorization.  */
>>       /* TODO: Another case in which order doesn't *really* matter is when we
>>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>>          Normally, pack_trunc performs an even/odd permute, whereas the
>>          repack from an even/odd expansion would be an interleave, which
>>          would be significantly simpler for e.g. AVX2.  */
>>       /* In any case, in order to avoid duplicating the code below, recurse
>>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>>          are properly set up for the caller.  If we fail, we'll continue with
>>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>>       if (vect_loop
>>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>           && !nested_in_vect_loop_p (vect_loop, stmt)
>>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>                                              stmt, vectype_out, vectype_in,
>>                                              code1, code2, multi_step_cvt,
>>                                              interm_types))
>>         return true;
>>
>>
>> Thanks,
>> Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
  2014-01-28 15:17     ` Richard Biener
@ 2014-01-28 17:28       ` Bingfeng Mei
  2014-01-29  9:36         ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bingfeng Mei @ 2014-01-28 17:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, gcc-patches

I checked vectorization code, it seems that only relevant place vec_widen_mult_even/odd & vec_widen_mult_lo/hi are generated is in supportable_widening_operation. One of these pairs is selected, with priority given to vec_widen_mult_even/odd if it is a reduction loop. However, lo/hi pair seems to have wider usage than even/odd pair (non-loop? Non-reduction?). Maybe that's why AltiVec and x86 still implement both pairs. Is following patch OK?

Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog	(revision 207183)
+++ gcc/ChangeLog	(working copy)
@@ -1,3 +1,9 @@
+2014-01-28  Bingfeng Mei  <bmei@broadcom.com>
+
+	* doc/md.texi: Mention that a target shouldn't implement 
+	vec_widen_(s|u)mul_even/odd pair if it is less efficient
+	than hi/lo pair.
+
 2014-01-28  Richard Biener  <rguenther@suse.de>
 
 	Revert
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	(revision 207183)
+++ gcc/doc/md.texi	(working copy)
@@ -4918,7 +4918,8 @@ the output vector (operand 0).
 Signed/Unsigned widening multiplication.  The two inputs (operands 1 and 2)
 are vectors with N signed/unsigned elements of size S@.  Multiply the high/low
 or even/odd elements of the two vectors, and put the N/2 products of size 2*S
-in the output vector (operand 0).
+in the output vector (operand 0). A target shouldn't implement even/odd pattern
+pair if it is less efficient than lo/hi one.
 
 @cindex @code{vec_widen_ushiftl_hi_@var{m}} instruction pattern
 @cindex @code{vec_widen_ushiftl_lo_@var{m}} instruction pattern


-----Original Message-----
From: Richard Biener [mailto:richard.guenther@gmail.com] 
Sent: 28 January 2014 12:56
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.

On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks, Richard. It is not very clear from documents.
>
> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
> are vectors with N signed/unsigned elements of size S. Multiply the high/low
> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> in the output vector (operand 0)."
>
> So I thought that implementing both can help vectorizer to optimize more loops.
> Maybe we should improve documents.

Maybe.  But my answer was from the top of my head - so better double-check
in the vectorizer sources.

Richard.

> Bingfeng
>
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 11:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>
> Hmm, what's the reason for a target to support both?  I think the idea
> was that a target only supports either (the more efficient case).
>
> Richard.
>
>>      /* The result of a vectorized widening operation usually requires
>>          two vectors (because the widened results do not fit into one vector).
>>          The generated vector results would normally be expected to be
>>          generated in the same order as in the original scalar computation,
>>          i.e. if 8 results are generated in each vector iteration, they are
>>          to be organized as follows:
>>                 vect1: [res1,res2,res3,res4],
>>                 vect2: [res5,res6,res7,res8].
>>
>>          However, in the special case that the result of the widening
>>          operation is used in a reduction computation only, the order doesn't
>>          matter (because when vectorizing a reduction we change the order of
>>          the computation).  Some targets can take advantage of this and
>>          generate more efficient code.  For example, targets like Altivec,
>>          that support widen_mult using a sequence of {mult_even,mult_odd}
>>          generate the following vectors:
>>                 vect1: [res1,res3,res5,res7],
>>                 vect2: [res2,res4,res6,res8].
>>
>>          When vectorizing outer-loops, we execute the inner-loop sequentially
>>          (each vectorized inner-loop iteration contributes to VF outer-loop
>>          iterations in parallel).  We therefore don't allow to change the
>>          order of the computation in the inner-loop during outer-loop
>>          vectorization.  */
>>       /* TODO: Another case in which order doesn't *really* matter is when we
>>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>>          Normally, pack_trunc performs an even/odd permute, whereas the
>>          repack from an even/odd expansion would be an interleave, which
>>          would be significantly simpler for e.g. AVX2.  */
>>       /* In any case, in order to avoid duplicating the code below, recurse
>>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>>          are properly set up for the caller.  If we fail, we'll continue with
>>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>>       if (vect_loop
>>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>           && !nested_in_vect_loop_p (vect_loop, stmt)
>>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>                                              stmt, vectype_out, vectype_in,
>>                                              code1, code2, multi_step_cvt,
>>                                              interm_types))
>>         return true;
>>
>>
>> Thanks,
>> Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
  2014-01-28 17:28       ` Bingfeng Mei
@ 2014-01-29  9:36         ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2014-01-29  9:36 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc, gcc-patches

On Tue, Jan 28, 2014 at 4:17 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> I checked vectorization code, it seems that only relevant place vec_widen_mult_even/odd & vec_widen_mult_lo/hi are generated is in supportable_widening_operation. One of these pairs is selected, with priority given to vec_widen_mult_even/odd if it is a reduction loop. However, lo/hi pair seems to have wider usage than even/odd pair (non-loop? Non-reduction?). Maybe that's why AltiVec and x86 still implement both pairs. Is following patch OK?

Ok.

Thanks,
Richard.

> Index: gcc/ChangeLog
> ===================================================================
> --- gcc/ChangeLog       (revision 207183)
> +++ gcc/ChangeLog       (working copy)
> @@ -1,3 +1,9 @@
> +2014-01-28  Bingfeng Mei  <bmei@broadcom.com>
> +
> +       * doc/md.texi: Mention that a target shouldn't implement
> +       vec_widen_(s|u)mul_even/odd pair if it is less efficient
> +       than hi/lo pair.
> +
>  2014-01-28  Richard Biener  <rguenther@suse.de>
>
>         Revert
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi     (revision 207183)
> +++ gcc/doc/md.texi     (working copy)
> @@ -4918,7 +4918,8 @@ the output vector (operand 0).
>  Signed/Unsigned widening multiplication.  The two inputs (operands 1 and 2)
>  are vectors with N signed/unsigned elements of size S@.  Multiply the high/low
>  or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> -in the output vector (operand 0).
> +in the output vector (operand 0). A target shouldn't implement even/odd pattern
> +pair if it is less efficient than lo/hi one.
>
>  @cindex @code{vec_widen_ushiftl_hi_@var{m}} instruction pattern
>  @cindex @code{vec_widen_ushiftl_lo_@var{m}} instruction pattern
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 12:56
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Thanks, Richard. It is not very clear from documents.
>>
>> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
>> are vectors with N signed/unsigned elements of size S. Multiply the high/low
>> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
>> in the output vector (operand 0)."
>>
>> So I thought that implementing both can help vectorizer to optimize more loops.
>> Maybe we should improve documents.
>
> Maybe.  But my answer was from the top of my head - so better double-check
> in the vectorizer sources.
>
> Richard.
>
>> Bingfeng
>>
>>
>>
>> -----Original Message-----
>> From: Richard Biener [mailto:richard.guenther@gmail.com]
>> Sent: 28 January 2014 11:02
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>>
>> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>>> Hi,
>>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>>
>> Hmm, what's the reason for a target to support both?  I think the idea
>> was that a target only supports either (the more efficient case).
>>
>> Richard.
>>
>>>      /* The result of a vectorized widening operation usually requires
>>>          two vectors (because the widened results do not fit into one vector).
>>>          The generated vector results would normally be expected to be
>>>          generated in the same order as in the original scalar computation,
>>>          i.e. if 8 results are generated in each vector iteration, they are
>>>          to be organized as follows:
>>>                 vect1: [res1,res2,res3,res4],
>>>                 vect2: [res5,res6,res7,res8].
>>>
>>>          However, in the special case that the result of the widening
>>>          operation is used in a reduction computation only, the order doesn't
>>>          matter (because when vectorizing a reduction we change the order of
>>>          the computation).  Some targets can take advantage of this and
>>>          generate more efficient code.  For example, targets like Altivec,
>>>          that support widen_mult using a sequence of {mult_even,mult_odd}
>>>          generate the following vectors:
>>>                 vect1: [res1,res3,res5,res7],
>>>                 vect2: [res2,res4,res6,res8].
>>>
>>>          When vectorizing outer-loops, we execute the inner-loop sequentially
>>>          (each vectorized inner-loop iteration contributes to VF outer-loop
>>>          iterations in parallel).  We therefore don't allow to change the
>>>          order of the computation in the inner-loop during outer-loop
>>>          vectorization.  */
>>>       /* TODO: Another case in which order doesn't *really* matter is when we
>>>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>>>          Normally, pack_trunc performs an even/odd permute, whereas the
>>>          repack from an even/odd expansion would be an interleave, which
>>>          would be significantly simpler for e.g. AVX2.  */
>>>       /* In any case, in order to avoid duplicating the code below, recurse
>>>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>>>          are properly set up for the caller.  If we fail, we'll continue with
>>>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>>>       if (vect_loop
>>>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>>           && !nested_in_vect_loop_p (vect_loop, stmt)
>>>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>>                                              stmt, vectype_out, vectype_in,
>>>                                              code1, code2, multi_step_cvt,
>>>                                              interm_types))
>>>         return true;
>>>
>>>
>>> Thanks,
>>> Bingfeng Mei

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-01-29  9:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-22 14:33 VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization Bingfeng Mei
2014-01-28 11:09 ` Richard Biener
2014-01-28 11:56   ` Bingfeng Mei
2014-01-28 15:17     ` Richard Biener
2014-01-28 17:28       ` Bingfeng Mei
2014-01-29  9:36         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).