public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210)
@ 2017-11-29 22:49 Jakub Jelinek
  2017-11-30  9:40 ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Jakub Jelinek @ 2017-11-29 22:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Hi!

Even if target has an umulv4_optab pattern like i?86 mul[lq]; jo,
if one argument is constant power of two, using two shifts, or
for multiplication by 2 just shl with js on previous value is faster
(though, for -Os mul[lq]; jo wins).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2017-11-29  Jakub Jelinek  <jakub@redhat.com>

	PR target/83210
	* internal-fn.c (expand_mul_overflow): Optimize unsigned
	multiplication by power of 2 constant into two shifts + comparison.

	* gcc.target/i386/pr83210.c: New test.

--- gcc/internal-fn.c.jj	2017-11-22 21:37:46.000000000 +0100
+++ gcc/internal-fn.c	2017-11-29 17:44:17.699009169 +0100
@@ -1462,6 +1462,39 @@ expand_mul_overflow (location_t loc, tre
   type = build_nonstandard_integer_type (GET_MODE_PRECISION (mode), uns);
   sign = uns ? UNSIGNED : SIGNED;
   icode = optab_handler (uns ? umulv4_optab : mulv4_optab, mode);
+  if (uns
+      && (integer_pow2p (arg0) || integer_pow2p (arg1))
+      && (optimize_insn_for_speed_p () || icode == CODE_FOR_nothing))
+    {
+      /* Optimize unsigned multiplication by power of 2 constant
+	 using 2 shifts, one for result, one to extract the shifted
+	 out bits to see if they are all zero.
+	 Don't do this if optimizing for size and we have umulv4_optab,
+	 in that case assume multiplication will be shorter.  */
+      rtx opn0 = op0;
+      rtx opn1 = op1;
+      tree argn0 = arg0;
+      tree argn1 = arg1;
+      if (integer_pow2p (arg0))
+	{
+	  std::swap (opn0, opn1);
+	  std::swap (argn0, argn1);
+	}
+      int cnt = tree_log2 (argn1);
+      if (cnt >= 0 && cnt < GET_MODE_PRECISION (mode))
+	{
+	  rtx upper = const0_rtx;
+	  res = expand_shift (LSHIFT_EXPR, mode, opn0, cnt, NULL_RTX, uns);
+	  if (cnt != 0)
+	    upper = expand_shift (RSHIFT_EXPR, mode, opn0,
+				  GET_MODE_PRECISION (mode) - cnt,
+				  NULL_RTX, uns);
+	  do_compare_rtx_and_jump (upper, const0_rtx, EQ, true, mode,
+				   NULL_RTX, NULL, done_label,
+				   profile_probability::very_likely ());
+	  goto do_error_label;
+	}
+    }
   if (icode != CODE_FOR_nothing)
     {
       struct expand_operand ops[4];
--- gcc/testsuite/gcc.target/i386/pr83210.c.jj	2017-11-29 17:58:00.102881065 +0100
+++ gcc/testsuite/gcc.target/i386/pr83210.c	2017-11-29 17:57:36.000000000 +0100
@@ -0,0 +1,53 @@
+/* PR target/83210 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-not {\mmul[lq]\M} } } */
+
+void bar (void);
+
+unsigned
+f1 (unsigned int x)
+{
+  unsigned res;
+  if (__builtin_mul_overflow (x, 2, &res))
+    bar ();
+  return res;
+}
+
+unsigned long
+f2 (unsigned long x)
+{
+  unsigned long res;
+  if (__builtin_mul_overflow (16, x, &res))
+    bar ();
+  return res;
+}
+
+unsigned long long
+f3 (unsigned long long x)
+{
+  unsigned long long res;
+  if (__builtin_mul_overflow (x, (1ULL << (__SIZEOF_LONG_LONG__ * __CHAR_BIT__ - 1)), &res))
+    bar ();
+  return res;
+}
+
+#ifdef __SIZEOF_INT128__
+unsigned __int128
+f4 (unsigned __int128 x)
+{
+  unsigned __int128 res;
+  if (__builtin_mul_overflow (x, (((unsigned __int128) 1) << (__SIZEOF_INT128__ * __CHAR_BIT__ / 2)), &res))
+    bar ();
+  return res;
+}
+
+unsigned __int128
+f5 (unsigned __int128 x)
+{
+  unsigned __int128 res;
+  if (__builtin_mul_overflow (x, (((unsigned __int128) 1) << (__SIZEOF_INT128__ * __CHAR_BIT__ / 2 + 3)), &res))
+    bar ();
+  return res;
+}
+#endif

	Jakub

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210)
  2017-11-29 22:49 [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210) Jakub Jelinek
@ 2017-11-30  9:40 ` Richard Biener
  2017-11-30  9:56   ` Jakub Jelinek
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Biener @ 2017-11-30  9:40 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches

On Wed, 29 Nov 2017, Jakub Jelinek wrote:

> Hi!
> 
> Even if target has an umulv4_optab pattern like i?86 mul[lq]; jo,
> if one argument is constant power of two, using two shifts, or
> for multiplication by 2 just shl with js on previous value is faster
> (though, for -Os mul[lq]; jo wins).
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

So this is because we don't go through the standard multiplication
expander, right?  But for icode == CODE_FOR_nothing we probably
should do this and then emit the compare-and-jump?

And for constant multiplicators in general as well?

As you do it you assume that the targets optab is less efficient
than two shifts and a compare but how do you know?  Shouldn't we
do regular mult expansion and cost the resulting sequence against
the one from the umulv expander?

That said, the patch looks x86 specific but done in generic code...

Thanks,
Richard.

> 2017-11-29  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR target/83210
> 	* internal-fn.c (expand_mul_overflow): Optimize unsigned
> 	multiplication by power of 2 constant into two shifts + comparison.
> 
> 	* gcc.target/i386/pr83210.c: New test.
> 
> --- gcc/internal-fn.c.jj	2017-11-22 21:37:46.000000000 +0100
> +++ gcc/internal-fn.c	2017-11-29 17:44:17.699009169 +0100
> @@ -1462,6 +1462,39 @@ expand_mul_overflow (location_t loc, tre
>    type = build_nonstandard_integer_type (GET_MODE_PRECISION (mode), uns);
>    sign = uns ? UNSIGNED : SIGNED;
>    icode = optab_handler (uns ? umulv4_optab : mulv4_optab, mode);
> +  if (uns
> +      && (integer_pow2p (arg0) || integer_pow2p (arg1))
> +      && (optimize_insn_for_speed_p () || icode == CODE_FOR_nothing))
> +    {
> +      /* Optimize unsigned multiplication by power of 2 constant
> +	 using 2 shifts, one for result, one to extract the shifted
> +	 out bits to see if they are all zero.
> +	 Don't do this if optimizing for size and we have umulv4_optab,
> +	 in that case assume multiplication will be shorter.  */
> +      rtx opn0 = op0;
> +      rtx opn1 = op1;
> +      tree argn0 = arg0;
> +      tree argn1 = arg1;
> +      if (integer_pow2p (arg0))
> +	{
> +	  std::swap (opn0, opn1);
> +	  std::swap (argn0, argn1);
> +	}
> +      int cnt = tree_log2 (argn1);
> +      if (cnt >= 0 && cnt < GET_MODE_PRECISION (mode))
> +	{
> +	  rtx upper = const0_rtx;
> +	  res = expand_shift (LSHIFT_EXPR, mode, opn0, cnt, NULL_RTX, uns);
> +	  if (cnt != 0)
> +	    upper = expand_shift (RSHIFT_EXPR, mode, opn0,
> +				  GET_MODE_PRECISION (mode) - cnt,
> +				  NULL_RTX, uns);
> +	  do_compare_rtx_and_jump (upper, const0_rtx, EQ, true, mode,
> +				   NULL_RTX, NULL, done_label,
> +				   profile_probability::very_likely ());
> +	  goto do_error_label;
> +	}
> +    }
>    if (icode != CODE_FOR_nothing)
>      {
>        struct expand_operand ops[4];
> --- gcc/testsuite/gcc.target/i386/pr83210.c.jj	2017-11-29 17:58:00.102881065 +0100
> +++ gcc/testsuite/gcc.target/i386/pr83210.c	2017-11-29 17:57:36.000000000 +0100
> @@ -0,0 +1,53 @@
> +/* PR target/83210 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { scan-assembler-not {\mmul[lq]\M} } } */
> +
> +void bar (void);
> +
> +unsigned
> +f1 (unsigned int x)
> +{
> +  unsigned res;
> +  if (__builtin_mul_overflow (x, 2, &res))
> +    bar ();
> +  return res;
> +}
> +
> +unsigned long
> +f2 (unsigned long x)
> +{
> +  unsigned long res;
> +  if (__builtin_mul_overflow (16, x, &res))
> +    bar ();
> +  return res;
> +}
> +
> +unsigned long long
> +f3 (unsigned long long x)
> +{
> +  unsigned long long res;
> +  if (__builtin_mul_overflow (x, (1ULL << (__SIZEOF_LONG_LONG__ * __CHAR_BIT__ - 1)), &res))
> +    bar ();
> +  return res;
> +}
> +
> +#ifdef __SIZEOF_INT128__
> +unsigned __int128
> +f4 (unsigned __int128 x)
> +{
> +  unsigned __int128 res;
> +  if (__builtin_mul_overflow (x, (((unsigned __int128) 1) << (__SIZEOF_INT128__ * __CHAR_BIT__ / 2)), &res))
> +    bar ();
> +  return res;
> +}
> +
> +unsigned __int128
> +f5 (unsigned __int128 x)
> +{
> +  unsigned __int128 res;
> +  if (__builtin_mul_overflow (x, (((unsigned __int128) 1) << (__SIZEOF_INT128__ * __CHAR_BIT__ / 2 + 3)), &res))
> +    bar ();
> +  return res;
> +}
> +#endif
> 
> 	Jakub
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210)
  2017-11-30  9:40 ` Richard Biener
@ 2017-11-30  9:56   ` Jakub Jelinek
  2017-11-30 11:04     ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Jakub Jelinek @ 2017-11-30  9:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Thu, Nov 30, 2017 at 10:26:33AM +0100, Richard Biener wrote:
> On Wed, 29 Nov 2017, Jakub Jelinek wrote:
> 
> > Hi!
> > 
> > Even if target has an umulv4_optab pattern like i?86 mul[lq]; jo,
> > if one argument is constant power of two, using two shifts, or
> > for multiplication by 2 just shl with js on previous value is faster
> > (though, for -Os mul[lq]; jo wins).
> > 
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> So this is because we don't go through the standard multiplication
> expander, right?

We sometimes go through the standard multiplication expander, sometimes
widening multiplication expander, sometimes highpart multiplication
expander, also sometimes with the arguments forced into registers because
we pick various subregs of them etc.  expand_mul_overflow already now has
a lots of cases.

>  But for icode == CODE_FOR_nothing we probably
> should do this and then emit the compare-and-jump?

The icode == CODE_FOR_nothing check is there because typically the
code when backend doesn't provide anything is larger or much larger, just
see how many different expansions there are.

The patch handles the stuff fully, i.e. emits two shifts plus
compare-and-jump, and does that for the power of 2 constants for
icode == CODE_FOR_nothing always and otherwise (i.e. i?86/x86_64)
only if not optimizing for size.

> And for constant multiplicators in general as well?

See the PR, I'm not convinced that is at least generally a win.

The case of two constant arguments should be handled earlier in gimple
optimizers, and case of one constant argument that is not power of two
means that either we'll compute the overflow by dividing by the constant
again (if that ends up really a divide, I'm pretty sure it is always a loss,
if it turns into multiplication, also, and if it is multiplication by some
large constant and shifts, we can end up with a really large code)
and compare against the original value.

> As you do it you assume that the targets optab is less efficient
> than two shifts and a compare but how do you know?  Shouldn't we
> do regular mult expansion and cost the resulting sequence against
> the one from the umulv expander?
> 
> That said, the patch looks x86 specific but done in generic code...

Well, right now x86 is the only target that has umulv<mode>4 patterns,
and from recent discussions with Segher and ARM folks it doesn't look like
that is going to change at least for power/arm/aarch64 any time soon.

So, not sure if it is worth to doing expensive and hard to maintain
expansion of both variants and pick the cheaper one right now, it is not
about correctness.  If some further targets add these patterns and the
current strategy doesn't work well there, we can reconsider.

	Jakub

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210)
  2017-11-30  9:56   ` Jakub Jelinek
@ 2017-11-30 11:04     ` Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2017-11-30 11:04 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches

On Thu, 30 Nov 2017, Jakub Jelinek wrote:

> On Thu, Nov 30, 2017 at 10:26:33AM +0100, Richard Biener wrote:
> > On Wed, 29 Nov 2017, Jakub Jelinek wrote:
> > 
> > > Hi!
> > > 
> > > Even if target has an umulv4_optab pattern like i?86 mul[lq]; jo,
> > > if one argument is constant power of two, using two shifts, or
> > > for multiplication by 2 just shl with js on previous value is faster
> > > (though, for -Os mul[lq]; jo wins).
> > > 
> > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> > 
> > So this is because we don't go through the standard multiplication
> > expander, right?
> 
> We sometimes go through the standard multiplication expander, sometimes
> widening multiplication expander, sometimes highpart multiplication
> expander, also sometimes with the arguments forced into registers because
> we pick various subregs of them etc.  expand_mul_overflow already now has
> a lots of cases.
> 
> >  But for icode == CODE_FOR_nothing we probably
> > should do this and then emit the compare-and-jump?
> 
> The icode == CODE_FOR_nothing check is there because typically the
> code when backend doesn't provide anything is larger or much larger, just
> see how many different expansions there are.
> 
> The patch handles the stuff fully, i.e. emits two shifts plus
> compare-and-jump, and does that for the power of 2 constants for
> icode == CODE_FOR_nothing always and otherwise (i.e. i?86/x86_64)
> only if not optimizing for size.
> 
> > And for constant multiplicators in general as well?
> 
> See the PR, I'm not convinced that is at least generally a win.
> 
> The case of two constant arguments should be handled earlier in gimple
> optimizers, and case of one constant argument that is not power of two
> means that either we'll compute the overflow by dividing by the constant
> again (if that ends up really a divide, I'm pretty sure it is always a loss,
> if it turns into multiplication, also, and if it is multiplication by some
> large constant and shifts, we can end up with a really large code)
> and compare against the original value.
> 
> > As you do it you assume that the targets optab is less efficient
> > than two shifts and a compare but how do you know?  Shouldn't we
> > do regular mult expansion and cost the resulting sequence against
> > the one from the umulv expander?
> > 
> > That said, the patch looks x86 specific but done in generic code...
> 
> Well, right now x86 is the only target that has umulv<mode>4 patterns,
> and from recent discussions with Segher and ARM folks it doesn't look like
> that is going to change at least for power/arm/aarch64 any time soon.
> 
> So, not sure if it is worth to doing expensive and hard to maintain
> expansion of both variants and pick the cheaper one right now, it is not
> about correctness.  If some further targets add these patterns and the
> current strategy doesn't work well there, we can reconsider.

Ok, thanks for the elaborate answer.

Can you add a comment indicating that?

The patch is ok with that.

Richard.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-11-30  9:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-29 22:49 [PATCH] Improve __builtin_mul_overflow (uns, 1U << y, &res) (PR target/83210) Jakub Jelinek
2017-11-30  9:40 ` Richard Biener
2017-11-30  9:56   ` Jakub Jelinek
2017-11-30 11:04     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).