[PATCH] Improved RTL expansion of 1LL << x.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Improved RTL expansion of 1LL << x.
@ 2023-10-14 23:32 Roger Sayle
  2023-10-15  4:32 ` Jeff Law
  0 siblings, 1 reply; 2+ messages in thread
From: Roger Sayle @ 2023-10-14 23:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Jeff Law'

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]


This patch improves the initial RTL expanded for double word shifts
on architectures with conditional moves, so that later passes don't
need to clean-up unnecessary and/or unused instructions.

Consider the general case, x << y, which is expanded well as:

        t1 = y & 32;
        t2 = 0;
        t3 = x_lo >> 1;
        t4 = y ^ ~0;
        t5 = t3 >> t4;
        tmp_hi = x_hi << y;
        tmp_hi |= t5;
        tmp_lo = x_lo << y;
        out_hi = t1 ? tmp_lo : tmp_hi;
        out_lo = t1 ? t2 : tmp_lo;

which is nearly optimal, the only thing that can be improved is
that using a unary NOT operation "t4 = ~y" is better than XOR
with -1, on targets that support it.  [Note the one_cmpl_optab
expander didn't fall back to XOR when this code was originally
written, but has been improved since].

Now consider the relatively common idiom of 1LL << y, which
currently produces the RTL equivalent of:

        t1 = y & 32;
        t2 = 0;
        t3 = 1 >> 1;
        t4 = y ^ ~0;
        t5 = t3 >> t4;
        tmp_hi = 0 << y;
        tmp_hi |= t5;
        tmp_lo = 1 << y;
        out_hi = t1 ? tmp_lo : tmp_hi;
        out_lo = t1 ? t2 : tmp_lo;

Notice here that t3 is always zero, so the assignment of t5
is a variable shift of zero, which expands to a loop on many
smaller targets, a similar shift by zero in the first tmp_hi
assignment (another loop), that the value of t4 is no longer
required (as t3 is zero), and that the ultimate value of tmp_hi
is always zero.

Fortunately, for many (but perhaps not all) targets this mess
gets cleaned up by later optimization passes.  However, this
patch avoids generating unnecessary RTL at expand time, by
calling simplify_expand_binop instead of expand_binop, and
avoiding generating dead or unnecessary code when intermediate
values are known to be zero.  For the 1LL << y test case above,
we now generate:

        t1 = y & 32;
        t2 = 0;
        tmp_hi = 0;
        tmp_lo = 1 << y;
        out_hi = t1 ? tmp_lo : tmp_hi;
        out_lo = t1 ? t2 : tmp_lo;

On arc-elf, for example, there are 18 RTL INSN_P instructions
generated by expand before this patch, but only 12 with this patch
(improving both compile-time and memory usage).


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-15  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * optabs.cc (expand_subword_shift): Call simplify_expand_binop
        instead of expand_binop.  Optimize cases (i.e. avoid generating
        RTL) when CARRIES or INTO_INPUT is zero.  Use one_cmpl_optab
        (i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1.


Thanks in advance,
Roger
--


[-- Attachment #2: patchex.txt --]
[-- Type: text/plain, Size: 2842 bytes --]

diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index e1898da..f0a048a 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab,
 	 has unknown behavior.  Do a single shift first, then shift by the
 	 remainder.  It's OK to use ~OP1 as the remainder if shift counts
 	 are truncated to the mode size.  */
-      carries = expand_binop (word_mode, reverse_unsigned_shift,
-			      outof_input, const1_rtx, 0, unsignedp, methods);
-      if (shift_mask == BITS_PER_WORD - 1)
-	{
-	  tmp = immed_wide_int_const
-	    (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
-	  tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp,
-				       0, true, methods);
-	}
+      carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+				       outof_input, const1_rtx, 0,
+				       unsignedp, methods);
+      if (carries == const0_rtx)
+	tmp = const0_rtx;
+      else if (shift_mask == BITS_PER_WORD - 1)
+	tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true);
       else
 	{
 	  tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
@@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab,
     }
   if (tmp == 0 || carries == 0)
     return false;
-  carries = expand_binop (word_mode, reverse_unsigned_shift,
-			  carries, tmp, 0, unsignedp, methods);
+  if (carries != const0_rtx && tmp != const0_rtx)
+    carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+				     carries, tmp, 0, unsignedp, methods);
   if (carries == 0)
     return false;
 
-  /* Shift INTO_INPUT logically by OP1.  This is the last use of INTO_INPUT
-     so the result can go directly into INTO_TARGET if convenient.  */
-  tmp = expand_binop (word_mode, unsigned_shift, into_input, op1,
-		      into_target, unsignedp, methods);
-  if (tmp == 0)
-    return false;
+  if (into_input != const0_rtx)
+    {
+      /* Shift INTO_INPUT logically by OP1.  This is the last use of
+	 INTO_INPUT so the result can go directly into INTO_TARGET if
+	 convenient.  */
+      tmp = simplify_expand_binop (word_mode, unsigned_shift, into_input,
+				   op1, into_target, unsignedp, methods);
+      if (tmp == 0)
+	return false;
 
-  /* Now OR in the bits carried over from OUTOF_INPUT.  */
-  if (!force_expand_binop (word_mode, ior_optab, tmp, carries,
-			   into_target, unsignedp, methods))
-    return false;
+      /* Now OR in the bits carried over from OUTOF_INPUT.  */
+      if (!force_expand_binop (word_mode, ior_optab, tmp, carries,
+			       into_target, unsignedp, methods))
+	return false;
+    }
+  else
+    emit_move_insn (into_target, carries);
 
   /* Use a standard word_mode shift for the out-of half.  */
   if (outof_target != 0)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] Improved RTL expansion of 1LL << x.
  2023-10-14 23:32 [PATCH] Improved RTL expansion of 1LL << x Roger Sayle
@ 2023-10-15  4:32 ` Jeff Law
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff Law @ 2023-10-15  4:32 UTC (permalink / raw)
  To: Roger Sayle, gcc-patches



On 10/14/23 17:32, Roger Sayle wrote:
> 
> This patch improves the initial RTL expanded for double word shifts
> on architectures with conditional moves, so that later passes don't
> need to clean-up unnecessary and/or unused instructions.
> 
> Consider the general case, x << y, which is expanded well as:
> 
>          t1 = y & 32;
>          t2 = 0;
>          t3 = x_lo >> 1;
>          t4 = y ^ ~0;
>          t5 = t3 >> t4;
>          tmp_hi = x_hi << y;
>          tmp_hi |= t5;
>          tmp_lo = x_lo << y;
>          out_hi = t1 ? tmp_lo : tmp_hi;
>          out_lo = t1 ? t2 : tmp_lo;
> 
> which is nearly optimal, the only thing that can be improved is
> that using a unary NOT operation "t4 = ~y" is better than XOR
> with -1, on targets that support it.  [Note the one_cmpl_optab
> expander didn't fall back to XOR when this code was originally
> written, but has been improved since].
> 
> Now consider the relatively common idiom of 1LL << y, which
> currently produces the RTL equivalent of:
> 
>          t1 = y & 32;
>          t2 = 0;
>          t3 = 1 >> 1;
>          t4 = y ^ ~0;
>          t5 = t3 >> t4;
>          tmp_hi = 0 << y;
>          tmp_hi |= t5;
>          tmp_lo = 1 << y;
>          out_hi = t1 ? tmp_lo : tmp_hi;
>          out_lo = t1 ? t2 : tmp_lo;
> 
> Notice here that t3 is always zero, so the assignment of t5
> is a variable shift of zero, which expands to a loop on many
> smaller targets, a similar shift by zero in the first tmp_hi
> assignment (another loop), that the value of t4 is no longer
> required (as t3 is zero), and that the ultimate value of tmp_hi
> is always zero.
> 
> Fortunately, for many (but perhaps not all) targets this mess
> gets cleaned up by later optimization passes.  However, this
> patch avoids generating unnecessary RTL at expand time, by
> calling simplify_expand_binop instead of expand_binop, and
> avoiding generating dead or unnecessary code when intermediate
> values are known to be zero.  For the 1LL << y test case above,
> we now generate:
> 
>          t1 = y & 32;
>          t2 = 0;
>          tmp_hi = 0;
>          tmp_lo = 1 << y;
>          out_hi = t1 ? tmp_lo : tmp_hi;
>          out_lo = t1 ? t2 : tmp_lo;
> 
> On arc-elf, for example, there are 18 RTL INSN_P instructions
> generated by expand before this patch, but only 12 with this patch
> (improving both compile-time and memory usage).
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
> 
> 
> 2023-10-15  Roger Sayle  <roger@nextmovesoftware.com>
> 
> gcc/ChangeLog
>          * optabs.cc (expand_subword_shift): Call simplify_expand_binop
>          instead of expand_binop.  Optimize cases (i.e. avoid generating
>          RTL) when CARRIES or INTO_INPUT is zero.  Use one_cmpl_optab
>          (i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1.
OK.

FWIW H8 is another one of the targets where variable shifts have to be 
implemented as a loop.

Jeff

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-15  4:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-14 23:32 [PATCH] Improved RTL expansion of 1LL << x Roger Sayle
2023-10-15  4:32 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).