public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
@ 2023-09-28 11:26 Roger Sayle
  2023-10-03 14:25 ` Claudiu Zissulescu
  2023-10-04  8:37 ` Claudiu Zissulescu Ianculescu
  0 siblings, 2 replies; 5+ messages in thread
From: Roger Sayle @ 2023-09-28 11:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Claudiu Zissulescu'

[-- Attachment #1: Type: text/plain, Size: 7086 bytes --]


Hi Claudiu,
It was great meeting up with you and the Synopsys ARC team at the
GNU tools Cauldron in Cambridge.

This patch is the first in a series to improve SImode and DImode
shifts and rotates in the ARC backend.  This first piece splits
SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine
and before reload, in the split1 pass, as suggested by the FIXME
comment above output_shift in arc.cc.  To do this I've copied the
implementation of the x86_pre_reload_split function from i386
backend, and renamed it arc_pre_reload_split.

Although the actual implementations of shifts remain the same
(as in output_shift), having them as explicit instructions in
the RTL stream allows better scheduling and use of compact forms
when available.  The benefits can be seen in two short examples
below.

For the function:
unsigned int foo(unsigned int x, unsigned int y) {
  return y << 2;
}

GCC with -O2 -mcpu=em would previously generate:
foo:    add r1,r1,r1
        add r1,r1,r1
        j_s.d   [blink]
        mov_s   r0,r1   ;4
and with this patch now generates:
foo:    asl_s r0,r1
        j_s.d   [blink]
        asl_s r0,r0

Notice the original (from shift_si3's output_shift) requires the
shift sequence to be monolithic with the same destination register
as the source (requiring an extra mov_s).  The new version can
eliminate this move, and schedule the second asl in the branch
delay slot of the return.

For the function:
int x,y,z;

void bar()
{
  x <<= 3;
  y <<= 3;
  z <<= 3;
}

GCC -O2 -mcpu=em currently generates:
bar:    push_s  r13
        ld.as   r12,[gp,@x@sda] ;23
        ld.as   r3,[gp,@y@sda]  ;23
        mov r2,0
        add3 r12,r2,r12
        mov r2,0
        add3 r3,r2,r3
        ld.as   r2,[gp,@z@sda]  ;23
        st.as   r12,[gp,@x@sda] ;26
        mov r13,0
        add3 r2,r13,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s.d   [blink]
        pop_s   r13

where each shift by 3, uses ARC's add3 instruction, which is similar
to x86's lea implementing x = (y<<3) + z, but requires the value zero
to be placed in a temporary register "z".  Splitting this before reload
allows these pseudos to be shared/reused.  With this patch, we get

bar:    ld.as   r2,[gp,@x@sda]  ;23
        mov_s   r3,0    ;3
        add3    r2,r3,r2
        ld.as   r3,[gp,@y@sda]  ;23
        st.as   r2,[gp,@x@sda]  ;26
        ld.as   r2,[gp,@z@sda]  ;23
        mov_s   r12,0   ;3
        add3    r3,r12,r3
        add3    r2,r12,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s     [blink]

Unfortunately, register allocation means that we only share two of the
three "mov_s z,0", but this is sufficient to reduce register pressure
enough to avoid spilling r13 in the prologue/epilogue.

This patch also contains a (latent?) bug fix.  The implementation of
the default insn "length" attribute, assumes instructions of type
"shift" have two input operands and accesses operands[2], hence 
specializations of shifts that don't have a operands[2], need to be
categorized as type "unary" (which results in the correct length).

This patch has been tested on a cross-compiler to arc-elf (hosted on
x86_64-pc-linux-gnu), but because I've an incomplete tool chain many
of the regression test fail, but there are no new failures with new
test cases added below.  If you can confirm that there are no issues
from additional testing, is this OK for mainline?

Finally a quick technical question.  ARC's zero overhead loops require
at least two instructions in the loop, so currently the backend's
implementation of shr20 pads the loop body with a "nop".

lshr20: mov.f lp_count, 20
        lpnz    2f
        lsr r0,r0
        nop
2:      # end single insn loop
        j_s     [blink]

could this be more efficiently implemented as:

lshr20: mov lp_count, 10
        lp 2f
        lsr_s r0,r0
        lsr_s r0,r0
2:      # end single insn loop
        j_s     [blink]

i.e. half the number of iterations, but doing twice as much useful
work in each iteration?  Or might the nop be free on advanced
microarchitectures, and/or the consecutive dependent shifts cause
a pipeline stall?  It would be nice to fuse loops to implement
rotations, such that rotr16 (aka swap) would look like:

rot16:  mov_s r1,r0
        mov lp_count, 16
        lp 2f
        asl_s r0,r0
        lsr_s r1,r1
2:      # end single insn loop
        j_s.d    [blink]
        or_s r0,r0,r1

Thanks in advance,
Roger


2023-09-28  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/arc/arc-protos.h (emit_shift): Delete prototype.
        (arc_pre_reload_split): New function prototype.
        * config/arc/arc.cc (emit_shift): Delete function.
        (arc_pre_reload_split): New predicate function, copied from i386,
        to schedule define_insn_and_split splitters to the split1 pass.
        * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally.
        (ashrsi3): Likewise.
        (lshrsi3): Likewise.
        (shift_si3): Move after other shift patterns, and disable when
        operands[2] is one (which is handled by its own define_insn).
        Use shiftr4_operator, instead of shift4_operator, as this is no
        longer used for left shifts.
        (shift_si3_loop): Likewise.  Additionally remove match_scratch.
        (*ashlsi3_nobs): New pre-reload define_insn_and_split.
        (*ashrsi3_nobs): Likewise.
        (*lshrsi3_nobs): Likewise.
        (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
        (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        Change length attribute to "*,4" to allow compact representation.
        (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (add_shift): Rename define_insn from *add_shift.
        * config/arc/predicates.md (shiftl4_operator): Delete.
        (shift4_operator): Delete.

gcc/testsuite/ChangeLog
        * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-3.c: Likewise.
        * gcc.target/arc/ashrsi-4.c: Likewise.
        * gcc.target/arc/ashrsi-5.c: Likewise.
        * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-3.c: Likewise.
        * gcc.target/arc/lshrsi-4.c: Likewise.
        * gcc.target/arc/lshrsi-5.c: Likewise.
        * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-3.c: Likewise.
        * gcc.target/arc/shlsi-4.c: Likewise.
        * gcc.target/arc/shlsi-5.c: Likewise.


[-- Attachment #2: patchar.txt --]
[-- Type: text/plain, Size: 24762 bytes --]

diff --git a/gcc/config/arc/arc-protos.h b/gcc/config/arc/arc-protos.h
index d47b475..2b3012b 100644
--- a/gcc/config/arc/arc-protos.h
+++ b/gcc/config/arc/arc-protos.h
@@ -36,7 +36,7 @@ extern int arc_output_addsi (rtx *operands, bool, bool);
 extern int arc_output_commutative_cond_exec (rtx *operands, bool);
 extern bool arc_expand_cpymem (rtx *operands);
 extern bool prepare_move_operands (rtx *operands, machine_mode mode);
-extern void emit_shift (enum rtx_code, rtx, rtx, rtx);
+extern bool arc_pre_reload_split (void);
 extern void arc_expand_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx);
 extern void arc_split_compare_and_swap (rtx *);
 extern void arc_expand_compare_and_swap (rtx *);
diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index 266ba8b..647f4a1 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -4239,18 +4239,16 @@ arc_unspec_offset (rtx loc, int unspec)
 					       unspec));
 }
 
-/* !TARGET_BARREL_SHIFTER support.  */
-/* Emit a shift insn to set OP0 to OP1 shifted by OP2; CODE specifies what
-   kind of shift.  */
+/* Predicate for pre-reload splitters with associated instructions,
+   which can match any time before the split1 pass (usually combine),
+   then are unconditionally split in that pass and should not be
+   matched again afterwards.  */
 
-void
-emit_shift (enum rtx_code code, rtx op0, rtx op1, rtx op2)
+bool
+arc_pre_reload_split (void)
 {
-  rtx shift = gen_rtx_fmt_ee (code, SImode, op1, op2);
-  rtx pat
-    = ((shift4_operator (shift, SImode) ?  gen_shift_si3 : gen_shift_si3_loop)
-	(op0, op1, op2, shift));
-  emit_insn (pat);
+  return (can_create_pseudo_p ()
+	  && !(cfun->curr_properties & PROP_rtl_split_insns));
 }
 
 /* Output the assembler code for doing a shift.
diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 1f122d9..d3898d1 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -3404,70 +3404,19 @@ archs4x, archs4xd"
   [(set (match_operand:SI 0 "dest_reg_operand" "")
 	(ashift:SI (match_operand:SI 1 "register_operand" "")
 		   (match_operand:SI 2 "nonmemory_operand" "")))]
-  ""
-  "
-{
-  if (!TARGET_BARREL_SHIFTER)
-    {
-      emit_shift (ASHIFT, operands[0], operands[1], operands[2]);
-      DONE;
-    }
-}")
+  "")
 
 (define_expand "ashrsi3"
   [(set (match_operand:SI 0 "dest_reg_operand" "")
 	(ashiftrt:SI (match_operand:SI 1 "register_operand" "")
 		     (match_operand:SI 2 "nonmemory_operand" "")))]
-  ""
-  "
-{
-  if (!TARGET_BARREL_SHIFTER)
-    {
-      emit_shift (ASHIFTRT, operands[0], operands[1], operands[2]);
-      DONE;
-    }
-}")
+  "")
 
 (define_expand "lshrsi3"
   [(set (match_operand:SI 0 "dest_reg_operand" "")
 	(lshiftrt:SI (match_operand:SI 1 "register_operand" "")
 		     (match_operand:SI 2 "nonmemory_operand" "")))]
-  ""
-  "
-{
-  if (!TARGET_BARREL_SHIFTER)
-    {
-      emit_shift (LSHIFTRT, operands[0], operands[1], operands[2]);
-      DONE;
-    }
-}")
-
-(define_insn "shift_si3"
-  [(set (match_operand:SI 0 "dest_reg_operand" "=r")
-	(match_operator:SI 3 "shift4_operator"
-			   [(match_operand:SI 1 "register_operand" "0")
-			    (match_operand:SI 2 "const_int_operand" "n")]))
-   (clobber (match_scratch:SI 4 "=&r"))
-   (clobber (reg:CC CC_REG))
-  ]
-  "!TARGET_BARREL_SHIFTER"
-  "* return output_shift (operands);"
-  [(set_attr "type" "shift")
-   (set_attr "length" "16")])
-
-(define_insn "shift_si3_loop"
-  [(set (match_operand:SI 0 "dest_reg_operand" "=r,r")
-	(match_operator:SI 3 "shift_operator"
-			   [(match_operand:SI 1 "register_operand" "0,0")
-			    (match_operand:SI 2 "nonmemory_operand" "rn,Cal")]))
-   (clobber (match_scratch:SI 4 "=X,X"))
-   (clobber (reg:SI LP_COUNT))
-   (clobber (reg:CC CC_REG))
-  ]
-  "!TARGET_BARREL_SHIFTER"
-  "* return output_shift (operands);"
-  [(set_attr "type" "shift")
-   (set_attr "length" "16,20")])
+  "")
 
 ; asl, asr, lsr patterns:
 ; There is no point in including an 'I' alternative since only the lowest 5
@@ -3515,6 +3464,183 @@ archs4x, archs4xd"
    (set_attr "predicable" "no,no,no,yes,no,no")
    (set_attr "cond" "canuse,nocond,canuse,canuse,nocond,nocond")])
 
+(define_insn_and_split "*ashlsi3_nobs"
+  [(set (match_operand:SI 0 "dest_reg_operand")
+	(ashift:SI (match_operand:SI 1 "register_operand")
+		   (match_operand:SI 2 "nonmemory_operand")))]
+  "!TARGET_BARREL_SHIFTER
+   && operands[2] != const1_rtx
+   && arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  if (CONST_INT_P (operands[2]))
+    {
+      int n = INTVAL (operands[2]) & 0x1f;
+      if (n <= 9)
+	{
+	  if (n == 0)
+	    emit_move_insn (operands[0], operands[1]);
+	  else if (n <= 2)
+	    {
+	      emit_insn (gen_ashlsi3_cnt1 (operands[0], operands[1]));
+	      if (n == 2)
+		emit_insn (gen_ashlsi3_cnt1 (operands[0], operands[0]));
+	    }
+	  else
+	    {
+	      rtx zero = gen_reg_rtx (SImode);
+	      emit_move_insn (zero, const0_rtx);
+	      emit_insn (gen_add_shift (operands[0], operands[1],
+					GEN_INT (3), zero));
+	      for (n -= 3; n >= 3; n -= 3)
+		emit_insn (gen_add_shift (operands[0], operands[0],
+					  GEN_INT (3), zero));
+	      if (n == 2)
+		emit_insn (gen_add_shift (operands[0], operands[0],
+					  const2_rtx, zero));
+	      else if (n)
+		emit_insn (gen_ashlsi3_cnt1 (operands[0], operands[0]));
+	    }
+	  DONE;
+	}
+      else if (n >= 29)
+	{
+	  if (n < 31)
+	    {
+	      if (n == 29)
+		{
+		  emit_insn (gen_andsi3_i (operands[0], operands[1],
+					   GEN_INT (7)));
+		  emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0]));
+		}
+	      else
+		emit_insn (gen_andsi3_i (operands[0], operands[1],
+					 GEN_INT (3)));
+	      emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0]));
+	    }
+	  else
+	    emit_insn (gen_andsi3_i (operands[0], operands[1], const1_rtx));
+	  emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0]));
+	  DONE;
+	}
+    }
+
+  rtx shift = gen_rtx_fmt_ee (ASHIFT, SImode, operands[1], operands[2]);
+  emit_insn (gen_shift_si3_loop (operands[0], operands[1],
+				 operands[2], shift));
+  DONE;
+})
+
+(define_insn_and_split "*ashlri3_nobs"
+  [(set (match_operand:SI 0 "dest_reg_operand")
+	(ashiftrt:SI (match_operand:SI 1 "register_operand")
+		     (match_operand:SI 2 "nonmemory_operand")))]
+  "!TARGET_BARREL_SHIFTER
+   && operands[2] != const1_rtx
+   && arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  if (CONST_INT_P (operands[2]))
+    {
+      int n = INTVAL (operands[2]) & 0x1f;
+      if (n <= 4)
+	{
+	  if (n != 0)
+	    {
+	      emit_insn (gen_ashrsi3_cnt1 (operands[0], operands[1]));
+	      while (--n > 0)
+		emit_insn (gen_ashrsi3_cnt1 (operands[0], operands[0]));
+	    }
+	  else 
+	    emit_move_insn (operands[0], operands[1]);
+	  DONE;
+	}
+    }
+
+  rtx pat;
+  rtx shift = gen_rtx_fmt_ee (ASHIFTRT, SImode, operands[1], operands[2]);
+  if (shiftr4_operator (shift, SImode))
+    pat = gen_shift_si3 (operands[0], operands[1], operands[2], shift);
+  else
+    pat = gen_shift_si3_loop (operands[0], operands[1], operands[2], shift);
+  emit_insn (pat);
+  DONE;
+})
+
+(define_insn_and_split "*lshrsi3_nobs"
+  [(set (match_operand:SI 0 "dest_reg_operand")
+	(lshiftrt:SI (match_operand:SI 1 "register_operand")
+		     (match_operand:SI 2 "nonmemory_operand")))]
+  "!TARGET_BARREL_SHIFTER
+   && operands[2] != const1_rtx
+   && arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  if (CONST_INT_P (operands[2]))
+    {
+      int n = INTVAL (operands[2]) & 0x1f;
+      if (n <= 4)
+	{
+	  if (n != 0)
+	    {
+	      emit_insn (gen_lshrsi3_cnt1 (operands[0], operands[1]));
+	      while (--n > 0)
+		emit_insn (gen_lshrsi3_cnt1 (operands[0], operands[0]));
+	    }
+	  else 
+	    emit_move_insn (operands[0], operands[1]);
+	  DONE;
+	}
+    }
+
+  rtx pat;
+  rtx shift = gen_rtx_fmt_ee (LSHIFTRT, SImode, operands[1], operands[2]);
+  if (shiftr4_operator (shift, SImode))
+    pat = gen_shift_si3 (operands[0], operands[1], operands[2], shift);
+  else
+    pat = gen_shift_si3_loop (operands[0], operands[1], operands[2], shift);
+  emit_insn (pat);
+  DONE;
+})
+
+;; shift_si3 appears after {ashr,lshr}si3_nobs
+(define_insn "shift_si3"
+  [(set (match_operand:SI 0 "dest_reg_operand" "=r")
+	(match_operator:SI 3 "shiftr4_operator"
+			   [(match_operand:SI 1 "register_operand" "0")
+			    (match_operand:SI 2 "const_int_operand" "n")]))
+   (clobber (match_scratch:SI 4 "=&r"))
+   (clobber (reg:CC CC_REG))
+  ]
+  "!TARGET_BARREL_SHIFTER
+   && operands[2] != const1_rtx"
+  "* return output_shift (operands);"
+  [(set_attr "type" "shift")
+   (set_attr "length" "16")])
+
+;; shift_si3_loop appears after {ashl,ashr,lshr}si3_nobs
+(define_insn "shift_si3_loop"
+  [(set (match_operand:SI 0 "dest_reg_operand" "=r,r")
+	(match_operator:SI 3 "shift_operator"
+			   [(match_operand:SI 1 "register_operand" "0,0")
+			    (match_operand:SI 2 "nonmemory_operand" "rn,Cal")]))
+   (clobber (reg:SI LP_COUNT))
+   (clobber (reg:CC CC_REG))
+  ]
+  "!TARGET_BARREL_SHIFTER
+   && operands[2] != const1_rtx"
+  "* return output_shift (operands);"
+  [(set_attr "type" "shift")
+   (set_attr "length" "16,20")])
+
+;; Rotate instructions.
+
 (define_insn "rotrsi3"
   [(set (match_operand:SI 0 "dest_reg_operand"                    "=r, r,   r")
 	(rotatert:SI (match_operand:SI 1 "arc_nonmemory_operand"  " 0,rL,rCsz")
@@ -5945,7 +6071,7 @@ archs4x, archs4xd"
 		   (zero_extract:SI (match_dup 1) (match_dup 5) (match_dup 7)))])
    (match_dup 1)])
 
-(define_insn "*rotrsi3_cnt1"
+(define_insn "rotrsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand"              "=r")
 	(rotatert:SI (match_operand:SI 1 "nonmemory_operand" "rL")
 		     (const_int 1)))]
@@ -5965,15 +6091,15 @@ archs4x, archs4xd"
    (set_attr "predicable" "no")
    (set_attr "length" "4")])
 
-(define_insn "*ashlsi2_cnt1"
+(define_insn "ashlsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand"           "=q,w")
 	(ashift:SI (match_operand:SI 1 "register_operand" "q,c")
 		   (const_int 1)))]
   ""
   "asl%? %0,%1%&"
-  [(set_attr "type" "shift")
+  [(set_attr "type" "unary")
    (set_attr "iscompact" "maybe,false")
-   (set_attr "length" "4")
+   (set_attr "length" "*,4")
    (set_attr "predicable" "no,no")])
 
 (define_insn "*ashlsi2_cnt8"
@@ -5998,23 +6124,23 @@ archs4x, archs4xd"
    (set_attr "length" "4")
    (set_attr "predicable" "no")])
 
-(define_insn "*lshrsi3_cnt1"
+(define_insn "lshrsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand"             "=q,w")
 	(lshiftrt:SI (match_operand:SI 1 "register_operand" "q,c")
 		     (const_int 1)))]
   ""
   "lsr%? %0,%1%&"
-  [(set_attr "type" "shift")
+  [(set_attr "type" "unary")
    (set_attr "iscompact" "maybe,false")
    (set_attr "predicable" "no,no")])
 
-(define_insn "*ashrsi3_cnt1"
+(define_insn "ashrsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand"             "=q,w")
 	(ashiftrt:SI (match_operand:SI 1 "register_operand" "q,c")
 		     (const_int 1)))]
   ""
   "asr%? %0,%1%&"
-  [(set_attr "type" "shift")
+  [(set_attr "type" "unary")
    (set_attr "iscompact" "maybe,false")
    (set_attr "predicable" "no,no")])
 
@@ -6364,7 +6490,7 @@ archs4x, archs4xd"
    (set_attr "type" "multi")
    (set_attr "predicable" "yes")])
 
-(define_insn "*add_shift"
+(define_insn "add_shift"
   [(set (match_operand:SI 0 "register_operand" "=q,r,r")
 	(plus:SI (ashift:SI (match_operand:SI 1 "register_operand" "q,r,r")
 			    (match_operand:SI 2 "_1_2_3_operand" ""))
diff --git a/gcc/config/arc/predicates.md b/gcc/config/arc/predicates.md
index 7650e47..e37d884 100644
--- a/gcc/config/arc/predicates.md
+++ b/gcc/config/arc/predicates.md
@@ -549,16 +549,6 @@
   (match_code "ashiftrt, lshiftrt, ashift")
 )
 
-;; Return true if OP is a left shift operator that can be implemented in
-;; four insn words or less without a barrel shifter or multiplier.
-(define_predicate "shiftl4_operator"
-  (and (match_code "ashift")
-       (match_test "const_int_operand (XEXP (op, 1), VOIDmode) ")
-       (match_test "UINTVAL (XEXP (op, 1)) <= 9U
-		    || INTVAL (XEXP (op, 1)) == 29
-		    || INTVAL (XEXP (op, 1)) == 30
-		    || INTVAL (XEXP (op, 1)) == 31")))
-
 ;; Return true if OP is a right shift operator that can be implemented in
 ;; four insn words or less without a barrel shifter or multiplier.
 (define_predicate "shiftr4_operator"
@@ -568,12 +558,6 @@
 		    || INTVAL (XEXP (op, 1)) == 30
 		    || INTVAL (XEXP (op, 1)) == 31")))
 
-;; Return true if OP is a shift operator that can be implemented in
-;; four insn words or less without a barrel shifter or multiplier.
-(define_predicate "shift4_operator"
-  (ior (match_operand 0 "shiftl4_operator")
-       (match_operand 0 "shiftr4_operator")))
-
 (define_predicate "mult_operator"
     (and (match_code "mult") (match_test "TARGET_MPY"))
 )
diff --git a/gcc/testsuite/gcc.target/arc/ashrsi-1.c b/gcc/testsuite/gcc.target/arc/ashrsi-1.c
new file mode 100644
index 0000000..3100aa3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/ashrsi-1.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=hs" } */
+
+int ashr1(int x) { return x >> 1; }
+int ashr2(int x) { return x >> 2; }
+int ashr3(int x) { return x >> 3; }
+int ashr4(int x) { return x >> 4; }
+int ashr5(int x) { return x >> 5; }
+int ashr6(int x) { return x >> 6; }
+int ashr7(int x) { return x >> 7; }
+int ashr8(int x) { return x >> 8; }
+int ashr9(int x) { return x >> 9; }
+int ashr10(int x) { return x >> 10; }
+int ashr11(int x) { return x >> 11; }
+int ashr12(int x) { return x >> 12; }
+int ashr13(int x) { return x >> 13; }
+int ashr14(int x) { return x >> 14; }
+int ashr15(int x) { return x >> 15; }
+int ashr16(int x) { return x >> 16; }
+int ashr17(int x) { return x >> 17; }
+int ashr18(int x) { return x >> 18; }
+int ashr19(int x) { return x >> 19; }
+int ashr20(int x) { return x >> 20; }
+int ashr21(int x) { return x >> 21; }
+int ashr22(int x) { return x >> 22; }
+int ashr23(int x) { return x >> 23; }
+int ashr24(int x) { return x >> 24; }
+int ashr25(int x) { return x >> 25; }
+int ashr26(int x) { return x >> 26; }
+int ashr27(int x) { return x >> 27; }
+int ashr28(int x) { return x >> 28; }
+int ashr29(int x) { return x >> 29; }
+int ashr30(int x) { return x >> 30; }
+int ashr31(int x) { return x >> 31; }
+
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r0" 31 } } */
diff --git a/gcc/testsuite/gcc.target/arc/ashrsi-2.c b/gcc/testsuite/gcc.target/arc/ashrsi-2.c
new file mode 100644
index 0000000..b551ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/ashrsi-2.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int foo(int x) { return x >> 1; }
+
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/ashrsi-3.c b/gcc/testsuite/gcc.target/arc/ashrsi-3.c
new file mode 100644
index 0000000..c030682
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/ashrsi-3.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int foo(int x, int y) { return y >> 1; }
+
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/ashrsi-4.c b/gcc/testsuite/gcc.target/arc/ashrsi-4.c
new file mode 100644
index 0000000..98e58bc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/ashrsi-4.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int foo(int x) { return x >> 2; }
+
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r0" 2 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/ashrsi-5.c b/gcc/testsuite/gcc.target/arc/ashrsi-5.c
new file mode 100644
index 0000000..f40af2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/ashrsi-5.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int foo(int x, int y) { return y >> 2; }
+
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler-times "asr_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/lshrsi-1.c b/gcc/testsuite/gcc.target/arc/lshrsi-1.c
new file mode 100644
index 0000000..9bec79d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/lshrsi-1.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=hs" } */
+
+unsigned int lshr1(unsigned int x) { return x >> 1; }
+unsigned int lshr2(unsigned int x) { return x >> 2; }
+unsigned int lshr3(unsigned int x) { return x >> 3; }
+unsigned int lshr4(unsigned int x) { return x >> 4; }
+unsigned int lshr5(unsigned int x) { return x >> 5; }
+unsigned int lshr6(unsigned int x) { return x >> 6; }
+unsigned int lshr7(unsigned int x) { return x >> 7; }
+unsigned int lshr8(unsigned int x) { return x >> 8; }
+unsigned int lshr9(unsigned int x) { return x >> 9; }
+unsigned int lshr10(unsigned int x) { return x >> 10; }
+unsigned int lshr11(unsigned int x) { return x >> 11; }
+unsigned int lshr12(unsigned int x) { return x >> 12; }
+unsigned int lshr13(unsigned int x) { return x >> 13; }
+unsigned int lshr14(unsigned int x) { return x >> 14; }
+unsigned int lshr15(unsigned int x) { return x >> 15; }
+unsigned int lshr16(unsigned int x) { return x >> 16; }
+unsigned int lshr17(unsigned int x) { return x >> 17; }
+unsigned int lshr18(unsigned int x) { return x >> 18; }
+unsigned int lshr19(unsigned int x) { return x >> 19; }
+unsigned int lshr20(unsigned int x) { return x >> 20; }
+unsigned int lshr21(unsigned int x) { return x >> 21; }
+unsigned int lshr22(unsigned int x) { return x >> 22; }
+unsigned int lshr23(unsigned int x) { return x >> 23; }
+unsigned int lshr24(unsigned int x) { return x >> 24; }
+unsigned int lshr25(unsigned int x) { return x >> 25; }
+unsigned int lshr26(unsigned int x) { return x >> 26; }
+unsigned int lshr27(unsigned int x) { return x >> 27; }
+unsigned int lshr28(unsigned int x) { return x >> 28; }
+unsigned int lshr29(unsigned int x) { return x >> 29; }
+unsigned int lshr30(unsigned int x) { return x >> 30; }
+unsigned int lshr31(unsigned int x) { return x >> 31; }
+
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r0" 31 } } */
diff --git a/gcc/testsuite/gcc.target/arc/lshrsi-2.c b/gcc/testsuite/gcc.target/arc/lshrsi-2.c
new file mode 100644
index 0000000..d857740
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/lshrsi-2.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x) { return x >> 1; }
+
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/lshrsi-3.c b/gcc/testsuite/gcc.target/arc/lshrsi-3.c
new file mode 100644
index 0000000..58bfac0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/lshrsi-3.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y){ return y >> 1; }
+
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/lshrsi-4.c b/gcc/testsuite/gcc.target/arc/lshrsi-4.c
new file mode 100644
index 0000000..3094de2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/lshrsi-4.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x) { return x >> 2; }
+
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r0" 2 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/lshrsi-5.c b/gcc/testsuite/gcc.target/arc/lshrsi-5.c
new file mode 100644
index 0000000..dce3f00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/lshrsi-5.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y){ return y >> 2; }
+
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler-times "lsr_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
diff --git a/gcc/testsuite/gcc.target/arc/shlsi-1.c b/gcc/testsuite/gcc.target/arc/shlsi-1.c
new file mode 100644
index 0000000..eea7c56
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/shlsi-1.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=hs" } */
+
+unsigned int shl1(unsigned int x) { return x << 1; }
+unsigned int shl2(unsigned int x) { return x << 2; }
+unsigned int shl3(unsigned int x) { return x << 3; }
+unsigned int shl4(unsigned int x) { return x << 4; }
+unsigned int shl5(unsigned int x) { return x << 5; }
+unsigned int shl6(unsigned int x) { return x << 6; }
+unsigned int shl7(unsigned int x) { return x << 7; }
+unsigned int shl8(unsigned int x) { return x << 8; }
+unsigned int shl9(unsigned int x) { return x << 9; }
+unsigned int shl10(unsigned int x) { return x << 10; }
+unsigned int shl11(unsigned int x) { return x << 11; }
+unsigned int shl12(unsigned int x) { return x << 12; }
+unsigned int shl13(unsigned int x) { return x << 13; }
+unsigned int shl14(unsigned int x) { return x << 14; }
+unsigned int shl15(unsigned int x) { return x << 15; }
+unsigned int shl16(unsigned int x) { return x << 16; }
+unsigned int shl17(unsigned int x) { return x << 17; }
+unsigned int shl18(unsigned int x) { return x << 18; }
+unsigned int shl19(unsigned int x) { return x << 19; }
+unsigned int shl20(unsigned int x) { return x << 20; }
+unsigned int shl21(unsigned int x) { return x << 21; }
+unsigned int shl22(unsigned int x) { return x << 22; }
+unsigned int shl23(unsigned int x) { return x << 23; }
+unsigned int shl24(unsigned int x) { return x << 24; }
+unsigned int shl25(unsigned int x) { return x << 25; }
+unsigned int shl26(unsigned int x) { return x << 26; }
+unsigned int shl27(unsigned int x) { return x << 27; }
+unsigned int shl28(unsigned int x) { return x << 28; }
+unsigned int shl29(unsigned int x) { return x << 29; }
+unsigned int shl30(unsigned int x) { return x << 30; }
+unsigned int shl31(unsigned int x) { return x << 31; }
+
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r0,\[1-9\]" 31 } } */
diff --git a/gcc/testsuite/gcc.target/arc/shlsi-2.c b/gcc/testsuite/gcc.target/arc/shlsi-2.c
new file mode 100644
index 0000000..ab8d2f8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/shlsi-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x) { return x << 1; }
+
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
+
diff --git a/gcc/testsuite/gcc.target/arc/shlsi-3.c b/gcc/testsuite/gcc.target/arc/shlsi-3.c
new file mode 100644
index 0000000..244a786
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/shlsi-3.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y) { return y << 1; }
+
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
+
diff --git a/gcc/testsuite/gcc.target/arc/shlsi-4.c b/gcc/testsuite/gcc.target/arc/shlsi-4.c
new file mode 100644
index 0000000..8fdc25e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/shlsi-4.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x) { return x << 2; }
+
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r0" 2 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
+
diff --git a/gcc/testsuite/gcc.target/arc/shlsi-5.c b/gcc/testsuite/gcc.target/arc/shlsi-5.c
new file mode 100644
index 0000000..a91103e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/shlsi-5.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y) { return y << 2; }
+
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r0" 1 } } */
+/* { dg-final { scan-assembler-times "asl_s\\s+r0,r1" 1 } } */
+/* { dg-final { scan-assembler "j_s\.d" } } */
+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
  2023-09-28 11:26 [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER Roger Sayle
@ 2023-10-03 14:25 ` Claudiu Zissulescu
  2023-10-03 15:34   ` Roger Sayle
  2023-10-04  8:37 ` Claudiu Zissulescu Ianculescu
  1 sibling, 1 reply; 5+ messages in thread
From: Claudiu Zissulescu @ 2023-10-03 14:25 UTC (permalink / raw)
  To: Roger Sayle, gcc-patches

Hi Roger,

It was nice to meet you too.

Thank you in looking into the ARC's non-Barrel Shifter configurations.  I will dive into your patch asap, but before starting here are a few of my comments: 

-----Original Message-----
From: Roger Sayle <roger@nextmovesoftware.com> 
Sent: Thursday, September 28, 2023 2:27 PM
To: gcc-patches@gcc.gnu.org
Cc: Claudiu Zissulescu <claziss@synopsys.com>
Subject: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.


Hi Claudiu,
It was great meeting up with you and the Synopsys ARC team at the GNU tools Cauldron in Cambridge.

This patch is the first in a series to improve SImode and DImode shifts and rotates in the ARC backend.  This first piece splits SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 pass, as suggested by the FIXME comment above output_shift in arc.cc.  To do this I've copied the implementation of the x86_pre_reload_split function from i386 backend, and renamed it arc_pre_reload_split.

Although the actual implementations of shifts remain the same (as in output_shift), having them as explicit instructions in the RTL stream allows better scheduling and use of compact forms when available.  The benefits can be seen in two short examples below.

For the function:
unsigned int foo(unsigned int x, unsigned int y) {
  return y << 2;
}

GCC with -O2 -mcpu=em would previously generate:
foo:    add r1,r1,r1
        add r1,r1,r1
        j_s.d   [blink]
        mov_s   r0,r1   ;4

[CZI] The move shouldn't be generated indeed. The use of ADDs are slightly beneficial for older ARCv1 arches.

and with this patch now generates:
foo:    asl_s r0,r1
        j_s.d   [blink]
        asl_s r0,r0

[CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.

Notice the original (from shift_si3's output_shift) requires the shift sequence to be monolithic with the same destination register as the source (requiring an extra mov_s).  The new version can eliminate this move, and schedule the second asl in the branch delay slot of the return.

For the function:
int x,y,z;

void bar()
{
  x <<= 3;
  y <<= 3;
  z <<= 3;
}

GCC -O2 -mcpu=em currently generates:
bar:    push_s  r13
        ld.as   r12,[gp,@x@sda] ;23
        ld.as   r3,[gp,@y@sda]  ;23
        mov r2,0
        add3 r12,r2,r12
        mov r2,0
        add3 r3,r2,r3
        ld.as   r2,[gp,@z@sda]  ;23
        st.as   r12,[gp,@x@sda] ;26
        mov r13,0
        add3 r2,r13,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s.d   [blink]
        pop_s   r13

where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea implementing x = (y<<3) + z, but requires the value zero to be placed in a temporary register "z".  Splitting this before reload allows these pseudos to be shared/reused.  With this patch, we get

bar:    ld.as   r2,[gp,@x@sda]  ;23
        mov_s   r3,0    ;3
        add3    r2,r3,r2
        ld.as   r3,[gp,@y@sda]  ;23
        st.as   r2,[gp,@x@sda]  ;26
        ld.as   r2,[gp,@z@sda]  ;23
        mov_s   r12,0   ;3
        add3    r3,r12,r3
        add3    r2,r12,r2
        st.as   r3,[gp,@y@sda]  ;26
        st.as   r2,[gp,@z@sda]  ;26
        j_s     [blink]

[CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the Ra,LIMM,RC variant, which will lead to have instead of 
        mov_s   r3,0    ;3
        add3    r2,r3,r2
Only this add3,0,r2, Indeed it is longer instruction but faster.

Unfortunately, register allocation means that we only share two of the three "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid spilling r13 in the prologue/epilogue.

This patch also contains a (latent?) bug fix.  The implementation of the default insn "length" attribute, assumes instructions of type "shift" have two input operands and accesses operands[2], hence specializations of shifts that don't have a operands[2], need to be categorized as type "unary" (which results in the correct length).

[CZI] The ARC types need an upgrade too.

This patch has been tested on a cross-compiler to arc-elf (hosted on x86_64-pc-linux-gnu), but because I've an incomplete tool chain many of the regression test fail, but there are no new failures with new test cases added below.  If you can confirm that there are no issues from additional testing, is this OK for mainline?

Finally a quick technical question.  ARC's zero overhead loops require at least two instructions in the loop, so currently the backend's implementation of shr20 pads the loop body with a "nop".

lshr20: mov.f lp_count, 20
        lpnz    2f
        lsr r0,r0
        nop
2:      # end single insn loop
        j_s     [blink]


[CZI] The ZOLs (LP instructions) are not great when dealing with short loop blocks. Hence, the NOP instruction. Personally, I don't fancy using the LP instruction in this case, as it prohibits LP usage for a true for-loop.

could this be more efficiently implemented as:

lshr20: mov lp_count, 10
        lp 2f
        lsr_s r0,r0
        lsr_s r0,r0
2:      # end single insn loop
        j_s     [blink]

i.e. half the number of iterations, but doing twice as much useful work in each iteration?  Or might the nop be free on advanced microarchitectures, and/or the consecutive dependent shifts cause a pipeline stall?  It would be nice to fuse loops to implement rotations, such that rotr16 (aka swap) would look like:

rot16:  mov_s r1,r0
        mov lp_count, 16
        lp 2f
        asl_s r0,r0
        lsr_s r1,r1
2:      # end single insn loop
        j_s.d    [blink]
        or_s r0,r0,r1

Thanks in advance,
Roger


2023-09-28  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/arc/arc-protos.h (emit_shift): Delete prototype.
        (arc_pre_reload_split): New function prototype.
        * config/arc/arc.cc (emit_shift): Delete function.
        (arc_pre_reload_split): New predicate function, copied from i386,
        to schedule define_insn_and_split splitters to the split1 pass.
        * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally.
        (ashrsi3): Likewise.
        (lshrsi3): Likewise.
        (shift_si3): Move after other shift patterns, and disable when
        operands[2] is one (which is handled by its own define_insn).
        Use shiftr4_operator, instead of shift4_operator, as this is no
        longer used for left shifts.
        (shift_si3_loop): Likewise.  Additionally remove match_scratch.
        (*ashlsi3_nobs): New pre-reload define_insn_and_split.
        (*ashrsi3_nobs): Likewise.
        (*lshrsi3_nobs): Likewise.
        (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
        (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        Change length attribute to "*,4" to allow compact representation.
        (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
        insn type attribute to "unary", as this doesn't have operands[2].
        (add_shift): Rename define_insn from *add_shift.
        * config/arc/predicates.md (shiftl4_operator): Delete.
        (shift4_operator): Delete.

gcc/testsuite/ChangeLog
        * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/ashrsi-3.c: Likewise.
        * gcc.target/arc/ashrsi-4.c: Likewise.
        * gcc.target/arc/ashrsi-5.c: Likewise.
        * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/lshrsi-3.c: Likewise.
        * gcc.target/arc/lshrsi-4.c: Likewise.
        * gcc.target/arc/lshrsi-5.c: Likewise.
        * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
        * gcc.target/arc/shlsi-3.c: Likewise.
        * gcc.target/arc/shlsi-4.c: Likewise.
        * gcc.target/arc/shlsi-5.c: Likewise.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
  2023-10-03 14:25 ` Claudiu Zissulescu
@ 2023-10-03 15:34   ` Roger Sayle
  2023-10-03 16:07     ` Claudiu Zissulescu Ianculescu
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Sayle @ 2023-10-03 15:34 UTC (permalink / raw)
  To: 'Claudiu Zissulescu'; +Cc: gcc-patches


Hi Claudiu,
Thanks for the answers to my technical questions.
If you'd prefer to update arc.md's add3 pattern first,
I'm happy to update/revise my patch based on this
and your feedback, for example preferring add over
asl_s (or controlling this choice with -Os).

Thanks again.
Roger
--

> -----Original Message-----
> From: Claudiu Zissulescu <Claudiu.Zissulescu@synopsys.com>
> Sent: 03 October 2023 15:26
> To: Roger Sayle <roger@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> Hi Roger,
> 
> It was nice to meet you too.
> 
> Thank you in looking into the ARC's non-Barrel Shifter configurations.  I
will dive
> into your patch asap, but before starting here are a few of my comments:
> 
> -----Original Message-----
> From: Roger Sayle <roger@nextmovesoftware.com>
> Sent: Thursday, September 28, 2023 2:27 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu <claziss@synopsys.com>
> Subject: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> 
> Hi Claudiu,
> It was great meeting up with you and the Synopsys ARC team at the GNU
tools
> Cauldron in Cambridge.
> 
> This patch is the first in a series to improve SImode and DImode shifts
and rotates
> in the ARC backend.  This first piece splits SImode shifts, for
> !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the
split1
> pass, as suggested by the FIXME comment above output_shift in arc.cc.  To
do
> this I've copied the implementation of the x86_pre_reload_split function
from
> i386 backend, and renamed it arc_pre_reload_split.
> 
> Although the actual implementations of shifts remain the same (as in
> output_shift), having them as explicit instructions in the RTL stream
allows better
> scheduling and use of compact forms when available.  The benefits can be
seen in
> two short examples below.
> 
> For the function:
> unsigned int foo(unsigned int x, unsigned int y) {
>   return y << 2;
> }
> 
> GCC with -O2 -mcpu=em would previously generate:
> foo:    add r1,r1,r1
>         add r1,r1,r1
>         j_s.d   [blink]
>         mov_s   r0,r1   ;4
> 
> [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly
> beneficial for older ARCv1 arches.
> 
> and with this patch now generates:
> foo:    asl_s r0,r1
>         j_s.d   [blink]
>         asl_s r0,r0
> 
> [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.
> 
> Notice the original (from shift_si3's output_shift) requires the shift
sequence to be
> monolithic with the same destination register as the source (requiring an
extra
> mov_s).  The new version can eliminate this move, and schedule the second
asl in
> the branch delay slot of the return.
> 
> For the function:
> int x,y,z;
> 
> void bar()
> {
>   x <<= 3;
>   y <<= 3;
>   z <<= 3;
> }
> 
> GCC -O2 -mcpu=em currently generates:
> bar:    push_s  r13
>         ld.as   r12,[gp,@x@sda] ;23
>         ld.as   r3,[gp,@y@sda]  ;23
>         mov r2,0
>         add3 r12,r2,r12
>         mov r2,0
>         add3 r3,r2,r3
>         ld.as   r2,[gp,@z@sda]  ;23
>         st.as   r12,[gp,@x@sda] ;26
>         mov r13,0
>         add3 r2,r13,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s.d   [blink]
>         pop_s   r13
> 
> where each shift by 3, uses ARC's add3 instruction, which is similar to
x86's lea
> implementing x = (y<<3) + z, but requires the value zero to be placed in a
> temporary register "z".  Splitting this before reload allows these pseudos
to be
> shared/reused.  With this patch, we get
> 
> bar:    ld.as   r2,[gp,@x@sda]  ;23
>         mov_s   r3,0    ;3
>         add3    r2,r3,r2
>         ld.as   r3,[gp,@y@sda]  ;23
>         st.as   r2,[gp,@x@sda]  ;26
>         ld.as   r2,[gp,@z@sda]  ;23
>         mov_s   r12,0   ;3
>         add3    r3,r12,r3
>         add3    r2,r12,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s     [blink]
> 
> [CZI] Looks great, but it also shows that I've forgot to add to ADD3
instruction the
> Ra,LIMM,RC variant, which will lead to have instead of
>         mov_s   r3,0    ;3
>         add3    r2,r3,r2
> Only this add3,0,r2, Indeed it is longer instruction but faster.
> 
> Unfortunately, register allocation means that we only share two of the
three
> "mov_s z,0", but this is sufficient to reduce register pressure enough to
avoid
> spilling r13 in the prologue/epilogue.
> 
> This patch also contains a (latent?) bug fix.  The implementation of the
default
> insn "length" attribute, assumes instructions of type "shift" have two
input
> operands and accesses operands[2], hence specializations of shifts that
don't
> have a operands[2], need to be categorized as type "unary" (which results
in the
> correct length).
> 
> [CZI] The ARC types need an upgrade too.
> 
> This patch has been tested on a cross-compiler to arc-elf (hosted on
x86_64-pc-
> linux-gnu), but because I've an incomplete tool chain many of the
regression test
> fail, but there are no new failures with new test cases added below.  If
you can
> confirm that there are no issues from additional testing, is this OK for
mainline?
> 
> Finally a quick technical question.  ARC's zero overhead loops require at
least two
> instructions in the loop, so currently the backend's implementation of
shr20 pads
> the loop body with a "nop".
> 
> lshr20: mov.f lp_count, 20
>         lpnz    2f
>         lsr r0,r0
>         nop
> 2:      # end single insn loop
>         j_s     [blink]
> 
> 
> [CZI] The ZOLs (LP instructions) are not great when dealing with short
loop blocks.
> Hence, the NOP instruction. Personally, I don't fancy using the LP
instruction in
> this case, as it prohibits LP usage for a true for-loop.
> 
> could this be more efficiently implemented as:
> 
> lshr20: mov lp_count, 10
>         lp 2f
>         lsr_s r0,r0
>         lsr_s r0,r0
> 2:      # end single insn loop
>         j_s     [blink]
> 
> i.e. half the number of iterations, but doing twice as much useful work in
each
> iteration?  Or might the nop be free on advanced microarchitectures,
and/or the
> consecutive dependent shifts cause a pipeline stall?  It would be nice to
fuse loops
> to implement rotations, such that rotr16 (aka swap) would look like:
> 
> rot16:  mov_s r1,r0
>         mov lp_count, 16
>         lp 2f
>         asl_s r0,r0
>         lsr_s r1,r1
> 2:      # end single insn loop
>         j_s.d    [blink]
>         or_s r0,r0,r1
> 
> Thanks in advance,
> Roger
> 
> 
> 2023-09-28  Roger Sayle  <roger@nextmovesoftware.com>
> 
> gcc/ChangeLog
>         * config/arc/arc-protos.h (emit_shift): Delete prototype.
>         (arc_pre_reload_split): New function prototype.
>         * config/arc/arc.cc (emit_shift): Delete function.
>         (arc_pre_reload_split): New predicate function, copied from i386,
>         to schedule define_insn_and_split splitters to the split1 pass.
>         * config/arc/arc.md (ashlsi3): Expand RTL template
unconditionally.
>         (ashrsi3): Likewise.
>         (lshrsi3): Likewise.
>         (shift_si3): Move after other shift patterns, and disable when
>         operands[2] is one (which is handled by its own define_insn).
>         Use shiftr4_operator, instead of shift4_operator, as this is no
>         longer used for left shifts.
>         (shift_si3_loop): Likewise.  Additionally remove match_scratch.
>         (*ashlsi3_nobs): New pre-reload define_insn_and_split.
>         (*ashrsi3_nobs): Likewise.
>         (*lshrsi3_nobs): Likewise.
>         (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
>         (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         Change length attribute to "*,4" to allow compact representation.
>         (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (add_shift): Rename define_insn from *add_shift.
>         * config/arc/predicates.md (shiftl4_operator): Delete.
>         (shift4_operator): Delete.
> 
> gcc/testsuite/ChangeLog
>         * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-3.c: Likewise.
>         * gcc.target/arc/ashrsi-4.c: Likewise.
>         * gcc.target/arc/ashrsi-5.c: Likewise.
>         * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-3.c: Likewise.
>         * gcc.target/arc/lshrsi-4.c: Likewise.
>         * gcc.target/arc/lshrsi-5.c: Likewise.
>         * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-3.c: Likewise.
>         * gcc.target/arc/shlsi-4.c: Likewise.
>         * gcc.target/arc/shlsi-5.c: Likewise.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
  2023-10-03 15:34   ` Roger Sayle
@ 2023-10-03 16:07     ` Claudiu Zissulescu Ianculescu
  0 siblings, 0 replies; 5+ messages in thread
From: Claudiu Zissulescu Ianculescu @ 2023-10-03 16:07 UTC (permalink / raw)
  To: Roger Sayle; +Cc: Claudiu Zissulescu, gcc-patches

Hi Roger,

It is not necessary to do any mods on your patch. I've just answered
the questions which you asked me. The adds are faster for the ARC CPUs
which are still in production, and I suppose we can leverage the LP
instruction use with DBNZ instructions for implementing loops. I'll
come back to you asap, after I've got the nightly results :)

Thank you,
Claudiu

On Tue, Oct 3, 2023 at 6:34 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Claudiu,
> Thanks for the answers to my technical questions.
> If you'd prefer to update arc.md's add3 pattern first,
> I'm happy to update/revise my patch based on this
> and your feedback, for example preferring add over
> asl_s (or controlling this choice with -Os).
>
> Thanks again.
> Roger
> --
>
> > -----Original Message-----
> > From: Claudiu Zissulescu <Claudiu.Zissulescu@synopsys.com>
> > Sent: 03 October 2023 15:26
> > To: Roger Sayle <roger@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on
> > !TARGET_BARREL_SHIFTER.
> >
> > Hi Roger,
> >
> > It was nice to meet you too.
> >
> > Thank you in looking into the ARC's non-Barrel Shifter configurations.  I
> will dive
> > into your patch asap, but before starting here are a few of my comments:
> >
> > -----Original Message-----
> > From: Roger Sayle <roger@nextmovesoftware.com>
> > Sent: Thursday, September 28, 2023 2:27 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Claudiu Zissulescu <claziss@synopsys.com>
> > Subject: [ARC PATCH] Split SImode shifts pre-reload on
> > !TARGET_BARREL_SHIFTER.
> >
> >
> > Hi Claudiu,
> > It was great meeting up with you and the Synopsys ARC team at the GNU
> tools
> > Cauldron in Cambridge.
> >
> > This patch is the first in a series to improve SImode and DImode shifts
> and rotates
> > in the ARC backend.  This first piece splits SImode shifts, for
> > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the
> split1
> > pass, as suggested by the FIXME comment above output_shift in arc.cc.  To
> do
> > this I've copied the implementation of the x86_pre_reload_split function
> from
> > i386 backend, and renamed it arc_pre_reload_split.
> >
> > Although the actual implementations of shifts remain the same (as in
> > output_shift), having them as explicit instructions in the RTL stream
> allows better
> > scheduling and use of compact forms when available.  The benefits can be
> seen in
> > two short examples below.
> >
> > For the function:
> > unsigned int foo(unsigned int x, unsigned int y) {
> >   return y << 2;
> > }
> >
> > GCC with -O2 -mcpu=em would previously generate:
> > foo:    add r1,r1,r1
> >         add r1,r1,r1
> >         j_s.d   [blink]
> >         mov_s   r0,r1   ;4
> >
> > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly
> > beneficial for older ARCv1 arches.
> >
> > and with this patch now generates:
> > foo:    asl_s r0,r1
> >         j_s.d   [blink]
> >         asl_s r0,r0
> >
> > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.
> >
> > Notice the original (from shift_si3's output_shift) requires the shift
> sequence to be
> > monolithic with the same destination register as the source (requiring an
> extra
> > mov_s).  The new version can eliminate this move, and schedule the second
> asl in
> > the branch delay slot of the return.
> >
> > For the function:
> > int x,y,z;
> >
> > void bar()
> > {
> >   x <<= 3;
> >   y <<= 3;
> >   z <<= 3;
> > }
> >
> > GCC -O2 -mcpu=em currently generates:
> > bar:    push_s  r13
> >         ld.as   r12,[gp,@x@sda] ;23
> >         ld.as   r3,[gp,@y@sda]  ;23
> >         mov r2,0
> >         add3 r12,r2,r12
> >         mov r2,0
> >         add3 r3,r2,r3
> >         ld.as   r2,[gp,@z@sda]  ;23
> >         st.as   r12,[gp,@x@sda] ;26
> >         mov r13,0
> >         add3 r2,r13,r2
> >         st.as   r3,[gp,@y@sda]  ;26
> >         st.as   r2,[gp,@z@sda]  ;26
> >         j_s.d   [blink]
> >         pop_s   r13
> >
> > where each shift by 3, uses ARC's add3 instruction, which is similar to
> x86's lea
> > implementing x = (y<<3) + z, but requires the value zero to be placed in a
> > temporary register "z".  Splitting this before reload allows these pseudos
> to be
> > shared/reused.  With this patch, we get
> >
> > bar:    ld.as   r2,[gp,@x@sda]  ;23
> >         mov_s   r3,0    ;3
> >         add3    r2,r3,r2
> >         ld.as   r3,[gp,@y@sda]  ;23
> >         st.as   r2,[gp,@x@sda]  ;26
> >         ld.as   r2,[gp,@z@sda]  ;23
> >         mov_s   r12,0   ;3
> >         add3    r3,r12,r3
> >         add3    r2,r12,r2
> >         st.as   r3,[gp,@y@sda]  ;26
> >         st.as   r2,[gp,@z@sda]  ;26
> >         j_s     [blink]
> >
> > [CZI] Looks great, but it also shows that I've forgot to add to ADD3
> instruction the
> > Ra,LIMM,RC variant, which will lead to have instead of
> >         mov_s   r3,0    ;3
> >         add3    r2,r3,r2
> > Only this add3,0,r2, Indeed it is longer instruction but faster.
> >
> > Unfortunately, register allocation means that we only share two of the
> three
> > "mov_s z,0", but this is sufficient to reduce register pressure enough to
> avoid
> > spilling r13 in the prologue/epilogue.
> >
> > This patch also contains a (latent?) bug fix.  The implementation of the
> default
> > insn "length" attribute, assumes instructions of type "shift" have two
> input
> > operands and accesses operands[2], hence specializations of shifts that
> don't
> > have a operands[2], need to be categorized as type "unary" (which results
> in the
> > correct length).
> >
> > [CZI] The ARC types need an upgrade too.
> >
> > This patch has been tested on a cross-compiler to arc-elf (hosted on
> x86_64-pc-
> > linux-gnu), but because I've an incomplete tool chain many of the
> regression test
> > fail, but there are no new failures with new test cases added below.  If
> you can
> > confirm that there are no issues from additional testing, is this OK for
> mainline?
> >
> > Finally a quick technical question.  ARC's zero overhead loops require at
> least two
> > instructions in the loop, so currently the backend's implementation of
> shr20 pads
> > the loop body with a "nop".
> >
> > lshr20: mov.f lp_count, 20
> >         lpnz    2f
> >         lsr r0,r0
> >         nop
> > 2:      # end single insn loop
> >         j_s     [blink]
> >
> >
> > [CZI] The ZOLs (LP instructions) are not great when dealing with short
> loop blocks.
> > Hence, the NOP instruction. Personally, I don't fancy using the LP
> instruction in
> > this case, as it prohibits LP usage for a true for-loop.
> >
> > could this be more efficiently implemented as:
> >
> > lshr20: mov lp_count, 10
> >         lp 2f
> >         lsr_s r0,r0
> >         lsr_s r0,r0
> > 2:      # end single insn loop
> >         j_s     [blink]
> >
> > i.e. half the number of iterations, but doing twice as much useful work in
> each
> > iteration?  Or might the nop be free on advanced microarchitectures,
> and/or the
> > consecutive dependent shifts cause a pipeline stall?  It would be nice to
> fuse loops
> > to implement rotations, such that rotr16 (aka swap) would look like:
> >
> > rot16:  mov_s r1,r0
> >         mov lp_count, 16
> >         lp 2f
> >         asl_s r0,r0
> >         lsr_s r1,r1
> > 2:      # end single insn loop
> >         j_s.d    [blink]
> >         or_s r0,r0,r1
> >
> > Thanks in advance,
> > Roger
> >
> >
> > 2023-09-28  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/arc/arc-protos.h (emit_shift): Delete prototype.
> >         (arc_pre_reload_split): New function prototype.
> >         * config/arc/arc.cc (emit_shift): Delete function.
> >         (arc_pre_reload_split): New predicate function, copied from i386,
> >         to schedule define_insn_and_split splitters to the split1 pass.
> >         * config/arc/arc.md (ashlsi3): Expand RTL template
> unconditionally.
> >         (ashrsi3): Likewise.
> >         (lshrsi3): Likewise.
> >         (shift_si3): Move after other shift patterns, and disable when
> >         operands[2] is one (which is handled by its own define_insn).
> >         Use shiftr4_operator, instead of shift4_operator, as this is no
> >         longer used for left shifts.
> >         (shift_si3_loop): Likewise.  Additionally remove match_scratch.
> >         (*ashlsi3_nobs): New pre-reload define_insn_and_split.
> >         (*ashrsi3_nobs): Likewise.
> >         (*lshrsi3_nobs): Likewise.
> >         (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
> >         (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
> >         insn type attribute to "unary", as this doesn't have operands[2].
> >         Change length attribute to "*,4" to allow compact representation.
> >         (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
> >         insn type attribute to "unary", as this doesn't have operands[2].
> >         (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
> >         insn type attribute to "unary", as this doesn't have operands[2].
> >         (add_shift): Rename define_insn from *add_shift.
> >         * config/arc/predicates.md (shiftl4_operator): Delete.
> >         (shift4_operator): Delete.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/ashrsi-3.c: Likewise.
> >         * gcc.target/arc/ashrsi-4.c: Likewise.
> >         * gcc.target/arc/ashrsi-5.c: Likewise.
> >         * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/lshrsi-3.c: Likewise.
> >         * gcc.target/arc/lshrsi-4.c: Likewise.
> >         * gcc.target/arc/lshrsi-5.c: Likewise.
> >         * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
> >         * gcc.target/arc/shlsi-3.c: Likewise.
> >         * gcc.target/arc/shlsi-4.c: Likewise.
> >         * gcc.target/arc/shlsi-5.c: Likewise.
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
  2023-09-28 11:26 [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER Roger Sayle
  2023-10-03 14:25 ` Claudiu Zissulescu
@ 2023-10-04  8:37 ` Claudiu Zissulescu Ianculescu
  1 sibling, 0 replies; 5+ messages in thread
From: Claudiu Zissulescu Ianculescu @ 2023-10-04  8:37 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Claudiu Zissulescu

Hi Roger,

The patch as it is passed the validation, and it is in general OK.
Although it doesn't address the elephant in the room, namely
output_shift function, it is a welcome cleanup.
I would like you to split the patch in two. One which deals with
improvements on shifts in absence of a barrel shifter, and one which
addresses the default instruction length, as they can be seen as
separate work. Please feel free to commit resulting patches to the
mainline.

Thank you for your contribution,
Claudiu

On Thu, Sep 28, 2023 at 2:27 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Claudiu,
> It was great meeting up with you and the Synopsys ARC team at the
> GNU tools Cauldron in Cambridge.
>
> This patch is the first in a series to improve SImode and DImode
> shifts and rotates in the ARC backend.  This first piece splits
> SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine
> and before reload, in the split1 pass, as suggested by the FIXME
> comment above output_shift in arc.cc.  To do this I've copied the
> implementation of the x86_pre_reload_split function from i386
> backend, and renamed it arc_pre_reload_split.
>
> Although the actual implementations of shifts remain the same
> (as in output_shift), having them as explicit instructions in
> the RTL stream allows better scheduling and use of compact forms
> when available.  The benefits can be seen in two short examples
> below.
>
> For the function:
> unsigned int foo(unsigned int x, unsigned int y) {
>   return y << 2;
> }
>
> GCC with -O2 -mcpu=em would previously generate:
> foo:    add r1,r1,r1
>         add r1,r1,r1
>         j_s.d   [blink]
>         mov_s   r0,r1   ;4
> and with this patch now generates:
> foo:    asl_s r0,r1
>         j_s.d   [blink]
>         asl_s r0,r0
>
> Notice the original (from shift_si3's output_shift) requires the
> shift sequence to be monolithic with the same destination register
> as the source (requiring an extra mov_s).  The new version can
> eliminate this move, and schedule the second asl in the branch
> delay slot of the return.
>
> For the function:
> int x,y,z;
>
> void bar()
> {
>   x <<= 3;
>   y <<= 3;
>   z <<= 3;
> }
>
> GCC -O2 -mcpu=em currently generates:
> bar:    push_s  r13
>         ld.as   r12,[gp,@x@sda] ;23
>         ld.as   r3,[gp,@y@sda]  ;23
>         mov r2,0
>         add3 r12,r2,r12
>         mov r2,0
>         add3 r3,r2,r3
>         ld.as   r2,[gp,@z@sda]  ;23
>         st.as   r12,[gp,@x@sda] ;26
>         mov r13,0
>         add3 r2,r13,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s.d   [blink]
>         pop_s   r13
>
> where each shift by 3, uses ARC's add3 instruction, which is similar
> to x86's lea implementing x = (y<<3) + z, but requires the value zero
> to be placed in a temporary register "z".  Splitting this before reload
> allows these pseudos to be shared/reused.  With this patch, we get
>
> bar:    ld.as   r2,[gp,@x@sda]  ;23
>         mov_s   r3,0    ;3
>         add3    r2,r3,r2
>         ld.as   r3,[gp,@y@sda]  ;23
>         st.as   r2,[gp,@x@sda]  ;26
>         ld.as   r2,[gp,@z@sda]  ;23
>         mov_s   r12,0   ;3
>         add3    r3,r12,r3
>         add3    r2,r12,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s     [blink]
>
> Unfortunately, register allocation means that we only share two of the
> three "mov_s z,0", but this is sufficient to reduce register pressure
> enough to avoid spilling r13 in the prologue/epilogue.
>
> This patch also contains a (latent?) bug fix.  The implementation of
> the default insn "length" attribute, assumes instructions of type
> "shift" have two input operands and accesses operands[2], hence
> specializations of shifts that don't have a operands[2], need to be
> categorized as type "unary" (which results in the correct length).
>
> This patch has been tested on a cross-compiler to arc-elf (hosted on
> x86_64-pc-linux-gnu), but because I've an incomplete tool chain many
> of the regression test fail, but there are no new failures with new
> test cases added below.  If you can confirm that there are no issues
> from additional testing, is this OK for mainline?
>
> Finally a quick technical question.  ARC's zero overhead loops require
> at least two instructions in the loop, so currently the backend's
> implementation of shr20 pads the loop body with a "nop".
>
> lshr20: mov.f lp_count, 20
>         lpnz    2f
>         lsr r0,r0
>         nop
> 2:      # end single insn loop
>         j_s     [blink]
>
> could this be more efficiently implemented as:
>
> lshr20: mov lp_count, 10
>         lp 2f
>         lsr_s r0,r0
>         lsr_s r0,r0
> 2:      # end single insn loop
>         j_s     [blink]
>
> i.e. half the number of iterations, but doing twice as much useful
> work in each iteration?  Or might the nop be free on advanced
> microarchitectures, and/or the consecutive dependent shifts cause
> a pipeline stall?  It would be nice to fuse loops to implement
> rotations, such that rotr16 (aka swap) would look like:
>
> rot16:  mov_s r1,r0
>         mov lp_count, 16
>         lp 2f
>         asl_s r0,r0
>         lsr_s r1,r1
> 2:      # end single insn loop
>         j_s.d    [blink]
>         or_s r0,r0,r1
>
> Thanks in advance,
> Roger
>
>
> 2023-09-28  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/arc/arc-protos.h (emit_shift): Delete prototype.
>         (arc_pre_reload_split): New function prototype.
>         * config/arc/arc.cc (emit_shift): Delete function.
>         (arc_pre_reload_split): New predicate function, copied from i386,
>         to schedule define_insn_and_split splitters to the split1 pass.
>         * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally.
>         (ashrsi3): Likewise.
>         (lshrsi3): Likewise.
>         (shift_si3): Move after other shift patterns, and disable when
>         operands[2] is one (which is handled by its own define_insn).
>         Use shiftr4_operator, instead of shift4_operator, as this is no
>         longer used for left shifts.
>         (shift_si3_loop): Likewise.  Additionally remove match_scratch.
>         (*ashlsi3_nobs): New pre-reload define_insn_and_split.
>         (*ashrsi3_nobs): Likewise.
>         (*lshrsi3_nobs): Likewise.
>         (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
>         (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         Change length attribute to "*,4" to allow compact representation.
>         (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (add_shift): Rename define_insn from *add_shift.
>         * config/arc/predicates.md (shiftl4_operator): Delete.
>         (shift4_operator): Delete.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-3.c: Likewise.
>         * gcc.target/arc/ashrsi-4.c: Likewise.
>         * gcc.target/arc/ashrsi-5.c: Likewise.
>         * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-3.c: Likewise.
>         * gcc.target/arc/lshrsi-4.c: Likewise.
>         * gcc.target/arc/lshrsi-5.c: Likewise.
>         * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-3.c: Likewise.
>         * gcc.target/arc/shlsi-4.c: Likewise.
>         * gcc.target/arc/shlsi-5.c: Likewise.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-10-04  8:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-28 11:26 [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER Roger Sayle
2023-10-03 14:25 ` Claudiu Zissulescu
2023-10-03 15:34   ` Roger Sayle
2023-10-03 16:07     ` Claudiu Zissulescu Ianculescu
2023-10-04  8:37 ` Claudiu Zissulescu Ianculescu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).