Re: [ARM] Implementing doloop pattern

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: Revital1 Eres <ERES@il.ibm.com>
To: Roman Zhuykov <zhroma@ispras.ru>
Cc: dm@ispras.ru, gcc@gcc.gnu.org, cltang@codesourcery.com,
	       yao@codesourcery.com, Ayal Zaks <ZAKS@il.ibm.com>
Subject: Re: [ARM] Implementing doloop pattern
Date: Thu, 30 Dec 2010 16:56:00 -0000	[thread overview]
Message-ID: <OFFD96764C.38FEC784-ONC2257809.005BD5C2-C2257809.005D0F95@il.ibm.com> (raw)
In-Reply-To: <4D1C915C.10001@ispras.ru>

[-- Attachment #1: Type: text/plain, Size: 3882 bytes --]

Hello,

The attached patch is my latest attempt to model doloop for arm.
I followed Chung-Lin Tang suggestion and used subs+jump similar to your
patch.
On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with
SMS using the following flags: -fmodulo-sched-allow-regmoves
-funsafe-loop-optimizations -fmodulo-sched   -fno-auto-inc-dec
-fdump-rtl-sms -mthumb  -mcpu=cortex-a8 -O3. (compare to using only
-mthumb  -mcpu=cortex-a8 -O3)

I have not fully tested the patch and it's not in the proper format of
submission yet.

Thanks,
Revital

(See attached file: patch_arm_doloop.txt)



From:	Roman Zhuykov <zhroma@ispras.ru>
To:	gcc@gcc.gnu.org
Cc:	dm@ispras.ru
Date:	30/12/2010 04:04 PM
Subject:	[ARM] Implementing doloop pattern
Sent by:	gcc-owner@gcc.gnu.org



Hello!

The main idea of the work described below was to estimate speedup we can
gain from SMS on ARM.  SMS depends on doloop_end pattern and there is no
appropriate instruction on ARM.  We decided to create a "fake"
doloop_end pattern on ARM using a pair of "subs" and "bne" assembler
instructions.  In implementation we used ideas from machine description
files of other architectures, e. g. spu, which expands doloop_end
pattern only when SMS is enabled.  The patch is attached.

This patch allows to use any possible register for the doloop pattern.
It was tested on trunk snapshot from 30 Aug 2010.  It works fine on
several small examples, but gives an ICE on sqlite-amalgamation-3.6.1
source:
sqlite3.c: In function 'sqlite3WhereBegin':
sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at
cfgrtl.c:1020

ICE happens in ira pass, when cleanup_cfg is called at the end or ira.

The "bad" instruction looks like
(jump_insn 3601 628 4065 76 (parallel [
             (set (pc)
                 (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                                 (const_int 36 [0x24])) [105 %sfp+-916
S4 A32])
                         (const_int 1 [0x1]))
                     (label_ref 3600)
                     (pc)))
             (set (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                         (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
                 (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                             (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
                     (const_int -1 [0xffffffffffffffff])))
         ]) sqlite3.c:75235 328 {doloop_end_internal}
      (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
         (nil))
  -> 3600)

So, the problem seems to be with ira.  Memory is used instead of a
register to store doloop counter.  We tried to fix this by explicitly
specifying hard register (r5) for doloop pattern.  The fixed version
seems to work, but this doesn't look like a proper fix.  On trunk
snapshot from 17 Dec 2010 the ICE described above have disappeared, but
probably it's just a coincidence, and it will shop up anyway on some
other test case.

The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec
-fmodulo-sched" vs "-O2 -fno-auto-inc-dec").
Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown.
Other aburto tests, sqlite tests and libevas rasterization library
(expedite testsuite) show around zero results.

A motivating example shows about 23% speedup:

char scal (int n, char *a, char *b)
{
   int i;
   char s = 0;
   for (i = 0; i < n; i++)
     s += a[i] * b[i];
   return s;
}

We have analyzed SMS results, and can conclude that if SMS has
successfully built a schedule for the loop we usually gain a speedup,
and when SMS fails, we often have some slowdown, which have appeared
because of do-loop conversion.

The questions are:
How to properly fix the ICE described?
Do you think this approach (after the fixes) can make its way into trunk?

Happy holidays!
--
Roman Zhuykov

[attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM]

[-- Attachment #2: patch_arm_doloop.txt --]
[-- Type: text/plain, Size: 5558 bytes --]

Index: modulo-sched.c
===================================================================
--- modulo-sched.c	(revision 167637)
+++ modulo-sched.c	(working copy)
@@ -1021,7 +1021,8 @@ sms_schedule (void)
         if (CALL_P (insn)
             || BARRIER_P (insn)
             || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
-                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE)
+                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
+                && !reg_mentioned_p (count_reg, insn))
             || (FIND_REG_INC_NOTE (insn, NULL_RTX) != 0)
             || (INSN_P (insn) && (set = single_set (insn))
                 && GET_CODE (SET_DEST (set)) == SUBREG))
Index: loop-doloop.c
===================================================================
--- loop-doloop.c	(revision 167637)
+++ loop-doloop.c	(working copy)
@@ -96,7 +96,15 @@ doloop_condition_get (rtx doloop_pat)
      2)  (set (reg) (plus (reg) (const_int -1))
          (set (pc) (if_then_else (reg != 0)
 	                         (label_ref (label))
-			         (pc))).  */
+			         (pc))).  
+
+     In ARM the following sequence of instructions implements doloop:
+
+     3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
+                   (set (reg) (plus (reg) (const_int -1)))])
+        (set (pc) (if_then_else (NE == cc)
+                                (label_ref (label))
+                                (pc))) */
 
   pattern = PATTERN (doloop_pat);
 
@@ -111,7 +119,12 @@ doloop_condition_get (rtx doloop_pat)
         return 0;
 
       cmp = pattern;
-      inc = PATTERN (PREV_INSN (doloop_pat));
+      if (GET_CODE (PATTERN (prev_insn)) == PARALLEL)
+       {
+         inc = XVECEXP (PATTERN (prev_insn), 0, 1);
+       }
+      else
+        inc = PATTERN (PREV_INSN (doloop_pat));
       /* We expect the condition to be of the form (reg != 0)  */
       cond = XEXP (SET_SRC (cmp), 0);
       if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
@@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat)
     return 0;
 
   if ((XEXP (condition, 0) == reg)
+      || (REGNO (XEXP (condition, 0)) == CC_REGNUM)
       || (GET_CODE (XEXP (condition, 0)) == PLUS
 		   && XEXP (XEXP (condition, 0), 0) == reg))
    {
@@ -181,7 +195,24 @@ doloop_condition_get (rtx doloop_pat)
                      (set (reg) (plus (reg) (const_int -1)))
                      (additional clobbers and uses)])
 
-         So we return that form instead.
+        For the third form we expect:
+
+        (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0))
+                   (set (reg) (plus (reg) (const_int -1)))])
+        (set (pc) (if_then_else (NE == cc)
+                                (label_ref (label))
+                                (pc))) 
+
+        which is equivalent to the following:
+
+        (parallel [(set (cc) (compare (reg,  1))
+                   (set (reg) (plus (reg) (const_int -1)))])
+        (set (pc) (if_then_else (NE == cc)
+                               (label_ref (label))
+                                (pc)))
+
+        So we return the second form instead.
+
      */
         condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx);
 
Index: config/arm/thumb2.md
===================================================================
--- config/arm/thumb2.md	(revision 167637)
+++ config/arm/thumb2.md	(working copy)
@@ -836,7 +836,7 @@
   "operands[4] = GEN_INT (- INTVAL (operands[2]));"
 )
 
-(define_insn "*thumb2_addsi3_compare0"
+(define_insn "thumb2_addsi3_compare0"
   [(set (reg:CC_NOOV CC_REGNUM)
 	(compare:CC_NOOV
 	  (plus:SI (match_operand:SI 1 "s_register_operand" "l,  0, r")
@@ -1118,3 +1118,49 @@
   "
   operands[2] = GEN_INT (32 - INTVAL (operands[2]));
   ")
+
+ ;; Define the subtract-one-and-jump insns so loop.c
+ ;; knows what to generate.
+ (define_expand "doloop_end"
+   [(use (match_operand 0 "" ""))      ; loop pseudo
+    (use (match_operand 1 "" ""))      ; iterations; zero if unknown
+    (use (match_operand 2 "" ""))      ; max iterations
+    (use (match_operand 3 "" ""))      ; loop level
+    (use (match_operand 4 "" ""))]     ; label
+   "TARGET_THUMB2"
+   "
+ {
+   /* Currently SMS relies on the do-loop pattern to recognize loops
+      where (1) the control part comprises of all insns defining and/or
+      using a certain 'count' register and (2) the loop count can be
+      adjusted by modifying this register prior to the loop.
+.     ??? The possible introduction of a new block to initialize the
+      new IV can potentially effects branch optimizations.  */
+   if (optimize > 0 && flag_modulo_sched)
+   {
+     rtx s0;
+     rtx bcomp;
+     rtx loc_ref;
+     rtx cc_reg;
+
+     /* Only use this on innermost loops.  */
+     if (INTVAL (operands[3]) > 1)
+       FAIL;
+     if (GET_MODE (operands[0]) != SImode)
+       FAIL;
+
+     cc_reg = gen_rtx_REG (CC_NOOVmode, CC_REGNUM);
+     s0 = operands [0];
+     emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
+     bcomp = gen_rtx_NE(VOIDmode, cc_reg, const0_rtx); 
+     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [4]);
+     emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
+                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+                                                        loc_ref, pc_rtx)));
+
+     DONE;
+   }else
+      FAIL;
+ }")
+
+

next prev parent reply	other threads:[~2010-12-30 16:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-30 14:04 Roman Zhuykov
2010-12-30 16:02 ` Ulrich Weigand
2010-12-30 16:56 ` Revital1 Eres [this message]
2011-01-05 15:35   ` Richard Earnshaw
2011-01-06  7:59     ` Revital1 Eres
2011-01-06  9:11       ` Andreas Schwab
2011-01-13 11:11         ` Ramana Radhakrishnan
2011-01-13 13:51       ` Nathan Froyd

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=OFFD96764C.38FEC784-ONC2257809.005BD5C2-C2257809.005D0F95@il.ibm.com \
    --to=eres@il.ibm.com \
    --cc=ZAKS@il.ibm.com \
    --cc=cltang@codesourcery.com \
    --cc=dm@ispras.ru \
    --cc=gcc@gcc.gnu.org \
    --cc=yao@codesourcery.com \
    --cc=zhroma@ispras.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).