public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [ARM] Implementing doloop pattern
@ 2010-12-30 14:04 Roman Zhuykov
  2010-12-30 16:02 ` Ulrich Weigand
  2010-12-30 16:56 ` Revital1 Eres
  0 siblings, 2 replies; 8+ messages in thread
From: Roman Zhuykov @ 2010-12-30 14:04 UTC (permalink / raw)
  To: gcc; +Cc: dm

[-- Attachment #1: Type: text/plain, Size: 3055 bytes --]

Hello!

The main idea of the work described below was to estimate speedup we can 
gain from SMS on ARM.  SMS depends on doloop_end pattern and there is no 
appropriate instruction on ARM.  We decided to create a "fake" 
doloop_end pattern on ARM using a pair of "subs" and "bne" assembler 
instructions.  In implementation we used ideas from machine description 
files of other architectures, e. g. spu, which expands doloop_end 
pattern only when SMS is enabled.  The patch is attached.

This patch allows to use any possible register for the doloop pattern.  
It was tested on trunk snapshot from 30 Aug 2010.  It works fine on 
several small examples, but gives an ICE on sqlite-amalgamation-3.6.1 
source:
sqlite3.c: In function 'sqlite3WhereBegin':
sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at 
cfgrtl.c:1020

ICE happens in ira pass, when cleanup_cfg is called at the end or ira.

The "bad" instruction looks like
(jump_insn 3601 628 4065 76 (parallel [
             (set (pc)
                 (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                                 (const_int 36 [0x24])) [105 %sfp+-916 
S4 A32])
                         (const_int 1 [0x1]))
                     (label_ref 3600)
                     (pc)))
             (set (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                         (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
                 (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp)
                             (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
                     (const_int -1 [0xffffffffffffffff])))
         ]) sqlite3.c:75235 328 {doloop_end_internal}
      (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
         (nil))
  -> 3600)

So, the problem seems to be with ira.  Memory is used instead of a 
register to store doloop counter.  We tried to fix this by explicitly 
specifying hard register (r5) for doloop pattern.  The fixed version 
seems to work, but this doesn't look like a proper fix.  On trunk 
snapshot from 17 Dec 2010 the ICE described above have disappeared, but 
probably it's just a coincidence, and it will shop up anyway on some 
other test case.

The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec 
-fmodulo-sched" vs "-O2 -fno-auto-inc-dec").
Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown.
Other aburto tests, sqlite tests and libevas rasterization library 
(expedite testsuite) show around zero results.

A motivating example shows about 23% speedup:

char scal (int n, char *a, char *b)
{
   int i;
   char s = 0;
   for (i = 0; i < n; i++)
     s += a[i] * b[i];
   return s;
}

We have analyzed SMS results, and can conclude that if SMS has 
successfully built a schedule for the loop we usually gain a speedup, 
and when SMS fails, we often have some slowdown, which have appeared 
because of do-loop conversion.

The questions are:
How to properly fix the ICE described?
Do you think this approach (after the fixes) can make its way into trunk?

Happy holidays!
--
Roman Zhuykov


[-- Attachment #2: sms-doloop-any-reg.diff --]
[-- Type: text/plain, Size: 1574 bytes --]

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 9d7310b..ab0373d 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -10699,6 +10699,49 @@
   "
 )
 
+(define_expand "doloop_end"
+   [(use (match_operand 0 "arm_general_register_operand" ""))      ; loop pseudo
+    (use (match_operand 1 "" ""))      ; iterations; zero if unknown
+    (use (match_operand 2 "" ""))      ; max iterations
+    (use (match_operand 3 "" ""))      ; loop level
+    (use (match_operand 4 "" ""))]     ; label
+   ""
+   "
+ {
+   if (optimize > 0 && flag_modulo_sched)
+   {
+     /* Only use this on innermost loops. */
+     if (INTVAL (operands[3]) > 1)
+       FAIL;
+     if (GET_MODE (operands[0]) != SImode)
+       FAIL;
+     emit_jump_insn (gen_doloop_end_internal(operands[0], operands[4]));
+     DONE;
+   }else
+     FAIL;
+ }")
+
+(define_insn "doloop_end_internal"
+  [(set (pc) (if_then_else
+              (ne (match_operand:SI 0 "arm_general_register_operand" "")
+                  (const_int 1))
+              (label_ref (match_operand 1 "" ""))
+              (pc)))
+      (set (match_dup 0)
+           (plus:SI (match_dup 0)
+                   (const_int -1)))]
+  "TARGET_32BIT && optimize > 0 && flag_modulo_sched"
+  "*
+  if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2)
+    {
+      arm_ccfsm_state += 2;
+    }
+  return \"subs\\t%0, %0, #1\;bne\\t%l1\";
+  "
+  [(set_attr "length" "8")
+   (set_attr "type" "branch")]
+)
+
 ;; Load the load/store multiple patterns
 (include "ldmstm.md")
 ;; Load the FPA co-processor patterns

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-01-13 13:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-30 14:04 [ARM] Implementing doloop pattern Roman Zhuykov
2010-12-30 16:02 ` Ulrich Weigand
2010-12-30 16:56 ` Revital1 Eres
2011-01-05 15:35   ` Richard Earnshaw
2011-01-06  7:59     ` Revital1 Eres
2011-01-06  9:11       ` Andreas Schwab
2011-01-13 11:11         ` Ramana Radhakrishnan
2011-01-13 13:51       ` Nathan Froyd

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).