* [ARM] Implementing doloop pattern @ 2010-12-30 14:04 Roman Zhuykov 2010-12-30 16:02 ` Ulrich Weigand 2010-12-30 16:56 ` Revital1 Eres 0 siblings, 2 replies; 8+ messages in thread From: Roman Zhuykov @ 2010-12-30 14:04 UTC (permalink / raw) To: gcc; +Cc: dm [-- Attachment #1: Type: text/plain, Size: 3055 bytes --] Hello! The main idea of the work described below was to estimate speedup we can gain from SMS on ARM. SMS depends on doloop_end pattern and there is no appropriate instruction on ARM. We decided to create a "fake" doloop_end pattern on ARM using a pair of "subs" and "bne" assembler instructions. In implementation we used ideas from machine description files of other architectures, e. g. spu, which expands doloop_end pattern only when SMS is enabled. The patch is attached. This patch allows to use any possible register for the doloop pattern. It was tested on trunk snapshot from 30 Aug 2010. It works fine on several small examples, but gives an ICE on sqlite-amalgamation-3.6.1 source: sqlite3.c: In function 'sqlite3WhereBegin': sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at cfgrtl.c:1020 ICE happens in ira pass, when cleanup_cfg is called at the end or ira. The "bad" instruction looks like (jump_insn 3601 628 4065 76 (parallel [ (set (pc) (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (const_int 1 [0x1])) (label_ref 3600) (pc))) (set (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (const_int -1 [0xffffffffffffffff]))) ]) sqlite3.c:75235 328 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) (nil)) -> 3600) So, the problem seems to be with ira. Memory is used instead of a register to store doloop counter. We tried to fix this by explicitly specifying hard register (r5) for doloop pattern. The fixed version seems to work, but this doesn't look like a proper fix. On trunk snapshot from 17 Dec 2010 the ICE described above have disappeared, but probably it's just a coincidence, and it will shop up anyway on some other test case. The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec -fmodulo-sched" vs "-O2 -fno-auto-inc-dec"). Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown. Other aburto tests, sqlite tests and libevas rasterization library (expedite testsuite) show around zero results. A motivating example shows about 23% speedup: char scal (int n, char *a, char *b) { int i; char s = 0; for (i = 0; i < n; i++) s += a[i] * b[i]; return s; } We have analyzed SMS results, and can conclude that if SMS has successfully built a schedule for the loop we usually gain a speedup, and when SMS fails, we often have some slowdown, which have appeared because of do-loop conversion. The questions are: How to properly fix the ICE described? Do you think this approach (after the fixes) can make its way into trunk? Happy holidays! -- Roman Zhuykov [-- Attachment #2: sms-doloop-any-reg.diff --] [-- Type: text/plain, Size: 1574 bytes --] diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index 9d7310b..ab0373d 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -10699,6 +10699,49 @@ " ) +(define_expand "doloop_end" + [(use (match_operand 0 "arm_general_register_operand" "")) ; loop pseudo + (use (match_operand 1 "" "")) ; iterations; zero if unknown + (use (match_operand 2 "" "")) ; max iterations + (use (match_operand 3 "" "")) ; loop level + (use (match_operand 4 "" ""))] ; label + "" + " + { + if (optimize > 0 && flag_modulo_sched) + { + /* Only use this on innermost loops. */ + if (INTVAL (operands[3]) > 1) + FAIL; + if (GET_MODE (operands[0]) != SImode) + FAIL; + emit_jump_insn (gen_doloop_end_internal(operands[0], operands[4])); + DONE; + }else + FAIL; + }") + +(define_insn "doloop_end_internal" + [(set (pc) (if_then_else + (ne (match_operand:SI 0 "arm_general_register_operand" "") + (const_int 1)) + (label_ref (match_operand 1 "" "")) + (pc))) + (set (match_dup 0) + (plus:SI (match_dup 0) + (const_int -1)))] + "TARGET_32BIT && optimize > 0 && flag_modulo_sched" + "* + if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2) + { + arm_ccfsm_state += 2; + } + return \"subs\\t%0, %0, #1\;bne\\t%l1\"; + " + [(set_attr "length" "8") + (set_attr "type" "branch")] +) + ;; Load the load/store multiple patterns (include "ldmstm.md") ;; Load the FPA co-processor patterns ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2010-12-30 14:04 [ARM] Implementing doloop pattern Roman Zhuykov @ 2010-12-30 16:02 ` Ulrich Weigand 2010-12-30 16:56 ` Revital1 Eres 1 sibling, 0 replies; 8+ messages in thread From: Ulrich Weigand @ 2010-12-30 16:02 UTC (permalink / raw) To: Roman Zhuykov; +Cc: gcc, dm Roman Zhuykov wrote: > Memory is used instead of a register to store doloop counter. Yes, this can happen, and your doloop insn pattern *must* be able to handle this. This is usually done via a splitter (and possibly an additional scratch register allocated via an extra insn operand). See various other doloop implementations for examples, like s390 or rs6000. (The reason why the register allocator and/or reload cannot fix this is: the doloop counter is an *output* as well as an input to the isns, therefore it would require an output reload to fix; however, the doloop insn is also a *jump* pattern, and those must never have output reloads, since reload has no place to put them.) Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE Ulrich.Weigand@de.ibm.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2010-12-30 14:04 [ARM] Implementing doloop pattern Roman Zhuykov 2010-12-30 16:02 ` Ulrich Weigand @ 2010-12-30 16:56 ` Revital1 Eres 2011-01-05 15:35 ` Richard Earnshaw 1 sibling, 1 reply; 8+ messages in thread From: Revital1 Eres @ 2010-12-30 16:56 UTC (permalink / raw) To: Roman Zhuykov; +Cc: dm, gcc, cltang, yao, Ayal Zaks [-- Attachment #1: Type: text/plain, Size: 3882 bytes --] Hello, The attached patch is my latest attempt to model doloop for arm. I followed Chung-Lin Tang suggestion and used subs+jump similar to your patch. On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with SMS using the following flags: -fmodulo-sched-allow-regmoves -funsafe-loop-optimizations -fmodulo-sched -fno-auto-inc-dec -fdump-rtl-sms -mthumb -mcpu=cortex-a8 -O3. (compare to using only -mthumb -mcpu=cortex-a8 -O3) I have not fully tested the patch and it's not in the proper format of submission yet. Thanks, Revital (See attached file: patch_arm_doloop.txt) From: Roman Zhuykov <zhroma@ispras.ru> To: gcc@gcc.gnu.org Cc: dm@ispras.ru Date: 30/12/2010 04:04 PM Subject: [ARM] Implementing doloop pattern Sent by: gcc-owner@gcc.gnu.org Hello! The main idea of the work described below was to estimate speedup we can gain from SMS on ARM. SMS depends on doloop_end pattern and there is no appropriate instruction on ARM. We decided to create a "fake" doloop_end pattern on ARM using a pair of "subs" and "bne" assembler instructions. In implementation we used ideas from machine description files of other architectures, e. g. spu, which expands doloop_end pattern only when SMS is enabled. The patch is attached. This patch allows to use any possible register for the doloop pattern. It was tested on trunk snapshot from 30 Aug 2010. It works fine on several small examples, but gives an ICE on sqlite-amalgamation-3.6.1 source: sqlite3.c: In function 'sqlite3WhereBegin': sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at cfgrtl.c:1020 ICE happens in ira pass, when cleanup_cfg is called at the end or ira. The "bad" instruction looks like (jump_insn 3601 628 4065 76 (parallel [ (set (pc) (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (const_int 1 [0x1])) (label_ref 3600) (pc))) (set (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp) (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) (const_int -1 [0xffffffffffffffff]))) ]) sqlite3.c:75235 328 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) (nil)) -> 3600) So, the problem seems to be with ira. Memory is used instead of a register to store doloop counter. We tried to fix this by explicitly specifying hard register (r5) for doloop pattern. The fixed version seems to work, but this doesn't look like a proper fix. On trunk snapshot from 17 Dec 2010 the ICE described above have disappeared, but probably it's just a coincidence, and it will shop up anyway on some other test case. The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec -fmodulo-sched" vs "-O2 -fno-auto-inc-dec"). Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown. Other aburto tests, sqlite tests and libevas rasterization library (expedite testsuite) show around zero results. A motivating example shows about 23% speedup: char scal (int n, char *a, char *b) { int i; char s = 0; for (i = 0; i < n; i++) s += a[i] * b[i]; return s; } We have analyzed SMS results, and can conclude that if SMS has successfully built a schedule for the loop we usually gain a speedup, and when SMS fails, we often have some slowdown, which have appeared because of do-loop conversion. The questions are: How to properly fix the ICE described? Do you think this approach (after the fixes) can make its way into trunk? Happy holidays! -- Roman Zhuykov [attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM] [-- Attachment #2: patch_arm_doloop.txt --] [-- Type: text/plain, Size: 5558 bytes --] Index: modulo-sched.c =================================================================== --- modulo-sched.c (revision 167637) +++ modulo-sched.c (working copy) @@ -1021,7 +1021,8 @@ sms_schedule (void) if (CALL_P (insn) || BARRIER_P (insn) || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn) - && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE) + && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE + && !reg_mentioned_p (count_reg, insn)) || (FIND_REG_INC_NOTE (insn, NULL_RTX) != 0) || (INSN_P (insn) && (set = single_set (insn)) && GET_CODE (SET_DEST (set)) == SUBREG)) Index: loop-doloop.c =================================================================== --- loop-doloop.c (revision 167637) +++ loop-doloop.c (working copy) @@ -96,7 +96,15 @@ doloop_condition_get (rtx doloop_pat) 2) (set (reg) (plus (reg) (const_int -1)) (set (pc) (if_then_else (reg != 0) (label_ref (label)) - (pc))). */ + (pc))). + + In ARM the following sequence of instructions implements doloop: + + 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0))) + (set (reg) (plus (reg) (const_int -1)))]) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))) */ pattern = PATTERN (doloop_pat); @@ -111,7 +119,12 @@ doloop_condition_get (rtx doloop_pat) return 0; cmp = pattern; - inc = PATTERN (PREV_INSN (doloop_pat)); + if (GET_CODE (PATTERN (prev_insn)) == PARALLEL) + { + inc = XVECEXP (PATTERN (prev_insn), 0, 1); + } + else + inc = PATTERN (PREV_INSN (doloop_pat)); /* We expect the condition to be of the form (reg != 0) */ cond = XEXP (SET_SRC (cmp), 0); if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx) @@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat) return 0; if ((XEXP (condition, 0) == reg) + || (REGNO (XEXP (condition, 0)) == CC_REGNUM) || (GET_CODE (XEXP (condition, 0)) == PLUS && XEXP (XEXP (condition, 0), 0) == reg)) { @@ -181,7 +195,24 @@ doloop_condition_get (rtx doloop_pat) (set (reg) (plus (reg) (const_int -1))) (additional clobbers and uses)]) - So we return that form instead. + For the third form we expect: + + (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0)) + (set (reg) (plus (reg) (const_int -1)))]) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))) + + which is equivalent to the following: + + (parallel [(set (cc) (compare (reg, 1)) + (set (reg) (plus (reg) (const_int -1)))]) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))) + + So we return the second form instead. + */ condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx); Index: config/arm/thumb2.md =================================================================== --- config/arm/thumb2.md (revision 167637) +++ config/arm/thumb2.md (working copy) @@ -836,7 +836,7 @@ "operands[4] = GEN_INT (- INTVAL (operands[2]));" ) -(define_insn "*thumb2_addsi3_compare0" +(define_insn "thumb2_addsi3_compare0" [(set (reg:CC_NOOV CC_REGNUM) (compare:CC_NOOV (plus:SI (match_operand:SI 1 "s_register_operand" "l, 0, r") @@ -1118,3 +1118,49 @@ " operands[2] = GEN_INT (32 - INTVAL (operands[2])); ") + + ;; Define the subtract-one-and-jump insns so loop.c + ;; knows what to generate. + (define_expand "doloop_end" + [(use (match_operand 0 "" "")) ; loop pseudo + (use (match_operand 1 "" "")) ; iterations; zero if unknown + (use (match_operand 2 "" "")) ; max iterations + (use (match_operand 3 "" "")) ; loop level + (use (match_operand 4 "" ""))] ; label + "TARGET_THUMB2" + " + { + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part comprises of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. +. ??? The possible introduction of a new block to initialize the + new IV can potentially effects branch optimizations. */ + if (optimize > 0 && flag_modulo_sched) + { + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + + /* Only use this on innermost loops. */ + if (INTVAL (operands[3]) > 1) + FAIL; + if (GET_MODE (operands[0]) != SImode) + FAIL; + + cc_reg = gen_rtx_REG (CC_NOOVmode, CC_REGNUM); + s0 = operands [0]; + emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1))); + bcomp = gen_rtx_NE(VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [4]); + emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, + loc_ref, pc_rtx))); + + DONE; + }else + FAIL; + }") + + ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2010-12-30 16:56 ` Revital1 Eres @ 2011-01-05 15:35 ` Richard Earnshaw 2011-01-06 7:59 ` Revital1 Eres 0 siblings, 1 reply; 8+ messages in thread From: Richard Earnshaw @ 2011-01-05 15:35 UTC (permalink / raw) To: Revital1 Eres; +Cc: Roman Zhuykov, dm, gcc, cltang, yao, Ayal Zaks On Thu, 2010-12-30 at 18:56 +0200, Revital1 Eres wrote: > Hello, > > The attached patch is my latest attempt to model doloop for arm. > I followed Chung-Lin Tang suggestion and used subs+jump similar to your > patch. > On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with > SMS using the following flags: -fmodulo-sched-allow-regmoves > -funsafe-loop-optimizations -fmodulo-sched -fno-auto-inc-dec > -fdump-rtl-sms -mthumb -mcpu=cortex-a8 -O3. (compare to using only > -mthumb -mcpu=cortex-a8 -O3) > > I have not fully tested the patch and it's not in the proper format of > submission yet. > > Thanks, > Revital > > (See attached file: patch_arm_doloop.txt) > > > > From: Roman Zhuykov <zhroma@ispras.ru> > To: gcc@gcc.gnu.org > Cc: dm@ispras.ru > Date: 30/12/2010 04:04 PM > Subject: [ARM] Implementing doloop pattern > Sent by: gcc-owner@gcc.gnu.org > > > > Hello! > > The main idea of the work described below was to estimate speedup we can > gain from SMS on ARM. SMS depends on doloop_end pattern and there is no > appropriate instruction on ARM. We decided to create a "fake" > doloop_end pattern on ARM using a pair of "subs" and "bne" assembler > instructions. In implementation we used ideas from machine description > files of other architectures, e. g. spu, which expands doloop_end > pattern only when SMS is enabled. The patch is attached. > > This patch allows to use any possible register for the doloop pattern. > It was tested on trunk snapshot from 30 Aug 2010. It works fine on > several small examples, but gives an ICE on sqlite-amalgamation-3.6.1 > source: > sqlite3.c: In function 'sqlite3WhereBegin': > sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at > cfgrtl.c:1020 > > ICE happens in ira pass, when cleanup_cfg is called at the end or ira. > > The "bad" instruction looks like > (jump_insn 3601 628 4065 76 (parallel [ > (set (pc) > (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 > S4 A32]) > (const_int 1 [0x1])) > (label_ref 3600) > (pc))) > (set (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) > (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp) > (const_int 36 [0x24])) [105 %sfp+-916 S4 A32]) > (const_int -1 [0xffffffffffffffff]))) > ]) sqlite3.c:75235 328 {doloop_end_internal} > (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) > (nil)) > -> 3600) > > So, the problem seems to be with ira. Memory is used instead of a > register to store doloop counter. We tried to fix this by explicitly > specifying hard register (r5) for doloop pattern. The fixed version > seems to work, but this doesn't look like a proper fix. On trunk > snapshot from 17 Dec 2010 the ICE described above have disappeared, but > probably it's just a coincidence, and it will shop up anyway on some > other test case. > > The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec > -fmodulo-sched" vs "-O2 -fno-auto-inc-dec"). > Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown. > Other aburto tests, sqlite tests and libevas rasterization library > (expedite testsuite) show around zero results. > > A motivating example shows about 23% speedup: > > char scal (int n, char *a, char *b) > { > int i; > char s = 0; > for (i = 0; i < n; i++) > s += a[i] * b[i]; > return s; > } > > We have analyzed SMS results, and can conclude that if SMS has > successfully built a schedule for the loop we usually gain a speedup, > and when SMS fails, we often have some slowdown, which have appeared > because of do-loop conversion. > > The questions are: > How to properly fix the ICE described? > Do you think this approach (after the fixes) can make its way into trunk? > > Happy holidays! > -- > Roman Zhuykov > > [attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM] @@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat) return 0; if ((XEXP (condition, 0) == reg) + || (REGNO (XEXP (condition, 0)) == CC_REGNUM) || (GET_CODE (XEXP (condition, 0)) == PLUS && XEXP (XEXP (condition, 0), 0) == reg)) You can't depend on CC_REGNUM in generic code. That's part of the private machine description for ARM. Other cores have different ways of representing condition codes. R. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2011-01-05 15:35 ` Richard Earnshaw @ 2011-01-06 7:59 ` Revital1 Eres 2011-01-06 9:11 ` Andreas Schwab 2011-01-13 13:51 ` Nathan Froyd 0 siblings, 2 replies; 8+ messages in thread From: Revital1 Eres @ 2011-01-06 7:59 UTC (permalink / raw) To: Richard Earnshaw; +Cc: Ayal Zaks, cltang, dm, gcc, yao, Roman Zhuykov [-- Attachment #1: Type: text/plain, Size: 685 bytes --] Hello, > > @@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat) > return 0; > > if ((XEXP (condition, 0) == reg) > + || (REGNO (XEXP (condition, 0)) == CC_REGNUM) > || (GET_CODE (XEXP (condition, 0)) == PLUS > && XEXP (XEXP (condition, 0), 0) == reg)) > > You can't depend on CC_REGNUM in generic code. That's part of the > private machine description for ARM. Other cores have different ways of > representing condition codes. > > R. Yes, thanks, I found that out when testing the patch on PowerPC. Attached is a newer version of the patch which is currently under testing. Thanks, Revital (See attached file: patch_arm_doloop_5.txt) [-- Attachment #2: patch_arm_doloop_5.txt --] [-- Type: text/plain, Size: 7131 bytes --] Index: modulo-sched.c =================================================================== --- modulo-sched.c (revision 168397) +++ modulo-sched.c (working copy) @@ -1021,7 +1021,8 @@ sms_schedule (void) if (CALL_P (insn) || BARRIER_P (insn) || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn) - && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE) + && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE + && !reg_mentioned_p (count_reg, insn)) || (FIND_REG_INC_NOTE (insn, NULL_RTX) != 0) || (INSN_P (insn) && (set = single_set (insn)) && GET_CODE (SET_DEST (set)) == SUBREG)) Index: loop-doloop.c =================================================================== --- loop-doloop.c (revision 168397) +++ loop-doloop.c (working copy) @@ -78,6 +78,8 @@ doloop_condition_get (rtx doloop_pat) rtx inc_src; rtx condition; rtx pattern; + rtx cc_reg = NULL_RTX; + rtx reg_orig; /* The canonical doloop pattern we expect has one of the following forms: @@ -96,7 +98,16 @@ doloop_condition_get (rtx doloop_pat) 2) (set (reg) (plus (reg) (const_int -1)) (set (pc) (if_then_else (reg != 0) (label_ref (label)) - (pc))). */ + (pc))). + + Some targets (ARM) do the comparison before the branch, as in the + folloring form: + + 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0))) + (set (reg) (plus (reg) (const_int -1)))]) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))) */ pattern = PATTERN (doloop_pat); @@ -104,14 +115,42 @@ doloop_condition_get (rtx doloop_pat) { rtx cond; rtx prev_insn = prev_nondebug_insn (doloop_pat); + rtx src_orig; + rtx cmp_orig; - /* We expect the decrement to immediately precede the branch. */ + /* In case the pattern is not PARALLEL we expect two forms + of doloop which are cases 2) and 3) above: in case 2) the + decrement is immediately precedes the branch. while in case + 3) the compre and decrement instructions immediately precede + the branch. */ if (prev_insn == NULL_RTX || !INSN_P (prev_insn)) return 0; cmp = pattern; - inc = PATTERN (PREV_INSN (doloop_pat)); + if (GET_CODE (PATTERN (prev_insn)) == PARALLEL) + { + /* The third case: the compre and decrement instructions + immediately precede the branch. */ + cmp_orig = XVECEXP (PATTERN (prev_insn), 0, 0); + if (GET_CODE (cmp_orig) != SET) + return 0; + if (GET_CODE (SET_SRC (cmp_orig)) != COMPARE) + return 0; + src_orig = XEXP (SET_SRC (cmp_orig), 0); + if (XEXP (SET_SRC (cmp_orig), 1) != const0_rtx + || GET_CODE (src_orig) != PLUS) + return 0; + reg_orig = XEXP (src_orig, 0); + if (XEXP (src_orig, 1) != GEN_INT (-1) + || !REG_P (reg_orig)) + return 0; + cc_reg = SET_DEST (cmp_orig); + + inc = XVECEXP (PATTERN (prev_insn), 0, 1); + } + else + inc = PATTERN (PREV_INSN (doloop_pat)); /* We expect the condition to be of the form (reg != 0) */ cond = XEXP (SET_SRC (cmp), 0); if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx) @@ -162,6 +201,9 @@ doloop_condition_get (rtx doloop_pat) return 0; if ((XEXP (condition, 0) == reg) + || ((cc_reg != NULL_RTX) + && (XEXP (condition, 0) == cc_reg) + && (reg_orig == reg)) || (GET_CODE (XEXP (condition, 0)) == PLUS && XEXP (XEXP (condition, 0), 0) == reg)) { @@ -181,7 +223,24 @@ doloop_condition_get (rtx doloop_pat) (set (reg) (plus (reg) (const_int -1))) (additional clobbers and uses)]) - So we return that form instead. + For the third form we expect: + + (parallel [(set (cc) (compare ((plus (reg) (const_int -1)), 0)) + (set (reg) (plus (reg) (const_int -1)))]) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))) + + which is equivalent to the following: + + (parallel [(set (cc) (compare (reg, 1)) + (set (reg) (plus (reg) (const_int -1))) + (set (pc) (if_then_else (NE == cc) + (label_ref (label)) + (pc))))]) + + So we return the second form instead for the two cases. + */ condition = gen_rtx_fmt_ee (NE, VOIDmode, inc_src, const1_rtx); Index: config/arm/thumb2.md =================================================================== --- config/arm/thumb2.md (revision 168424) +++ config/arm/thumb2.md (working copy) @@ -836,7 +836,7 @@ "operands[4] = GEN_INT (- INTVAL (operands[2]));" ) -(define_insn "*thumb2_addsi3_compare0" +(define_insn "thumb2_addsi3_compare0" [(set (reg:CC_NOOV CC_REGNUM) (compare:CC_NOOV (plus:SI (match_operand:SI 1 "s_register_operand" "l, 0, r") @@ -1118,3 +1118,53 @@ " operands[2] = GEN_INT (32 - INTVAL (operands[2])); ") + + ;; Define the subtract-one-and-jump insns so loop.c + ;; knows what to generate. + (define_expand "doloop_end" + [(use (match_operand 0 "" "")) ; loop pseudo + (use (match_operand 1 "" "")) ; iterations; zero if unknown + (use (match_operand 2 "" "")) ; max iterations + (use (match_operand 3 "" "")) ; loop level + (use (match_operand 4 "" ""))] ; label + "TARGET_THUMB2" + " + { + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part comprises of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. + ??? The possible introduction of a new block to initialize the + new IV can potentially effects branch optimizations. */ + if (optimize > 0 && flag_modulo_sched) + { + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + rtx insn; + rtx cmp; + + /* Only use this on innermost loops. */ + if (INTVAL (operands[3]) > 1) + FAIL; + if (GET_MODE (operands[0]) != SImode) + FAIL; + + s0 = operands [0]; + insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1))); + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_NE(VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [4]); + emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, + loc_ref, pc_rtx))); + + DONE; + }else + FAIL; + }") + + + ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2011-01-06 7:59 ` Revital1 Eres @ 2011-01-06 9:11 ` Andreas Schwab 2011-01-13 11:11 ` Ramana Radhakrishnan 2011-01-13 13:51 ` Nathan Froyd 1 sibling, 1 reply; 8+ messages in thread From: Andreas Schwab @ 2011-01-06 9:11 UTC (permalink / raw) To: Revital1 Eres Cc: Richard Earnshaw, Ayal Zaks, cltang, dm, gcc, yao, Roman Zhuykov Revital1 Eres <ERES@il.ibm.com> writes: > Index: loop-doloop.c > =================================================================== > --- loop-doloop.c (revision 168397) > +++ loop-doloop.c (working copy) > + /* The third case: the compre and decrement instructions s/compre/compare/. > Index: config/arm/thumb2.md > =================================================================== > --- config/arm/thumb2.md (revision 168424) > +++ config/arm/thumb2.md (working copy) > + ??? The possible introduction of a new block to initialize the > + new IV can potentially effects branch optimizations. */ s/effects/effect/ Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2011-01-06 9:11 ` Andreas Schwab @ 2011-01-13 11:11 ` Ramana Radhakrishnan 0 siblings, 0 replies; 8+ messages in thread From: Ramana Radhakrishnan @ 2011-01-13 11:11 UTC (permalink / raw) To: Andreas Schwab Cc: Revital1 Eres, Richard Earnshaw, Ayal Zaks, cltang, dm, gcc, yao, Roman Zhuykov >> Index: config/arm/thumb2.md >> =================================================================== >> --- config/arm/thumb2.md (revision 168424) >> +++ config/arm/thumb2.md (working copy) > >> + ??? The possible introduction of a new block to initialize the >> + new IV can potentially effects branch optimizations. */ > > s/effects/effect/ Err I think it should be "affect" rather than effect here. Thus s/effects/affect Ramana ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ARM] Implementing doloop pattern 2011-01-06 7:59 ` Revital1 Eres 2011-01-06 9:11 ` Andreas Schwab @ 2011-01-13 13:51 ` Nathan Froyd 1 sibling, 0 replies; 8+ messages in thread From: Nathan Froyd @ 2011-01-13 13:51 UTC (permalink / raw) To: Revital1 Eres Cc: Richard Earnshaw, Ayal Zaks, cltang, dm, gcc, yao, Roman Zhuykov On Thu, Jan 06, 2011 at 09:59:08AM +0200, Revital1 Eres wrote: > Index: loop-doloop.c > + Some targets (ARM) do the comparison before the branch, as in the > + folloring form: ^^^^^^^^^ "following" > + /* In case the pattern is not PARALLEL we expect two forms > + of doloop which are cases 2) and 3) above: in case 2) the > + decrement is immediately precedes the branch. while in case ^^ Take out the "is". > + 3) the compre and decrement instructions immediately precede ^^^^^^ "compare" > Index: config/arm/thumb2.md > + /* Currently SMS relies on the do-loop pattern to recognize loops > + where (1) the control part comprises of all insns defining and/or ^^^^^^^^^ I think "consists" would be more idiomatic here, even if it's still a little awkward. -Nathan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-01-13 13:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-12-30 14:04 [ARM] Implementing doloop pattern Roman Zhuykov 2010-12-30 16:02 ` Ulrich Weigand 2010-12-30 16:56 ` Revital1 Eres 2011-01-05 15:35 ` Richard Earnshaw 2011-01-06 7:59 ` Revital1 Eres 2011-01-06 9:11 ` Andreas Schwab 2011-01-13 11:11 ` Ramana Radhakrishnan 2011-01-13 13:51 ` Nathan Froyd
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).