From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-424509-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 4521 invoked by alias); 17 Jun 2013 00:13:49 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 4479 invoked by uid 48); 17 Jun 2013 00:13:44 -0000
From: "olegendo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/55190] [SH] ivopts causes loop setup bloat
Date: Mon, 17 Jun 2013 00:13:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 4.8.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: olegendo at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: component
Message-ID: <bug-55190-4-hcRrWsf1pF@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-55190-4@http.gcc.gnu.org/bugzilla/>
References: <bug-55190-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-06/txt/msg00888.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55190

Oleg Endo <olegendo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |rtl-optimization
--- Comment #2 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Looking at a simpler case (with -O2) ....

int test_0 (int* x, int y)
{
  int sum = 0;

  for (int i = 0; i < y; ++i)
    sum += x[i];

  return sum;
}

        cmp/pl  r5
        bf/s    .L6
        mov     #0,r0

        shll2   r5
        add     #-4,r5
        shlr2   r5
        add     #1,r5    // r5 = ((((unsigned int)y << 2) - 4) >> 2) + 1

        .align 2
.L3:
        mov.l   @r4+,r1
        dt      r5
        bf/s    .L3
        add     r1,r0
.L6:
        rts
        nop

In this case, if 'y' initially has the value '0x7FFFFFFF' the resulting loop
count is truncated to '0x3FFFFFFF'.  This is sort of OK, since the resulting
address would overflow and that is undefined behavior.
On the other hand, if an unlucky address for 'x' is passed in the first place,
the resulting address might overflow much earlier than that.  Thus the loop
counter fiddling seems pointless.

The tree-ssa-ivopts pass converts the loop to this:

Replacing exit test: if (y_3(D) > i_11)
int test_0(int*, int) (int * x, int y)
{
  unsigned int ivtmp.6;
  int i;
  int sum;
  unsigned int i.0;
  unsigned int _1;
  void * _2;
  int _9;
  unsigned int _19;
  unsigned int _20;
  unsigned int _21;

  <bb 2>:
  if (y_3(D) > 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 3>:
  ivtmp.6_12 = (unsigned int) x_6(D);
  _1 = (unsigned int) y_3(D);
  _21 = _1 * 4;
  _20 = (unsigned int) x_6(D);
  _19 = _20 + _21;

  <bb 4>:
  # sum_16 = PHI <sum_10(6), 0(3)>
  # ivtmp.6_15 = PHI <ivtmp.6_14(6), ivtmp.6_12(3)>
  _2 = (void *) ivtmp.6_15;
  _9 = MEM[base: _2, offset: 0B];
  sum_10 = _9 + sum_16;
  ivtmp.6_14 = ivtmp.6_15 + 4;
  if (ivtmp.6_14 != _19)
    goto <bb 6>;
  else
    goto <bb 5>;

  <bb 5>:
  # sum_18 = PHI <sum_10(4)>
  goto <bb 7>;

  <bb 6>:
  goto <bb 4>;

  <bb 7>:
  # sum_13 = PHI <sum_18(5), 0(2)>
  return sum_13;

}

... which uses address '(x + y * 4)' as the loop exit test.

It is expanded to RTL as '(x + (y << 2))'


;; Generating RTL for gimple basic block 3

;; ivtmp.6_12 = (unsigned int) x_6(D);

(insn 38 37 0 (set (reg:SI 190 [ ivtmp.6 ])
        (reg/v/f:SI 194 [ x ])) -1
     (nil))

;; _19 = ivtmp.6_12 + _21;

(insn 39 38 40 (set (reg:SI 196 [ D.1617 ])
        (ashift:SI (reg/v:SI 195 [ y ])
            (const_int 2 [0x2]))) -1
     (nil))

(insn 40 39 0 (set (reg:SI 191 [ D.1617 ])
        (plus:SI (reg:SI 190 [ ivtmp.6 ])
            (reg:SI 196 [ D.1617 ]))) -1
     (nil))

... and remains until the loop2_doloop RTL pass, which converts the whole thing
into a decrement-and-test loop and adds the other loop counter modifications:

Analyzing operand (reg:SI 191 [ D.1617 ]) of insn (insn 45 44 46 4 (set (reg:SI
147 t)
        (eq:SI (reg:SI 190 [ ivtmp.6 ])
            (reg:SI 191 [ D.1617 ]))) sh_tmp.cpp:5 17 {cmpeqsi_t}
     (expr_list:REG_DEAD (reg:SI 191 [ D.1617 ])
        (nil)))
  invariant (reg:SI 191 [ D.1617 ]) (in SI)
;; improved upper bound by one.
;; Determined upper bound -2.
Loop 1 is simple:
  simple exit 4 -> 7
  number of iterations: (lshiftrt:SI (plus:SI (minus:SI (reg:SI 191 [ D.1617 ])
            (reg:SI 190 [ ivtmp.6 ]))
        (const_int -4 [0xfffffffffffffffc]))
    (const_int 2 [0x2]))
  upper bound: 1073741822
  realistic bound: -1


The code in loop-iv.c works out the correct loop count if it gets the actual
loop count upper bound instead of the truncated address upper bound if the
tree-ssa-ivopts pass is turned off via -fno-ivopts.

I have also tried out the same code on ARM:

        cmp     r1, #0
        ble     .L4
        mov     r3, r0
        add     r1, r0, r1, asl #2
        mov     r0, #0
.L3:
        ldr     r2, [r3], #4
        cmp     r3, r1
        add     r0, r0, r2
        bne     .L3
        bx      lr
.L4:
        mov     r0, #0
        bx      lr

Since there is no doloop pattern on ARM, the code is left as it was output by
the tree-ssa-ivopts pass, i.e. the exit test uses 'x + (y << 2)'.

So this doesn't seem to be a SH only issue.  However, I'm not sure whether this
is more a problem of tree-ssa-ivopts or loop-iv.
If the tree-ssa-ivopts pass left the loop counter alone, the doloop pass would
pick it up and the upper bound calculations in this case would go away. 
However, targets that can do better without doloop (such as ARM) would probably
suffer.  So probably it would be better to handle this overflow case in
loop-iv.