[Bug c/67606] New: Missing optimization: load possible return value before early-out test

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/67606] New: Missing optimization: load possible return value before early-out test
@ 2015-09-17  6:08 peter at cordes dot ca
  2015-09-17  8:08 ` [Bug middle-end/67606] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: peter at cordes dot ca @ 2015-09-17  6:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67606

            Bug ID: 67606
           Summary: Missing optimization: load possible return value
                    before early-out test
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Sry if this is a duplicate; I couldn't think of any good search terms to
describe this.

In a function that might return a constant if a loop condition is false for the
first iteration, gcc generates a separate basic block for the early-out return
when it doesn't need to.

int f(int a[], int length) {
  int count=0;
  for (int i = 0 ; i < length ; i++)
    if (a[i] > 5)
      count++;

  return count;
}

x86 gcc 5.2 -O3 -fno-tree-vectorize compiles it to:

f(int*, int):
        testl   %esi, %esi      # length
        jle     .L4     #,
        leal    -1(%rsi), %eax  #, D.2373
        leaq    4(%rdi,%rax,4), %rcx    #, D.2371
        xorl    %eax, %eax      # count
.L3:
        xorl    %edx, %edx      # D.2370
        cmpl    $5, (%rdi)      #, MEM[base: _29, offset: 0B]
        setg    %dl     #, D.2370
        addq    $4, %rdi        #, ivtmp.7
        addl    %edx, %eax      # D.2370, count
        cmpq    %rdi, %rcx      # ivtmp.7, D.2371
        jne     .L3     #,
        rep ret
.L4:
        xorl    %eax, %eax      # count
        ret

(actually g++, since I used godbolt.org, but gcc 4.9.2 looked similar)

A better sequence would be:

f(int*, int):
        xorl    %eax, %eax      # count
        testl   %esi, %esi      # length
        jle     .L4     #,

... unchanged ...

        cmpq    %rdi, %rcx      # ivtmp.7, D.2371
        jne     .L3     #,
.L4:
        rep ret

In this case, the ret was already a rep ret, so this change saves ~3
instruction bytes, (or more usually, none, because of padding) at zero speed
cost in either the taken or non-taken early-out case.

Also, looks like a separate bug, but what the crap is gcc doing with the two
lea instructions?  Defending against integer overflow when calculating a
one-past-the end loop boundary pointer?  It doesn't do that when f() takes a
size_t arg.    (In this case, nothing depends on the lea results soon enough
for it to matter.  They're dependent, anyway, and the first one is one of the
first 4 insns.

leal    -1(%rsi), %eax  #, D.2373
        leaq    4(%rdi,%rax,4), %rcx    #, D.2371
        xorl    %eax, %eax      # count

Unless I'm missing something, leaq (%rdi, %rsi, 4), %rcx  should be just as
safe.  Is something adding the -1-before-scaling, correct after scaling thing
for a reason?  Is it something that doesn't realize it will be implemented with
lea, which avoids any problems?  It's an extra instruction, and 4(...) stores
the offset as a 32bit constant in the insn encoding, not a sign-extended 8bit
value.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug middle-end/67606] Missing optimization: load possible return value before early-out test
  2015-09-17  6:08 [Bug c/67606] New: Missing optimization: load possible return value before early-out test peter at cordes dot ca
@ 2015-09-17  8:08 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-09-17  8:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67606

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-09-17
                 CC|                            |matz at gcc dot gnu.org
          Component|c                           |middle-end
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So for the main part of this PR we actually expand from

  <bb 7>:
  # count_16 = PHI <count_1(6), 0(2)>
  return count_16;

so it is a matter of coalescing and where we put that copy from zero.

We coalesce the following way:

Coalesce list: (1)count_1 & (15)count_15 [map: 0, 7] : Success -> 0
Coalesce list: (3)ivtmp.6_3 & (13)ivtmp.6_13 [map: 2, 6] : Success -> 2
Coalesce list: (1)count_1 & (11)count_11 [map: 0, 5] : Success -> 0
Coalesce list: (1)count_1 & (16)count_16 [map: 0, 8] : Success -> 0
Coalesce list: (2)ivtmp.6_2 & (13)ivtmp.6_3 [map: 1, 2] : Success -> 2

so 'count' is fully coalesced but of course the constant is still there
and we insert a copy on the 2->7 edge.

Inserting a value copy on edge BB3->BB4 : PART.0 = 0
Inserting a value copy on edge BB2->BB7 : PART.0 = 0

which also looks good (we use the correct partition for this).  Note that
the zero init isn't partially redundant so GCSE isn't able to optimize
here and RTL code hoisting isn't very advanced.

I'm also sure the RA guys will say it's not the RAs job of doing the
hoisting.

So with my usual GIMPLE hat on I'd say it would have been nice to help
things by placing the value copy more intelligently.  We have a pass
that is supposed to help here - uncprop.  We're faced with

  <bb 2>:
  if (length_4(D) > 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 3>:
...

  <bb 4>:
  # count_15 = PHI <0(3), count_1(6)>
...

  <bb 6>:
  # count_1 = PHI <count_15(4), count_11(5)>
  ivtmp.6_3 = ivtmp.6_13 + 4;
  if (ivtmp.6_3 != _25)
    goto <bb 4>;
  else
    goto <bb 7>;

  <bb 7>:
  # count_16 = PHI <count_1(6), 0(2)>
  return count_16;

which might be a good enough pattern to detect (slight complication with
the forwarder BB3).  Note that we'd increase register pressure throughout
BB3 and that for the whole thing to work we'd still need to be able to
make sure we can coalesce all of count and the register we init with zero.

Given the interaction with coalescing I wonder whether it makes sense to
do "uncprop" together with coalescing or to make emitting and placing
value-copies work on GIMPLE, exposing the partitions explicitely somehow.

Well, trying to improve uncprop for this particular testcase would work
and shouldn't be too hard (you'd extend it from detecting edge equivalencies
to existing vargs to do PHI value hoisting).  There is also other code
trying to improve coalescing - rewrite_out_of_ssa has insert_backedge_copies
so we could also detect the pattern here and insert copies from the
common constant in the common dominator.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-09-17  8:08 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17  6:08 [Bug c/67606] New: Missing optimization: load possible return value before early-out test peter at cordes dot ca
2015-09-17  8:08 ` [Bug middle-end/67606] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).