[Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance
@ 2011-10-10 22:45 alex.gaynor at gmail dot com
  2011-10-10 22:52 ` [Bug c/50693] " alex.gaynor at gmail dot com
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: alex.gaynor at gmail dot com @ 2011-10-10 22:45 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

             Bug #: 50693
           Summary: Slightly different loop body leads to 5.5x slower
                    performance
    Classification: Unclassified
           Product: gcc
           Version: 4.6.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: alex.gaynor@gmail.com


Created attachment 25460
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25460
Script which reproduces the speed difference.

Given the two loops in the attached file, the first generates excellent code,
which performs the same as a memset, however the second results in very poor
code which is 5.5x slower than the first. If anyone is curious why we have such
strange looking code: a compiler which targets C.

This was tested by compiling at -O3 with -std=gnu99 on my machine:


alex@alex-gaynor-laptop:/tmp$ gcc -O3 test.c -std=gnu99
alex@alex-gaynor-laptop:/tmp$ time ./a.out > /dev/null 

real    0m0.427s
user    0m0.412s
sys    0m0.016s
alex@alex-gaynor-laptop:/tmp$ time ./a.out > /dev/null 

real    0m0.428s
user    0m0.416s
sys    0m0.008s
alex@alex-gaynor-laptop:/tmp$ time ./a.out > /dev/null 

real    0m0.432s
user    0m0.404s
sys    0m0.024s
alex@alex-gaynor-laptop:/tmp$ 
alex@alex-gaynor-laptop:/tmp$ time ./a.out 0 > /dev/null 

real    0m2.225s
user    0m2.200s
sys    0m0.020s
alex@alex-gaynor-laptop:/tmp$ time ./a.out 0 > /dev/null 

real    0m2.217s
user    0m2.196s
sys    0m0.016s
alex@alex-gaynor-laptop:/tmp$ time ./a.out 0 > /dev/null 

real    0m2.268s
user    0m2.252s
sys    0m0.012s


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug c/50693] Slightly different loop body leads to 5.5x slower performance
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
@ 2011-10-10 22:52 ` alex.gaynor at gmail dot com
  2011-10-10 22:54 ` [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop pinskia at gcc dot gnu.org
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: alex.gaynor at gmail dot com @ 2011-10-10 22:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #1 from Alex Gaynor <alex.gaynor at gmail dot com> 2011-10-10 22:52:07 UTC ---
It may be interesting to note that clang (version 2.9) does not exhibit this
performance difference, but versions execute quickly.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
  2011-10-10 22:52 ` [Bug c/50693] " alex.gaynor at gmail dot com
@ 2011-10-10 22:54 ` pinskia at gcc dot gnu.org
  2011-10-10 22:55 ` pinskia at gcc dot gnu.org
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-10 22:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|c                           |tree-optimization
            Summary|Slightly different loop     |Slightly different loop
                   |body leads to 5.5x slower   |body leads not vectoring
                   |performance                 |loop

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-10 22:53:50 UTC ---
The loop is not vectorized in the second case for some reason.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
  2011-10-10 22:52 ` [Bug c/50693] " alex.gaynor at gmail dot com
  2011-10-10 22:54 ` [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop pinskia at gcc dot gnu.org
@ 2011-10-10 22:55 ` pinskia at gcc dot gnu.org
  2011-10-10 23:57 ` dje at gcc dot gnu.org
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-10 22:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-10 22:55:07 UTC ---
(In reply to comment #1)
> It may be interesting to note that clang (version 2.9) does not exhibit this
> performance difference, but versions execute quickly.

Yes that really true because I thought LLVM does not have a vectorizer.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (2 preceding siblings ...)
  2011-10-10 22:55 ` pinskia at gcc dot gnu.org
@ 2011-10-10 23:57 ` dje at gcc dot gnu.org
  2011-10-11  0:04 ` dje at gcc dot gnu.org
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-10 23:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #4 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-10 23:56:55 UTC ---
Created attachment 25462
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25462
GCC 4.6.1 assembler output


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (3 preceding siblings ...)
  2011-10-10 23:57 ` dje at gcc dot gnu.org
@ 2011-10-11  0:04 ` dje at gcc dot gnu.org
  2011-10-11  0:22 ` dje at gcc dot gnu.org
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11  0:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

David Edelsohn <dje at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #25462|0                           |1
        is obsolete|                            |

--- Comment #5 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 00:04:05 UTC ---
Created attachment 25463
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25463
Clang/LLVM x86_64 output


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (4 preceding siblings ...)
  2011-10-11  0:04 ` dje at gcc dot gnu.org
@ 2011-10-11  0:22 ` dje at gcc dot gnu.org
  2011-10-11  0:23 ` pinskia at gcc dot gnu.org
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11  0:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #6 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 00:21:48 UTC ---
Created attachment 25464
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25464
GCC 4.6.1 assembler output


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (5 preceding siblings ...)
  2011-10-11  0:22 ` dje at gcc dot gnu.org
@ 2011-10-11  0:23 ` pinskia at gcc dot gnu.org
  2011-10-11  1:12 ` [Bug tree-optimization/50693] Loop optimization restricted by GOTOs dje at gcc dot gnu.org
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-11  0:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-11 00:22:28 UTC ---
Oh but LLVM has a memset loop detector which causes the speed up to happen.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (6 preceding siblings ...)
  2011-10-11  0:23 ` pinskia at gcc dot gnu.org
@ 2011-10-11  1:12 ` dje at gcc dot gnu.org
  2011-10-11  1:24 ` pinskia at gcc dot gnu.org
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11  1:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

David Edelsohn <dje at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-10-11
     Ever Confirmed|0                           |1

--- Comment #8 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 01:11:47 UTC ---
Both loop1 and loop2 produce the same code on LLVM, presumably from its memset
pattern:

        movq    %rax, 8(%r15)
        movq    %rbx, (%r15)
        testq   %rbx, %rbx
        je      .LBB1_3
# BB#1:
        movq    %rbx, %rcx
        movq    %rax, %rdx
        .align  16, 0x90
.LBB1_2:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
        movb    %r14b, (%rdx)
        incq    %rdx
        decq    %rcx
        jne     .LBB1_2
.LBB1_3:                                # %._crit_edge
        movb    $0, (%rax,%rbx)

Direct pointer arithmetic might not be recommended, but Intel makes do.


For loop1, GCC produces:

        testq   %rbx, %rbx
        movq    %rax, 8(%rbp)
        movq    %rbx, 0(%rbp)
        je      .L3
        xorl    %edx, %edx
        .p2align 4,,10
        .p2align 3
.L5:
        movb    %r12b, (%rax,%rdx)
        addq    $1, %rdx
        movq    8(%rbp), %rax
        cmpq    %rbx, %rdx
        jne     .L5
.L3:
        movb    $0, (%rax,%rbx)

For loop2, GCC produces:

        xorl    %edx, %edx
        testq   %rbx, %rbx
        movq    %rax, 8(%rbp)
        movq    %rbx, 0(%rbp)
        jne     .L13
        jmp     .L9
        .p2align 4,,10
        .p2align 3
.L11:
        movq    8(%rbp), %rax
.L8:
.L13:
.L10:
        movb    %r12b, (%rax,%rdx)
        addq    $1, %rdx
        cmpq    %rbx, %rdx
        jne     .L11
        movq    8(%rbp), %rax
.L9:
        movb    $0, (%rax,%rbx)

In both cases GCC unnecessarily re-reads v->chars.

Is loop2 slower because jne .L13 jump into the middle of the loop confuses the
Intel loop branch predictor logic?  Or the loop2 instructions order cracks into
uops badly?  The cause of the performance difference is not obvious.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (7 preceding siblings ...)
  2011-10-11  1:12 ` [Bug tree-optimization/50693] Loop optimization restricted by GOTOs dje at gcc dot gnu.org
@ 2011-10-11  1:24 ` pinskia at gcc dot gnu.org
  2011-10-11  1:35 ` dje at gcc dot gnu.org
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-11  1:24 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-11 01:24:00 UTC ---
The vectorization is not being done for the second version of the loop with the
goto.  I have not looked into the cause of it though.  Note -fno-tree-vectorize
shows that the loop is slow for both cases.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (8 preceding siblings ...)
  2011-10-11  1:24 ` pinskia at gcc dot gnu.org
@ 2011-10-11  1:35 ` dje at gcc dot gnu.org
  2011-10-11  7:15 ` irar at il dot ibm.com
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11  1:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #10 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 01:35:20 UTC ---
Sorry, I was looking at the loop1 and loop2 functions, not the code inlined
into the benchmark for main.

LLVM generates:

        movq    %r12, %rdi
        movl    $99, %esi
        movq    %rbx, %rdx
        callq   memset

GCC vectorizes loop1:

.L22:
        addq    $1, %rdx
        movdqa  %xmm0, (%rcx)
        addq    $16, %rcx
        cmpq    %rsi, %rdx
        jb      .L22

but not loop2:

.L28:
.L29:
        movb    $99, (%rbx,%rax)
        addq    $1, %rax
        cmpq    %rbp, %rax
        jne     .L28


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (9 preceding siblings ...)
  2011-10-11  1:35 ` dje at gcc dot gnu.org
@ 2011-10-11  7:15 ` irar at il dot ibm.com
  2011-10-11 14:07 ` dje at gcc dot gnu.org
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: irar at il dot ibm.com @ 2011-10-11  7:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Ira Rosen <irar at il dot ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm.com

--- Comment #11 from Ira Rosen <irar at il dot ibm.com> 2011-10-11 07:15:15 UTC ---
The vectorizer doesn't handle control flow in loop, and for the second loop we
have:

<bb 3>:
  goto <bb 7> (copy_block);

loop_back:
  if (n_3(D) > i_10)
    goto <bb 6>;
  else
    goto <bb 5>;

<bb 5>:
  pretmp.20_6 = v_20->chars;
  goto <bb 8> (end);

<bb 6>:
  pretmp.20_2 = v_20->chars;

  # i_29 = PHI <i_10(6), 0(3)>
  # prephitmp.21_1 = PHI <pretmp.20_2(6), D.4528_22(3)>
copy_block:
  D.4443_8 = prephitmp.21_1 + i_29;
  *D.4443_8 = c_9(D);
  i_10 = i_29 + 1;
  goto <bb 4> (loop_back);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (10 preceding siblings ...)
  2011-10-11  7:15 ` irar at il dot ibm.com
@ 2011-10-11 14:07 ` dje at gcc dot gnu.org
  2011-10-11 14:13 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11 14:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #12 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 14:06:34 UTC ---
Because the vectorizer analysis occurs fairly early, I guess there is not a lot
of opportunity to clean up the control flow.

Should GCC have a memset peephole pass like LLVM?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (11 preceding siblings ...)
  2011-10-11 14:07 ` dje at gcc dot gnu.org
@ 2011-10-11 14:13 ` rguenth at gcc dot gnu.org
  2011-10-11 14:15 ` paolo.carlini at oracle dot com
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-11 14:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #13 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-11 14:13:11 UTC ---
(In reply to comment #12)
> Because the vectorizer analysis occurs fairly early, I guess there is not a lot
> of opportunity to clean up the control flow.
> 
> Should GCC have a memset peephole pass like LLVM?

It does, ftree-loop-distribute-patterns, enabled by default.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (12 preceding siblings ...)
  2011-10-11 14:13 ` rguenth at gcc dot gnu.org
@ 2011-10-11 14:15 ` paolo.carlini at oracle dot com
  2011-10-11 14:35 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: paolo.carlini at oracle dot com @ 2011-10-11 14:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #14 from Paolo Carlini <paolo.carlini at oracle dot com> 2011-10-11 14:14:24 UTC ---
A memcmp too?!? (see also the discussion part of libstdc++/50661).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (13 preceding siblings ...)
  2011-10-11 14:15 ` paolo.carlini at oracle dot com
@ 2011-10-11 14:35 ` rguenth at gcc dot gnu.org
  2011-10-11 14:36 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-11 14:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #15 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-11 14:34:47 UTC ---
Note that it doesn't handle memset though, and the convoluted loop wouldn't be
easy to detect either.

    size_t i = 0;
    bool loop_cond = i < n;
    while (loop_cond) {
        goto copy_block;
        loop_back:
            loop_cond = i < n;
    }
    goto end;
    copy_block:
        v->chars[i] = c;
        i++;
        goto loop_back;
    end:
        v->chars[n] = '\0';
        return v;

is simply trying to be too clever.  I can't even understand that source ;)

What probably causes this is that we don't merge the blocks

  # i_29 = PHI <i_10(3), 0(2)>
copy_block:
  D.3506_7 = v_20->chars;
  D.3507_8 = D.3506_7 + i_29;
  *D.3507_8 = c_9(D);
  i_10 = i_29 + 1;
  goto <bb 3> (loop_back);

loop_back:
  loop_cond_11 = i_10 < n_3(D);
  if (loop_cond_11 != 0)
    goto <bb 4> (copy_block);
  else
    goto <bb 5> (end);

even though it's a fallthru edge.  We don't do this to preserve user labels
for debugging (and mind, no code-gen differences between -g0 vs. -g).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (14 preceding siblings ...)
  2011-10-11 14:35 ` rguenth at gcc dot gnu.org
@ 2011-10-11 14:36 ` rguenth at gcc dot gnu.org
  2011-10-11 14:40 ` dje at gcc dot gnu.org
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-11 14:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #16 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-11 14:35:17 UTC ---
(In reply to comment #14)
> A memcmp too?!? (see also the discussion part of libstdc++/50661).

No, only memset with zero.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (15 preceding siblings ...)
  2011-10-11 14:36 ` rguenth at gcc dot gnu.org
@ 2011-10-11 14:40 ` dje at gcc dot gnu.org
  2011-10-11 14:44 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dje at gcc dot gnu.org @ 2011-10-11 14:40 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #17 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 14:40:09 UTC ---
LLVM appears to be able to recognize memset of any value, not just zero.  And
apparently performs control flow simplification before attempting to recognize
the idiom, so it can expose the loop created by the convoluted GOTOs.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (16 preceding siblings ...)
  2011-10-11 14:40 ` dje at gcc dot gnu.org
@ 2011-10-11 14:44 ` rguenth at gcc dot gnu.org
  2011-10-11 14:46 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-11 14:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aoliva at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org

--- Comment #18 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-11 14:42:46 UTC ---
(In reply to comment #15)
> What probably causes this is that we don't merge the blocks
> 
>   # i_29 = PHI <i_10(3), 0(2)>
> copy_block:
>   D.3506_7 = v_20->chars;
>   D.3507_8 = D.3506_7 + i_29;
>   *D.3507_8 = c_9(D);
>   i_10 = i_29 + 1;
>   goto <bb 3> (loop_back);
> 
> loop_back:
>   loop_cond_11 = i_10 < n_3(D);
>   if (loop_cond_11 != 0)
>     goto <bb 4> (copy_block);
>   else
>     goto <bb 5> (end);
> 
> even though it's a fallthru edge.  We don't do this to preserve user labels
> for debugging (and mind, no code-gen differences between -g0 vs. -g).

Yes.  With

Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c      (revision 179804)
+++ gcc/tree-cfg.c      (working copy)
@@ -1456,7 +1460,7 @@ gimple_can_merge_blocks_p (basic_block a

       /* Do not remove user labels.  */
       if (!DECL_ARTIFICIAL (lab))
-       return false;
+       ;
     }

   /* Protect the loop latches.  */

we merge the blocks and vectorize both loops.

The above patch is not acceptable though, which leaves making the vectorizer
deal with trivial control-flow (or a new GIMPLE_DEBUG kind that would
preserve the label?).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (17 preceding siblings ...)
  2011-10-11 14:44 ` rguenth at gcc dot gnu.org
@ 2011-10-11 14:46 ` rguenth at gcc dot gnu.org
  2011-10-11 14:51 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-11 14:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #19 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-11 14:45:22 UTC ---
(In reply to comment #17)
> LLVM appears to be able to recognize memset of any value, not just zero.  And
> apparently performs control flow simplification before attempting to recognize
> the idiom, so it can expose the loop created by the convoluted GOTOs.

I suppose you can no longer debug that though (break at the labels by
name), even when disabling the memset pattern detection?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (18 preceding siblings ...)
  2011-10-11 14:46 ` rguenth at gcc dot gnu.org
@ 2011-10-11 14:51 ` jakub at gcc dot gnu.org
  2011-10-11 16:04 ` alex.gaynor at gmail dot com
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-10-11 14:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #20 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-10-11 14:50:51 UTC ---
(In reply to comment #17)
> LLVM appears to be able to recognize memset of any value, not just zero.  And
> apparently performs control flow simplification before attempting to recognize
> the idiom, so it can expose the loop created by the convoluted GOTOs.

Well, GCC also performs lots of control flow simplifications, just the bb's
aren't merged here because that would mean the user label would be lost,
couldn't be used by the user debugging the code at all.

Vectorization restricts the cfg of the loop.  In successfully vectorized loops
it is unlikely user labels would be very helpful to the user, since multiple
iterations of the loop are performed together.

If we want to handle this obfuscated code, either we'd need to make debugging
experience worse for all loops (say at -O3), no matter if they will be
successfully vectorized or not, or lift up the restrictions in the vectorizer,
so that it would accept multiple basic blocks with only fallthru edges in
between and no phis or something similar, or temporarily merge the block and
split it again after vectorization, readding the user labels.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (19 preceding siblings ...)
  2011-10-11 14:51 ` jakub at gcc dot gnu.org
@ 2011-10-11 16:04 ` alex.gaynor at gmail dot com
  2011-10-12 15:21 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: alex.gaynor at gmail dot com @ 2011-10-11 16:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #21 from Alex Gaynor <alex.gaynor at gmail dot com> 2011-10-11 16:02:56 UTC ---
Given the concern for preserving labels for debugging, perhaps allowing the
merging of basic blocks that eliminate labels could be conditional on either a
new function attribute or command line flag?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (20 preceding siblings ...)
  2011-10-11 16:04 ` alex.gaynor at gmail dot com
@ 2011-10-12 15:21 ` rguenth at gcc dot gnu.org
  2011-10-25  4:48 ` aoliva at gcc dot gnu.org
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-10-12 15:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #22 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-10-12 15:19:54 UTC ---
Yeah, maybe we can just throw them away with -O3.  Or decay them (on BB
merging) to

# DEBUG user_label:

that exposes the label to more code motion issues, so its location would be
less precise, but nothing prevents inter-block code-motion for labels at
the start of a fallthru destination either.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (21 preceding siblings ...)
  2011-10-12 15:21 ` rguenth at gcc dot gnu.org
@ 2011-10-25  4:48 ` aoliva at gcc dot gnu.org
  2011-11-04 17:18 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: aoliva at gcc dot gnu.org @ 2011-10-25  4:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #23 from Alexandre Oliva <aoliva at gcc dot gnu.org> 2011-10-25 04:48:24 UTC ---
Yup.  We don't even need a new debug stmt type, methinks.  Say, emit the debug
stmt with the LABEL_DECL, decay that to a debug stmt in cfgexpand, and turn
that into a NOTE_INSN_DELETED_LABEL during var-tracking initial scanning, to
minimize code motion impact.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (22 preceding siblings ...)
  2011-10-25  4:48 ` aoliva at gcc dot gnu.org
@ 2011-11-04 17:18 ` jakub at gcc dot gnu.org
  2011-11-05 19:59 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-11-04 17:18 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
         AssignedTo|unassigned at gcc dot       |jakub at gcc dot gnu.org
                   |gnu.org                     |

--- Comment #24 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-11-04 17:18:34 UTC ---
Created attachment 25721
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25721
gcc47-pr50693.patch

Here is that idea implemented.  On the gimple/expansion side it was trivial,
the only thing that complicated it is to avoid -fcompare-debug differences.
NOTE_INSN_DELETED_LABELs use the same numbers as other labels, unique for the
whole CU, so if we start to allocate from that number pool for these labels
that will be only present with -g (since they live in debug stmts and later on
in DEBUG_INSNs), we'll get label number differences, and furthermore for darwin
does some terrible hacks to workaround the mess called MachO it could even
result in different code genration.
So I decided to add NOTE_INSN_DELETED_LABEL variant which will use a different
label namespace.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (23 preceding siblings ...)
  2011-11-04 17:18 ` jakub at gcc dot gnu.org
@ 2011-11-05 19:59 ` jakub at gcc dot gnu.org
  2011-11-05 20:09 ` alex.gaynor at gmail dot com
  2012-03-26 10:35 ` jakub at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-11-05 19:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-11-05 19:58:41 UTC ---
Author: jakub
Date: Sat Nov  5 19:58:37 2011
New Revision: 181014

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=181014
Log:
    PR tree-optimization/50693
    * tree-cfg.c (gimple_can_merge_blocks_p): Allow merging with
    non-forced user labels.
    (gimple_merge_blocks): Turn non-forced user labels into
    debug bind stmt with the label as first operand and reset value.
    (gimple_duplicate_bb): Don't duplicate label debug stmts.
    * dwarf2out.c (gen_label_die): Handle NOTE_INSN_DELETED_DEBUG_LABEL.
    * final.c (final_scan_insn): Likewise.
    (rest_of_clean_state): Don't dump NOTE_INSN_DELETED_DEBUG_LABEL.
    * var-tracking.c (debug_label_num): New variable.
    (delete_debug_insns): Don't delete DEBUG_INSNs for LABEL_DECLs,
    instead turn them into NOTE_INSN_DELETED_DEBUG_LABEL notes.
    * cfglayout.c (skip_insns_after_block, duplicate_insn_chain): Handle
    NOTE_INSN_DELETED_DEBUG_LABEL.
    (duplicate_insn_chain): Don't duplicate LABEL_DECL DEBUG_INSNs.
    * insn-notes.def (DELETED_DEBUG_LABEL): New note kind.
    * print-rtl.c (print_rtx): Handle NOTE_INSN_DELETED_DEBUG_LABEL.
    * gengtype.c (adjust_field_rtx_def): Likewise.
    * config/i386/i386.c (ix86_output_function_epilogue): For MachO
    clear CODE_LABEL_NUMBER of NOTE_INSN_DELETED_DEBUG_LABEL
    if their are at the end of function and nop hasn't been emitted.
    * config/rs6000/rs6000.c (rs6000_output_function_epilogue): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfglayout.c
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/rs6000/rs6000.c
    trunk/gcc/dwarf2out.c
    trunk/gcc/final.c
    trunk/gcc/gengtype.c
    trunk/gcc/insn-notes.def
    trunk/gcc/print-rtl.c
    trunk/gcc/tree-cfg.c
    trunk/gcc/var-tracking.c


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (24 preceding siblings ...)
  2011-11-05 19:59 ` jakub at gcc dot gnu.org
@ 2011-11-05 20:09 ` alex.gaynor at gmail dot com
  2012-03-26 10:35 ` jakub at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: alex.gaynor at gmail dot com @ 2011-11-05 20:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #26 from Alex Gaynor <alex.gaynor at gmail dot com> 2011-11-05 20:08:08 UTC ---
Thank you very much for fixing this!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug tree-optimization/50693] Loop optimization restricted by GOTOs
  2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
                   ` (25 preceding siblings ...)
  2011-11-05 20:09 ` alex.gaynor at gmail dot com
@ 2012-03-26 10:35 ` jakub at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-26 10:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED

--- Comment #27 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-26 10:23:22 UTC ---
Fixed for 4.7+, won't backport to older branches.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2012-03-26 10:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-10 22:45 [Bug c/50693] New: Slightly different loop body leads to 5.5x slower performance alex.gaynor at gmail dot com
2011-10-10 22:52 ` [Bug c/50693] " alex.gaynor at gmail dot com
2011-10-10 22:54 ` [Bug tree-optimization/50693] Slightly different loop body leads not vectoring loop pinskia at gcc dot gnu.org
2011-10-10 22:55 ` pinskia at gcc dot gnu.org
2011-10-10 23:57 ` dje at gcc dot gnu.org
2011-10-11  0:04 ` dje at gcc dot gnu.org
2011-10-11  0:22 ` dje at gcc dot gnu.org
2011-10-11  0:23 ` pinskia at gcc dot gnu.org
2011-10-11  1:12 ` [Bug tree-optimization/50693] Loop optimization restricted by GOTOs dje at gcc dot gnu.org
2011-10-11  1:24 ` pinskia at gcc dot gnu.org
2011-10-11  1:35 ` dje at gcc dot gnu.org
2011-10-11  7:15 ` irar at il dot ibm.com
2011-10-11 14:07 ` dje at gcc dot gnu.org
2011-10-11 14:13 ` rguenth at gcc dot gnu.org
2011-10-11 14:15 ` paolo.carlini at oracle dot com
2011-10-11 14:35 ` rguenth at gcc dot gnu.org
2011-10-11 14:36 ` rguenth at gcc dot gnu.org
2011-10-11 14:40 ` dje at gcc dot gnu.org
2011-10-11 14:44 ` rguenth at gcc dot gnu.org
2011-10-11 14:46 ` rguenth at gcc dot gnu.org
2011-10-11 14:51 ` jakub at gcc dot gnu.org
2011-10-11 16:04 ` alex.gaynor at gmail dot com
2011-10-12 15:21 ` rguenth at gcc dot gnu.org
2011-10-25  4:48 ` aoliva at gcc dot gnu.org
2011-11-04 17:18 ` jakub at gcc dot gnu.org
2011-11-05 19:59 ` jakub at gcc dot gnu.org
2011-11-05 20:09 ` alex.gaynor at gmail dot com
2012-03-26 10:35 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).