public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/52056] New: Code optimization sensitive to trivial changes
@ 2012-01-30 20:02 gccbug at jamasaru dot com
  2012-01-30 23:17 ` [Bug c/52056] " gccbug at jamasaru dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: gccbug at jamasaru dot com @ 2012-01-30 20:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52056

             Bug #: 52056
           Summary: Code optimization sensitive to trivial changes
    Classification: Unclassified
           Product: gcc
           Version: 4.6.1
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: gccbug@jamasaru.com


GCC 4.6.1 (also in at least 4.5 and 4.4), 64 bit Ubuntu.

The GCC optimizer detects and significantly speeds up some simple inner-loop
computations, but small changes to those inner loops can completely change the
behavior of the optimizer, losing 50% or more of the code speed.

Example code, simplified down from a hash table computation:

#include <stdio.h>
const int UNROLL=8;
int main()
{
  // static keyword in next line can trigger optimizer behavior!
  static unsigned long accum=0;
  unsigned int sum=0;  

  for (long j=0; j<0x400000000/UNROLL; ++j)            
    for (int unroll=0; unroll<UNROLL; ++unroll)    {
      unsigned long j;
      ++accum;
      // unsigned versus signed shift in next line can trigger optimizer
      j=(accum^(((unsigned long)accum)>>22));

      j*=0x6543210123456789;
      sum+=(unsigned int)(j>>32);
    }
  printf("Sum=%d\n", sum);
}

compile with  gcc -O3 test.c -o test


The inner loop of shifts, xors, mults, and adds is the core computation. The
GCC optimizer is very sensitive to whether the shift above is made with a
SIGNED or UNSIGNED long.   It is also sensitive to whether the "accum" variable
is static or not.  

Some timings on a i7 980X CPU:

Static accum, signed shift:   11.3 seconds
Static accum, unsigned shift: 24.3 seconds
Local accum, signed shift:    12.1 seconds
Local accum, unsigned shift:  14.4 seconds


Looking at the assembly -S output, it's clear that the optimizations are valid
and effective, with the dramatic speed increase coming from switching over to
SSE math.  The output is correct in all 4 cases.
The only problem is that the optimizer is unable to recognize the same
opportunity in all four cases, giving inconsistent performance.


This example is simplified down from the inner loops of some of my code
involving Monte Carlo simulation, and has a significant affect on runtime.  The
Intel ICC compiler is significantly slower than GCC in most cases on my code.
For this specific example, ICC has an execution time of 14.3 seconds for all 4
cases. That 25% speed advantage of GCC is huge to us when we have 500 machines
in a cluster running simulations for days!


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-07-13  8:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-30 20:02 [Bug c/52056] New: Code optimization sensitive to trivial changes gccbug at jamasaru dot com
2012-01-30 23:17 ` [Bug c/52056] " gccbug at jamasaru dot com
2012-01-31  1:04 ` [Bug middle-end/52056] " jakub at gcc dot gnu.org
2012-01-31 11:40 ` [Bug tree-optimization/52056] Vectorizer cost model is imprecise rguenth at gcc dot gnu.org
2012-07-13  8:52 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).