public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch
@ 2010-12-02  9:56 edwintorok at gmail dot com
  2010-12-02 11:03 ` [Bug tree-optimization/46763] " amonakov at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: edwintorok at gmail dot com @ 2010-12-02  9:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

           Summary: gcc 4.5: missed optimization: copy global to local,
                    prefetch
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: edwintorok@gmail.com


Created attachment 22601
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22601
gy.i.bz2

I made a simple change to OCaml's GC: copy a global to a local var (and restore
before calling external function), and add a prefetchnta.
The global optimization is worth ~4% speedup, the prefetchnta alone is ~8%
speedup, and both ~10% speedup.
I would expect GCC to do this optimization by itself (at least the global to
register one).

Attached is a testcase to show the missed optimization, the relevant function
is sweep_slice (and its manually optimized variants sweep_slice2, ...):
$ gcc-4.5 gy.i -O2 -lm
$ ./a.out
             default: 1.325195s ( 100.0%)
            glob2loc: 1.268875s ( 95.8% +- 1.024%)
         prefetchnta: 1.207342s ( 91.1% +- 0.4986%)
            prefetch: 1.277638s ( 96.4% +- 0.1179%)
glob2loc+prefetchnta: 1.199906s ( 90.5% +- 0.3629%)


default is the original function (sweep_slice), glob2loc is my manual
optimization of caml_gc_sweep_hp, prefetchnta and prefetch are
__builtin_prefetch added by me (non-temporal prefetch is very good here), the
last one is both manual optimizations at once, resulting in a 9.5% speedup.

The attached testcase is quite large, because I dumped the sizes of all objects
from the GC to have a realistic run of the GC, I also included all functions
needed for the GC to run.

gcc-4.5 and gcc-4.4 both have this missed optimization, didn't try older ones.
BTW OCaml uses just -O -fno-defer-pop to compile, instead of -O2, but using -O
or -O2 doesn't make much difference on this testcase, so I used -O2.

$ gcc-4.5 -v
Using built-in specs.
COLLECT_GCC=gcc-4.5
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.5.1/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.5.1-11'
--with-bugurl=file:///usr/share/doc/gcc-4.5/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.5 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.5 --libdir=/usr/lib --enable-nls
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes
--enable-plugin --enable-gold --enable-ld=default --with-plugin-ld=ld.gold
--enable-objc-gc --with-arch-32=i586 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.5.1 (Debian 4.5.1-11)

CPU: AMD Phenom(tm) II X6 1090T Processor
uname -a: Linux debian 2.6.36-phenom #107 SMP PREEMPT Sat Oct 23 10:30:01 EEST
2010 x86_64 GNU/Linux


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/46763] gcc 4.5: missed optimization: copy global to local, prefetch
  2010-12-02  9:56 [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch edwintorok at gmail dot com
@ 2010-12-02 11:03 ` amonakov at gcc dot gnu.org
  2010-12-02 13:26 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: amonakov at gcc dot gnu.org @ 2010-12-02 11:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> 2010-12-02 11:03:00 UTC ---
Small testcase for the global load/store issue:

int g;
extern int bar(int);
void foo(int n)
{
  int i;
  for (i = 0; i < n; i++)
    {
      if (g)
        {
          g++;
          g = bar(i);
        }
      else
        g = i;
    }
}

Trunk at -O3 does not optimize stores to g (at -O2, it also loads g on each
iteration):

.L3:
        movl    %ebx, g(%rip)
        movl    %ebx, %eax
        addl    $1, %ebx
        cmpl    %ebp, %ebx
        je      .L1
.L5:
        testl   %eax, %eax
        je      .L3
        addl    $1, %eax
        movl    %ebx, %edi
        addl    $1, %ebx
        movl    %eax, g(%rip)
        call    bar
        cmpl    %ebp, %ebx
        movl    %eax, g(%rip)
        jne     .L5


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/46763] gcc 4.5: missed optimization: copy global to local, prefetch
  2010-12-02  9:56 [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch edwintorok at gmail dot com
  2010-12-02 11:03 ` [Bug tree-optimization/46763] " amonakov at gcc dot gnu.org
@ 2010-12-02 13:26 ` rguenth at gcc dot gnu.org
  2011-03-15 11:44 ` rguenth at gcc dot gnu.org
  2021-09-12  5:10 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2010-12-02 13:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2010.12.02 13:26:31
                 CC|                            |rguenth at gcc dot gnu.org
         Depends on|                            |41490
     Ever Confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2010-12-02 13:26:31 UTC ---
GCC has to preserve the stores and loads around the call to bar() as that
might change the value of the variable.  So transforming to

int g;
extern int bar(int);
void foo(int n)
{
  int i;
  int tem = g;
  for (i = 0; i < n; i++)
    {
      if (tem)
        {
          tem++;
          tem = bar(i);
        }
      else
        tem = i;
    }
  g = tem;
}

if that is what you did in your source-to-source transformation isn't valid.

GCC can't do conditional store motion, that is, transform it to

int g;
extern int bar(int);
void foo(int n)
{
  int i;
  int tem = g;
  for (i = 0; i < n; i++)
    {
      if (tem)
        {
          tem++;
          g = tem;
          tem = bar(i);
        }
      else
        tem = i;
    }
  g = tem;
}

which would be valid.  An enabling transform is missing as well, sinking
the store to g:

int g;
extern int bar(int);
void foo(int n)
{
  int i;
  for (i = 0; i < n; i++)
    {
      if (g)
        {
          g++;
          tem = bar(i);
        }
      else
        tem = i;
      g = tem;
    }
}

which would then allow us to do the load part of the partial store motion
by PRE.  That is, you'd get

int g;
extern int bar(int);
void foo(int n)
{
  int i,tem;
  tem = g;
  for (i = 0; i < n; i++)
    {
      if (tem)
        {
          tem++;
          g = tem;
          tem = bar(i);
        }
      else
        tem = i;
      g = tem;
    }
}

but we don't understand that we can sink the store out of the loop
as we don't understand the combined effect of g = tem; tem = bar (i);
to g.  You also get the above with -O3 because we see a partial partial
redundancy but then you retain three stores (we still miss both
sinking opportunities).  Fixing PR41490 might fix both.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/46763] gcc 4.5: missed optimization: copy global to local, prefetch
  2010-12-02  9:56 [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch edwintorok at gmail dot com
  2010-12-02 11:03 ` [Bug tree-optimization/46763] " amonakov at gcc dot gnu.org
  2010-12-02 13:26 ` rguenth at gcc dot gnu.org
@ 2011-03-15 11:44 ` rguenth at gcc dot gnu.org
  2021-09-12  5:10 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-03-15 11:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu.org

--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-15 11:32:49 UTC ---
store sinking now works and exposes an if-conversion possibility:

<bb 5>:
  g.1_6 = i_19 + 1;
  g = g.1_6;
  i_7 = bar (i_16);
  g = i_7;
  goto <bb 7>;

<bb 6>:
  g = i_16;

<bb 7>:
  # i_5 = PHI <i_7(5), i_16(6)>

the stores to g can be if-converted by re-using the existing PHI like so:

<bb 5>:
  g.1_6 = i_19 + 1;
  g = g.1_6;
  i_7 = bar (i_16);
  goto <bb 7>;

<bb 6>:
  ;

<bb 7>:
  # i_5 = PHI <i_7(5), i_16(6)>
  g = i_5;

that eventually fits into the cs_elim framework, but cs_elim runs
too early - Micha, do you remember why it runs where it runs?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/46763] gcc 4.5: missed optimization: copy global to local, prefetch
  2010-12-02  9:56 [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch edwintorok at gmail dot com
                   ` (2 preceding siblings ...)
  2011-03-15 11:44 ` rguenth at gcc dot gnu.org
@ 2021-09-12  5:10 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-12  5:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
-O3 produces:
        jmp     .L6
        .p2align 4,,10
        .p2align 3
.L3:
        leal    1(%rbx), %edx
        movl    %ebx, g(%rip)
        cmpl    %edx, %ebp
        je      .L1
.L4:
        movl    %ebx, %eax
        movl    %edx, %ebx
.L6:
        testl   %eax, %eax
        je      .L3
        addl    $1, %eax
        movl    %ebx, %edi
        movl    %eax, g(%rip)
        call    bar(int)
        leal    1(%rbx), %edx
        movl    %eax, g(%rip)
        cmpl    %edx, %ebp
        je      .L1
        movl    %eax, %ebx
        jmp     .L4

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-12  5:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-02  9:56 [Bug tree-optimization/46763] New: gcc 4.5: missed optimization: copy global to local, prefetch edwintorok at gmail dot com
2010-12-02 11:03 ` [Bug tree-optimization/46763] " amonakov at gcc dot gnu.org
2010-12-02 13:26 ` rguenth at gcc dot gnu.org
2011-03-15 11:44 ` rguenth at gcc dot gnu.org
2021-09-12  5:10 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).