[Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
@ 2012-06-20  6:10 vbyakovl23 at gmail dot com
  2012-06-20  6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
                   ` (22 more replies)
  0 siblings, 23 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20  6:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

             Bug #: 53726
           Summary: [4.8 Regression] aes test performance drop for
                    eembc_2_0_peak_32
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: vbyakovl23@gmail.com


After fix

r188261 | rguenth | 2012-06-06 13:45:27 +0400 (Wed, 06 Jun 2012) | 23 lines

2012-06-06  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/53081
        * tree-data-ref.h (adjacent_store_dr_p): Rename to ...
        (adjacent_dr_p): ... this and make it work for reads, too.
        * tree-loop-distribution.c (enum partition_kind): Add PKIND_MEMCPY.
        (struct partition_s): Change main_stmt to main_dr, add
        secondary_dr member.
        (build_size_arg_loc): Change to date data-reference and not
        gimplify here.
        (build_addr_arg_loc): New function split out from ...
        (generate_memset_builtin): ... here.  Use it and simplify.
        (generate_memcpy_builtin): New function.
        (generate_code_for_partition): Adjust.
        (classify_partition): Streamline pattern detection.  Detect
        memcpy.
        (ldist_gen): Adjust.
        (tree_loop_distribution): Adjust seed statements for memcpy
        recognition.

        * gcc.dg/tree-ssa/ldist-20.c: New testcase.
        * gcc.dg/tree-ssa/loop-19.c: Add -fno-tree-loop-distribute-patterns.

regression on Atom 11%, on Sundy Bridge 30%. The fix lead to unrecognition of
memcpy. Reduced test case and assemblers are attached. Command line to
reproduce

gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
-mtune=corei7   test.c


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug c/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
@ 2012-06-20  6:13 ` vbyakovl23 at gmail dot com
  2012-06-20  9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20  6:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #1 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 06:13:26 UTC ---
Created attachment 27658
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27658
Test case and assemblers


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
  2012-06-20  6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
@ 2012-06-20  9:28 ` rguenth at gcc dot gnu.org
  2012-06-20 10:48 ` vbyakovl23 at gmail dot com
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20  9:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2012-06-20
          Component|c                           |tree-optimization
                 CC|                            |rguenth at gcc dot gnu.org
     Ever Confirmed|0                           |1
   Target Milestone|---                         |4.8.0

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 09:27:52 UTC ---
You mean the fix lead to recognition of memcpy?  At least I see memcpy
calls in the bad assembly.

There is always a cost consideration for memcpy - does performance recover
with -minline-all-stringops?  I suppose BC is actually very small?

The testcase does not include a runtime part so I can't check myself.

Definitely a byte-wise copy loop as in the .good assembly variant,

 .L5:
-       .loc 1 14 0 is_stmt 1 discriminator 2
-       movzbl  16(%esp,%eax), %edx
-       movb    %dl, (%esi,%eax)
-       leal    1(%eax), %eax
-.LVL5:
-       cmpl    %ebx, %eax
-       jl      .L5

does not look good - even a rep movb should be faster, no?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
  2012-06-20  6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
  2012-06-20  9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
@ 2012-06-20 10:48 ` vbyakovl23 at gmail dot com
  2012-06-20 10:50 ` vbyakovl23 at gmail dot com
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 10:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #3 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 10:48:11 UTC ---
I added executable testcase. Command line to compile

gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
-mtune=corei7   test.c m.c

Run results

Wed Jun 20 14:39:05: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time
./test.corei7.bad.exe                                                           

real    0m6.317s
user    0m6.290s
sys     0m0.002s
Wed Jun 20 14:39:24: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time
./test.corei7.good.exe                                                          

real    0m4.815s
user    0m4.713s
sys     0m0.000s


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (2 preceding siblings ...)
  2012-06-20 10:48 ` vbyakovl23 at gmail dot com
@ 2012-06-20 10:50 ` vbyakovl23 at gmail dot com
  2012-06-20 11:48 ` rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 10:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #4 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 10:50:28 UTC ---
Created attachment 27664
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27664
Executable test case


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (3 preceding siblings ...)
  2012-06-20 10:50 ` vbyakovl23 at gmail dot com
@ 2012-06-20 11:48 ` rguenth at gcc dot gnu.org
  2012-06-20 12:31 ` rguenth at gcc dot gnu.org
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 11:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 11:48:13 UTC ---
Ok.  A rep movsb; is as slow as a memcpy call (-mstringop-strategy=rep_byte
-minline-all-stringops).  -minline-all-stringops itself is nearly as fast
as -fno-tree-loop-distribute-patterns.

To answer my own question, BC is between zero and 7.

But I really wonder why the rep movsb is slower than the explicit byte-copy
loop ...

We do seem to seriously hose the CFG though - with PGO we get a nice
loop nest CFG and the speed of before the patch - even when it uses
a memcpy call.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (4 preceding siblings ...)
  2012-06-20 11:48 ` rguenth at gcc dot gnu.org
@ 2012-06-20 12:31 ` rguenth at gcc dot gnu.org
  2012-06-20 12:47 ` hjl.tools at gmail dot com
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 12:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 12:31:22 UTC ---
Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
simply does a rep movsb; for any size lower than 20 bytes ... but as I have
been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...

Thus, I suppose you should look at improving the memcpy implementation for
small sizes on 32bits.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (5 preceding siblings ...)
  2012-06-20 12:31 ` rguenth at gcc dot gnu.org
@ 2012-06-20 12:47 ` hjl.tools at gmail dot com
  2012-06-20 13:30 ` rguenther at suse dot de
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 12:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC ---
(In reply to comment #6)
> Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc

Which glibc are you using?

> simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
> 

I didn't see rep movsb in 32-bit memcpy in glibc 2.14.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (6 preceding siblings ...)
  2012-06-20 12:47 ` hjl.tools at gmail dot com
@ 2012-06-20 13:30 ` rguenther at suse dot de
  2012-06-20 14:20 ` hjl.tools at gmail dot com
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-20 13:30 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-20 13:30:21 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC ---
> (In reply to comment #6)
> > Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
> 
> Which glibc are you using?
> 
> > simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> > been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
> > 
> 
> I didn't see rep movsb in 32-bit memcpy in glibc 2.14.

The one from openSUSE 12.1.  Btw, this is with static linking, so I
suppose IFUNC and friends does not apply(?)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (7 preceding siblings ...)
  2012-06-20 13:30 ` rguenther at suse dot de
@ 2012-06-20 14:20 ` hjl.tools at gmail dot com
  2012-06-20 14:35 ` vbyakovl23 at gmail dot com
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com

--- Comment #9 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:19:29 UTC ---
(In reply to comment #3)
> I added executable testcase. Command line to compile
> 
> gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> -mtune=corei7   test.c m.c
> 

Please compare results without -static.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (8 preceding siblings ...)
  2012-06-20 14:20 ` hjl.tools at gmail dot com
@ 2012-06-20 14:35 ` vbyakovl23 at gmail dot com
  2012-06-20 14:36 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 14:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #10 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 14:34:32 UTC ---
I've tried without static. Runtimes is still the same.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (9 preceding siblings ...)
  2012-06-20 14:35 ` vbyakovl23 at gmail dot com
@ 2012-06-20 14:36 ` rguenth at gcc dot gnu.org
  2012-06-20 14:44 ` hjl.tools at gmail dot com
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 14:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #11 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 14:36:00 UTC ---
(In reply to comment #9)
> (In reply to comment #3)
> > I added executable testcase. Command line to compile
> > 
> > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > -mtune=corei7   test.c m.c
> > 
> 
> Please compare results without -static.

Same results.  memcpy ends up here:

Dump of assembler code for function memcpy:
=> 0xf7ed0cc0 <+0>:     push   %edi
   0xf7ed0cc1 <+1>:     push   %esi
   0xf7ed0cc2 <+2>:     mov    0xc(%esp),%edi
   0xf7ed0cc6 <+6>:     mov    0x10(%esp),%esi
   0xf7ed0cca <+10>:    mov    0x14(%esp),%ecx
   0xf7ed0cce <+14>:    mov    %edi,%eax
   0xf7ed0cd0 <+16>:    cld    
   0xf7ed0cd1 <+17>:    cmp    $0x20,%ecx
   0xf7ed0cd4 <+20>:    jbe    0xf7ed0d2c <memcpy+108>
...
   0xf7ed0d2c <+108>:   rep movsb %ds:(%esi),%es:(%edi)
   0xf7ed0d2e <+110>:   pop    %esi
   0xf7ed0d2f <+111>:   pop    %edi
   0xf7ed0d30 <+112>:   ret    

this seems to be sysdeps/i386/i586/memcpy.S


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (10 preceding siblings ...)
  2012-06-20 14:36 ` rguenth at gcc dot gnu.org
@ 2012-06-20 14:44 ` hjl.tools at gmail dot com
  2012-06-20 14:47 ` hjl.tools at gmail dot com
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC ---
(In reply to comment #11)
> (In reply to comment #9)
> > (In reply to comment #3)
> > > I added executable testcase. Command line to compile
> > > 
> > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > > -mtune=corei7   test.c m.c
> > > 
> > 
> > Please compare results without -static.
> 
> Same results.  memcpy ends up here:
> 
> 
> this seems to be sysdeps/i386/i586/memcpy.S

Your libc.so doesn't support IFUNC optimization.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (11 preceding siblings ...)
  2012-06-20 14:44 ` hjl.tools at gmail dot com
@ 2012-06-20 14:47 ` hjl.tools at gmail dot com
  2012-06-20 14:52 ` hjl.tools at gmail dot com
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:46:55 UTC ---
(In reply to comment #10)
> I've tried without static. Runtimes is still the same.

It doesn't match what I saw.  On Atom D510:

/export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3
-ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o new
time ./new
./new  58.46s user 0.00s system 99% cpu 58.479 total
/export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3
-ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o old
time ./old
./old  58.38s user 0.00s system 99% cpu 58.490 total


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (12 preceding siblings ...)
  2012-06-20 14:47 ` hjl.tools at gmail dot com
@ 2012-06-20 14:52 ` hjl.tools at gmail dot com
  2012-06-20 14:55 ` rguenther at suse dot de
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING

--- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:51:31 UTC ---
With -static, there is a regression:

./new.static  61.99s user 0.00s system 99% cpu 1:02.14 total
./old.static  58.25s user 0.00s system 99% cpu 58.261 total

The problem is the slow memcpy, not GCC.  Please find why
it is different from you.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (13 preceding siblings ...)
  2012-06-20 14:52 ` hjl.tools at gmail dot com
@ 2012-06-20 14:55 ` rguenther at suse dot de
  2012-06-20 15:10 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-20 14:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-20 14:53:58 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC ---
> (In reply to comment #11)
> > (In reply to comment #9)
> > > (In reply to comment #3)
> > > > I added executable testcase. Command line to compile
> > > > 
> > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static  -march=corei7
> > > > -mtune=corei7   test.c m.c
> > > > 
> > > 
> > > Please compare results without -static.
> > 
> > Same results.  memcpy ends up here:
> > 
> > 
> > this seems to be sysdeps/i386/i586/memcpy.S
> 
> Your libc.so doesn't support IFUNC optimization.

Quite possible (for whatever reason).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (14 preceding siblings ...)
  2012-06-20 14:55 ` rguenther at suse dot de
@ 2012-06-20 15:10 ` rguenth at gcc dot gnu.org
  2012-06-20 15:37 ` hjl.tools at gmail dot com
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 15:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #16 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 15:09:46 UTC ---
What we could do for the case in question is look at the maximum possible
value of c, derived from number-of-iteration analysis which should tell
us 8 because of the size of the tem array.

But I am not sure if a good library implementation shouldn't be always
preferable to a byte-wise copy.  We could, at least try to envision a way
to retain and use the knowledge that the size is at most 8 when expanding
the memcpy (with AVX we could use a masked store for example - quite fancy).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (15 preceding siblings ...)
  2012-06-20 15:10 ` rguenth at gcc dot gnu.org
@ 2012-06-20 15:37 ` hjl.tools at gmail dot com
  2012-06-21  8:47 ` rguenther at suse dot de
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 15:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
(In reply to comment #16)
> But I am not sure if a good library implementation shouldn't be always
> preferable to a byte-wise copy.  We could, at least try to envision a way
> to retain and use the knowledge that the size is at most 8 when expanding
> the memcpy (with AVX we could use a masked store for example - quite fancy).

string/memory functions in libc can be much faster than the ones generated
by GCC unless the size is very small, PR 43052.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (16 preceding siblings ...)
  2012-06-20 15:37 ` hjl.tools at gmail dot com
@ 2012-06-21  8:47 ` rguenther at suse dot de
  2012-06-21 12:47 ` vbyakovl23 at gmail dot com
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-21  8:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-21 08:46:11 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
> (In reply to comment #16)
> > But I am not sure if a good library implementation shouldn't be always
> > preferable to a byte-wise copy.  We could, at least try to envision a way
> > to retain and use the knowledge that the size is at most 8 when expanding
> > the memcpy (with AVX we could use a masked store for example - quite fancy).
> 
> string/memory functions in libc can be much faster than the ones generated
> by GCC unless the size is very small, PR 43052.

Yes.  The question is what is "very small" and how can we possibly
detect "very small".  For this testcase we can derive an upper bound
of the size, which is 8, but the size is not constant.  I think unless
we know we can expand the variable-size memcpy with, say, three
CPU instructions inline there is no reason to not call memcpy.

Thus if the CPU could do

  tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
  mask = generate mask from size
  store-unaligned-8-bytes-with-maxk

then expanding the memcpy call inline would be a win I suppose.
AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
bytes is profitable?  Note that from the specs
of VMASKMOV it seems the memory operands need to be aligned and
the mask does not support byte-granularity.

Which would leave us to inline expanding the case of at most 2 byte
memcpy.  Of course currently there is no way to record an upper
bound for the size (we do not retain value-range information - but
we of course should).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (17 preceding siblings ...)
  2012-06-21  8:47 ` rguenther at suse dot de
@ 2012-06-21 12:47 ` vbyakovl23 at gmail dot com
  2012-06-21 12:51 ` hjl.tools at gmail dot com
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-21 12:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #19 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-21 12:46:11 UTC ---
(In reply to comment #13)
> (In reply to comment #10)
> > I've tried without static. Runtimes is still the same.
> 
> It doesn't match what I saw.  On Atom D510:
> 
> /export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o new
> time ./new
> ./new  58.46s user 0.00s system 99% cpu 58.479 total
> /export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32   -march=atom m.c test.c -o old
> time ./old
> ./old  58.38s user 0.00s system 99% cpu 58.490 total

I rechecked there is no regression without static on Sundy Bridge nor Atom.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (18 preceding siblings ...)
  2012-06-21 12:47 ` vbyakovl23 at gmail dot com
@ 2012-06-21 12:51 ` hjl.tools at gmail dot com
  2012-06-21 12:56 ` hjl.tools at gmail dot com
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-21 12:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-21 12:50:21 UTC ---
(In reply to comment #19)
> 
> I rechecked there is no regression without static on Sundy Bridge nor Atom.

Great.  Please verify that -static isn't used on eembc and SPEC
CPU 2000/2006 runs.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (19 preceding siblings ...)
  2012-06-21 12:51 ` hjl.tools at gmail dot com
@ 2012-06-21 12:56 ` hjl.tools at gmail dot com
  2012-06-22 22:46 ` hubicka at ucw dot cz
  2012-06-22 23:05 ` hubicka at ucw dot cz
  22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-21 12:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |RESOLVED
         Resolution|                            |WORKSFORME

--- Comment #21 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-21 12:55:33 UTC ---
(In reply to comment #18)
> > string/memory functions in libc can be much faster than the ones generated
> > by GCC unless the size is very small, PR 43052.
> 
> Yes.  The question is what is "very small" and how can we possibly
> detect "very small".  For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant.  I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.

It is OK to call memcpy if the size isn't constant.

> 
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy.  Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).

It is nice to have.  We can open another bug for this.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (20 preceding siblings ...)
  2012-06-21 12:56 ` hjl.tools at gmail dot com
@ 2012-06-22 22:46 ` hubicka at ucw dot cz
  2012-06-22 23:04   ` Jan Hubicka
  2012-06-22 23:05 ` hubicka at ucw dot cz
  22 siblings, 1 reply; 25+ messages in thread
From: hubicka at ucw dot cz @ 2012-06-22 22:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> Yes.  The question is what is "very small" and how can we possibly

As what is very small is defined in the i386.c in the cost tables.
I simply run a small benchmark testing library&GCC implementations to
fill it in.  With new glibcs these tables may need upating.  I updated them
on some to make glibc in SUSE 11.x.

PR  43052 is about memcmp. Memcpy/memset should behave more or less sanely.
(that also reminds me that I should look again at the SSE memcpy/memset
implementation for 4.8)

> detect "very small".  For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant.  I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.
> 
> Thus if the CPU could do
> 
>   tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
>   mask = generate mask from size
>   store-unaligned-8-bytes-with-maxk
> 
> then expanding the memcpy call inline would be a win I suppose.
> AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
> bytes is profitable?  Note that from the specs
> of VMASKMOV it seems the memory operands need to be aligned and
> the mask does not support byte-granularity.
> 
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy.  Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).

My secret plan was to make VRP produce value profiling histogram
when value is known to be with small range.  Should be quite easy
to implement.

Honza


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-22 22:46 ` hubicka at ucw dot cz
@ 2012-06-22 23:04   ` Jan Hubicka
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2012-06-22 23:04 UTC (permalink / raw)
  To: hubicka at ucw dot cz; +Cc: gcc-bugs

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> > Yes.  The question is what is "very small" and how can we possibly
> 
> As what is very small is defined in the i386.c in the cost tables.
> I simply run a small benchmark testing library&GCC implementations to
> fill it in.  With new glibcs these tables may need upating.  I updated them
> on some to make glibc in SUSE 11.x.
> 
> PR  43052 is about memcmp. Memcpy/memset should behave more or less sanely.
> (that also reminds me that I should look again at the SSE memcpy/memset
> implementation for 4.8)

That also reminds me that this tunning was mostly reverted with the SSE work.
I will look into that patches and push out the safe bits for 4.8

Honza


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
  2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
                   ` (21 preceding siblings ...)
  2012-06-22 22:46 ` hubicka at ucw dot cz
@ 2012-06-22 23:05 ` hubicka at ucw dot cz
  22 siblings, 0 replies; 25+ messages in thread
From: hubicka at ucw dot cz @ 2012-06-22 23:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #23 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 23:04:21 UTC ---
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> > Yes.  The question is what is "very small" and how can we possibly
> 
> As what is very small is defined in the i386.c in the cost tables.
> I simply run a small benchmark testing library&GCC implementations to
> fill it in.  With new glibcs these tables may need upating.  I updated them
> on some to make glibc in SUSE 11.x.
> 
> PR  43052 is about memcmp. Memcpy/memset should behave more or less sanely.
> (that also reminds me that I should look again at the SSE memcpy/memset
> implementation for 4.8)

That also reminds me that this tunning was mostly reverted with the SSE work.
I will look into that patches and push out the safe bits for 4.8

Honza


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-06-22 23:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-20  6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
2012-06-20  6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
2012-06-20  9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
2012-06-20 10:48 ` vbyakovl23 at gmail dot com
2012-06-20 10:50 ` vbyakovl23 at gmail dot com
2012-06-20 11:48 ` rguenth at gcc dot gnu.org
2012-06-20 12:31 ` rguenth at gcc dot gnu.org
2012-06-20 12:47 ` hjl.tools at gmail dot com
2012-06-20 13:30 ` rguenther at suse dot de
2012-06-20 14:20 ` hjl.tools at gmail dot com
2012-06-20 14:35 ` vbyakovl23 at gmail dot com
2012-06-20 14:36 ` rguenth at gcc dot gnu.org
2012-06-20 14:44 ` hjl.tools at gmail dot com
2012-06-20 14:47 ` hjl.tools at gmail dot com
2012-06-20 14:52 ` hjl.tools at gmail dot com
2012-06-20 14:55 ` rguenther at suse dot de
2012-06-20 15:10 ` rguenth at gcc dot gnu.org
2012-06-20 15:37 ` hjl.tools at gmail dot com
2012-06-21  8:47 ` rguenther at suse dot de
2012-06-21 12:47 ` vbyakovl23 at gmail dot com
2012-06-21 12:51 ` hjl.tools at gmail dot com
2012-06-21 12:56 ` hjl.tools at gmail dot com
2012-06-22 22:46 ` hubicka at ucw dot cz
2012-06-22 23:04   ` Jan Hubicka
2012-06-22 23:05 ` hubicka at ucw dot cz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).