public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
@ 2012-06-20 6:10 vbyakovl23 at gmail dot com
2012-06-20 6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
` (22 more replies)
0 siblings, 23 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 6:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
Bug #: 53726
Summary: [4.8 Regression] aes test performance drop for
eembc_2_0_peak_32
Classification: Unclassified
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: vbyakovl23@gmail.com
After fix
r188261 | rguenth | 2012-06-06 13:45:27 +0400 (Wed, 06 Jun 2012) | 23 lines
2012-06-06 Richard Guenther <rguenther@suse.de>
PR tree-optimization/53081
* tree-data-ref.h (adjacent_store_dr_p): Rename to ...
(adjacent_dr_p): ... this and make it work for reads, too.
* tree-loop-distribution.c (enum partition_kind): Add PKIND_MEMCPY.
(struct partition_s): Change main_stmt to main_dr, add
secondary_dr member.
(build_size_arg_loc): Change to date data-reference and not
gimplify here.
(build_addr_arg_loc): New function split out from ...
(generate_memset_builtin): ... here. Use it and simplify.
(generate_memcpy_builtin): New function.
(generate_code_for_partition): Adjust.
(classify_partition): Streamline pattern detection. Detect
memcpy.
(ldist_gen): Adjust.
(tree_loop_distribution): Adjust seed statements for memcpy
recognition.
* gcc.dg/tree-ssa/ldist-20.c: New testcase.
* gcc.dg/tree-ssa/loop-19.c: Add -fno-tree-loop-distribute-patterns.
regression on Atom 11%, on Sundy Bridge 30%. The fix lead to unrecognition of
memcpy. Reduced test case and assemblers are attached. Command line to
reproduce
gcc -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
-mtune=corei7 test.c
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug c/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
@ 2012-06-20 6:13 ` vbyakovl23 at gmail dot com
2012-06-20 9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
` (21 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 6:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #1 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 06:13:26 UTC ---
Created attachment 27658
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27658
Test case and assemblers
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
2012-06-20 6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
@ 2012-06-20 9:28 ` rguenth at gcc dot gnu.org
2012-06-20 10:48 ` vbyakovl23 at gmail dot com
` (20 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 9:28 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |WAITING
Last reconfirmed| |2012-06-20
Component|c |tree-optimization
CC| |rguenth at gcc dot gnu.org
Ever Confirmed|0 |1
Target Milestone|--- |4.8.0
--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 09:27:52 UTC ---
You mean the fix lead to recognition of memcpy? At least I see memcpy
calls in the bad assembly.
There is always a cost consideration for memcpy - does performance recover
with -minline-all-stringops? I suppose BC is actually very small?
The testcase does not include a runtime part so I can't check myself.
Definitely a byte-wise copy loop as in the .good assembly variant,
.L5:
- .loc 1 14 0 is_stmt 1 discriminator 2
- movzbl 16(%esp,%eax), %edx
- movb %dl, (%esi,%eax)
- leal 1(%eax), %eax
-.LVL5:
- cmpl %ebx, %eax
- jl .L5
does not look good - even a rep movb should be faster, no?
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
2012-06-20 6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
2012-06-20 9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
@ 2012-06-20 10:48 ` vbyakovl23 at gmail dot com
2012-06-20 10:50 ` vbyakovl23 at gmail dot com
` (19 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 10:48 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #3 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 10:48:11 UTC ---
I added executable testcase. Command line to compile
gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
-mtune=corei7 test.c m.c
Run results
Wed Jun 20 14:39:05: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time
./test.corei7.bad.exe
real 0m6.317s
user 0m6.290s
sys 0m0.002s
Wed Jun 20 14:39:24: /gnumnt/msticlxl25_users/vbyakovl/1020/test$ time
./test.corei7.good.exe
real 0m4.815s
user 0m4.713s
sys 0m0.000s
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (2 preceding siblings ...)
2012-06-20 10:48 ` vbyakovl23 at gmail dot com
@ 2012-06-20 10:50 ` vbyakovl23 at gmail dot com
2012-06-20 11:48 ` rguenth at gcc dot gnu.org
` (18 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 10:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #4 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 10:50:28 UTC ---
Created attachment 27664
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27664
Executable test case
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (3 preceding siblings ...)
2012-06-20 10:50 ` vbyakovl23 at gmail dot com
@ 2012-06-20 11:48 ` rguenth at gcc dot gnu.org
2012-06-20 12:31 ` rguenth at gcc dot gnu.org
` (17 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 11:48 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
CC| |hubicka at gcc dot gnu.org
--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 11:48:13 UTC ---
Ok. A rep movsb; is as slow as a memcpy call (-mstringop-strategy=rep_byte
-minline-all-stringops). -minline-all-stringops itself is nearly as fast
as -fno-tree-loop-distribute-patterns.
To answer my own question, BC is between zero and 7.
But I really wonder why the rep movsb is slower than the explicit byte-copy
loop ...
We do seem to seriously hose the CFG though - with PGO we get a nice
loop nest CFG and the speed of before the patch - even when it uses
a memcpy call.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (4 preceding siblings ...)
2012-06-20 11:48 ` rguenth at gcc dot gnu.org
@ 2012-06-20 12:31 ` rguenth at gcc dot gnu.org
2012-06-20 12:47 ` hjl.tools at gmail dot com
` (16 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 12:31 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 12:31:22 UTC ---
Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
simply does a rep movsb; for any size lower than 20 bytes ... but as I have
been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
Thus, I suppose you should look at improving the memcpy implementation for
small sizes on 32bits.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (5 preceding siblings ...)
2012-06-20 12:31 ` rguenth at gcc dot gnu.org
@ 2012-06-20 12:47 ` hjl.tools at gmail dot com
2012-06-20 13:30 ` rguenther at suse dot de
` (15 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 12:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC ---
(In reply to comment #6)
> Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
Which glibc are you using?
> simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
>
I didn't see rep movsb in 32-bit memcpy in glibc 2.14.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (6 preceding siblings ...)
2012-06-20 12:47 ` hjl.tools at gmail dot com
@ 2012-06-20 13:30 ` rguenther at suse dot de
2012-06-20 14:20 ` hjl.tools at gmail dot com
` (14 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-20 13:30 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-20 13:30:21 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
>
> --- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 12:46:57 UTC ---
> (In reply to comment #6)
> > Btw, I cannot reproduce the slowdown on 64bit and the 32bit memcpy in glibc
>
> Which glibc are you using?
>
> > simply does a rep movsb; for any size lower than 20 bytes ... but as I have
> > been told rep movsb; setup cost is prohibitively high on most Intel CPUs ...
> >
>
> I didn't see rep movsb in 32-bit memcpy in glibc 2.14.
The one from openSUSE 12.1. Btw, this is with static linking, so I
suppose IFUNC and friends does not apply(?)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (7 preceding siblings ...)
2012-06-20 13:30 ` rguenther at suse dot de
@ 2012-06-20 14:20 ` hjl.tools at gmail dot com
2012-06-20 14:35 ` vbyakovl23 at gmail dot com
` (13 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:20 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl.tools at gmail dot com
--- Comment #9 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:19:29 UTC ---
(In reply to comment #3)
> I added executable testcase. Command line to compile
>
> gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
> -mtune=corei7 test.c m.c
>
Please compare results without -static.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (8 preceding siblings ...)
2012-06-20 14:20 ` hjl.tools at gmail dot com
@ 2012-06-20 14:35 ` vbyakovl23 at gmail dot com
2012-06-20 14:36 ` rguenth at gcc dot gnu.org
` (12 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-20 14:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #10 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-20 14:34:32 UTC ---
I've tried without static. Runtimes is still the same.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (9 preceding siblings ...)
2012-06-20 14:35 ` vbyakovl23 at gmail dot com
@ 2012-06-20 14:36 ` rguenth at gcc dot gnu.org
2012-06-20 14:44 ` hjl.tools at gmail dot com
` (11 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 14:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #11 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 14:36:00 UTC ---
(In reply to comment #9)
> (In reply to comment #3)
> > I added executable testcase. Command line to compile
> >
> > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
> > -mtune=corei7 test.c m.c
> >
>
> Please compare results without -static.
Same results. memcpy ends up here:
Dump of assembler code for function memcpy:
=> 0xf7ed0cc0 <+0>: push %edi
0xf7ed0cc1 <+1>: push %esi
0xf7ed0cc2 <+2>: mov 0xc(%esp),%edi
0xf7ed0cc6 <+6>: mov 0x10(%esp),%esi
0xf7ed0cca <+10>: mov 0x14(%esp),%ecx
0xf7ed0cce <+14>: mov %edi,%eax
0xf7ed0cd0 <+16>: cld
0xf7ed0cd1 <+17>: cmp $0x20,%ecx
0xf7ed0cd4 <+20>: jbe 0xf7ed0d2c <memcpy+108>
...
0xf7ed0d2c <+108>: rep movsb %ds:(%esi),%es:(%edi)
0xf7ed0d2e <+110>: pop %esi
0xf7ed0d2f <+111>: pop %edi
0xf7ed0d30 <+112>: ret
this seems to be sysdeps/i386/i586/memcpy.S
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (10 preceding siblings ...)
2012-06-20 14:36 ` rguenth at gcc dot gnu.org
@ 2012-06-20 14:44 ` hjl.tools at gmail dot com
2012-06-20 14:47 ` hjl.tools at gmail dot com
` (10 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:44 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC ---
(In reply to comment #11)
> (In reply to comment #9)
> > (In reply to comment #3)
> > > I added executable testcase. Command line to compile
> > >
> > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
> > > -mtune=corei7 test.c m.c
> > >
> >
> > Please compare results without -static.
>
> Same results. memcpy ends up here:
>
>
> this seems to be sysdeps/i386/i586/memcpy.S
Your libc.so doesn't support IFUNC optimization.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (11 preceding siblings ...)
2012-06-20 14:44 ` hjl.tools at gmail dot com
@ 2012-06-20 14:47 ` hjl.tools at gmail dot com
2012-06-20 14:52 ` hjl.tools at gmail dot com
` (9 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:46:55 UTC ---
(In reply to comment #10)
> I've tried without static. Runtimes is still the same.
It doesn't match what I saw. On Atom D510:
/export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3
-ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o new
time ./new
./new 58.46s user 0.00s system 99% cpu 58.479 total
/export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3
-ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o old
time ./old
./old 58.38s user 0.00s system 99% cpu 58.490 total
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (12 preceding siblings ...)
2012-06-20 14:47 ` hjl.tools at gmail dot com
@ 2012-06-20 14:52 ` hjl.tools at gmail dot com
2012-06-20 14:55 ` rguenther at suse dot de
` (8 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 14:52 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |WAITING
--- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:51:31 UTC ---
With -static, there is a regression:
./new.static 61.99s user 0.00s system 99% cpu 1:02.14 total
./old.static 58.25s user 0.00s system 99% cpu 58.261 total
The problem is the slow memcpy, not GCC. Please find why
it is different from you.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (13 preceding siblings ...)
2012-06-20 14:52 ` hjl.tools at gmail dot com
@ 2012-06-20 14:55 ` rguenther at suse dot de
2012-06-20 15:10 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-20 14:55 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-20 14:53:58 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
>
> --- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 14:43:11 UTC ---
> (In reply to comment #11)
> > (In reply to comment #9)
> > > (In reply to comment #3)
> > > > I added executable testcase. Command line to compile
> > > >
> > > > gcc -g -ansi -O3 -ffast-math -msse2 -mfpmath=sse -m32 -static -march=corei7
> > > > -mtune=corei7 test.c m.c
> > > >
> > >
> > > Please compare results without -static.
> >
> > Same results. memcpy ends up here:
> >
> >
> > this seems to be sysdeps/i386/i586/memcpy.S
>
> Your libc.so doesn't support IFUNC optimization.
Quite possible (for whatever reason).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (14 preceding siblings ...)
2012-06-20 14:55 ` rguenther at suse dot de
@ 2012-06-20 15:10 ` rguenth at gcc dot gnu.org
2012-06-20 15:37 ` hjl.tools at gmail dot com
` (6 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-06-20 15:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #16 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-20 15:09:46 UTC ---
What we could do for the case in question is look at the maximum possible
value of c, derived from number-of-iteration analysis which should tell
us 8 because of the size of the tem array.
But I am not sure if a good library implementation shouldn't be always
preferable to a byte-wise copy. We could, at least try to envision a way
to retain and use the knowledge that the size is at most 8 when expanding
the memcpy (with AVX we could use a masked store for example - quite fancy).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (15 preceding siblings ...)
2012-06-20 15:10 ` rguenth at gcc dot gnu.org
@ 2012-06-20 15:37 ` hjl.tools at gmail dot com
2012-06-21 8:47 ` rguenther at suse dot de
` (5 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-20 15:37 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
(In reply to comment #16)
> But I am not sure if a good library implementation shouldn't be always
> preferable to a byte-wise copy. We could, at least try to envision a way
> to retain and use the knowledge that the size is at most 8 when expanding
> the memcpy (with AVX we could use a masked store for example - quite fancy).
string/memory functions in libc can be much faster than the ones generated
by GCC unless the size is very small, PR 43052.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (16 preceding siblings ...)
2012-06-20 15:37 ` hjl.tools at gmail dot com
@ 2012-06-21 8:47 ` rguenther at suse dot de
2012-06-21 12:47 ` vbyakovl23 at gmail dot com
` (4 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: rguenther at suse dot de @ 2012-06-21 8:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-21 08:46:11 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
>
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 UTC ---
> (In reply to comment #16)
> > But I am not sure if a good library implementation shouldn't be always
> > preferable to a byte-wise copy. We could, at least try to envision a way
> > to retain and use the knowledge that the size is at most 8 when expanding
> > the memcpy (with AVX we could use a masked store for example - quite fancy).
>
> string/memory functions in libc can be much faster than the ones generated
> by GCC unless the size is very small, PR 43052.
Yes. The question is what is "very small" and how can we possibly
detect "very small". For this testcase we can derive an upper bound
of the size, which is 8, but the size is not constant. I think unless
we know we can expand the variable-size memcpy with, say, three
CPU instructions inline there is no reason to not call memcpy.
Thus if the CPU could do
tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
mask = generate mask from size
store-unaligned-8-bytes-with-maxk
then expanding the memcpy call inline would be a win I suppose.
AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
bytes is profitable? Note that from the specs
of VMASKMOV it seems the memory operands need to be aligned and
the mask does not support byte-granularity.
Which would leave us to inline expanding the case of at most 2 byte
memcpy. Of course currently there is no way to record an upper
bound for the size (we do not retain value-range information - but
we of course should).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (17 preceding siblings ...)
2012-06-21 8:47 ` rguenther at suse dot de
@ 2012-06-21 12:47 ` vbyakovl23 at gmail dot com
2012-06-21 12:51 ` hjl.tools at gmail dot com
` (3 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: vbyakovl23 at gmail dot com @ 2012-06-21 12:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #19 from Vladimir Yakovlev <vbyakovl23 at gmail dot com> 2012-06-21 12:46:11 UTC ---
(In reply to comment #13)
> (In reply to comment #10)
> > I've tried without static. Runtimes is still the same.
>
> It doesn't match what I saw. On Atom D510:
>
> /export/gnu/import/git/gcc-regression/master/188261/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o new
> time ./new
> ./new 58.46s user 0.00s system 99% cpu 58.479 total
> /export/gnu/import/git/gcc-regression/master/188259/usr/bin/gcc -ansi -O3
> -ffast-math -msse2 -mfpmath=sse -m32 -march=atom m.c test.c -o old
> time ./old
> ./old 58.38s user 0.00s system 99% cpu 58.490 total
I rechecked there is no regression without static on Sundy Bridge nor Atom.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (18 preceding siblings ...)
2012-06-21 12:47 ` vbyakovl23 at gmail dot com
@ 2012-06-21 12:51 ` hjl.tools at gmail dot com
2012-06-21 12:56 ` hjl.tools at gmail dot com
` (2 subsequent siblings)
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-21 12:51 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #20 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-21 12:50:21 UTC ---
(In reply to comment #19)
>
> I rechecked there is no regression without static on Sundy Bridge nor Atom.
Great. Please verify that -static isn't used on eembc and SPEC
CPU 2000/2006 runs.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (19 preceding siblings ...)
2012-06-21 12:51 ` hjl.tools at gmail dot com
@ 2012-06-21 12:56 ` hjl.tools at gmail dot com
2012-06-22 22:46 ` hubicka at ucw dot cz
2012-06-22 23:05 ` hubicka at ucw dot cz
22 siblings, 0 replies; 25+ messages in thread
From: hjl.tools at gmail dot com @ 2012-06-21 12:56 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |RESOLVED
Resolution| |WORKSFORME
--- Comment #21 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-21 12:55:33 UTC ---
(In reply to comment #18)
> > string/memory functions in libc can be much faster than the ones generated
> > by GCC unless the size is very small, PR 43052.
>
> Yes. The question is what is "very small" and how can we possibly
> detect "very small". For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant. I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.
It is OK to call memcpy if the size isn't constant.
>
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy. Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).
It is nice to have. We can open another bug for this.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (20 preceding siblings ...)
2012-06-21 12:56 ` hjl.tools at gmail dot com
@ 2012-06-22 22:46 ` hubicka at ucw dot cz
2012-06-22 23:04 ` Jan Hubicka
2012-06-22 23:05 ` hubicka at ucw dot cz
22 siblings, 1 reply; 25+ messages in thread
From: hubicka at ucw dot cz @ 2012-06-22 22:46 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> Yes. The question is what is "very small" and how can we possibly
As what is very small is defined in the i386.c in the cost tables.
I simply run a small benchmark testing library&GCC implementations to
fill it in. With new glibcs these tables may need upating. I updated them
on some to make glibc in SUSE 11.x.
PR 43052 is about memcmp. Memcpy/memset should behave more or less sanely.
(that also reminds me that I should look again at the SSE memcpy/memset
implementation for 4.8)
> detect "very small". For this testcase we can derive an upper bound
> of the size, which is 8, but the size is not constant. I think unless
> we know we can expand the variable-size memcpy with, say, three
> CPU instructions inline there is no reason to not call memcpy.
>
> Thus if the CPU could do
>
> tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
> mask = generate mask from size
> store-unaligned-8-bytes-with-maxk
>
> then expanding the memcpy call inline would be a win I suppose.
> AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
> bytes is profitable? Note that from the specs
> of VMASKMOV it seems the memory operands need to be aligned and
> the mask does not support byte-granularity.
>
> Which would leave us to inline expanding the case of at most 2 byte
> memcpy. Of course currently there is no way to record an upper
> bound for the size (we do not retain value-range information - but
> we of course should).
My secret plan was to make VRP produce value profiling histogram
when value is known to be with small range. Should be quite easy
to implement.
Honza
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-22 22:46 ` hubicka at ucw dot cz
@ 2012-06-22 23:04 ` Jan Hubicka
0 siblings, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2012-06-22 23:04 UTC (permalink / raw)
To: hubicka at ucw dot cz; +Cc: gcc-bugs
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
>
> --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> > Yes. The question is what is "very small" and how can we possibly
>
> As what is very small is defined in the i386.c in the cost tables.
> I simply run a small benchmark testing library&GCC implementations to
> fill it in. With new glibcs these tables may need upating. I updated them
> on some to make glibc in SUSE 11.x.
>
> PR 43052 is about memcmp. Memcpy/memset should behave more or less sanely.
> (that also reminds me that I should look again at the SSE memcpy/memset
> implementation for 4.8)
That also reminds me that this tunning was mostly reverted with the SSE work.
I will look into that patches and push out the safe bits for 4.8
Honza
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Bug tree-optimization/53726] [4.8 Regression] aes test performance drop for eembc_2_0_peak_32
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
` (21 preceding siblings ...)
2012-06-22 22:46 ` hubicka at ucw dot cz
@ 2012-06-22 23:05 ` hubicka at ucw dot cz
22 siblings, 0 replies; 25+ messages in thread
From: hubicka at ucw dot cz @ 2012-06-22 23:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #23 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 23:04:21 UTC ---
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
>
> --- Comment #22 from Jan Hubicka <hubicka at ucw dot cz> 2012-06-22 22:45:35 UTC ---
> > Yes. The question is what is "very small" and how can we possibly
>
> As what is very small is defined in the i386.c in the cost tables.
> I simply run a small benchmark testing library&GCC implementations to
> fill it in. With new glibcs these tables may need upating. I updated them
> on some to make glibc in SUSE 11.x.
>
> PR 43052 is about memcmp. Memcpy/memset should behave more or less sanely.
> (that also reminds me that I should look again at the SSE memcpy/memset
> implementation for 4.8)
That also reminds me that this tunning was mostly reverted with the SSE work.
I will look into that patches and push out the safe bits for 4.8
Honza
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-06-22 23:05 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-20 6:10 [Bug c/53726] New: [4.8 Regression] aes test performance drop for eembc_2_0_peak_32 vbyakovl23 at gmail dot com
2012-06-20 6:13 ` [Bug c/53726] " vbyakovl23 at gmail dot com
2012-06-20 9:28 ` [Bug tree-optimization/53726] " rguenth at gcc dot gnu.org
2012-06-20 10:48 ` vbyakovl23 at gmail dot com
2012-06-20 10:50 ` vbyakovl23 at gmail dot com
2012-06-20 11:48 ` rguenth at gcc dot gnu.org
2012-06-20 12:31 ` rguenth at gcc dot gnu.org
2012-06-20 12:47 ` hjl.tools at gmail dot com
2012-06-20 13:30 ` rguenther at suse dot de
2012-06-20 14:20 ` hjl.tools at gmail dot com
2012-06-20 14:35 ` vbyakovl23 at gmail dot com
2012-06-20 14:36 ` rguenth at gcc dot gnu.org
2012-06-20 14:44 ` hjl.tools at gmail dot com
2012-06-20 14:47 ` hjl.tools at gmail dot com
2012-06-20 14:52 ` hjl.tools at gmail dot com
2012-06-20 14:55 ` rguenther at suse dot de
2012-06-20 15:10 ` rguenth at gcc dot gnu.org
2012-06-20 15:37 ` hjl.tools at gmail dot com
2012-06-21 8:47 ` rguenther at suse dot de
2012-06-21 12:47 ` vbyakovl23 at gmail dot com
2012-06-21 12:51 ` hjl.tools at gmail dot com
2012-06-21 12:56 ` hjl.tools at gmail dot com
2012-06-22 22:46 ` hubicka at ucw dot cz
2012-06-22 23:04 ` Jan Hubicka
2012-06-22 23:05 ` hubicka at ucw dot cz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).