[Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
@ 2010-12-18  7:46 jgarzik at pobox dot com
  2010-12-18  7:48 ` [Bug c/47000] " jgarzik at pobox dot com
                   ` (29 more replies)
  0 siblings, 30 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18  7:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

           Summary: Major performance regression in parallel SSE2 impl of
                    SHA256 hash algorithm
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: jgarzik@pobox.com


Created attachment 22805
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22805
4-way SHA256 implementation, whose performance decreases markedly 4.4.x ->
4.5.x

OS: Fedora 14

My "cpuminer" open source project is -very- sensitive to performance of
generated code, and experiences a severe performance regression going from gcc
4.4.x to 4.5.x.

Our program core is essentially
     for (n = 0; n < 0xffffff; n++)
          sha256( sha256( data ) )      /* one iteration of inner loop */

Building with gcc 4.4.5 -or- Fedora 13 gcc (4.4.x derivative), we achieve
     1850.85 kilo-iterations per second

Building with gcc 4.5.1 -or- Fedora 14 gcc (4.5.x derivative), we achieve
     1389.82 kilo-iterations per second

This is a significant performance decrease, and the only variable is the
compiler.  I have presented x86_64 data below, but similar slowdowns are seen
on i686-mingw in Fedora 13 (fast gcc 4.4.x) or Fedora 14 (slow gcc 4.5.x).

This interesting variant of the standard SHA256 algorithm is implemented using
Intel/AMD SSE2-specific operations, effectively running four (4) SHA256
iterations in parallel, generating four (4) SHA256 hashes on four distinct
datasets.

See attachment sha256_4way.i.

--------------------------------------------------------------------------
fast, working gcc -v:
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../src/gcc-4.4.5/configure --prefix=/garz/gcc44
--enable-languages=c
Thread model: posix
gcc version 4.4.5 (GCC) 

--------------------------------------------------------------------------
slow, broken gcc -v:
Using built-in specs.
COLLECT_GCC=/garz/gcc45/bin/gcc
COLLECT_LTO_WRAPPER=/garz/gcc45/libexec/gcc/x86_64-unknown-linux-gnu/4.5.1/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../src/gcc-4.5.1/configure --prefix=/garz/gcc45
--enable-languages=c
Thread model: posix
gcc version 4.5.1 (GCC)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug c/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
@ 2010-12-18  7:48 ` jgarzik at pobox dot com
  2010-12-18 12:27 ` [Bug target/47000] " steven at gcc dot gnu.org
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18  7:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #1 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 07:48:23 UTC ---
Besides the attached sha256_4way.i, the full source code is at
http://yyz.us/bitcoin/cpuminer-0.2.2.tar.gz  It's really quite small and easy
to build and use.

A sample RPC destination, usable for free, is http://mining.bitcoin.cz/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
  2010-12-18  7:48 ` [Bug c/47000] " jgarzik at pobox dot com
@ 2010-12-18 12:27 ` steven at gcc dot gnu.org
  2010-12-18 12:39 ` steven at gcc dot gnu.org
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: steven at gcc dot gnu.org @ 2010-12-18 12:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |steven at gcc dot gnu.org

--- Comment #2 from Steven Bosscher <steven at gcc dot gnu.org> 2010-12-18 12:27:37 UTC ---
What compiler options are you using?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
  2010-12-18  7:48 ` [Bug c/47000] " jgarzik at pobox dot com
  2010-12-18 12:27 ` [Bug target/47000] " steven at gcc dot gnu.org
@ 2010-12-18 12:39 ` steven at gcc dot gnu.org
  2010-12-18 12:49 ` steven at gcc dot gnu.org
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: steven at gcc dot gnu.org @ 2010-12-18 12:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2010.12.18 12:39:26
     Ever Confirmed|0                           |1

--- Comment #3 from Steven Bosscher <steven at gcc dot gnu.org> 2010-12-18 12:39:26 UTC ---
Compiled like so:
$ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
$ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s

$ grep -c call *.s
sha256_4way-44.s:0
sha256_4way-45.s:484
$ grep call *.s|head
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
sha256_4way-45.s:    call    ROTR
$ 

ROTR should have been inlined:

static inline __m128i ROTR(__m128i x, const int n) {
    return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
}

This probably explains the slowdown.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (2 preceding siblings ...)
  2010-12-18 12:39 ` steven at gcc dot gnu.org
@ 2010-12-18 12:49 ` steven at gcc dot gnu.org
  2010-12-18 15:36 ` hjl.tools at gmail dot com
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: steven at gcc dot gnu.org @ 2010-12-18 12:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
      Known to work|                            |4.4.2, 4.4.5
      Known to fail|                            |4.5.0, 4.5.1

--- Comment #4 from Steven Bosscher <steven at gcc dot gnu.org> 2010-12-18 12:49:12 UTC ---
GCC 4.6 (trunk revision 167996) also inlines ROTR. Is it possible for the
reporter to measure the number of k-iters with a recent snapshot of the trunk?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (3 preceding siblings ...)
  2010-12-18 12:49 ` steven at gcc dot gnu.org
@ 2010-12-18 15:36 ` hjl.tools at gmail dot com
  2010-12-18 15:41 ` hjl.tools at gmail dot com
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 15:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 15:36:30 UTC ---
(In reply to comment #3)
> Compiled like so:
> $ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
> $ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s
> 
> $ grep -c call *.s
> sha256_4way-44.s:0
> sha256_4way-45.s:484
> $ grep call *.s|head
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> sha256_4way-45.s:    call    ROTR
> $ 
> 
> ROTR should have been inlined:
> 
> static inline __m128i ROTR(__m128i x, const int n) {
>     return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
> }
> 
> This probably explains the slowdown.

This is caused by revision 151511:

http://gcc.gnu.org/ml/gcc-cvs/2009-09/msg00257.html


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (4 preceding siblings ...)
  2010-12-18 15:36 ` hjl.tools at gmail dot com
@ 2010-12-18 15:41 ` hjl.tools at gmail dot com
  2010-12-18 15:43 ` hjl.tools at gmail dot com
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 15:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #6 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 15:40:38 UTC ---
(In reply to comment #5)
> (In reply to comment #3)
> > Compiled like so:
> > $ gcc-4.4.2 -S -O2 sha256_4way.i -o sha256_4way-44.s
> > $ gcc-4.5.0 -S -O2 sha256_4way.i -o sha256_4way-45.s
> > 
> > $ grep -c call *.s
> > sha256_4way-44.s:0
> > sha256_4way-45.s:484
> > $ grep call *.s|head
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > sha256_4way-45.s:    call    ROTR
> > $ 
> > 
> > ROTR should have been inlined:
> > 
> > static inline __m128i ROTR(__m128i x, const int n) {
> >     return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
> > }
> > 
> > This probably explains the slowdown.
> 
> This is caused by revision 151511:
> 
> http://gcc.gnu.org/ml/gcc-cvs/2009-09/msg00257.html

It is fixed by revision 166517:

http://gcc.gnu.org/ml/gcc-cvs/2010-11/msg00405.html


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (5 preceding siblings ...)
  2010-12-18 15:41 ` hjl.tools at gmail dot com
@ 2010-12-18 15:43 ` hjl.tools at gmail dot com
  2010-12-18 16:04 ` hjl.tools at gmail dot com
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 15:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 15:43:06 UTC ---
It may be fixed by the patch for PR 40436.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (6 preceding siblings ...)
  2010-12-18 15:43 ` hjl.tools at gmail dot com
@ 2010-12-18 16:04 ` hjl.tools at gmail dot com
  2010-12-18 18:24 ` jgarzik at pobox dot com
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 16:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #8 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 16:03:44 UTC ---
Can you try

--
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index af1adf4..dd00de6 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3342,7 +3342,11 @@ estimate_num_insns (gimple stmt, eni_weights *weights)
     if (POINTER_TYPE_P (funtype))
       funtype = TREE_TYPE (funtype);

-    if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
+    /* Do not special case builtins where we see the body.
+       This just confuse inliner.  */
+    if (!decl || cgraph_node (decl)->analyzed)
+      ;
+    else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
       cost = weights->target_builtin_call_cost;
     else
       cost = weights->call_cost;
---


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (7 preceding siblings ...)
  2010-12-18 16:04 ` hjl.tools at gmail dot com
@ 2010-12-18 18:24 ` jgarzik at pobox dot com
  2010-12-18 18:49 ` jgarzik at pobox dot com
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 18:24 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #9 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 18:24:09 UTC ---
(In reply to comment #2)
> What compiler options are you using?

Pretty basic:  -O3 -Wall -msse2 -g

Sometimes -O3 -Wall -g -march=native, on a quad core Intel box

Results are the same: performance regression.

Will try HJ's patch next...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (8 preceding siblings ...)
  2010-12-18 18:24 ` jgarzik at pobox dot com
@ 2010-12-18 18:49 ` jgarzik at pobox dot com
  2010-12-18 19:08 ` jgarzik at pobox dot com
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 18:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #10 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 18:49:35 UTC ---
(In reply to comment #8)
> Can you try
> 
> --
> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
> index af1adf4..dd00de6 100644
> --- a/gcc/tree-inline.c
> +++ b/gcc/tree-inline.c
> @@ -3342,7 +3342,11 @@ estimate_num_insns (gimple stmt, eni_weights *weights)
>      if (POINTER_TYPE_P (funtype))
>        funtype = TREE_TYPE (funtype);
> 
> -    if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
> +    /* Do not special case builtins where we see the body.
> +       This just confuse inliner.  */
> +    if (!decl || cgraph_node (decl)->analyzed)
> +      ;
> +    else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
>        cost = weights->target_builtin_call_cost;
>      else
>        cost = weights->call_cost;
> ---

ICE appears when attempting to make with gcc 4.5.1 + your patch:

/garz/gcc45/src/gcc-4.5.1.patch1/libgcc/config/libbid/bid128_fma.c:4460:1:
internal compiler error: in estimate_function_body_sizes, at ipa-inline.c:1920
Please submit a full bug report,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (9 preceding siblings ...)
  2010-12-18 18:49 ` jgarzik at pobox dot com
@ 2010-12-18 19:08 ` jgarzik at pobox dot com
  2010-12-18 19:09 ` jgarzik at pobox dot com
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 19:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #11 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 19:08:45 UTC ---
(In reply to comment #4)
> GCC 4.6 (trunk revision 167996) also inlines ROTR. Is it possible for the
> reporter to measure the number of k-iters with a recent snapshot of the trunk?

Latest SVN trunk (r168027) is very nice, and does not show the performance
regression:

Performance is all the way up to 1977.31 kilo-iterations per second.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (10 preceding siblings ...)
  2010-12-18 19:08 ` jgarzik at pobox dot com
@ 2010-12-18 19:09 ` jgarzik at pobox dot com
  2010-12-18 19:21 ` steven at gcc dot gnu.org
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 19:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #12 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 19:09:25 UTC ---
Any other patches for me to try, for gcc 4.5.1?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] Major performance regression in parallel SSE2 impl of SHA256 hash algorithm
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (11 preceding siblings ...)
  2010-12-18 19:09 ` jgarzik at pobox dot com
@ 2010-12-18 19:21 ` steven at gcc dot gnu.org
  2010-12-18 19:35 ` [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics hjl.tools at gmail dot com
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: steven at gcc dot gnu.org @ 2010-12-18 19:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #13 from Steven Bosscher <steven at gcc dot gnu.org> 2010-12-18 19:21:05 UTC ---
I'd like to wait for Honza's opinion before we just start trying random
patches. 

But if you feel like trying some other things, perhaps you can see if
backporting all changes of
http://gcc.gnu.org/viewcvs?view=revision&revision=166517 helps.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (12 preceding siblings ...)
  2010-12-18 19:21 ` steven at gcc dot gnu.org
@ 2010-12-18 19:35 ` hjl.tools at gmail dot com
  2010-12-18 19:39 ` hjl.tools at gmail dot com
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 19:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #14 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 19:35:24 UTC ---
Created attachment 22813
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22813
A new patch

Try this.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (13 preceding siblings ...)
  2010-12-18 19:35 ` [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics hjl.tools at gmail dot com
@ 2010-12-18 19:39 ` hjl.tools at gmail dot com
  2010-12-18 20:26 ` jakub at gcc dot gnu.org
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hjl.tools at gmail dot com @ 2010-12-18 19:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> 2010-12-18 19:38:47 UTC ---
(In reply to comment #13)
> I'd like to wait for Honza's opinion before we just start trying random
> patches. 
> 
> But if you feel like trying some other things, perhaps you can see if
> backporting all changes of
> http://gcc.gnu.org/viewcvs?view=revision&revision=166517 helps.

This checkin depends on is_simple_builtin and is_inexpensive_builtin,
which are new in 4.6.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (14 preceding siblings ...)
  2010-12-18 19:39 ` hjl.tools at gmail dot com
@ 2010-12-18 20:26 ` jakub at gcc dot gnu.org
  2010-12-18 21:15 ` jgarzik at pobox dot com
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-12-18 20:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-12-18 20:26:29 UTC ---
I don't think it is a good idea to change inliner heuristics in 4.5 at this
point.  If it is always a good idea to inline that function, it should be
__attribute__((always_inline)).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (15 preceding siblings ...)
  2010-12-18 20:26 ` jakub at gcc dot gnu.org
@ 2010-12-18 21:15 ` jgarzik at pobox dot com
  2010-12-18 21:16 ` jgarzik at pobox dot com
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 21:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #17 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 21:15:28 UTC ---
(In reply to comment #8)
> -    if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
> +    /* Do not special case builtins where we see the body.
> +       This just confuse inliner.  */
> +    if (!decl || cgraph_node (decl)->analyzed)
> +      ;
> +    else if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_MD)
>        cost = weights->target_builtin_call_cost;
>      else
>        cost = weights->call_cost;

This patch successfully fixes the performance regression in 4.5.1.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (16 preceding siblings ...)
  2010-12-18 21:15 ` jgarzik at pobox dot com
@ 2010-12-18 21:16 ` jgarzik at pobox dot com
  2010-12-18 21:17 ` jgarzik at pobox dot com
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 21:16 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #18 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 21:16:31 UTC ---
argh, please ignore comment #17.  misquote.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (17 preceding siblings ...)
  2010-12-18 21:16 ` jgarzik at pobox dot com
@ 2010-12-18 21:17 ` jgarzik at pobox dot com
  2010-12-18 21:26 ` jgarzik at pobox dot com
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 21:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #19 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 21:17:09 UTC ---
(In reply to comment #14)
> Created attachment 22813 [details]
> A new patch
> 
> Try this.

This patch successfully fixes the performance regression in 4.5.1.

Thanks!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (18 preceding siblings ...)
  2010-12-18 21:17 ` jgarzik at pobox dot com
@ 2010-12-18 21:26 ` jgarzik at pobox dot com
  2010-12-19 11:50 ` hubicka at ucw dot cz
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jgarzik at pobox dot com @ 2010-12-18 21:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #20 from Jeff Garzik <jgarzik at pobox dot com> 2010-12-18 21:25:46 UTC ---
(In reply to comment #16)
> I don't think it is a good idea to change inliner heuristics in 4.5 at this
> point.  If it is always a good idea to inline that function, it should be
> __attribute__((always_inline)).

I confirm that replacing 'inline' with '__attribute__((always_inline))' also
resolves the regression.

It is a bit disappointing to leave such a major performance diff (-26%!) in
latest stable compiler release without resolution (if the decision is to leave
the inliner alone).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (19 preceding siblings ...)
  2010-12-18 21:26 ` jgarzik at pobox dot com
@ 2010-12-19 11:50 ` hubicka at ucw dot cz
  2010-12-19 11:53 ` hubicka at gcc dot gnu.org
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hubicka at ucw dot cz @ 2010-12-19 11:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #21 from Jan Hubicka <hubicka at ucw dot cz> 2010-12-19 11:49:53 UTC ---
> I'd like to wait for Honza's opinion before we just start trying random
> patches. 
Well, if H.J.'s proposed backport of the builtin cost sizes helps, I guess it
is sane
way to fix this.  I will take a look why main inliner don't do the job when
early inliner
ignores the call.

I am not sure how much of heuristics changes makes sense to backport to 4.5.
Depends on importance
of the regression I guess.

Honza


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (20 preceding siblings ...)
  2010-12-19 11:50 ` hubicka at ucw dot cz
@ 2010-12-19 11:53 ` hubicka at gcc dot gnu.org
  2010-12-19 11:58 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hubicka at gcc dot gnu.org @ 2010-12-19 11:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #22 from Jan Hubicka <hubicka at gcc dot gnu.org> 2010-12-19 11:53:32 UTC ---
  freq:  8000 size:  2 time:  2 D.13088_5480 = VIEW_CONVERT_EXPR<vector
int>(D.8004_729);
  freq:  8000 size:  2 time:  2 D.13087_5481 = VIEW_CONVERT_EXPR<vector
int>(a_5271);
  freq:  8000 size:  7 time: 16 D.13086_5482 = __builtin_ia32_paddd128
(D.13087_5481, D.13088_5480);

obviously we also should count V_C_E as free like other conversions.  Will test
patch for that.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (21 preceding siblings ...)
  2010-12-19 11:53 ` hubicka at gcc dot gnu.org
@ 2010-12-19 11:58 ` hubicka at gcc dot gnu.org
  2010-12-20  8:32 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hubicka at gcc dot gnu.org @ 2010-12-19 11:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #23 from Jan Hubicka <hubicka at gcc dot gnu.org> 2010-12-19 11:58:35 UTC ---
sha256_4way.c:287:78: warning: called from here
sha256_4way.c:50:23: warning: inlining failed in call to ‘ROTR’: --param
inline-unit-growth limit reached

so you could also workaround with --param inline-unit-growth=<some sufficiently
large number>.
Otherwise H.J.'s proposed backport seems like most sane way to solve the
problem.  I guess it can be backported.


I am testing
Index: tree-inline.c
===================================================================
--- tree-inline.c       (revision 168047)
+++ tree-inline.c       (working copy)
@@ -3281,6 +3281,7 @@ estimate_operator_cost (enum tree_code c
     CASE_CONVERT:
     case COMPLEX_EXPR:
     case PAREN_EXPR:
+    case VIEW_CONVERT_EXPR:
       return 0;

     /* Assign cost of 1 to usual operations.

to solve the V_C_E problems.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (22 preceding siblings ...)
  2010-12-19 11:58 ` hubicka at gcc dot gnu.org
@ 2010-12-20  8:32 ` jakub at gcc dot gnu.org
  2010-12-21 10:31 ` hubicka at gcc dot gnu.org
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2010-12-20  8:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #24 from Jakub Jelinek <jakub at gcc dot gnu.org> 2010-12-20 08:32:10 UTC ---
VCE is often very expensive though (often a memory store followed by memory
load into a different register, etc.), so 0 unconditionally is IMHO wrong.
Perhaps for some TYPE_MODE combinations at most.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (23 preceding siblings ...)
  2010-12-20  8:32 ` jakub at gcc dot gnu.org
@ 2010-12-21 10:31 ` hubicka at gcc dot gnu.org
  2010-12-21 10:39 ` hubicka at gcc dot gnu.org
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hubicka at gcc dot gnu.org @ 2010-12-21 10:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #25 from Jan Hubicka <hubicka at gcc dot gnu.org> 2010-12-21 10:30:36 UTC ---
Author: hubicka
Date: Tue Dec 21 10:30:33 2010
New Revision: 168108

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168108
Log:

    PR middle-end/47000
    * tree-inline.c (estimate_operator_cost): Handle VIEW_CONVERT_EXPR.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-inline.c


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (24 preceding siblings ...)
  2010-12-21 10:31 ` hubicka at gcc dot gnu.org
@ 2010-12-21 10:39 ` hubicka at gcc dot gnu.org
  2010-12-28 14:57 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: hubicka at gcc dot gnu.org @ 2010-12-21 10:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

--- Comment #26 from Jan Hubicka <hubicka at gcc dot gnu.org> 2010-12-21 10:39:42 UTC ---
Hi,
I read the comment only after comiting the patch.  We generally believe
conversions to be free even if this is not always the case. FP->int conversions
tends to be expensive, too.  I don't think it is serious problem since the
conversions tends to be dominated by real work elsewhere and there is good
chance for conversions to combine and optimize when code is duplicated by
inlining or peeling or so.

For non-registers V_C_Es are already counted as all non-register accesses are
believed to be read/writes.  So all we get wrong are those int<->fp V_C_Es. I
don't think they are terribly common and it is very target specific on how
expensive they really are... SSE intrincics and SRA are nowdays both quite good
source of V_C_Es that are cheap so I would guess that wast majority of them is
cheap anyway.

Honza


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (25 preceding siblings ...)
  2010-12-21 10:39 ` hubicka at gcc dot gnu.org
@ 2010-12-28 14:57 ` rguenth at gcc dot gnu.org
  2011-03-08 13:20 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: rguenth at gcc dot gnu.org @ 2010-12-28 14:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |4.5.3

--- Comment #27 from Richard Guenther <rguenth at gcc dot gnu.org> 2010-12-28 14:57:37 UTC ---
(In reply to comment #24)
> VCE is often very expensive though (often a memory store followed by memory
> load into a different register, etc.), so 0 unconditionally is IMHO wrong.
> Perhaps for some TYPE_MODE combinations at most.

I think assuming VCE is zero-cost on the tree level makes sense though,
as they tend to get away usually (that is, when they appear in regular
code, not as a result of weird type punnings).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (26 preceding siblings ...)
  2010-12-28 14:57 ` rguenth at gcc dot gnu.org
@ 2011-03-08 13:20 ` rguenth at gcc dot gnu.org
  2011-04-28 15:28 ` rguenth at gcc dot gnu.org
  2012-07-02 10:28 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-03-08 13:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
      Known to fail|                            |4.5.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (27 preceding siblings ...)
  2011-03-08 13:20 ` rguenth at gcc dot gnu.org
@ 2011-04-28 15:28 ` rguenth at gcc dot gnu.org
  2012-07-02 10:28 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-04-28 15:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.5.3                       |4.5.4

--- Comment #28 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-04-28 14:51:41 UTC ---
GCC 4.5.3 is being released, adjusting target milestone.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics
  2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
                   ` (28 preceding siblings ...)
  2011-04-28 15:28 ` rguenth at gcc dot gnu.org
@ 2012-07-02 10:28 ` rguenth at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-02 10:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47000

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
   Target Milestone|4.5.4                       |4.6.0

--- Comment #29 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-02 10:27:51 UTC ---
Fixed in 4.6.0, the 4.5 branch is being closed.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-07-02 10:28 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-18  7:46 [Bug c/47000] New: Major performance regression in parallel SSE2 impl of SHA256 hash algorithm jgarzik at pobox dot com
2010-12-18  7:48 ` [Bug c/47000] " jgarzik at pobox dot com
2010-12-18 12:27 ` [Bug target/47000] " steven at gcc dot gnu.org
2010-12-18 12:39 ` steven at gcc dot gnu.org
2010-12-18 12:49 ` steven at gcc dot gnu.org
2010-12-18 15:36 ` hjl.tools at gmail dot com
2010-12-18 15:41 ` hjl.tools at gmail dot com
2010-12-18 15:43 ` hjl.tools at gmail dot com
2010-12-18 16:04 ` hjl.tools at gmail dot com
2010-12-18 18:24 ` jgarzik at pobox dot com
2010-12-18 18:49 ` jgarzik at pobox dot com
2010-12-18 19:08 ` jgarzik at pobox dot com
2010-12-18 19:09 ` jgarzik at pobox dot com
2010-12-18 19:21 ` steven at gcc dot gnu.org
2010-12-18 19:35 ` [Bug target/47000] [4.5 Regression] Failure to inline SSE intrinsics hjl.tools at gmail dot com
2010-12-18 19:39 ` hjl.tools at gmail dot com
2010-12-18 20:26 ` jakub at gcc dot gnu.org
2010-12-18 21:15 ` jgarzik at pobox dot com
2010-12-18 21:16 ` jgarzik at pobox dot com
2010-12-18 21:17 ` jgarzik at pobox dot com
2010-12-18 21:26 ` jgarzik at pobox dot com
2010-12-19 11:50 ` hubicka at ucw dot cz
2010-12-19 11:53 ` hubicka at gcc dot gnu.org
2010-12-19 11:58 ` hubicka at gcc dot gnu.org
2010-12-20  8:32 ` jakub at gcc dot gnu.org
2010-12-21 10:31 ` hubicka at gcc dot gnu.org
2010-12-21 10:39 ` hubicka at gcc dot gnu.org
2010-12-28 14:57 ` rguenth at gcc dot gnu.org
2011-03-08 13:20 ` rguenth at gcc dot gnu.org
2011-04-28 15:28 ` rguenth at gcc dot gnu.org
2012-07-02 10:28 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).