public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
@ 2020-03-27 18:06 jamborm at gcc dot gnu.org
  2020-03-30  7:51 ` [Bug target/94364] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-27 18:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

            Bug ID: 94364
           Summary: 505.mcf_r is 8% faster when compiled with
                    -mprefer-vector-width=128
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options
-Ofast -march=native -mtune=native, is 8% slower than when we also use
option -mprefer-vector-width=128.  I have observed it on both AMD Zen2
and Intel Cascade Lake Server CPUs (using master revision 26b3e568a60).

Better vector width selection would therefore bring about noticeable
speed-up.


Symbol profiles (collected on AMD Rome):

-Ofast -march=native -mtune=native:

  Overhead       Samples  Shared Object    Symbol                          
  ........  ............  ...............  ................................

    28.64%        462302  mcf_r_peak.mine  spec_qsort
    21.58%        348703  mcf_r_peak.mine  cost_compare
    15.81%        255029  mcf_r_peak.mine  primal_bea_mpp
    15.58%        251176  mcf_r_peak.mine  replace_weaker_arc
     7.37%        118646  mcf_r_peak.mine  arc_compare
     6.53%        105337  mcf_r_peak.mine  price_out_impl
     1.38%         22276  mcf_r_peak.mine  update_tree

-Ofast -march=native -mtune=native -mprefer-vector-width=128:

  Overhead       Samples  Shared Object    Symbol                          
  ........  ............  ...............  ................................

    23.57%        354536  mcf_r_peak.mine  spec_qsort
    23.51%        353767  mcf_r_peak.mine  cost_compare
    16.98%        255104  mcf_r_peak.mine  primal_bea_mpp
    16.65%        249891  mcf_r_peak.mine  replace_weaker_arc
     7.29%        109267  mcf_r_peak.mine  arc_compare
     7.09%        106380  mcf_r_peak.mine  price_out_impl
     1.53%         22968  mcf_r_peak.mine  update_tree


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
@ 2020-03-30  7:51 ` rguenth at gcc dot gnu.org
  2020-04-01 19:14 ` jamborm at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-30  7:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Huh, looks like this is the (patched by us) memory copying done in spec_qsort?
I wonder if you can re-measure with our patching undone but then with
-fno-strict-aliasing (though I think that only was required with LTO).

How large are the objects sorted in mcf?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
  2020-03-30  7:51 ` [Bug target/94364] " rguenth at gcc dot gnu.org
@ 2020-04-01 19:14 ` jamborm at gcc dot gnu.org
  2020-04-02  7:30 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-04-01 19:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #2 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> Huh, looks like this is the (patched by us) memory copying done in
> spec_qsort?

Yes

> I wonder if you can re-measure with our patching undone but then with
> -fno-strict-aliasing (though I think that only was required with LTO).
>

The difference indeed goes away :-/  The current code we're
benchmarking (when not using LTO) is slower in both cases :-/

> How large are the objects sorted in mcf?

It's always pointers, 8 bytes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
  2020-03-30  7:51 ` [Bug target/94364] " rguenth at gcc dot gnu.org
  2020-04-01 19:14 ` jamborm at gcc dot gnu.org
@ 2020-04-02  7:30 ` rguenth at gcc dot gnu.org
  2020-04-02 11:18 ` marxin at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-02  7:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #2)
> (In reply to Richard Biener from comment #1)
> > Huh, looks like this is the (patched by us) memory copying done in
> > spec_qsort?
> 
> Yes
> 
> > I wonder if you can re-measure with our patching undone but then with
> > -fno-strict-aliasing (though I think that only was required with LTO).
> >
> 
> The difference indeed goes away :-/  The current code we're
> benchmarking (when not using LTO) is slower in both cases :-/

:/

What is the diff we are using?  IIRC spec_qsort contains special casing
for standard integer type sizes and my original patch simply removed all
that premature optimization and instead always uses the char copying loop
(which seems to be vectorized then).  Maybe we can resort to apply
-fno-strict-aliasing just to the qsort CU?  It wasn't intended to introduce
big differences compared to official runs...

> > How large are the objects sorted in mcf?
> 
> It's always pointers, 8 bytes.

OK, that would explain it then.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-04-02  7:30 ` rguenth at gcc dot gnu.org
@ 2020-04-02 11:18 ` marxin at gcc dot gnu.org
  2020-04-02 12:00 ` marxin at gcc dot gnu.org
  2020-04-02 14:37 ` jamborm at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-04-02 11:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #4 from Martin Liška <marxin at gcc dot gnu.org> ---
Created attachment 48169
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48169&action=edit
qsort patch

I'm sending spec_qsort patch we use. I'm going to prepare a patch that will
revert this and add -fno-strict-aliasing attribute to the function.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-04-02 11:18 ` marxin at gcc dot gnu.org
@ 2020-04-02 12:00 ` marxin at gcc dot gnu.org
  2020-04-02 14:37 ` jamborm at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-04-02 12:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #5 from Martin Liška <marxin at gcc dot gnu.org> ---
With something like:

diff --git a/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.c
b/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.c
index 05cad501..ad79ddae 100755
--- a/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.c
+++ b/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.c
@@ -112,6 +112,7 @@ med3(char *a, char *b, char *c, cmp_t *cmp)
 }

 void
+__attribute__((optimize("-fno-strict-aliasing")))
 spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)
 {
         char *pa, *pb, *pc, *pd, *pl, *pm, *pn;
diff --git a/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.h
b/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.h
index 0519f867..c25a1159 100755
--- a/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.h
+++ b/benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.h
@@ -6,5 +6,7 @@
 #ifdef __cplusplus
 extern "C"
 #endif
-void spec_qsort(void *array, size_t nitems, size_t size, int (*cmp)(const
void*,const void*));
+void
+__attribute__((optimize("-fno-strict-aliasing")))
+spec_qsort(void *array, size_t nitems, size_t size, int (*cmp)(const
void*,const void*));
 #endif

and -Ofast -march=znver2 I get:

    21.95%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] cost_compare
    19.95%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] spec_qsort
    19.63%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] primal_bea_mpp
    14.20%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] replace_weaker_arc
     9.17%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] arc_compare
     8.47%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] price_out_impl
     1.37%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] update_tree
     0.97%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] switch_arcs.constprop.0
     0.83%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] suspend_impl
     0.69%  mcf_r_peak.gcc7  mcf_r_peak.gcc7-m64  [.] primal_iminus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
  2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-04-02 12:00 ` marxin at gcc dot gnu.org
@ 2020-04-02 14:37 ` jamborm at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-04-02 14:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #6 from Martin Jambor <jamborm at gcc dot gnu.org> ---
OK, I'm going to close this given that this problem is specific to our
mcf patch which we decided to change and the issue cannot easily be
avoided in the compiler.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-04-02 14:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-27 18:06 [Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 jamborm at gcc dot gnu.org
2020-03-30  7:51 ` [Bug target/94364] " rguenth at gcc dot gnu.org
2020-04-01 19:14 ` jamborm at gcc dot gnu.org
2020-04-02  7:30 ` rguenth at gcc dot gnu.org
2020-04-02 11:18 ` marxin at gcc dot gnu.org
2020-04-02 12:00 ` marxin at gcc dot gnu.org
2020-04-02 14:37 ` jamborm at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).