gcc9 snapshot 20190414 is 30x slower than gcc 6.3

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* gcc9 snapshot 20190414 is 30x slower than gcc 6.3
@ 2019-04-17  0:28 Jason Mancini
  2019-04-17  2:10 ` Jason Mancini
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-17  0:28 UTC (permalink / raw)
  To: gcc-help

Using gcc 6.3, my C++ source file compiles in 1m2s with -O0.  With snapshot 20190414 (compiled with --disable-checking and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0.  Have I overlooked disabling some snapshot self-checking code?  Are there known configuration mistakes that could result in this sort of performance degradation?  Thanks!  It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference.  Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file.  There's a lot of templatized headers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-17  0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini
@ 2019-04-17  2:10 ` Jason Mancini
  2019-04-17  2:20   ` Xi Ruoyao
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-17  2:10 UTC (permalink / raw)
  To: gcc-help

> Using gcc 6.3, my C++ source file compiles in 1m2s with -O0.  With snapshot 20190414 (compiled with --disable-checking
> and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0.  Have I overlooked disabling some
> snapshot self-checking code?  Are there known configuration mistakes that could result in this sort of performance
> degradation?  Thanks!  It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference.
> Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file.  There's a lot of templatized headers.

Latest data points:
gcc-6.3/6.4 take about 43 seconds
gcc-7.2 takes 30 minutes
gcc-8.2 takes 27 minutes
gcc-9.0 takes 31 minutes (snapshot 20190414)
clang 6.0.1/7.01 take about 31 seconds

This is frustrating, as I'm going to have to capitulate to using clang here for a very large user base.  We've been a gcc
shop for decades.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-17  2:10 ` Jason Mancini
@ 2019-04-17  2:20   ` Xi Ruoyao
  2019-04-17  8:38     ` Jonathan Wakely
  0 siblings, 1 reply; 24+ messages in thread
From: Xi Ruoyao @ 2019-04-17  2:20 UTC (permalink / raw)
  To: Jason Mancini; +Cc: gcc-help

On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0.  With snapshot
> > 20190414 (compiled with --disable-checking
> > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > file with -O0.  Have I overlooked disabling some
> > snapshot self-checking code?  Are there known configuration mistakes that
> > could result in this sort of performance
> > degradation?  Thanks!  It will take a while to go back and try other gcc 6,
> > 7, 8, and 9 snapshots to collect points of reference.
> > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > file.  There's a lot of templatized headers.
> 
> Latest data points:
> gcc-6.3/6.4 take about 43 seconds
> gcc-7.2 takes 30 minutes
> gcc-8.2 takes 27 minutes
> gcc-9.0 takes 31 minutes (snapshot 20190414)
> clang 6.0.1/7.01 take about 31 seconds
> 
> This is frustrating, as I'm going to have to capitulate to using clang here
> for a very large user base.  We've been a gcc
> shop for decades.

We'll never know why unless you can give a testcase to reproduce this issue.
-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-17  2:20   ` Xi Ruoyao
@ 2019-04-17  8:38     ` Jonathan Wakely
  2019-04-17  9:07       ` Segher Boessenkool
  2019-04-18 16:06       ` Jason Mancini
  0 siblings, 2 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-04-17  8:38 UTC (permalink / raw)
  To: Jason Mancini; +Cc: gcc-help

On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote:
>
> On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0.  With snapshot
> > > 20190414 (compiled with --disable-checking
> > > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > > file with -O0.  Have I overlooked disabling some
> > > snapshot self-checking code?  Are there known configuration mistakes that
> > > could result in this sort of performance
> > > degradation?  Thanks!  It will take a while to go back and try other gcc 6,
> > > 7, 8, and 9 snapshots to collect points of reference.
> > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > > file.  There's a lot of templatized headers.
> >
> > Latest data points:
> > gcc-6.3/6.4 take about 43 seconds
> > gcc-7.2 takes 30 minutes
> > gcc-8.2 takes 27 minutes
> > gcc-9.0 takes 31 minutes (snapshot 20190414)
> > clang 6.0.1/7.01 take about 31 seconds
> >
> > This is frustrating, as I'm going to have to capitulate to using clang here
> > for a very large user base.  We've been a gcc
> > shop for decades.
>
> We'll never know why unless you can give a testcase to reproduce this issue.

Even better would be a bug report.

The chances of it ever getting fixed are much higher if we know
there's a problem. If you just complain that you have to switch to
clang then nothing will change. And if you'd told us two years ago
that your program started compiling 40 times slower, maybe it would
have been fixed already.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-17  8:38     ` Jonathan Wakely
@ 2019-04-17  9:07       ` Segher Boessenkool
  2019-04-18 16:06       ` Jason Mancini
  1 sibling, 0 replies; 24+ messages in thread
From: Segher Boessenkool @ 2019-04-17  9:07 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: Jason Mancini, gcc-help

On Wed, Apr 17, 2019 at 09:37:54AM +0100, Jonathan Wakely wrote:
> On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote:
> >
> > On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0.  With snapshot
> > > > 20190414 (compiled with --disable-checking
> > > > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > > > file with -O0.  Have I overlooked disabling some
> > > > snapshot self-checking code?  Are there known configuration mistakes that
> > > > could result in this sort of performance
> > > > degradation?  Thanks!  It will take a while to go back and try other gcc 6,
> > > > 7, 8, and 9 snapshots to collect points of reference.
> > > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > > > file.  There's a lot of templatized headers.
> > >
> > > Latest data points:
> > > gcc-6.3/6.4 take about 43 seconds
> > > gcc-7.2 takes 30 minutes
> > > gcc-8.2 takes 27 minutes
> > > gcc-9.0 takes 31 minutes (snapshot 20190414)
> > > clang 6.0.1/7.01 take about 31 seconds
> > >
> > > This is frustrating, as I'm going to have to capitulate to using clang here
> > > for a very large user base.  We've been a gcc
> > > shop for decades.
> >
> > We'll never know why unless you can give a testcase to reproduce this issue.
> 
> Even better would be a bug report.

Yes...  With -ftime-report info, to start with.


Segher

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-17  8:38     ` Jonathan Wakely
  2019-04-17  9:07       ` Segher Boessenkool
@ 2019-04-18 16:06       ` Jason Mancini
  2019-04-18 19:07         ` Segher Boessenkool
  1 sibling, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-18 16:06 UTC (permalink / raw)
  To: Jonathan Wakely, gcc-help

The root cause is between 7.1 and 7.2!  7.1 is fast, 7.2 is slow.
Bisected, and found it's due to revision 249333 on gcc-7-branch.
Here's the commit log and -ftime-report output.  Where do we
go from here?  Thanks!  -JasonM

------------------------------------------------------------------------
r249333 | jason | 2017-06-16 19:34:15 -0700 (Fri, 16 Jun 2017) | 6 lines

        PR c++/81045 - Wrong type-dependence with auto return type.

        * pt.c (type_dependent_expression_p): An undeduced auto outside the
        template isn't dependent.
        * call.c (build_over_call): Instantiate undeduced auto even in a
        template.
------------------------------------------------------------------------

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1385 kB ( 0%) ggc
 phase parsing           :1732.55 (99%) usr  50.00 (97%) sys1782.71 (99%) wall133075852 kB (99%) ggc
 phase lang. deferred    :  10.14 ( 1%) usr   1.33 ( 3%) sys  11.46 ( 1%) wall  494433 kB ( 0%) ggc
 phase opt and generate  :   4.63 ( 0%) usr   0.45 ( 1%) sys   5.09 ( 0%) wall  237946 kB ( 0%) ggc
 |name lookup            :   5.94 ( 0%) usr   1.76 ( 3%) sys   7.75 ( 0%) wall  220059 kB ( 0%) ggc
 |overload resolution    :  25.11 ( 1%) usr   3.98 ( 8%) sys  29.70 ( 2%) wall 2052677 kB ( 2%) ggc
 garbage collection      : 197.31 (11%) usr   8.98 (17%) sys 206.31 (11%) wall       0 kB ( 0%) ggc
 dump files              :   0.04 ( 0%) usr   0.04 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 callgraph construction  :   0.77 ( 0%) usr   0.10 ( 0%) sys   0.94 ( 0%) wall   95588 kB ( 0%) ggc
 callgraph optimization  :   0.04 ( 0%) usr   0.02 ( 0%) sys   0.04 ( 0%) wall      64 kB ( 0%) ggc
 ipa dead code removal   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 ipa inheritance graph   :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     719 kB ( 0%) ggc
 ipa inlining heuristics :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 cfg construction        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     248 kB ( 0%) ggc
 cfg cleanup             :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      53 kB ( 0%) ggc
 trivially dead code     :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 df scan insns           :   0.11 ( 0%) usr   0.01 ( 0%) sys   0.14 ( 0%) wall      81 kB ( 0%) ggc
 df live regs            :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    2290 kB ( 0%) ggc
 register information    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis          :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.07 ( 0%) wall     797 kB ( 0%) ggc
 rebuild jump labels     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 preprocessing           :   0.57 ( 0%) usr   0.89 ( 2%) sys   1.32 ( 0%) wall   74289 kB ( 0%) ggc
 parser (global)         :   1.22 ( 0%) usr   1.10 ( 2%) sys   2.43 ( 0%) wall  202556 kB ( 0%) ggc
 parser struct body      :1459.61 (84%) usr  29.08 (56%) sys1489.75 (83%) wall122435248 kB (91%) ggc
 parser enumerator list  :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    2847 kB ( 0%) ggc
 parser function body    :   0.61 ( 0%) usr   0.18 ( 0%) sys   0.72 ( 0%) wall   29874 kB ( 0%) ggc
 parser inl. func. body  :   0.14 ( 0%) usr   0.07 ( 0%) sys   0.22 ( 0%) wall   10547 kB ( 0%) ggc
 parser inl. meth. body  :   3.86 ( 0%) usr   0.68 ( 1%) sys   4.39 ( 0%) wall  239905 kB ( 0%) ggc
 template instantiation  :  79.00 ( 5%) usr  10.23 (20%) sys  88.55 ( 5%) wall10574851 kB ( 8%) ggc
 early inlining heuristics:   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       3 kB ( 0%) ggc
 inline parameters       :   0.05 ( 0%) usr   0.01 ( 0%) sys   0.04 ( 0%) wall    1531 kB ( 0%) ggc
 tree gimplify           :   0.28 ( 0%) usr   0.02 ( 0%) sys   0.31 ( 0%) wall   21735 kB ( 0%) ggc
 tree eh                 :   0.02 ( 0%) usr   0.02 ( 0%) sys   0.04 ( 0%) wall    3173 kB ( 0%) ggc
 tree CFG construction   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    4453 kB ( 0%) ggc
 tree CFG cleanup        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall      24 kB ( 0%) ggc
 tree PHI insertion      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     732 kB ( 0%) ggc
 tree SSA rewrite        :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall    2542 kB ( 0%) ggc
 tree SSA other          :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.05 ( 0%) wall     258 kB ( 0%) ggc
 tree SSA incremental    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree operand scan       :   1.17 ( 0%) usr   0.04 ( 0%) sys   1.18 ( 0%) wall    6878 kB ( 0%) ggc
 dominance computation   :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 out of ssa              :   0.03 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall     174 kB ( 0%) ggc
 expand vars             :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.09 ( 0%) wall    3785 kB ( 0%) ggc
 expand                  :   0.26 ( 0%) usr   0.02 ( 0%) sys   0.26 ( 0%) wall   22606 kB ( 0%) ggc
 post expand cleanups    :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    1994 kB ( 0%) ggc
 varconst                :   0.26 ( 0%) usr   0.08 ( 0%) sys   0.32 ( 0%) wall     164 kB ( 0%) ggc
 jump                    :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 loop init               :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    1375 kB ( 0%) ggc
 mode switching          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 integrated RA           :   0.63 ( 0%) usr   0.01 ( 0%) sys   0.69 ( 0%) wall   53889 kB ( 0%) ggc
 LRA non-specific        :   0.17 ( 0%) usr   0.03 ( 0%) sys   0.19 ( 0%) wall     279 kB ( 0%) ggc
 LRA virtuals elimination:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    1039 kB ( 0%) ggc
 LRA reload inheritance  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 LRA create live ranges  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     137 kB ( 0%) ggc
 LRA hard reg assignment :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 reload                  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue  :   0.10 ( 0%) usr   0.02 ( 0%) sys   0.03 ( 0%) wall    2859 kB ( 0%) ggc
 shorten branches        :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall       0 kB ( 0%) ggc
 reg stack               :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall    3775 kB ( 0%) ggc
 symout                  :   0.06 ( 0%) usr   0.04 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 initialize rtl          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      12 kB ( 0%) ggc
 rest of compilation     :   0.15 ( 0%) usr   0.05 ( 0%) sys   0.18 ( 0%) wall    4640 kB ( 0%) ggc
 TOTAL                 :1747.32            51.80          1799.28           133809627 kB

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-18 16:06       ` Jason Mancini
@ 2019-04-18 19:07         ` Segher Boessenkool
  2019-04-18 19:38           ` Jeff Law
  0 siblings, 1 reply; 24+ messages in thread
From: Segher Boessenkool @ 2019-04-18 19:07 UTC (permalink / raw)
  To: Jason Mancini; +Cc: Jonathan Wakely, gcc-help

On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote:
> The root cause is between 7.1 and 7.2!Â  7.1 is fast, 7.2 is slow.
> Bisected, and found it's due to revision 249333 on gcc-7-branch.
> Here's the commit log and -ftime-report output.  Where do we
> go from here?  Thanks!  -JasonM

Please open a bug report?  https://gcc.gnu.org/bugzilla

Thanks,


Segher

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-18 19:07         ` Segher Boessenkool
@ 2019-04-18 19:38           ` Jeff Law
  2019-04-19  2:03             ` Jason Mancini
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Jeff Law @ 2019-04-18 19:38 UTC (permalink / raw)
  To: Segher Boessenkool, Jason Mancini; +Cc: Jonathan Wakely, gcc-help

On 4/18/19 1:07 PM, Segher Boessenkool wrote:
> On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote:
>> The root cause is between 7.1 and 7.2!Â  7.1 is fast, 7.2 is slow.
>> Bisected, and found it's due to revision 249333 on gcc-7-branch.
>> Here's the commit log and -ftime-report output.  Where do we
>> go from here?  Thanks!  -JasonM
> 
> Please open a bug report?  https://gcc.gnu.org/bugzilla
WIth a testcase!

jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-18 19:38           ` Jeff Law
@ 2019-04-19  2:03             ` Jason Mancini
  2019-04-22 22:01             ` Jason Mancini
  2019-04-23  0:18             ` Jason Mancini
  2 siblings, 0 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-19  2:03 UTC (permalink / raw)
  To: Jeff Law, Segher Boessenkool; +Cc: Jonathan Wakely, gcc-help

Working on getting a testcase prepared to file a bug report.  The trimmed down case shows a 6x degradation instead of 45x at -O0, but hopefully that's enough to pinpoint the reason.  (15 vs 85 seconds.)  Seems like some O(n^2) behavior, 15s becomes 90s, but 40s becomes 30m.

I used "gcc -E" to generate the output.  Is that what we're looking for here?  I'm not familiar with *.ii files (or is that the typical extension used for preprocessor output?)

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-18 19:38           ` Jeff Law
  2019-04-19  2:03             ` Jason Mancini
@ 2019-04-22 22:01             ` Jason Mancini
  2019-04-22 22:17               ` Jason Mancini
  2019-04-23  0:18             ` Jason Mancini
  2 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-22 22:01 UTC (permalink / raw)
  To: gcc-help

On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at:

if (undeduced_auto_decl (fn))
  mark_used (fn, complain); // <= this guy from gcc-7-branch r249333
else
  /* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598.  */
  TREE_USED (fn) = 1;

I'm still working on a code sample.  The code sample has to be large to tickle the issue so far.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-22 22:01             ` Jason Mancini
@ 2019-04-22 22:17               ` Jason Mancini
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-22 22:17 UTC (permalink / raw)
  To: gcc-help

> On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at:
>
> if (undeduced_auto_decl (fn))
>   mark_used (fn, complain); // <= this guy from gcc-7-branch r249333
> else
>   /* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598.  */
>   TREE_USED (fn) = 1;

mark_used is only called 1260 times, but inflates run time from ~13 to ~81 seconds for one sample.
The calls to mark_used aren't expensive, so they must be triggering a down-stream effect.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-18 19:38           ` Jeff Law
  2019-04-19  2:03             ` Jason Mancini
  2019-04-22 22:01             ` Jason Mancini
@ 2019-04-23  0:18             ` Jason Mancini
  2019-04-23 12:58               ` Jonathan Wakely
  2019-04-29 20:33               ` Jason Mancini
  2 siblings, 2 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-23  0:18 UTC (permalink / raw)
  To: gcc-help

We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions.  Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-23  0:18             ` Jason Mancini
@ 2019-04-23 12:58               ` Jonathan Wakely
  2019-04-29 20:33               ` Jason Mancini
  1 sibling, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-04-23 12:58 UTC (permalink / raw)
  To: Jason Mancini; +Cc: gcc-help

On Tue, 23 Apr 2019 at 01:18, Jason Mancini <jayrusman@hotmail.com> wrote:
>
> We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions.  Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue.

Is the code you're compiling the same in all cases, meaning
decltype(auto) is faster with GCC 6 than later releases?

Or are you only using decltype(auto) with the later releases?

It's possible that GCC 7 and later fixes some bugs in the handling of
decltype(auto) which makes it slower than GCC 6.

It's unsurprising that decltype(auto) requires the compiler to do more
work, but ideally that work wouldn't make compilation exponentially
slower, even with less buggy behaviour than in older releases.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
  2019-04-23  0:18             ` Jason Mancini
  2019-04-23 12:58               ` Jonathan Wakely
@ 2019-04-29 20:33               ` Jason Mancini
  2019-05-01 13:31                 ` C11, <stdatomic.h> and atomic pointers Chris Hall
  1 sibling, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-29 20:33 UTC (permalink / raw)
  To: gcc-help

Jason Mancini said:
> We've determined that the gcc perf drop is due to use of decltype(auto) as the
> return type for template functions.  Replacing with a known type or func(...) -> decltype(...)
> trailing type syntax seems to avoid the performance issue.

I misspoke here.  Turns out that the above replacement made everything equally slow.
So the performance bug was lurking in there, and gcc-7-branch r249333 exposed it more.
Yeah yeah, we need to get an offending code blob cleaned up, approved, and bug filed.
I've been using gcc9 snapshots with part of r249333 reverted in the mean time to make
forward progress vetting gcc9 on our code base (no other problems so far!)
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* C11, <stdatomic.h> and atomic pointers
  2019-04-29 20:33               ` Jason Mancini
@ 2019-05-01 13:31                 ` Chris Hall
  2019-05-01 14:15                   ` Martin Sebor
  2019-06-20 11:28                   ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Hall @ 2019-05-01 13:31 UTC (permalink / raw)
  To: gcc-help

I find that:

   int * _Atomic foo ;
   int bar[12] ;

   foo = bar ;
   foo += 4 ;			// foo -> &bar[4] -- of course

   foo = bar ;
   atomic_fetch_add(&foo, 4) ; 	// foo -> &bar[1] -- ????

which, I confess, I did not quite expect.  (Happy I looked, though !)

So, I looked at the Standard:

   7.17.7.5 The atomic_fetch and modify generic functions

   1 The following operations perform arithmetic and bitwise
     computations. All of these operations are applicable to an
     object of any atomic integer type. None of these
     operations is applicable to atomic_bool.

Of course, "integer type" excludes pointers, so I guess what it does 
with pointers is undefined.

Should gcc be throwing a friendly warning here ?

But the Standard goes on to say:

   3 ... For signed integer types ... there are no undefined
     results. ...
     ... For address types, the result may be an undefined
     address, but the operations otherwise have no undefined
     behavior.

I don't know why it feels the need to mention "address types", given 
that they are not valid arguments ?  [I'm assuming that by "address 
types" it actually means "pointer types".  I can find no other mention 
of "address type".]

The "Synopsis" says:

   2 #include <stdatomic.h>
     <C> atomic_fetch_<key>(volatile <A> *object, <M> operand);
     <C> atomic_fetch_<key>_explicit(volatile <A> *object,
                               <M> operand, memory_order order);

and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction", 
as follows:

   5 In the following synopses:

      - An <A> refers to one of the atomic types.
      - A <C> refers to its corresponding non-atomic type.
      - An <M> refers to the type of the other argument for
        arithmetic operations. For atomic integer types, <M>
        is <C>. For atomic pointer types, <M> is ptrdiff_t.

As it happens, <M> only used in the Synopsis for atomic_fetch_<key>... 
which is not defined for pointer types ?

I realise this is not really the place for discussion of the Standard, 
but I assume that what gcc does is based on some interpretation of it. 
Is there a good place to look for that interpretation ?

Chris

--------------------------

FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require 
the various <A> arguments to be atomic types... they are perfectly happy 
with ordinary types.  That doesn't seem right to me.

FWIW (2): the Standard (later in 7.17.7.5) says:

   5 NOTE The operation of the atomic_fetch and modify generic
     functions are nearly equivalent to the operation of the
     corresponding op= compound assignment operators. The only
     differences are that the compound assignment operators are
     not guaranteed to operate atomically, ...

Except that "6.5.16.2 Compound assignment" says:

   3 A compound assignment of the form E1 op= E2 ...
     ... If E1 has an atomic type, compound assignment is a
     read-modify-write operation with memory_order_seq_cst
     memory order semantics.

which looks like a flat contradiction to me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: C11, <stdatomic.h> and atomic pointers
  2019-05-01 13:31                 ` C11, <stdatomic.h> and atomic pointers Chris Hall
@ 2019-05-01 14:15                   ` Martin Sebor
  2019-05-02  7:54                     ` Jonathan Wakely
  2019-06-20 11:28                   ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
  1 sibling, 1 reply; 24+ messages in thread
From: Martin Sebor @ 2019-05-01 14:15 UTC (permalink / raw)
  To: Chris Hall, gcc-help

On 05/01/2019 07:31 AM, Chris Hall wrote:
> 
> I find that:
> 
>  Â  int * _Atomic foo ;
>  Â  int bar[12] ;
> 
>  Â  foo = bar ;
>  Â  foo += 4 ;Â Â Â Â Â Â Â Â Â Â Â  // foo -> &bar[4] -- of course
> 
>  Â  foo = bar ;
>  Â  atomic_fetch_add(&foo, 4) ;Â Â Â Â  // foo -> &bar[1] -- ????
> 

I think this is a bug 64843:
   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64843

> which, I confess, I did not quite expect.Â  (Happy I looked, though !)
> 
> So, I looked at the Standard:
> 
>  Â  7.17.7.5 The atomic_fetch and modify generic functions
> 
>  Â  1 The following operations perform arithmetic and bitwise
>  Â Â Â  computations. All of these operations are applicable to an
>  Â Â Â  object of any atomic integer type. None of these
>  Â Â Â  operations is applicable to atomic_bool.
> 
> Of course, "integer type" excludes pointers, so I guess what it does 
> with pointers is undefined.
> 
> Should gcc be throwing a friendly warning here ?
> 
> But the Standard goes on to say:
> 
>  Â  3 ... For signed integer types ... there are no undefined
>  Â Â Â  results. ...
>  Â Â Â  ... For address types, the result may be an undefined
>  Â Â Â  address, but the operations otherwise have no undefined
>  Â Â Â  behavior.
> 
> I don't know why it feels the need to mention "address types", given 
> that they are not valid arguments ?Â  [I'm assuming that by "address 
> types" it actually means "pointer types".Â  I can find no other mention 
> of "address type".]

Yes, that's a problem in the standard text that should be fixed.

> The "Synopsis" says:
> 
>  Â  2 #include <stdatomic.h>
>  Â Â Â  <C> atomic_fetch_<key>(volatile <A> *object, <M> operand);
>  Â Â Â  <C> atomic_fetch_<key>_explicit(volatile <A> *object,
>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  <M> operand, memory_order order);
> 
> and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction", 
> as follows:
> 
>  Â  5 In the following synopses:
> 
>  Â Â Â Â  - An <A> refers to one of the atomic types.
>  Â Â Â Â  - A <C> refers to its corresponding non-atomic type.
>  Â Â Â Â  - An <M> refers to the type of the other argument for
>  Â Â Â Â Â Â  arithmetic operations. For atomic integer types, <M>
>  Â Â Â Â Â Â  is <C>. For atomic pointer types, <M> is ptrdiff_t.
> 
> As it happens, <M> only used in the Synopsis for atomic_fetch_<key>... 
> which is not defined for pointer types ?
> 
> I realise this is not really the place for discussion of the Standard, 
> but I assume that what gcc does is based on some interpretation of it. 
> Is there a good place to look for that interpretation ?

There are C defect reports that GCC also considers.  Some may
already be incorporated, others are not.  C11 defect reports
are tracked here:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm

If the above isn't being tracked there or in the list below
we might want write up a new issue for it and submit it to
WG14 to get it fixed in C2X.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2316.htm

Martin

> 
> Chris
> 
> --------------------------
> 
> FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require 
> the various <A> arguments to be atomic types... they are perfectly happy 
> with ordinary types.Â  That doesn't seem right to me.
> 
> FWIW (2): the Standard (later in 7.17.7.5) says:
> 
>  Â  5 NOTE The operation of the atomic_fetch and modify generic
>  Â Â Â  functions are nearly equivalent to the operation of the
>  Â Â Â  corresponding op= compound assignment operators. The only
>  Â Â Â  differences are that the compound assignment operators are
>  Â Â Â  not guaranteed to operate atomically, ...
> 
> Except that "6.5.16.2 Compound assignment" says:
> 
>  Â  3 A compound assignment of the form E1 op= E2 ...
>  Â Â Â  ... If E1 has an atomic type, compound assignment is a
>  Â Â Â  read-modify-write operation with memory_order_seq_cst
>  Â Â Â  memory order semantics.
> 
> which looks like a flat contradiction to me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: C11, <stdatomic.h> and atomic pointers
  2019-05-01 14:15                   ` Martin Sebor
@ 2019-05-02  7:54                     ` Jonathan Wakely
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-05-02  7:54 UTC (permalink / raw)
  To: Martin Sebor; +Cc: Chris Hall, gcc-help

The equivalent wording in the C++ standard was modified by
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0558r1.pdf to
remove the nonsense words "address types", and to clarify that using
the fetch_xxx functions for arithmetic on pointers is valid, except
for pointers to void and function pointers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* __STDC_NO_THREADS__ and late model gcc/glibc
  2019-05-01 13:31                 ` C11, <stdatomic.h> and atomic pointers Chris Hall
  2019-05-01 14:15                   ` Martin Sebor
@ 2019-06-20 11:28                   ` Chris Hall
  2019-06-20 12:17                     ` Jonathan Wakely
  2020-01-06 15:09                     ` Function returning struct on x86_64 (at least) Chris Hall
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Hall @ 2019-06-20 11:28 UTC (permalink / raw)
  To: gcc-help

I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1', 
denying support for <threads.h>.  Nevertheless, it does support 
_Thread_local.

As of glibc 2.28, <threads.h> appears in the library.  I guess that 
means some version of gcc will no longer set __STDC_NO_THREADS__ ?

On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all 
versions up to and including the "trunk".

But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1, 
and no longer set by gcc v7.2.0 (which I just built on my machine).

I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc 
2.28.  I'm guessing that something, somewhere is taking that into account.

I have looked everywhere I can think of to find where 
__STDC_NO_THREADS__ is configured... but to no avail :-(

Does anyone know what I should expect, or where I should look to find 
out, please ?

Thanks,

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: __STDC_NO_THREADS__ and late model gcc/glibc
  2019-06-20 11:28                   ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
@ 2019-06-20 12:17                     ` Jonathan Wakely
  2019-06-21  9:39                       ` Chris Hall
  2020-01-06 15:09                     ` Function returning struct on x86_64 (at least) Chris Hall
  1 sibling, 1 reply; 24+ messages in thread
From: Jonathan Wakely @ 2019-06-20 12:17 UTC (permalink / raw)
  To: Chris Hall; +Cc: gcc-help

On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
>
>
> I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1',
> denying support for <threads.h>.  Nevertheless, it does support
> _Thread_local.
>
> As of glibc 2.28, <threads.h> appears in the library.  I guess that
> means some version of gcc will no longer set __STDC_NO_THREADS__ ?
>
> On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all
> versions up to and including the "trunk".
>
> But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1,
> and no longer set by gcc v7.2.0 (which I just built on my machine).
>
> I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc
> 2.28.  I'm guessing that something, somewhere is taking that into account.
>
> I have looked everywhere I can think of to find where
> __STDC_NO_THREADS__ is configured... but to no avail :-(
>
> Does anyone know what I should expect, or where I should look to find
> out, please ?

Glibc provides it in the /usr/include/stdc-predef.h file which is
implicitly pre-included by GCC.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: __STDC_NO_THREADS__ and late model gcc/glibc
  2019-06-20 12:17                     ` Jonathan Wakely
@ 2019-06-21  9:39                       ` Chris Hall
  2019-06-21  9:51                         ` Jonathan Wakely
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Hall @ 2019-06-21  9:39 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: gcc-help

On 20/06/2019 13:17, Jonathan Wakely wrote:
> On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
...
>> I have looked everywhere I can think of to find where
>> __STDC_NO_THREADS__ is configured... but to no avail :-(
>>
>> Does anyone know what I should expect, or where I should look to find
>> out, please ?

> Glibc provides it in the /usr/include/stdc-predef.h file which is
> implicitly pre-included by GCC.

Ah ha !  Thank you.

I guess that accounts for the "predefined" _STDC_PREDEF_H.

AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is 
built, depending on the target.  Deep Magic of the First Magnitude.

Thanks,

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: __STDC_NO_THREADS__ and late model gcc/glibc
  2019-06-21  9:39                       ` Chris Hall
@ 2019-06-21  9:51                         ` Jonathan Wakely
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-06-21  9:51 UTC (permalink / raw)
  To: Chris Hall; +Cc: gcc-help

On Fri, 21 Jun 2019 at 10:39, Chris Hall wrote:
>
> On 20/06/2019 13:17, Jonathan Wakely wrote:
> > On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
> ...
> >> I have looked everywhere I can think of to find where
> >> __STDC_NO_THREADS__ is configured... but to no avail :-(
> >>
> >> Does anyone know what I should expect, or where I should look to find
> >> out, please ?
>
> > Glibc provides it in the /usr/include/stdc-predef.h file which is
> > implicitly pre-included by GCC.
>
> Ah ha !  Thank you.
>
> I guess that accounts for the "predefined" _STDC_PREDEF_H.
>
> AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is
> built, depending on the target.

Right, see gcc/config/glibc-c.c

>  Deep Magic of the First Magnitude.

:-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Function returning struct on x86_64 (at least)
  2019-06-20 11:28                   ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
  2019-06-20 12:17                     ` Jonathan Wakely
@ 2020-01-06 15:09                     ` Chris Hall
  2020-01-06 15:19                       ` Marc Glisse
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Hall @ 2020-01-06 15:09 UTC (permalink / raw)
  To: gcc-help


I hoped to do something "clever" with a function of the form:

   typedef struct { char s[64] ; } qerr_str_t ;

   extern qerr_str_t
   qerrst0(int err)
   {
     qerr_str_t st ;

     snprintf(st.s, sizeof(st.s), "errno=%d", err) ;

     return st ;
   }

but was disappointed to find that this compiles (gcc 8.3 and others, 
-O2) to this:

   .LC0:
         .string "errno=%d"
   qerrst0:
         pushq   %rbx
         movl    %esi, %ecx
         movq    %rdi, %rbx
         movl    $.LC0, %edx
         movl    $64, %esi
         xorl    %eax, %eax
         subq    $64, %rsp
         movq    %rsp, %rdi
         call    snprintf
         movdqa  (%rsp), %xmm0
         movq    %rbx, %rax
         movdqa  16(%rsp), %xmm1
         movdqa  32(%rsp), %xmm2
         movdqa  48(%rsp), %xmm3
         movups  %xmm0, (%rbx)
         movups  %xmm1, 16(%rbx)
         movups  %xmm2, 32(%rbx)
         movups  %xmm3, 48(%rbx)
         addq    $64, %rsp
         popq    %rbx
         ret

On reflection, the compiler is playing safe and not writing to whatever 
the "hidden" pointer %rdi is pointing at, until the implicit assignment. 
  So I have no right to be disappointed.

The object of the exercise is to create temporary strings for use like this:

   int
   main(int argc, char* argv[])
   {
     printf("%s: %s\n", argv[0], qerrst0(argc).s) ;
   }

where the "hidden" pointer passed to qerrst0() does not, in fact, point 
to anything accessible.  Sadly, even when qerrst0() is inlined, I find:

   .LC0:
         .string "errno=%d"
   .LC1:
         .string "%s: %s\n"
   main:
         pushq   %rbx
         movl    %edi, %ecx
         movq    %rsi, %rbx
         movl    $.LC0, %edx
         movl    $64, %esi
         xorl    %eax, %eax
         addq    $-128, %rsp
         leaq    64(%rsp), %rdi
         call    snprintf
         movdqa  64(%rsp), %xmm0
         movq    (%rbx), %rsi
         xorl    %eax, %eax
         movdqa  80(%rsp), %xmm1
         movdqa  96(%rsp), %xmm2
         movq    %rsp, %rdx
         movl    $.LC1, %edi
         movdqa  112(%rsp), %xmm3
         movaps  %xmm0, (%rsp)
         movaps  %xmm1, 16(%rsp)
         movaps  %xmm2, 32(%rsp)
         movaps  %xmm3, 48(%rsp)
         call    printf
         subq    $-128, %rsp
         xorl    %eax, %eax
         popq    %rbx
         ret

where there is still an (unnecessary) assignment going on !

I tried something simpler:

   extern qerr_str_t
   qerrst1(int err)
   {
     qerr_str_t st ;

     st.s[0] = err ;

     return st ;
   }

which compiles to:

   qerrst1:
         movq    %rdi, %rax
         movb    %sil, (%rdi)
         ret

...so a trivial case optimises as one might hope.

As does:

   extern qerr_str_t
   qerrst2(int err)
   {
     qerr_str_t st ;
     char* q = st.s ;

     q[0]  = err ;
     q[63] = err ;

     return st ;
   }

   qerrst2:
         movq    %rdi, %rax
         movb    %sil, (%rdi)
         movb    %sil, 63(%rdi)
         ret

The following are also optimised:

   extern qerr_str_t
   qerrst3a(int err)
   {
     qerr_str_t st = { "" } ;

     return st ;
   }

   extern qerr_str_t
   qerrst3b(int err)
   {
     qerr_str_t st ;
     char* q = st.s ;

     memset(q, 0, sizeof(st.s)) ;

     return st ;
   }

to the same code:

   qerrst3a/b:
         pxor    %xmm0, %xmm0
         movq    %rdi, %rax
         movups  %xmm0, (%rdi)
         movups  %xmm0, 16(%rdi)
         movups  %xmm0, 32(%rdi)
         movups  %xmm0, 48(%rdi)
         ret

However, ever so slightly more complicated:

   extern qerr_str_t
   qerrst4(int err)
   {
     qerr_str_t st ;

     for (int i = 0 ; i < (err & 63) ; ++i)
       st.s[i] = err - i ;

     return st ;
   }

   qerrst4:
         movl    %esi, %edx
         movq    %rdi, %rax
         andl    $63, %edx
         je      .L12
         subl    $1, %edx
         leaq    -71(%rsp,%rdx), %r8
         leaq    -72(%rsp), %rdx
         addl    %edx, %esi
   .L11:
         movl    %esi, %ecx
         subl    %edx, %ecx
         addq    $1, %rdx
         movb    %cl, -1(%rdx)
         cmpq    %r8, %rdx
         jne     .L11
   .L12:
         movdqa  -72(%rsp), %xmm0
         movdqa  -56(%rsp), %xmm1
         movdqa  -40(%rsp), %xmm2
         movdqa  -24(%rsp), %xmm3
         movups  %xmm0, (%rax)
         movups  %xmm1, 16(%rax)
         movups  %xmm2, 32(%rax)
         movups  %xmm3, 48(%rax)
         ret

Which is a puzzle :-(

Interestingly, I also found (after a little effort):

   extern qerr_str_t
   qerrst5(int err, char* fred)
   {
     qerr_str_t st ;

     st.s[ 0] = err ;
     st.s[ 2] = fred[ 8] ;
     st.s[ 4] = fred[ 6] ;
     st.s[ 6] = fred[ 4] ;
     st.s[ 8] = fred[ 2] ;
     st.s[10] = fred[ 0] ;

     return st ;
   }

   qerrst5:
         movq    %rdi, %rax
         movzbl  8(%rdx), %r9d
         movzbl  6(%rdx), %r8d
         movzbl  4(%rdx), %edi
         movzbl  2(%rdx), %ecx
         movb    %sil, (%rax)	-- BUG iff %rax ==
         movzbl  (%rdx), %edx	--                 %rdx !
         movb    %r9b, 2(%rax)
         movb    %r8b, 4(%rax)
         movb    %dil, 6(%rax)
         movb    %cl, 8(%rax)
         movb    %dl, 10(%rax)
         ret

which is very nearly correct... except as noted, if *fred points at the 
final destination !!

For this to do what I had hoped (and I imagine is the majority case), 
what is needed is a way to mark the declaration of 'qerr_str_t st' in 
the function as a "clone" of the final destination 'qerr_str_t' in the 
caller -- so that the compiler could Just Do It.

I looked for an __attribute__(()) for this... but could not find one.

Is there any way in which I can persuade the compiler that a function 
returning a struct does not need to worry about preserving the value of 
the final destination (ie the struct at %rdi) ?

Chris















^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Function returning struct on x86_64 (at least)
  2020-01-06 15:09                     ` Function returning struct on x86_64 (at least) Chris Hall
@ 2020-01-06 15:19                       ` Marc Glisse
  2020-01-07 16:20                         ` Chris Hall
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Glisse @ 2020-01-06 15:19 UTC (permalink / raw)
  To: Chris Hall; +Cc: gcc-help

On Mon, 6 Jan 2020, Chris Hall wrote:

[description of NRVO]

> Is there any way in which I can persuade the compiler that a function 
> returning a struct does not need to worry about preserving the value of the 
> final destination (ie the struct at %rdi) ?

Compile the file as C++ instead of C. Not that it would be forbidden in C, 
but the optimization happens to be in the C++ front-end. There is also an 
optimization pass called nrv, but it does trigger that often.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Function returning struct on x86_64 (at least)
  2020-01-06 15:19                       ` Marc Glisse
@ 2020-01-07 16:20                         ` Chris Hall
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Hall @ 2020-01-07 16:20 UTC (permalink / raw)
  To: gcc-help, Marc Glisse

On 06/01/2020 15:19, Marc Glisse wrote:
> On Mon, 6 Jan 2020, Chris Hall wrote:
> 
> [description of NRVO]
> 
>> Is there any way in which I can persuade the compiler that a function 
>> returning a struct does not need to worry about preserving the value 
>> of the final destination (ie the struct at %rdi) ?

> Compile the file as C++ instead of C. Not that it would be forbidden in 
> C, but the optimization happens to be in the C++ front-end. There is 
> also an optimization pass called nrv, but it does trigger that often.

The idea of trying to write as-standard-as-possible C11, and then 
compiling it as C++ makes me queasy :-(

As far as I can see, tree-nrv is enabled.  Some of the functions do 
write directly to %rdi, so I guess that's being done by tree-nrv.  But, 
as you say, this optimization does seem not to be applied often.

I was puzzled by why the optimization would not be forbidden in C...

...clearly, the problem is that 'z = f(...) ;' is defined such that the 
result must be as if f() creates a temporary struct t, which is copied 
to z when f() completes.  If f() writes directly to z it must not then 
read from z -- which looks hard to guarantee, not least because only the 
*caller* knows what z is !!

Now, as I noted before, I found (after a little effort):

   typedef struct { char s[64] ; } qerrst_t ;

   extern qerrst_t
   qerrst5(int err, char* foo)
   {
     qerr_str_t st ;

     st.s[ 0] = err ;
     st.s[ 2] = foo[ 8] ;
     st.s[ 4] = foo[ 6] ;
     st.s[ 6] = foo[ 4] ;
     st.s[ 8] = foo[ 2] ;
     st.s[10] = foo[ 0] ;

     return st ;
   }

   qerrst5:
         movq    %rdi, %rax
         movzbl  8(%rdx), %r9d
         movzbl  6(%rdx), %r8d
         movzbl  4(%rdx), %edi
         movzbl  2(%rdx), %ecx
         movb    %sil, (%rax)    -- BUG iff %rax ==
         movzbl  (%rdx), %edx    --                 %rdx !
         movb    %r9b, 2(%rax)
         movb    %r8b, 4(%rax)
         movb    %dil, 6(%rax)
         movb    %cl, 8(%rax)
         movb    %dl, 10(%rax)
         ret

which does *not* do a copy in the function, and which is very nearly 
correct... except as noted, if *foo points at the final destination !!

HOWEVER... when I tried this function the bug did NOT appear.  It turns 
out that the *caller* passes a pointer to a hidden struct and the 
*caller* copies that to the final destination !!!

The test I ran is:

   typedef struct { char s[64] ; } qerrst_t ;

   extern qerrst_t qerrst5(int err, char* foo) ;

   static void  __attribute__((noinline))
   show(const char* name, char* s)
   {
     printf(name) ;
     for (int i = 0 ; i <= 10 ; ++i)
       printf(" %3d", (unsigned char)s[i]) ;
     printf("\n") ;
   }

   int
   main(Unused int argc, Unused char* argv[])
   {
     int err = argc ;
     qerrst_t x, y ;

     for (int i = 0 ; i < (int)sizeof(x.s) ; ++i)
       y.s[i] = x.s[i] = (char)(100 + i) ;

     y = qerrst5(err, x.s) ;
     show("y", y.s) ;

     x = qerrst5(err, x.s) ;
     show("x", x.s) ;

     return 0 ;
   }

and in a separate compilation unit (for completeness):

   extern qerrst_t
   qerrst5(int err, char* foo)
   {
     qerrst_t st ;

     st.s[ 0] = (char)err ;
     st.s[ 1] = -1 ;
     st.s[ 2] = foo[ 8] ;
     st.s[ 3] = -3 ;
     st.s[ 4] = foo[ 6] ;
     st.s[ 5] = -5 ;
     st.s[ 6] = foo[ 4] ;
     st.s[ 7] = -7 ;
     st.s[ 8] = foo[ 2] ;
     st.s[ 9] = -8 ;
     st.s[10] = foo[ 0] ;

     return st ;
   }

which compiles to:

   Dump of assembler code for function qerrst5:
    0x47fdd0 <+0>:     mov    %rdi,%rax
    0x47fdd3 <+3>:     movzbl 0x8(%rdx),%r8d   # read foo[8]
    0x47fdd8 <+8>:     movzbl 0x6(%rdx),%edi
    0x47fddc <+12>:    mov    %esi,%r9d
    0x47fddf <+15>:    movzbl 0x2(%rdx),%ecx
    0x47fde3 <+19>:    movzbl 0x4(%rdx),%esi
    0x47fde7 <+23>:    mov    %r9b,(%rax)      # write st.s[0]
    0x47fdea <+26>:    movb   $0xff,0x1(%rax)
    0x47fdee <+30>:    movb   $0xfd,0x3(%rax)
    0x47fdf2 <+34>:    movb   $0xfb,0x5(%rax)
    0x47fdf6 <+38>:    movb   $0xf9,0x7(%rax)
    0x47fdfa <+42>:    movb   $0xf8,0x9(%rax)
    0x47fdfe <+46>:    mov    %r8b,0x2(%rax)
    0x47fe02 <+50>:    mov    %dil,0x4(%rax)
    0x47fe06 <+54>:    mov    %sil,0x6(%rax)
    0x47fe0a <+58>:    mov    %cl,0x8(%rax)
    0x47fe0d <+61>:    movzbl (%rdx),%edx      # read foo[0]
                                               #  -- BUG if foo == st.s
    0x47fe10 <+64>:    mov    %dl,0xa(%rax)    # write st.s[10]
    0x47fe13 <+67>:    retq

And the result was:

   y   1 255 108 253 106 251 104 249 102 248 100
   x   1 255 108 253 106 251 104 249 102 248 100

SURPRISE !  expected to see:

   x   1 255 108 253 106 251 104 249 102 248   1 <<< BUG

Looking at main() we see:

   Dump of assembler code for function main:
    0x4012a0 <+0>:     push   %rbp
    0x4012a1 <+1>:     mov    %edi,%esi
    0x4012a3 <+3>:     mov    %rsp,%rbp
    0x4012a6 <+6>:     push   %r12
    0x4012a8 <+8>:     mov    %edi,%r12d
    0x4012ab <+11>:    and    $0xffffffffffffffe0,%rsp
    0x4012af <+15>:    sub    $0xc0,%rsp
    0x4012b6 <+22>:    vmovaps 0x94642(%rip),%xmm0        # 0x495900
    0x4012be <+30>:    lea    0x40(%rsp),%rdx      # ->x
    0x4012c3 <+35>:    mov    %rsp,%rdi            # ->t
    0x4012c6 <+38>:    vmovaps %xmm0,0x40(%rsp)    # x0
    0x4012cc <+44>:    vmovaps %xmm0,0x80(%rsp)    # y0
    0x4012d5 <+53>:    vmovaps 0x94633(%rip),%xmm0        # 0x495910
    0x4012dd <+61>:    vmovaps %xmm0,0x50(%rsp)    # x1
    0x4012e3 <+67>:    vmovaps %xmm0,0x90(%rsp)    # y1
    0x4012ec <+76>:    vmovaps 0x9462c(%rip),%xmm0        # 0x495920
    0x4012f4 <+84>:    vmovaps %xmm0,0x60(%rsp)    # x2
    0x4012fa <+90>:    vmovaps %xmm0,0xa0(%rsp)    # y2
    0x401303 <+99>:    vmovaps 0x94625(%rip),%xmm0        # 0x495930
    0x40130b <+107>:   vmovaps %xmm0,0x70(%rsp)    # x3
    0x401311 <+113>:   vmovaps %xmm0,0xb0(%rsp)    # y3
    0x40131a <+122>:   callq  0x47fdd0 <qerrst5>
    0x40131f <+127>:   vmovups (%rsp),%xmm1        # t0
    0x401324 <+132>:   lea    0x80(%rsp),%rsi      # ->y
    0x40132c <+140>:   mov    $0x494b3e,%edi
    0x401331 <+145>:   vmovups 0x10(%rsp),%xmm2    # t1
    0x401337 <+151>:   vmovups 0x20(%rsp),%xmm3    # t2
    0x40133d <+157>:   vmovups 0x30(%rsp),%xmm4    # t3
    0x401343 <+163>:   vmovaps %xmm1,0x80(%rsp)    # y0
    0x40134c <+172>:   vmovaps %xmm2,0x90(%rsp)    # y1
    0x401355 <+181>:   vmovaps %xmm3,0xa0(%rsp)    # t2
    0x40135e <+190>:   vmovaps %xmm4,0xb0(%rsp)    # y3
    0x401367 <+199>:   callq  0x47fb60 <show>
    0x40136c <+204>:   lea    0x40(%rsp),%rdx      # ->x
    0x401371 <+209>:   mov    %r12d,%esi           # err
    0x401374 <+212>:   mov    %rsp,%rdi            # ->t
    0x401377 <+215>:   callq  0x47fdd0 <qerrst5>
    0x40137c <+220>:   vmovups (%rsp),%xmm5        # t0
    0x401381 <+225>:   lea    0x40(%rsp),%rsi      # ->x
    0x401386 <+230>:   mov    $0x4937c4,%edi
    0x40138b <+235>:   vmovups 0x10(%rsp),%xmm6    # t1
    0x401391 <+241>:   vmovups 0x20(%rsp),%xmm7    # t2
    0x401397 <+247>:   vmovups 0x30(%rsp),%xmm1    # t3
    0x40139d <+253>:   vmovaps %xmm5,0x40(%rsp)    # x0
    0x4013a3 <+259>:   vmovaps %xmm6,0x50(%rsp)    # x1
    0x4013a9 <+265>:   vmovaps %xmm7,0x60(%rsp)    # x2
    0x4013af <+271>:   vmovaps %xmm1,0x70(%rsp)    # x3
    0x4013b5 <+277>:   callq  0x47fb60 <show>
    0x4013ba <+282>:   xor    %eax,%eax
    0x4013bc <+284>:   mov    -0x8(%rbp),%r12
    0x4013c0 <+288>:   leaveq
    0x4013c1 <+289>:   retq

The caller is passing a pointer to a hidden 't' and then *itself* 
copying the result to the destination of the assignment !!

It looks like the caller is taking care of the problem, so a function 
returning a struct does not need to... surely ?

So I also tried:

   typedef struct { char s[64] ; } qerrst_t ;

   extern qerrst_t qerrst0(int err) ;

   int
   main(Unused int argc, Unused char* argv[])
   {
     int err = argc ;
     qerrst_t z ;

     printf("qerrst0()='%s'\n", qerrst0(err).s) ;

     z = qerrst0(err) ;
     printf("qerrst0()='%s'\n", z.s) ;

     return 0 ;
   }

and in a separate compilation unit (for completeness):

   extern qerrst_t
   qerrst0(int err)
   {
     qerrst_t st ;

     snprintf(st.s, sizeof(st.s), "errno=%d", err) ;

     return st ;
   }

which compiles to:

   Dump of assembler code for function qerrst0:
    0x47fc00 <+0>:     push   %r12
    0x47fc02 <+2>:     mov    %esi,%ecx
    0x47fc04 <+4>:     mov    %rdi,%r12
    0x47fc07 <+7>:     mov    $0x495910,%edx
    0x47fc0c <+12>:    sub    $0x40,%rsp
    0x47fc10 <+16>:    mov    $0x40,%esi
    0x47fc15 <+21>:    xor    %eax,%eax
    0x47fc17 <+23>:    mov    %rsp,%rdi
    0x47fc1a <+26>:    callq  0x4010b0 <snprintf@plt>
    0x47fc1f <+31>:    vmovaps (%rsp),%xmm0
    0x47fc24 <+36>:    mov    %r12,%rax
    0x47fc27 <+39>:    vmovaps 0x10(%rsp),%xmm1
    0x47fc2d <+45>:    vmovaps 0x20(%rsp),%xmm2
    0x47fc33 <+51>:    vmovaps 0x30(%rsp),%xmm3
    0x47fc39 <+57>:    vmovups %xmm0,(%r12)
    0x47fc3f <+63>:    vmovups %xmm1,0x10(%r12)
    0x47fc46 <+70>:    vmovups %xmm2,0x20(%r12)
    0x47fc4d <+77>:    vmovups %xmm3,0x30(%r12)
    0x47fc54 <+84>:    add    $0x40,%rsp
    0x47fc58 <+88>:    pop    %r12
    0x47fc5a <+90>:    retq

which, as before, creates a temporary, local struct which is copied to 
the return struct pointed to by %rdi.

And now we see:

   Dump of assembler code for function main:
    0x401280 <+0>:     push   %rbp
    0x401281 <+1>:     mov    %edi,%esi
    0x401283 <+3>:     mov    %rsp,%rbp
    0x401286 <+6>:     push   %r12
    0x401288 <+8>:     mov    %edi,%r12d
    0x40128b <+11>:    and    $0xffffffffffffffe0,%rsp
    0x40128f <+15>:    sub    $0xc0,%rsp
    0x401296 <+22>:    lea    0x80(%rsp),%rdi        # ->t
    0x40129e <+30>:    callq  0x47fc00 <qerrst0>     # qerrst0(err).s
    0x4012a3 <+35>:    lea    0x80(%rsp),%rsi
    0x4012ab <+43>:    mov    $0x495882,%edi
    0x4012b0 <+48>:    xor    %eax,%eax
    0x4012b2 <+50>:    callq  0x4010a0 <printf@plt>  # printf(..., t)
    0x4012b7 <+55>:    mov    %r12d,%esi
    0x4012ba <+58>:    mov    %rsp,%rdi              # ->t
    0x4012bd <+61>:    callq  0x47fc00 <qerrst0>     # z = qerrst0(err) ;
    0x4012c2 <+66>:    vmovups (%rsp),%xmm0          # t0
    0x4012c7 <+71>:    lea    0x40(%rsp),%rsi        # ->z
    0x4012cc <+76>:    mov    $0x495882,%edi
    0x4012d1 <+81>:    vmovups 0x10(%rsp),%xmm1      # t1
    0x4012d7 <+87>:    vmovups 0x20(%rsp),%xmm2      # t2
    0x4012dd <+93>:    xor    %eax,%eax
    0x4012df <+95>:    vmovups 0x30(%rsp),%xmm3      # t3
    0x4012e5 <+101>:   vmovaps %xmm0,0x40(%rsp)      # z0 )
    0x4012eb <+107>:   vmovaps %xmm1,0x50(%rsp)      # z1 ) copied from t
    0x4012f1 <+113>:   vmovaps %xmm2,0x60(%rsp)      # z2 )
    0x4012f7 <+119>:   vmovaps %xmm3,0x70(%rsp)      # z3 )
    0x4012fd <+125>:   callq  0x4010a0 <printf@plt>  # printf(..., z.s)
    0x401302 <+130>:   xor    %eax,%eax
    0x401304 <+132>:   mov    -0x8(%rbp),%r12
    0x401308 <+136>:   leaveq
    0x401309 <+137>:   retq

So for:

     printf("qerrst0()='%s'\n", qerrst0(err).s) ;

there is one (spurious) copy in qerrst0().

And for:

     z = qerrst0(err) ;
     printf("qerrst0()='%s'\n", z.s) ;

there is one (spurious) copy in qerrst0() AND a *second* copy in main().

Is it just me, or is this broken ?

So, I looked at the AMD64 ABI (Draft 0.99.7 â€“ November 17, 2014 â€“ 
15:08), Section 3.2.3 Parameter Passing, p22:

   Returning of Values: ....

     2. If the type has class MEMORY, then the caller provides space
        for the return value and passes the address of this storage
        in %rdi as if it were the first argument to the function.
        In effect, this address becomes a â€œhiddenâ€ first argument.

        This storage must not overlap any data visible to the callee
        through other names than this argument.

So... the ABI appears to say that the callee does *not* need to do any 
copying *ever*.

This pushes the problem back to the caller.  If the caller can be sure 
that the final destination is not visible to the callee, it too can 
avoid copying.

So... why is the qerrst0() function doing a copy ?

Chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-01-07 16:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-17  0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini
2019-04-17  2:10 ` Jason Mancini
2019-04-17  2:20   ` Xi Ruoyao
2019-04-17  8:38     ` Jonathan Wakely
2019-04-17  9:07       ` Segher Boessenkool
2019-04-18 16:06       ` Jason Mancini
2019-04-18 19:07         ` Segher Boessenkool
2019-04-18 19:38           ` Jeff Law
2019-04-19  2:03             ` Jason Mancini
2019-04-22 22:01             ` Jason Mancini
2019-04-22 22:17               ` Jason Mancini
2019-04-23  0:18             ` Jason Mancini
2019-04-23 12:58               ` Jonathan Wakely
2019-04-29 20:33               ` Jason Mancini
2019-05-01 13:31                 ` C11, <stdatomic.h> and atomic pointers Chris Hall
2019-05-01 14:15                   ` Martin Sebor
2019-05-02  7:54                     ` Jonathan Wakely
2019-06-20 11:28                   ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
2019-06-20 12:17                     ` Jonathan Wakely
2019-06-21  9:39                       ` Chris Hall
2019-06-21  9:51                         ` Jonathan Wakely
2020-01-06 15:09                     ` Function returning struct on x86_64 (at least) Chris Hall
2020-01-06 15:19                       ` Marc Glisse
2020-01-07 16:20                         ` Chris Hall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).