* gcc9 snapshot 20190414 is 30x slower than gcc 6.3
@ 2019-04-17 0:28 Jason Mancini
2019-04-17 2:10 ` Jason Mancini
0 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-17 0:28 UTC (permalink / raw)
To: gcc-help
Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot 20190414 (compiled with --disable-checking and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0. Have I overlooked disabling some snapshot self-checking code? Are there known configuration mistakes that could result in this sort of performance degradation? Thanks! It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference. Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file. There's a lot of templatized headers.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-17 0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini
@ 2019-04-17 2:10 ` Jason Mancini
2019-04-17 2:20 ` Xi Ruoyao
0 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-17 2:10 UTC (permalink / raw)
To: gcc-help
> Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot 20190414 (compiled with --disable-checking
> and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0. Have I overlooked disabling some
> snapshot self-checking code? Are there known configuration mistakes that could result in this sort of performance
> degradation? Thanks! It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference.
> Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file. There's a lot of templatized headers.
Latest data points:
gcc-6.3/6.4 take about 43 seconds
gcc-7.2 takes 30 minutes
gcc-8.2 takes 27 minutes
gcc-9.0 takes 31 minutes (snapshot 20190414)
clang 6.0.1/7.01 take about 31 seconds
This is frustrating, as I'm going to have to capitulate to using clang here for a very large user base. We've been a gcc
shop for decades.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-17 2:10 ` Jason Mancini
@ 2019-04-17 2:20 ` Xi Ruoyao
2019-04-17 8:38 ` Jonathan Wakely
0 siblings, 1 reply; 24+ messages in thread
From: Xi Ruoyao @ 2019-04-17 2:20 UTC (permalink / raw)
To: Jason Mancini; +Cc: gcc-help
On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot
> > 20190414 (compiled with --disable-checking
> > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > file with -O0. Have I overlooked disabling some
> > snapshot self-checking code? Are there known configuration mistakes that
> > could result in this sort of performance
> > degradation? Thanks! It will take a while to go back and try other gcc 6,
> > 7, 8, and 9 snapshots to collect points of reference.
> > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > file. There's a lot of templatized headers.
>
> Latest data points:
> gcc-6.3/6.4 take about 43 seconds
> gcc-7.2 takes 30 minutes
> gcc-8.2 takes 27 minutes
> gcc-9.0 takes 31 minutes (snapshot 20190414)
> clang 6.0.1/7.01 take about 31 seconds
>
> This is frustrating, as I'm going to have to capitulate to using clang here
> for a very large user base. We've been a gcc
> shop for decades.
We'll never know why unless you can give a testcase to reproduce this issue.
--
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-17 2:20 ` Xi Ruoyao
@ 2019-04-17 8:38 ` Jonathan Wakely
2019-04-17 9:07 ` Segher Boessenkool
2019-04-18 16:06 ` Jason Mancini
0 siblings, 2 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-04-17 8:38 UTC (permalink / raw)
To: Jason Mancini; +Cc: gcc-help
On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote:
>
> On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot
> > > 20190414 (compiled with --disable-checking
> > > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > > file with -O0. Have I overlooked disabling some
> > > snapshot self-checking code? Are there known configuration mistakes that
> > > could result in this sort of performance
> > > degradation? Thanks! It will take a while to go back and try other gcc 6,
> > > 7, 8, and 9 snapshots to collect points of reference.
> > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > > file. There's a lot of templatized headers.
> >
> > Latest data points:
> > gcc-6.3/6.4 take about 43 seconds
> > gcc-7.2 takes 30 minutes
> > gcc-8.2 takes 27 minutes
> > gcc-9.0 takes 31 minutes (snapshot 20190414)
> > clang 6.0.1/7.01 take about 31 seconds
> >
> > This is frustrating, as I'm going to have to capitulate to using clang here
> > for a very large user base. We've been a gcc
> > shop for decades.
>
> We'll never know why unless you can give a testcase to reproduce this issue.
Even better would be a bug report.
The chances of it ever getting fixed are much higher if we know
there's a problem. If you just complain that you have to switch to
clang then nothing will change. And if you'd told us two years ago
that your program started compiling 40 times slower, maybe it would
have been fixed already.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-17 8:38 ` Jonathan Wakely
@ 2019-04-17 9:07 ` Segher Boessenkool
2019-04-18 16:06 ` Jason Mancini
1 sibling, 0 replies; 24+ messages in thread
From: Segher Boessenkool @ 2019-04-17 9:07 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: Jason Mancini, gcc-help
On Wed, Apr 17, 2019 at 09:37:54AM +0100, Jonathan Wakely wrote:
> On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote:
> >
> > On 2019-04-17 02:09 +0000, Jason Mancini wrote:
> > > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot
> > > > 20190414 (compiled with --disable-checking
> > > > and -O2 and make install-strip), it takes 31 minutes to compile the same
> > > > file with -O0. Have I overlooked disabling some
> > > > snapshot self-checking code? Are there known configuration mistakes that
> > > > could result in this sort of performance
> > > > degradation? Thanks! It will take a while to go back and try other gcc 6,
> > > > 7, 8, and 9 snapshots to collect points of reference.
> > > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this
> > > > file. There's a lot of templatized headers.
> > >
> > > Latest data points:
> > > gcc-6.3/6.4 take about 43 seconds
> > > gcc-7.2 takes 30 minutes
> > > gcc-8.2 takes 27 minutes
> > > gcc-9.0 takes 31 minutes (snapshot 20190414)
> > > clang 6.0.1/7.01 take about 31 seconds
> > >
> > > This is frustrating, as I'm going to have to capitulate to using clang here
> > > for a very large user base. We've been a gcc
> > > shop for decades.
> >
> > We'll never know why unless you can give a testcase to reproduce this issue.
>
> Even better would be a bug report.
Yes... With -ftime-report info, to start with.
Segher
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-17 8:38 ` Jonathan Wakely
2019-04-17 9:07 ` Segher Boessenkool
@ 2019-04-18 16:06 ` Jason Mancini
2019-04-18 19:07 ` Segher Boessenkool
1 sibling, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-18 16:06 UTC (permalink / raw)
To: Jonathan Wakely, gcc-help
The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow.
Bisected, and found it's due to revision 249333 on gcc-7-branch.
Here's the commit log and -ftime-report output. Where do we
go from here? Thanks! -JasonM
------------------------------------------------------------------------
r249333 | jason | 2017-06-16 19:34:15 -0700 (Fri, 16 Jun 2017) | 6 lines
PR c++/81045 - Wrong type-dependence with auto return type.
* pt.c (type_dependent_expression_p): An undeduced auto outside the
template isn't dependent.
* call.c (build_over_call): Instantiate undeduced auto even in a
template.
------------------------------------------------------------------------
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1385 kB ( 0%) ggc
phase parsing :1732.55 (99%) usr 50.00 (97%) sys1782.71 (99%) wall133075852 kB (99%) ggc
phase lang. deferred : 10.14 ( 1%) usr 1.33 ( 3%) sys 11.46 ( 1%) wall 494433 kB ( 0%) ggc
phase opt and generate : 4.63 ( 0%) usr 0.45 ( 1%) sys 5.09 ( 0%) wall 237946 kB ( 0%) ggc
|name lookup : 5.94 ( 0%) usr 1.76 ( 3%) sys 7.75 ( 0%) wall 220059 kB ( 0%) ggc
|overload resolution : 25.11 ( 1%) usr 3.98 ( 8%) sys 29.70 ( 2%) wall 2052677 kB ( 2%) ggc
garbage collection : 197.31 (11%) usr 8.98 (17%) sys 206.31 (11%) wall 0 kB ( 0%) ggc
dump files : 0.04 ( 0%) usr 0.04 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc
callgraph construction : 0.77 ( 0%) usr 0.10 ( 0%) sys 0.94 ( 0%) wall 95588 kB ( 0%) ggc
callgraph optimization : 0.04 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 64 kB ( 0%) ggc
ipa dead code removal : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
ipa inheritance graph : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 719 kB ( 0%) ggc
ipa inlining heuristics : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
cfg construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 248 kB ( 0%) ggc
cfg cleanup : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 53 kB ( 0%) ggc
trivially dead code : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
df scan insns : 0.11 ( 0%) usr 0.01 ( 0%) sys 0.14 ( 0%) wall 81 kB ( 0%) ggc
df live regs : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc
df reg dead/unused notes: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2290 kB ( 0%) ggc
register information : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
alias analysis : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.07 ( 0%) wall 797 kB ( 0%) ggc
rebuild jump labels : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
preprocessing : 0.57 ( 0%) usr 0.89 ( 2%) sys 1.32 ( 0%) wall 74289 kB ( 0%) ggc
parser (global) : 1.22 ( 0%) usr 1.10 ( 2%) sys 2.43 ( 0%) wall 202556 kB ( 0%) ggc
parser struct body :1459.61 (84%) usr 29.08 (56%) sys1489.75 (83%) wall122435248 kB (91%) ggc
parser enumerator list : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 2847 kB ( 0%) ggc
parser function body : 0.61 ( 0%) usr 0.18 ( 0%) sys 0.72 ( 0%) wall 29874 kB ( 0%) ggc
parser inl. func. body : 0.14 ( 0%) usr 0.07 ( 0%) sys 0.22 ( 0%) wall 10547 kB ( 0%) ggc
parser inl. meth. body : 3.86 ( 0%) usr 0.68 ( 1%) sys 4.39 ( 0%) wall 239905 kB ( 0%) ggc
template instantiation : 79.00 ( 5%) usr 10.23 (20%) sys 88.55 ( 5%) wall10574851 kB ( 8%) ggc
early inlining heuristics: 0.00 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall 3 kB ( 0%) ggc
inline parameters : 0.05 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) wall 1531 kB ( 0%) ggc
tree gimplify : 0.28 ( 0%) usr 0.02 ( 0%) sys 0.31 ( 0%) wall 21735 kB ( 0%) ggc
tree eh : 0.02 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 3173 kB ( 0%) ggc
tree CFG construction : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 4453 kB ( 0%) ggc
tree CFG cleanup : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 24 kB ( 0%) ggc
tree PHI insertion : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 732 kB ( 0%) ggc
tree SSA rewrite : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 2542 kB ( 0%) ggc
tree SSA other : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.05 ( 0%) wall 258 kB ( 0%) ggc
tree SSA incremental : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
tree operand scan : 1.17 ( 0%) usr 0.04 ( 0%) sys 1.18 ( 0%) wall 6878 kB ( 0%) ggc
dominance computation : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc
out of ssa : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 174 kB ( 0%) ggc
expand vars : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.09 ( 0%) wall 3785 kB ( 0%) ggc
expand : 0.26 ( 0%) usr 0.02 ( 0%) sys 0.26 ( 0%) wall 22606 kB ( 0%) ggc
post expand cleanups : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1994 kB ( 0%) ggc
varconst : 0.26 ( 0%) usr 0.08 ( 0%) sys 0.32 ( 0%) wall 164 kB ( 0%) ggc
jump : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc
loop init : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1375 kB ( 0%) ggc
mode switching : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc
integrated RA : 0.63 ( 0%) usr 0.01 ( 0%) sys 0.69 ( 0%) wall 53889 kB ( 0%) ggc
LRA non-specific : 0.17 ( 0%) usr 0.03 ( 0%) sys 0.19 ( 0%) wall 279 kB ( 0%) ggc
LRA virtuals elimination: 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1039 kB ( 0%) ggc
LRA reload inheritance : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc
LRA create live ranges : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 137 kB ( 0%) ggc
LRA hard reg assignment : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
reload : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
thread pro- & epilogue : 0.10 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 2859 kB ( 0%) ggc
shorten branches : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc
reg stack : 0.00 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
final : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.16 ( 0%) wall 3775 kB ( 0%) ggc
symout : 0.06 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc
initialize rtl : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 12 kB ( 0%) ggc
rest of compilation : 0.15 ( 0%) usr 0.05 ( 0%) sys 0.18 ( 0%) wall 4640 kB ( 0%) ggc
TOTAL :1747.32 51.80 1799.28 133809627 kB
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-18 16:06 ` Jason Mancini
@ 2019-04-18 19:07 ` Segher Boessenkool
2019-04-18 19:38 ` Jeff Law
0 siblings, 1 reply; 24+ messages in thread
From: Segher Boessenkool @ 2019-04-18 19:07 UTC (permalink / raw)
To: Jason Mancini; +Cc: Jonathan Wakely, gcc-help
On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote:
> The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow.
> Bisected, and found it's due to revision 249333 on gcc-7-branch.
> Here's the commit log and -ftime-report output. Where do we
> go from here? Thanks! -JasonM
Please open a bug report? https://gcc.gnu.org/bugzilla
Thanks,
Segher
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-18 19:07 ` Segher Boessenkool
@ 2019-04-18 19:38 ` Jeff Law
2019-04-19 2:03 ` Jason Mancini
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Jeff Law @ 2019-04-18 19:38 UTC (permalink / raw)
To: Segher Boessenkool, Jason Mancini; +Cc: Jonathan Wakely, gcc-help
On 4/18/19 1:07 PM, Segher Boessenkool wrote:
> On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote:
>> The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow.
>> Bisected, and found it's due to revision 249333 on gcc-7-branch.
>> Here's the commit log and -ftime-report output. Where do we
>> go from here? Thanks! -JasonM
>
> Please open a bug report? https://gcc.gnu.org/bugzilla
WIth a testcase!
jeff
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-18 19:38 ` Jeff Law
@ 2019-04-19 2:03 ` Jason Mancini
2019-04-22 22:01 ` Jason Mancini
2019-04-23 0:18 ` Jason Mancini
2 siblings, 0 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-19 2:03 UTC (permalink / raw)
To: Jeff Law, Segher Boessenkool; +Cc: Jonathan Wakely, gcc-help
Working on getting a testcase prepared to file a bug report. The trimmed down case shows a 6x degradation instead of 45x at -O0, but hopefully that's enough to pinpoint the reason. (15 vs 85 seconds.) Seems like some O(n^2) behavior, 15s becomes 90s, but 40s becomes 30m.
I used "gcc -E" to generate the output. Is that what we're looking for here? I'm not familiar with *.ii files (or is that the typical extension used for preprocessor output?)
Thanks!
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-18 19:38 ` Jeff Law
2019-04-19 2:03 ` Jason Mancini
@ 2019-04-22 22:01 ` Jason Mancini
2019-04-22 22:17 ` Jason Mancini
2019-04-23 0:18 ` Jason Mancini
2 siblings, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-22 22:01 UTC (permalink / raw)
To: gcc-help
On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at:
if (undeduced_auto_decl (fn))
mark_used (fn, complain); // <= this guy from gcc-7-branch r249333
else
/* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598. */
TREE_USED (fn) = 1;
I'm still working on a code sample. The code sample has to be large to tickle the issue so far.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-22 22:01 ` Jason Mancini
@ 2019-04-22 22:17 ` Jason Mancini
0 siblings, 0 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-22 22:17 UTC (permalink / raw)
To: gcc-help
> On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at:
>
> if (undeduced_auto_decl (fn))
> mark_used (fn, complain); // <= this guy from gcc-7-branch r249333
> else
> /* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598. */
> TREE_USED (fn) = 1;
mark_used is only called 1260 times, but inflates run time from ~13 to ~81 seconds for one sample.
The calls to mark_used aren't expensive, so they must be triggering a down-stream effect.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-18 19:38 ` Jeff Law
2019-04-19 2:03 ` Jason Mancini
2019-04-22 22:01 ` Jason Mancini
@ 2019-04-23 0:18 ` Jason Mancini
2019-04-23 12:58 ` Jonathan Wakely
2019-04-29 20:33 ` Jason Mancini
2 siblings, 2 replies; 24+ messages in thread
From: Jason Mancini @ 2019-04-23 0:18 UTC (permalink / raw)
To: gcc-help
We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions. Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-23 0:18 ` Jason Mancini
@ 2019-04-23 12:58 ` Jonathan Wakely
2019-04-29 20:33 ` Jason Mancini
1 sibling, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-04-23 12:58 UTC (permalink / raw)
To: Jason Mancini; +Cc: gcc-help
On Tue, 23 Apr 2019 at 01:18, Jason Mancini <jayrusman@hotmail.com> wrote:
>
> We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions. Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue.
Is the code you're compiling the same in all cases, meaning
decltype(auto) is faster with GCC 6 than later releases?
Or are you only using decltype(auto) with the later releases?
It's possible that GCC 7 and later fixes some bugs in the handling of
decltype(auto) which makes it slower than GCC 6.
It's unsurprising that decltype(auto) requires the compiler to do more
work, but ideally that work wouldn't make compilation exponentially
slower, even with less buggy behaviour than in older releases.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3
2019-04-23 0:18 ` Jason Mancini
2019-04-23 12:58 ` Jonathan Wakely
@ 2019-04-29 20:33 ` Jason Mancini
2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall
1 sibling, 1 reply; 24+ messages in thread
From: Jason Mancini @ 2019-04-29 20:33 UTC (permalink / raw)
To: gcc-help
Jason Mancini said:
> We've determined that the gcc perf drop is due to use of decltype(auto) as the
> return type for template functions. Replacing with a known type or func(...) -> decltype(...)
> trailing type syntax seems to avoid the performance issue.
I misspoke here. Turns out that the above replacement made everything equally slow.
So the performance bug was lurking in there, and gcc-7-branch r249333 exposed it more.
Yeah yeah, we need to get an offending code blob cleaned up, approved, and bug filed.
I've been using gcc9 snapshots with part of r249333 reverted in the mean time to make
forward progress vetting gcc9 on our code base (no other problems so far!)
Jason
^ permalink raw reply [flat|nested] 24+ messages in thread
* C11, <stdatomic.h> and atomic pointers
2019-04-29 20:33 ` Jason Mancini
@ 2019-05-01 13:31 ` Chris Hall
2019-05-01 14:15 ` Martin Sebor
2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
0 siblings, 2 replies; 24+ messages in thread
From: Chris Hall @ 2019-05-01 13:31 UTC (permalink / raw)
To: gcc-help
I find that:
int * _Atomic foo ;
int bar[12] ;
foo = bar ;
foo += 4 ; // foo -> &bar[4] -- of course
foo = bar ;
atomic_fetch_add(&foo, 4) ; // foo -> &bar[1] -- ????
which, I confess, I did not quite expect. (Happy I looked, though !)
So, I looked at the Standard:
7.17.7.5 The atomic_fetch and modify generic functions
1 The following operations perform arithmetic and bitwise
computations. All of these operations are applicable to an
object of any atomic integer type. None of these
operations is applicable to atomic_bool.
Of course, "integer type" excludes pointers, so I guess what it does
with pointers is undefined.
Should gcc be throwing a friendly warning here ?
But the Standard goes on to say:
3 ... For signed integer types ... there are no undefined
results. ...
... For address types, the result may be an undefined
address, but the operations otherwise have no undefined
behavior.
I don't know why it feels the need to mention "address types", given
that they are not valid arguments ? [I'm assuming that by "address
types" it actually means "pointer types". I can find no other mention
of "address type".]
The "Synopsis" says:
2 #include <stdatomic.h>
<C> atomic_fetch_<key>(volatile <A> *object, <M> operand);
<C> atomic_fetch_<key>_explicit(volatile <A> *object,
<M> operand, memory_order order);
and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction",
as follows:
5 In the following synopses:
- An <A> refers to one of the atomic types.
- A <C> refers to its corresponding non-atomic type.
- An <M> refers to the type of the other argument for
arithmetic operations. For atomic integer types, <M>
is <C>. For atomic pointer types, <M> is ptrdiff_t.
As it happens, <M> only used in the Synopsis for atomic_fetch_<key>...
which is not defined for pointer types ?
I realise this is not really the place for discussion of the Standard,
but I assume that what gcc does is based on some interpretation of it.
Is there a good place to look for that interpretation ?
Chris
--------------------------
FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require
the various <A> arguments to be atomic types... they are perfectly happy
with ordinary types. That doesn't seem right to me.
FWIW (2): the Standard (later in 7.17.7.5) says:
5 NOTE The operation of the atomic_fetch and modify generic
functions are nearly equivalent to the operation of the
corresponding op= compound assignment operators. The only
differences are that the compound assignment operators are
not guaranteed to operate atomically, ...
Except that "6.5.16.2 Compound assignment" says:
3 A compound assignment of the form E1 op= E2 ...
... If E1 has an atomic type, compound assignment is a
read-modify-write operation with memory_order_seq_cst
memory order semantics.
which looks like a flat contradiction to me.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: C11, <stdatomic.h> and atomic pointers
2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall
@ 2019-05-01 14:15 ` Martin Sebor
2019-05-02 7:54 ` Jonathan Wakely
2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
1 sibling, 1 reply; 24+ messages in thread
From: Martin Sebor @ 2019-05-01 14:15 UTC (permalink / raw)
To: Chris Hall, gcc-help
On 05/01/2019 07:31 AM, Chris Hall wrote:
>
> I find that:
>
> Â int * _Atomic foo ;
> Â int bar[12] ;
>
> Â foo = bar ;
> Â foo += 4 ;Â Â Â Â Â Â Â Â Â Â Â // foo -> &bar[4] -- of course
>
> Â foo = bar ;
> Â atomic_fetch_add(&foo, 4) ;Â Â Â Â // foo -> &bar[1] -- ????
>
I think this is a bug 64843:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64843
> which, I confess, I did not quite expect. (Happy I looked, though !)
>
> So, I looked at the Standard:
>
> Â 7.17.7.5 The atomic_fetch and modify generic functions
>
> Â 1 The following operations perform arithmetic and bitwise
> Â Â Â computations. All of these operations are applicable to an
> Â Â Â object of any atomic integer type. None of these
> Â Â Â operations is applicable to atomic_bool.
>
> Of course, "integer type" excludes pointers, so I guess what it does
> with pointers is undefined.
>
> Should gcc be throwing a friendly warning here ?
>
> But the Standard goes on to say:
>
> Â 3 ... For signed integer types ... there are no undefined
> Â Â Â results. ...
> Â Â Â ... For address types, the result may be an undefined
> Â Â Â address, but the operations otherwise have no undefined
> Â Â Â behavior.
>
> I don't know why it feels the need to mention "address types", given
> that they are not valid arguments ? [I'm assuming that by "address
> types" it actually means "pointer types". I can find no other mention
> of "address type".]
Yes, that's a problem in the standard text that should be fixed.
> The "Synopsis" says:
>
> Â 2 #include <stdatomic.h>
> Â Â Â <C> atomic_fetch_<key>(volatile <A> *object, <M> operand);
> Â Â Â <C> atomic_fetch_<key>_explicit(volatile <A> *object,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â <M> operand, memory_order order);
>
> and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction",
> as follows:
>
> Â 5 In the following synopses:
>
> Â Â Â Â - An <A> refers to one of the atomic types.
> Â Â Â Â - A <C> refers to its corresponding non-atomic type.
> Â Â Â Â - An <M> refers to the type of the other argument for
> Â Â Â Â Â Â arithmetic operations. For atomic integer types, <M>
> Â Â Â Â Â Â is <C>. For atomic pointer types, <M> is ptrdiff_t.
>
> As it happens, <M> only used in the Synopsis for atomic_fetch_<key>...
> which is not defined for pointer types ?
>
> I realise this is not really the place for discussion of the Standard,
> but I assume that what gcc does is based on some interpretation of it.
> Is there a good place to look for that interpretation ?
There are C defect reports that GCC also considers. Some may
already be incorporated, others are not. C11 defect reports
are tracked here:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm
If the above isn't being tracked there or in the list below
we might want write up a new issue for it and submit it to
WG14 to get it fixed in C2X.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2316.htm
Martin
>
> Chris
>
> --------------------------
>
> FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require
> the various <A> arguments to be atomic types... they are perfectly happy
> with ordinary types. That doesn't seem right to me.
>
> FWIW (2): the Standard (later in 7.17.7.5) says:
>
> Â 5 NOTE The operation of the atomic_fetch and modify generic
> Â Â Â functions are nearly equivalent to the operation of the
> Â Â Â corresponding op= compound assignment operators. The only
> Â Â Â differences are that the compound assignment operators are
> Â Â Â not guaranteed to operate atomically, ...
>
> Except that "6.5.16.2 Compound assignment" says:
>
> Â 3 A compound assignment of the form E1 op= E2 ...
> Â Â Â ... If E1 has an atomic type, compound assignment is a
> Â Â Â read-modify-write operation with memory_order_seq_cst
> Â Â Â memory order semantics.
>
> which looks like a flat contradiction to me.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: C11, <stdatomic.h> and atomic pointers
2019-05-01 14:15 ` Martin Sebor
@ 2019-05-02 7:54 ` Jonathan Wakely
0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-05-02 7:54 UTC (permalink / raw)
To: Martin Sebor; +Cc: Chris Hall, gcc-help
The equivalent wording in the C++ standard was modified by
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0558r1.pdf to
remove the nonsense words "address types", and to clarify that using
the fetch_xxx functions for arithmetic on pointers is valid, except
for pointers to void and function pointers.
^ permalink raw reply [flat|nested] 24+ messages in thread
* __STDC_NO_THREADS__ and late model gcc/glibc
2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall
2019-05-01 14:15 ` Martin Sebor
@ 2019-06-20 11:28 ` Chris Hall
2019-06-20 12:17 ` Jonathan Wakely
2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall
1 sibling, 2 replies; 24+ messages in thread
From: Chris Hall @ 2019-06-20 11:28 UTC (permalink / raw)
To: gcc-help
I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1',
denying support for <threads.h>. Nevertheless, it does support
_Thread_local.
As of glibc 2.28, <threads.h> appears in the library. I guess that
means some version of gcc will no longer set __STDC_NO_THREADS__ ?
On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all
versions up to and including the "trunk".
But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1,
and no longer set by gcc v7.2.0 (which I just built on my machine).
I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc
2.28. I'm guessing that something, somewhere is taking that into account.
I have looked everywhere I can think of to find where
__STDC_NO_THREADS__ is configured... but to no avail :-(
Does anyone know what I should expect, or where I should look to find
out, please ?
Thanks,
Chris
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc
2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
@ 2019-06-20 12:17 ` Jonathan Wakely
2019-06-21 9:39 ` Chris Hall
2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall
1 sibling, 1 reply; 24+ messages in thread
From: Jonathan Wakely @ 2019-06-20 12:17 UTC (permalink / raw)
To: Chris Hall; +Cc: gcc-help
On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
>
>
> I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1',
> denying support for <threads.h>. Nevertheless, it does support
> _Thread_local.
>
> As of glibc 2.28, <threads.h> appears in the library. I guess that
> means some version of gcc will no longer set __STDC_NO_THREADS__ ?
>
> On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all
> versions up to and including the "trunk".
>
> But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1,
> and no longer set by gcc v7.2.0 (which I just built on my machine).
>
> I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc
> 2.28. I'm guessing that something, somewhere is taking that into account.
>
> I have looked everywhere I can think of to find where
> __STDC_NO_THREADS__ is configured... but to no avail :-(
>
> Does anyone know what I should expect, or where I should look to find
> out, please ?
Glibc provides it in the /usr/include/stdc-predef.h file which is
implicitly pre-included by GCC.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc
2019-06-20 12:17 ` Jonathan Wakely
@ 2019-06-21 9:39 ` Chris Hall
2019-06-21 9:51 ` Jonathan Wakely
0 siblings, 1 reply; 24+ messages in thread
From: Chris Hall @ 2019-06-21 9:39 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: gcc-help
On 20/06/2019 13:17, Jonathan Wakely wrote:
> On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
...
>> I have looked everywhere I can think of to find where
>> __STDC_NO_THREADS__ is configured... but to no avail :-(
>>
>> Does anyone know what I should expect, or where I should look to find
>> out, please ?
> Glibc provides it in the /usr/include/stdc-predef.h file which is
> implicitly pre-included by GCC.
Ah ha ! Thank you.
I guess that accounts for the "predefined" _STDC_PREDEF_H.
AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is
built, depending on the target. Deep Magic of the First Magnitude.
Thanks,
Chris
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc
2019-06-21 9:39 ` Chris Hall
@ 2019-06-21 9:51 ` Jonathan Wakely
0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Wakely @ 2019-06-21 9:51 UTC (permalink / raw)
To: Chris Hall; +Cc: gcc-help
On Fri, 21 Jun 2019 at 10:39, Chris Hall wrote:
>
> On 20/06/2019 13:17, Jonathan Wakely wrote:
> > On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote:
> ...
> >> I have looked everywhere I can think of to find where
> >> __STDC_NO_THREADS__ is configured... but to no avail :-(
> >>
> >> Does anyone know what I should expect, or where I should look to find
> >> out, please ?
>
> > Glibc provides it in the /usr/include/stdc-predef.h file which is
> > implicitly pre-included by GCC.
>
> Ah ha ! Thank you.
>
> I guess that accounts for the "predefined" _STDC_PREDEF_H.
>
> AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is
> built, depending on the target.
Right, see gcc/config/glibc-c.c
> Deep Magic of the First Magnitude.
:-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Function returning struct on x86_64 (at least)
2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
2019-06-20 12:17 ` Jonathan Wakely
@ 2020-01-06 15:09 ` Chris Hall
2020-01-06 15:19 ` Marc Glisse
1 sibling, 1 reply; 24+ messages in thread
From: Chris Hall @ 2020-01-06 15:09 UTC (permalink / raw)
To: gcc-help
I hoped to do something "clever" with a function of the form:
typedef struct { char s[64] ; } qerr_str_t ;
extern qerr_str_t
qerrst0(int err)
{
qerr_str_t st ;
snprintf(st.s, sizeof(st.s), "errno=%d", err) ;
return st ;
}
but was disappointed to find that this compiles (gcc 8.3 and others,
-O2) to this:
.LC0:
.string "errno=%d"
qerrst0:
pushq %rbx
movl %esi, %ecx
movq %rdi, %rbx
movl $.LC0, %edx
movl $64, %esi
xorl %eax, %eax
subq $64, %rsp
movq %rsp, %rdi
call snprintf
movdqa (%rsp), %xmm0
movq %rbx, %rax
movdqa 16(%rsp), %xmm1
movdqa 32(%rsp), %xmm2
movdqa 48(%rsp), %xmm3
movups %xmm0, (%rbx)
movups %xmm1, 16(%rbx)
movups %xmm2, 32(%rbx)
movups %xmm3, 48(%rbx)
addq $64, %rsp
popq %rbx
ret
On reflection, the compiler is playing safe and not writing to whatever
the "hidden" pointer %rdi is pointing at, until the implicit assignment.
So I have no right to be disappointed.
The object of the exercise is to create temporary strings for use like this:
int
main(int argc, char* argv[])
{
printf("%s: %s\n", argv[0], qerrst0(argc).s) ;
}
where the "hidden" pointer passed to qerrst0() does not, in fact, point
to anything accessible. Sadly, even when qerrst0() is inlined, I find:
.LC0:
.string "errno=%d"
.LC1:
.string "%s: %s\n"
main:
pushq %rbx
movl %edi, %ecx
movq %rsi, %rbx
movl $.LC0, %edx
movl $64, %esi
xorl %eax, %eax
addq $-128, %rsp
leaq 64(%rsp), %rdi
call snprintf
movdqa 64(%rsp), %xmm0
movq (%rbx), %rsi
xorl %eax, %eax
movdqa 80(%rsp), %xmm1
movdqa 96(%rsp), %xmm2
movq %rsp, %rdx
movl $.LC1, %edi
movdqa 112(%rsp), %xmm3
movaps %xmm0, (%rsp)
movaps %xmm1, 16(%rsp)
movaps %xmm2, 32(%rsp)
movaps %xmm3, 48(%rsp)
call printf
subq $-128, %rsp
xorl %eax, %eax
popq %rbx
ret
where there is still an (unnecessary) assignment going on !
I tried something simpler:
extern qerr_str_t
qerrst1(int err)
{
qerr_str_t st ;
st.s[0] = err ;
return st ;
}
which compiles to:
qerrst1:
movq %rdi, %rax
movb %sil, (%rdi)
ret
...so a trivial case optimises as one might hope.
As does:
extern qerr_str_t
qerrst2(int err)
{
qerr_str_t st ;
char* q = st.s ;
q[0] = err ;
q[63] = err ;
return st ;
}
qerrst2:
movq %rdi, %rax
movb %sil, (%rdi)
movb %sil, 63(%rdi)
ret
The following are also optimised:
extern qerr_str_t
qerrst3a(int err)
{
qerr_str_t st = { "" } ;
return st ;
}
extern qerr_str_t
qerrst3b(int err)
{
qerr_str_t st ;
char* q = st.s ;
memset(q, 0, sizeof(st.s)) ;
return st ;
}
to the same code:
qerrst3a/b:
pxor %xmm0, %xmm0
movq %rdi, %rax
movups %xmm0, (%rdi)
movups %xmm0, 16(%rdi)
movups %xmm0, 32(%rdi)
movups %xmm0, 48(%rdi)
ret
However, ever so slightly more complicated:
extern qerr_str_t
qerrst4(int err)
{
qerr_str_t st ;
for (int i = 0 ; i < (err & 63) ; ++i)
st.s[i] = err - i ;
return st ;
}
qerrst4:
movl %esi, %edx
movq %rdi, %rax
andl $63, %edx
je .L12
subl $1, %edx
leaq -71(%rsp,%rdx), %r8
leaq -72(%rsp), %rdx
addl %edx, %esi
.L11:
movl %esi, %ecx
subl %edx, %ecx
addq $1, %rdx
movb %cl, -1(%rdx)
cmpq %r8, %rdx
jne .L11
.L12:
movdqa -72(%rsp), %xmm0
movdqa -56(%rsp), %xmm1
movdqa -40(%rsp), %xmm2
movdqa -24(%rsp), %xmm3
movups %xmm0, (%rax)
movups %xmm1, 16(%rax)
movups %xmm2, 32(%rax)
movups %xmm3, 48(%rax)
ret
Which is a puzzle :-(
Interestingly, I also found (after a little effort):
extern qerr_str_t
qerrst5(int err, char* fred)
{
qerr_str_t st ;
st.s[ 0] = err ;
st.s[ 2] = fred[ 8] ;
st.s[ 4] = fred[ 6] ;
st.s[ 6] = fred[ 4] ;
st.s[ 8] = fred[ 2] ;
st.s[10] = fred[ 0] ;
return st ;
}
qerrst5:
movq %rdi, %rax
movzbl 8(%rdx), %r9d
movzbl 6(%rdx), %r8d
movzbl 4(%rdx), %edi
movzbl 2(%rdx), %ecx
movb %sil, (%rax) -- BUG iff %rax ==
movzbl (%rdx), %edx -- %rdx !
movb %r9b, 2(%rax)
movb %r8b, 4(%rax)
movb %dil, 6(%rax)
movb %cl, 8(%rax)
movb %dl, 10(%rax)
ret
which is very nearly correct... except as noted, if *fred points at the
final destination !!
For this to do what I had hoped (and I imagine is the majority case),
what is needed is a way to mark the declaration of 'qerr_str_t st' in
the function as a "clone" of the final destination 'qerr_str_t' in the
caller -- so that the compiler could Just Do It.
I looked for an __attribute__(()) for this... but could not find one.
Is there any way in which I can persuade the compiler that a function
returning a struct does not need to worry about preserving the value of
the final destination (ie the struct at %rdi) ?
Chris
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Function returning struct on x86_64 (at least)
2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall
@ 2020-01-06 15:19 ` Marc Glisse
2020-01-07 16:20 ` Chris Hall
0 siblings, 1 reply; 24+ messages in thread
From: Marc Glisse @ 2020-01-06 15:19 UTC (permalink / raw)
To: Chris Hall; +Cc: gcc-help
On Mon, 6 Jan 2020, Chris Hall wrote:
[description of NRVO]
> Is there any way in which I can persuade the compiler that a function
> returning a struct does not need to worry about preserving the value of the
> final destination (ie the struct at %rdi) ?
Compile the file as C++ instead of C. Not that it would be forbidden in C,
but the optimization happens to be in the C++ front-end. There is also an
optimization pass called nrv, but it does trigger that often.
--
Marc Glisse
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Function returning struct on x86_64 (at least)
2020-01-06 15:19 ` Marc Glisse
@ 2020-01-07 16:20 ` Chris Hall
0 siblings, 0 replies; 24+ messages in thread
From: Chris Hall @ 2020-01-07 16:20 UTC (permalink / raw)
To: gcc-help, Marc Glisse
On 06/01/2020 15:19, Marc Glisse wrote:
> On Mon, 6 Jan 2020, Chris Hall wrote:
>
> [description of NRVO]
>
>> Is there any way in which I can persuade the compiler that a function
>> returning a struct does not need to worry about preserving the value
>> of the final destination (ie the struct at %rdi) ?
> Compile the file as C++ instead of C. Not that it would be forbidden in
> C, but the optimization happens to be in the C++ front-end. There is
> also an optimization pass called nrv, but it does trigger that often.
The idea of trying to write as-standard-as-possible C11, and then
compiling it as C++ makes me queasy :-(
As far as I can see, tree-nrv is enabled. Some of the functions do
write directly to %rdi, so I guess that's being done by tree-nrv. But,
as you say, this optimization does seem not to be applied often.
I was puzzled by why the optimization would not be forbidden in C...
...clearly, the problem is that 'z = f(...) ;' is defined such that the
result must be as if f() creates a temporary struct t, which is copied
to z when f() completes. If f() writes directly to z it must not then
read from z -- which looks hard to guarantee, not least because only the
*caller* knows what z is !!
Now, as I noted before, I found (after a little effort):
typedef struct { char s[64] ; } qerrst_t ;
extern qerrst_t
qerrst5(int err, char* foo)
{
qerr_str_t st ;
st.s[ 0] = err ;
st.s[ 2] = foo[ 8] ;
st.s[ 4] = foo[ 6] ;
st.s[ 6] = foo[ 4] ;
st.s[ 8] = foo[ 2] ;
st.s[10] = foo[ 0] ;
return st ;
}
qerrst5:
movq %rdi, %rax
movzbl 8(%rdx), %r9d
movzbl 6(%rdx), %r8d
movzbl 4(%rdx), %edi
movzbl 2(%rdx), %ecx
movb %sil, (%rax) -- BUG iff %rax ==
movzbl (%rdx), %edx -- %rdx !
movb %r9b, 2(%rax)
movb %r8b, 4(%rax)
movb %dil, 6(%rax)
movb %cl, 8(%rax)
movb %dl, 10(%rax)
ret
which does *not* do a copy in the function, and which is very nearly
correct... except as noted, if *foo points at the final destination !!
HOWEVER... when I tried this function the bug did NOT appear. It turns
out that the *caller* passes a pointer to a hidden struct and the
*caller* copies that to the final destination !!!
The test I ran is:
typedef struct { char s[64] ; } qerrst_t ;
extern qerrst_t qerrst5(int err, char* foo) ;
static void __attribute__((noinline))
show(const char* name, char* s)
{
printf(name) ;
for (int i = 0 ; i <= 10 ; ++i)
printf(" %3d", (unsigned char)s[i]) ;
printf("\n") ;
}
int
main(Unused int argc, Unused char* argv[])
{
int err = argc ;
qerrst_t x, y ;
for (int i = 0 ; i < (int)sizeof(x.s) ; ++i)
y.s[i] = x.s[i] = (char)(100 + i) ;
y = qerrst5(err, x.s) ;
show("y", y.s) ;
x = qerrst5(err, x.s) ;
show("x", x.s) ;
return 0 ;
}
and in a separate compilation unit (for completeness):
extern qerrst_t
qerrst5(int err, char* foo)
{
qerrst_t st ;
st.s[ 0] = (char)err ;
st.s[ 1] = -1 ;
st.s[ 2] = foo[ 8] ;
st.s[ 3] = -3 ;
st.s[ 4] = foo[ 6] ;
st.s[ 5] = -5 ;
st.s[ 6] = foo[ 4] ;
st.s[ 7] = -7 ;
st.s[ 8] = foo[ 2] ;
st.s[ 9] = -8 ;
st.s[10] = foo[ 0] ;
return st ;
}
which compiles to:
Dump of assembler code for function qerrst5:
0x47fdd0 <+0>: mov %rdi,%rax
0x47fdd3 <+3>: movzbl 0x8(%rdx),%r8d # read foo[8]
0x47fdd8 <+8>: movzbl 0x6(%rdx),%edi
0x47fddc <+12>: mov %esi,%r9d
0x47fddf <+15>: movzbl 0x2(%rdx),%ecx
0x47fde3 <+19>: movzbl 0x4(%rdx),%esi
0x47fde7 <+23>: mov %r9b,(%rax) # write st.s[0]
0x47fdea <+26>: movb $0xff,0x1(%rax)
0x47fdee <+30>: movb $0xfd,0x3(%rax)
0x47fdf2 <+34>: movb $0xfb,0x5(%rax)
0x47fdf6 <+38>: movb $0xf9,0x7(%rax)
0x47fdfa <+42>: movb $0xf8,0x9(%rax)
0x47fdfe <+46>: mov %r8b,0x2(%rax)
0x47fe02 <+50>: mov %dil,0x4(%rax)
0x47fe06 <+54>: mov %sil,0x6(%rax)
0x47fe0a <+58>: mov %cl,0x8(%rax)
0x47fe0d <+61>: movzbl (%rdx),%edx # read foo[0]
# -- BUG if foo == st.s
0x47fe10 <+64>: mov %dl,0xa(%rax) # write st.s[10]
0x47fe13 <+67>: retq
And the result was:
y 1 255 108 253 106 251 104 249 102 248 100
x 1 255 108 253 106 251 104 249 102 248 100
SURPRISE ! expected to see:
x 1 255 108 253 106 251 104 249 102 248 1 <<< BUG
Looking at main() we see:
Dump of assembler code for function main:
0x4012a0 <+0>: push %rbp
0x4012a1 <+1>: mov %edi,%esi
0x4012a3 <+3>: mov %rsp,%rbp
0x4012a6 <+6>: push %r12
0x4012a8 <+8>: mov %edi,%r12d
0x4012ab <+11>: and $0xffffffffffffffe0,%rsp
0x4012af <+15>: sub $0xc0,%rsp
0x4012b6 <+22>: vmovaps 0x94642(%rip),%xmm0 # 0x495900
0x4012be <+30>: lea 0x40(%rsp),%rdx # ->x
0x4012c3 <+35>: mov %rsp,%rdi # ->t
0x4012c6 <+38>: vmovaps %xmm0,0x40(%rsp) # x0
0x4012cc <+44>: vmovaps %xmm0,0x80(%rsp) # y0
0x4012d5 <+53>: vmovaps 0x94633(%rip),%xmm0 # 0x495910
0x4012dd <+61>: vmovaps %xmm0,0x50(%rsp) # x1
0x4012e3 <+67>: vmovaps %xmm0,0x90(%rsp) # y1
0x4012ec <+76>: vmovaps 0x9462c(%rip),%xmm0 # 0x495920
0x4012f4 <+84>: vmovaps %xmm0,0x60(%rsp) # x2
0x4012fa <+90>: vmovaps %xmm0,0xa0(%rsp) # y2
0x401303 <+99>: vmovaps 0x94625(%rip),%xmm0 # 0x495930
0x40130b <+107>: vmovaps %xmm0,0x70(%rsp) # x3
0x401311 <+113>: vmovaps %xmm0,0xb0(%rsp) # y3
0x40131a <+122>: callq 0x47fdd0 <qerrst5>
0x40131f <+127>: vmovups (%rsp),%xmm1 # t0
0x401324 <+132>: lea 0x80(%rsp),%rsi # ->y
0x40132c <+140>: mov $0x494b3e,%edi
0x401331 <+145>: vmovups 0x10(%rsp),%xmm2 # t1
0x401337 <+151>: vmovups 0x20(%rsp),%xmm3 # t2
0x40133d <+157>: vmovups 0x30(%rsp),%xmm4 # t3
0x401343 <+163>: vmovaps %xmm1,0x80(%rsp) # y0
0x40134c <+172>: vmovaps %xmm2,0x90(%rsp) # y1
0x401355 <+181>: vmovaps %xmm3,0xa0(%rsp) # t2
0x40135e <+190>: vmovaps %xmm4,0xb0(%rsp) # y3
0x401367 <+199>: callq 0x47fb60 <show>
0x40136c <+204>: lea 0x40(%rsp),%rdx # ->x
0x401371 <+209>: mov %r12d,%esi # err
0x401374 <+212>: mov %rsp,%rdi # ->t
0x401377 <+215>: callq 0x47fdd0 <qerrst5>
0x40137c <+220>: vmovups (%rsp),%xmm5 # t0
0x401381 <+225>: lea 0x40(%rsp),%rsi # ->x
0x401386 <+230>: mov $0x4937c4,%edi
0x40138b <+235>: vmovups 0x10(%rsp),%xmm6 # t1
0x401391 <+241>: vmovups 0x20(%rsp),%xmm7 # t2
0x401397 <+247>: vmovups 0x30(%rsp),%xmm1 # t3
0x40139d <+253>: vmovaps %xmm5,0x40(%rsp) # x0
0x4013a3 <+259>: vmovaps %xmm6,0x50(%rsp) # x1
0x4013a9 <+265>: vmovaps %xmm7,0x60(%rsp) # x2
0x4013af <+271>: vmovaps %xmm1,0x70(%rsp) # x3
0x4013b5 <+277>: callq 0x47fb60 <show>
0x4013ba <+282>: xor %eax,%eax
0x4013bc <+284>: mov -0x8(%rbp),%r12
0x4013c0 <+288>: leaveq
0x4013c1 <+289>: retq
The caller is passing a pointer to a hidden 't' and then *itself*
copying the result to the destination of the assignment !!
It looks like the caller is taking care of the problem, so a function
returning a struct does not need to... surely ?
So I also tried:
typedef struct { char s[64] ; } qerrst_t ;
extern qerrst_t qerrst0(int err) ;
int
main(Unused int argc, Unused char* argv[])
{
int err = argc ;
qerrst_t z ;
printf("qerrst0()='%s'\n", qerrst0(err).s) ;
z = qerrst0(err) ;
printf("qerrst0()='%s'\n", z.s) ;
return 0 ;
}
and in a separate compilation unit (for completeness):
extern qerrst_t
qerrst0(int err)
{
qerrst_t st ;
snprintf(st.s, sizeof(st.s), "errno=%d", err) ;
return st ;
}
which compiles to:
Dump of assembler code for function qerrst0:
0x47fc00 <+0>: push %r12
0x47fc02 <+2>: mov %esi,%ecx
0x47fc04 <+4>: mov %rdi,%r12
0x47fc07 <+7>: mov $0x495910,%edx
0x47fc0c <+12>: sub $0x40,%rsp
0x47fc10 <+16>: mov $0x40,%esi
0x47fc15 <+21>: xor %eax,%eax
0x47fc17 <+23>: mov %rsp,%rdi
0x47fc1a <+26>: callq 0x4010b0 <snprintf@plt>
0x47fc1f <+31>: vmovaps (%rsp),%xmm0
0x47fc24 <+36>: mov %r12,%rax
0x47fc27 <+39>: vmovaps 0x10(%rsp),%xmm1
0x47fc2d <+45>: vmovaps 0x20(%rsp),%xmm2
0x47fc33 <+51>: vmovaps 0x30(%rsp),%xmm3
0x47fc39 <+57>: vmovups %xmm0,(%r12)
0x47fc3f <+63>: vmovups %xmm1,0x10(%r12)
0x47fc46 <+70>: vmovups %xmm2,0x20(%r12)
0x47fc4d <+77>: vmovups %xmm3,0x30(%r12)
0x47fc54 <+84>: add $0x40,%rsp
0x47fc58 <+88>: pop %r12
0x47fc5a <+90>: retq
which, as before, creates a temporary, local struct which is copied to
the return struct pointed to by %rdi.
And now we see:
Dump of assembler code for function main:
0x401280 <+0>: push %rbp
0x401281 <+1>: mov %edi,%esi
0x401283 <+3>: mov %rsp,%rbp
0x401286 <+6>: push %r12
0x401288 <+8>: mov %edi,%r12d
0x40128b <+11>: and $0xffffffffffffffe0,%rsp
0x40128f <+15>: sub $0xc0,%rsp
0x401296 <+22>: lea 0x80(%rsp),%rdi # ->t
0x40129e <+30>: callq 0x47fc00 <qerrst0> # qerrst0(err).s
0x4012a3 <+35>: lea 0x80(%rsp),%rsi
0x4012ab <+43>: mov $0x495882,%edi
0x4012b0 <+48>: xor %eax,%eax
0x4012b2 <+50>: callq 0x4010a0 <printf@plt> # printf(..., t)
0x4012b7 <+55>: mov %r12d,%esi
0x4012ba <+58>: mov %rsp,%rdi # ->t
0x4012bd <+61>: callq 0x47fc00 <qerrst0> # z = qerrst0(err) ;
0x4012c2 <+66>: vmovups (%rsp),%xmm0 # t0
0x4012c7 <+71>: lea 0x40(%rsp),%rsi # ->z
0x4012cc <+76>: mov $0x495882,%edi
0x4012d1 <+81>: vmovups 0x10(%rsp),%xmm1 # t1
0x4012d7 <+87>: vmovups 0x20(%rsp),%xmm2 # t2
0x4012dd <+93>: xor %eax,%eax
0x4012df <+95>: vmovups 0x30(%rsp),%xmm3 # t3
0x4012e5 <+101>: vmovaps %xmm0,0x40(%rsp) # z0 )
0x4012eb <+107>: vmovaps %xmm1,0x50(%rsp) # z1 ) copied from t
0x4012f1 <+113>: vmovaps %xmm2,0x60(%rsp) # z2 )
0x4012f7 <+119>: vmovaps %xmm3,0x70(%rsp) # z3 )
0x4012fd <+125>: callq 0x4010a0 <printf@plt> # printf(..., z.s)
0x401302 <+130>: xor %eax,%eax
0x401304 <+132>: mov -0x8(%rbp),%r12
0x401308 <+136>: leaveq
0x401309 <+137>: retq
So for:
printf("qerrst0()='%s'\n", qerrst0(err).s) ;
there is one (spurious) copy in qerrst0().
And for:
z = qerrst0(err) ;
printf("qerrst0()='%s'\n", z.s) ;
there is one (spurious) copy in qerrst0() AND a *second* copy in main().
Is it just me, or is this broken ?
So, I looked at the AMD64 ABI (Draft 0.99.7 â November 17, 2014 â
15:08), Section 3.2.3 Parameter Passing, p22:
Returning of Values: ....
2. If the type has class MEMORY, then the caller provides space
for the return value and passes the address of this storage
in %rdi as if it were the first argument to the function.
In effect, this address becomes a âhiddenâ first argument.
This storage must not overlap any data visible to the callee
through other names than this argument.
So... the ABI appears to say that the callee does *not* need to do any
copying *ever*.
This pushes the problem back to the caller. If the caller can be sure
that the final destination is not visible to the callee, it too can
avoid copying.
So... why is the qerrst0() function doing a copy ?
Chris
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2020-01-07 16:20 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-17 0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini
2019-04-17 2:10 ` Jason Mancini
2019-04-17 2:20 ` Xi Ruoyao
2019-04-17 8:38 ` Jonathan Wakely
2019-04-17 9:07 ` Segher Boessenkool
2019-04-18 16:06 ` Jason Mancini
2019-04-18 19:07 ` Segher Boessenkool
2019-04-18 19:38 ` Jeff Law
2019-04-19 2:03 ` Jason Mancini
2019-04-22 22:01 ` Jason Mancini
2019-04-22 22:17 ` Jason Mancini
2019-04-23 0:18 ` Jason Mancini
2019-04-23 12:58 ` Jonathan Wakely
2019-04-29 20:33 ` Jason Mancini
2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall
2019-05-01 14:15 ` Martin Sebor
2019-05-02 7:54 ` Jonathan Wakely
2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall
2019-06-20 12:17 ` Jonathan Wakely
2019-06-21 9:39 ` Chris Hall
2019-06-21 9:51 ` Jonathan Wakely
2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall
2020-01-06 15:19 ` Marc Glisse
2020-01-07 16:20 ` Chris Hall
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).