* gcc9 snapshot 20190414 is 30x slower than gcc 6.3 @ 2019-04-17 0:28 Jason Mancini 2019-04-17 2:10 ` Jason Mancini 0 siblings, 1 reply; 24+ messages in thread From: Jason Mancini @ 2019-04-17 0:28 UTC (permalink / raw) To: gcc-help Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot 20190414 (compiled with --disable-checking and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0. Have I overlooked disabling some snapshot self-checking code? Are there known configuration mistakes that could result in this sort of performance degradation? Thanks! It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference. Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file. There's a lot of templatized headers. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-17 0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini @ 2019-04-17 2:10 ` Jason Mancini 2019-04-17 2:20 ` Xi Ruoyao 0 siblings, 1 reply; 24+ messages in thread From: Jason Mancini @ 2019-04-17 2:10 UTC (permalink / raw) To: gcc-help > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot 20190414 (compiled with --disable-checking > and -O2 and make install-strip), it takes 31 minutes to compile the same file with -O0. Have I overlooked disabling some > snapshot self-checking code? Are there known configuration mistakes that could result in this sort of performance > degradation? Thanks! It will take a while to go back and try other gcc 6, 7, 8, and 9 snapshots to collect points of reference. > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this file. There's a lot of templatized headers. Latest data points: gcc-6.3/6.4 take about 43 seconds gcc-7.2 takes 30 minutes gcc-8.2 takes 27 minutes gcc-9.0 takes 31 minutes (snapshot 20190414) clang 6.0.1/7.01 take about 31 seconds This is frustrating, as I'm going to have to capitulate to using clang here for a very large user base. We've been a gcc shop for decades. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-17 2:10 ` Jason Mancini @ 2019-04-17 2:20 ` Xi Ruoyao 2019-04-17 8:38 ` Jonathan Wakely 0 siblings, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2019-04-17 2:20 UTC (permalink / raw) To: Jason Mancini; +Cc: gcc-help On 2019-04-17 02:09 +0000, Jason Mancini wrote: > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot > > 20190414 (compiled with --disable-checking > > and -O2 and make install-strip), it takes 31 minutes to compile the same > > file with -O0. Have I overlooked disabling some > > snapshot self-checking code? Are there known configuration mistakes that > > could result in this sort of performance > > degradation? Thanks! It will take a while to go back and try other gcc 6, > > 7, 8, and 9 snapshots to collect points of reference. > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this > > file. There's a lot of templatized headers. > > Latest data points: > gcc-6.3/6.4 take about 43 seconds > gcc-7.2 takes 30 minutes > gcc-8.2 takes 27 minutes > gcc-9.0 takes 31 minutes (snapshot 20190414) > clang 6.0.1/7.01 take about 31 seconds > > This is frustrating, as I'm going to have to capitulate to using clang here > for a very large user base. We've been a gcc > shop for decades. We'll never know why unless you can give a testcase to reproduce this issue. -- Xi Ruoyao <xry111@mengyan1223.wang> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-17 2:20 ` Xi Ruoyao @ 2019-04-17 8:38 ` Jonathan Wakely 2019-04-17 9:07 ` Segher Boessenkool 2019-04-18 16:06 ` Jason Mancini 0 siblings, 2 replies; 24+ messages in thread From: Jonathan Wakely @ 2019-04-17 8:38 UTC (permalink / raw) To: Jason Mancini; +Cc: gcc-help On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote: > > On 2019-04-17 02:09 +0000, Jason Mancini wrote: > > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot > > > 20190414 (compiled with --disable-checking > > > and -O2 and make install-strip), it takes 31 minutes to compile the same > > > file with -O0. Have I overlooked disabling some > > > snapshot self-checking code? Are there known configuration mistakes that > > > could result in this sort of performance > > > degradation? Thanks! It will take a while to go back and try other gcc 6, > > > 7, 8, and 9 snapshots to collect points of reference. > > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this > > > file. There's a lot of templatized headers. > > > > Latest data points: > > gcc-6.3/6.4 take about 43 seconds > > gcc-7.2 takes 30 minutes > > gcc-8.2 takes 27 minutes > > gcc-9.0 takes 31 minutes (snapshot 20190414) > > clang 6.0.1/7.01 take about 31 seconds > > > > This is frustrating, as I'm going to have to capitulate to using clang here > > for a very large user base. We've been a gcc > > shop for decades. > > We'll never know why unless you can give a testcase to reproduce this issue. Even better would be a bug report. The chances of it ever getting fixed are much higher if we know there's a problem. If you just complain that you have to switch to clang then nothing will change. And if you'd told us two years ago that your program started compiling 40 times slower, maybe it would have been fixed already. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-17 8:38 ` Jonathan Wakely @ 2019-04-17 9:07 ` Segher Boessenkool 2019-04-18 16:06 ` Jason Mancini 1 sibling, 0 replies; 24+ messages in thread From: Segher Boessenkool @ 2019-04-17 9:07 UTC (permalink / raw) To: Jonathan Wakely; +Cc: Jason Mancini, gcc-help On Wed, Apr 17, 2019 at 09:37:54AM +0100, Jonathan Wakely wrote: > On Wed, 17 Apr 2019 at 03:20, Xi Ruoyao wrote: > > > > On 2019-04-17 02:09 +0000, Jason Mancini wrote: > > > > Using gcc 6.3, my C++ source file compiles in 1m2s with -O0. With snapshot > > > > 20190414 (compiled with --disable-checking > > > > and -O2 and make install-strip), it takes 31 minutes to compile the same > > > > file with -O0. Have I overlooked disabling some > > > > snapshot self-checking code? Are there known configuration mistakes that > > > > could result in this sort of performance > > > > degradation? Thanks! It will take a while to go back and try other gcc 6, > > > > 7, 8, and 9 snapshots to collect points of reference. > > > > Both are pretty heavy on memory, gcc6 uses 3.7G and gcc9 uses 5.4G for this > > > > file. There's a lot of templatized headers. > > > > > > Latest data points: > > > gcc-6.3/6.4 take about 43 seconds > > > gcc-7.2 takes 30 minutes > > > gcc-8.2 takes 27 minutes > > > gcc-9.0 takes 31 minutes (snapshot 20190414) > > > clang 6.0.1/7.01 take about 31 seconds > > > > > > This is frustrating, as I'm going to have to capitulate to using clang here > > > for a very large user base. We've been a gcc > > > shop for decades. > > > > We'll never know why unless you can give a testcase to reproduce this issue. > > Even better would be a bug report. Yes... With -ftime-report info, to start with. Segher ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-17 8:38 ` Jonathan Wakely 2019-04-17 9:07 ` Segher Boessenkool @ 2019-04-18 16:06 ` Jason Mancini 2019-04-18 19:07 ` Segher Boessenkool 1 sibling, 1 reply; 24+ messages in thread From: Jason Mancini @ 2019-04-18 16:06 UTC (permalink / raw) To: Jonathan Wakely, gcc-help The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow. Bisected, and found it's due to revision 249333 on gcc-7-branch. Here's the commit log and -ftime-report output. Where do we go from here? Thanks! -JasonM ------------------------------------------------------------------------ r249333 | jason | 2017-06-16 19:34:15 -0700 (Fri, 16 Jun 2017) | 6 lines PR c++/81045 - Wrong type-dependence with auto return type. * pt.c (type_dependent_expression_p): An undeduced auto outside the template isn't dependent. * call.c (build_over_call): Instantiate undeduced auto even in a template. ------------------------------------------------------------------------ Execution times (seconds) phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1385 kB ( 0%) ggc phase parsing :1732.55 (99%) usr 50.00 (97%) sys1782.71 (99%) wall133075852 kB (99%) ggc phase lang. deferred : 10.14 ( 1%) usr 1.33 ( 3%) sys 11.46 ( 1%) wall 494433 kB ( 0%) ggc phase opt and generate : 4.63 ( 0%) usr 0.45 ( 1%) sys 5.09 ( 0%) wall 237946 kB ( 0%) ggc |name lookup : 5.94 ( 0%) usr 1.76 ( 3%) sys 7.75 ( 0%) wall 220059 kB ( 0%) ggc |overload resolution : 25.11 ( 1%) usr 3.98 ( 8%) sys 29.70 ( 2%) wall 2052677 kB ( 2%) ggc garbage collection : 197.31 (11%) usr 8.98 (17%) sys 206.31 (11%) wall 0 kB ( 0%) ggc dump files : 0.04 ( 0%) usr 0.04 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc callgraph construction : 0.77 ( 0%) usr 0.10 ( 0%) sys 0.94 ( 0%) wall 95588 kB ( 0%) ggc callgraph optimization : 0.04 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 64 kB ( 0%) ggc ipa dead code removal : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa inheritance graph : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 719 kB ( 0%) ggc ipa inlining heuristics : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc cfg construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 248 kB ( 0%) ggc cfg cleanup : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 53 kB ( 0%) ggc trivially dead code : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc df scan insns : 0.11 ( 0%) usr 0.01 ( 0%) sys 0.14 ( 0%) wall 81 kB ( 0%) ggc df live regs : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2290 kB ( 0%) ggc register information : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.07 ( 0%) wall 797 kB ( 0%) ggc rebuild jump labels : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc preprocessing : 0.57 ( 0%) usr 0.89 ( 2%) sys 1.32 ( 0%) wall 74289 kB ( 0%) ggc parser (global) : 1.22 ( 0%) usr 1.10 ( 2%) sys 2.43 ( 0%) wall 202556 kB ( 0%) ggc parser struct body :1459.61 (84%) usr 29.08 (56%) sys1489.75 (83%) wall122435248 kB (91%) ggc parser enumerator list : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 2847 kB ( 0%) ggc parser function body : 0.61 ( 0%) usr 0.18 ( 0%) sys 0.72 ( 0%) wall 29874 kB ( 0%) ggc parser inl. func. body : 0.14 ( 0%) usr 0.07 ( 0%) sys 0.22 ( 0%) wall 10547 kB ( 0%) ggc parser inl. meth. body : 3.86 ( 0%) usr 0.68 ( 1%) sys 4.39 ( 0%) wall 239905 kB ( 0%) ggc template instantiation : 79.00 ( 5%) usr 10.23 (20%) sys 88.55 ( 5%) wall10574851 kB ( 8%) ggc early inlining heuristics: 0.00 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall 3 kB ( 0%) ggc inline parameters : 0.05 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) wall 1531 kB ( 0%) ggc tree gimplify : 0.28 ( 0%) usr 0.02 ( 0%) sys 0.31 ( 0%) wall 21735 kB ( 0%) ggc tree eh : 0.02 ( 0%) usr 0.02 ( 0%) sys 0.04 ( 0%) wall 3173 kB ( 0%) ggc tree CFG construction : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 4453 kB ( 0%) ggc tree CFG cleanup : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 24 kB ( 0%) ggc tree PHI insertion : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 732 kB ( 0%) ggc tree SSA rewrite : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 2542 kB ( 0%) ggc tree SSA other : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.05 ( 0%) wall 258 kB ( 0%) ggc tree SSA incremental : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree operand scan : 1.17 ( 0%) usr 0.04 ( 0%) sys 1.18 ( 0%) wall 6878 kB ( 0%) ggc dominance computation : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc out of ssa : 0.03 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 174 kB ( 0%) ggc expand vars : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.09 ( 0%) wall 3785 kB ( 0%) ggc expand : 0.26 ( 0%) usr 0.02 ( 0%) sys 0.26 ( 0%) wall 22606 kB ( 0%) ggc post expand cleanups : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1994 kB ( 0%) ggc varconst : 0.26 ( 0%) usr 0.08 ( 0%) sys 0.32 ( 0%) wall 164 kB ( 0%) ggc jump : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc loop init : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1375 kB ( 0%) ggc mode switching : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc integrated RA : 0.63 ( 0%) usr 0.01 ( 0%) sys 0.69 ( 0%) wall 53889 kB ( 0%) ggc LRA non-specific : 0.17 ( 0%) usr 0.03 ( 0%) sys 0.19 ( 0%) wall 279 kB ( 0%) ggc LRA virtuals elimination: 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1039 kB ( 0%) ggc LRA reload inheritance : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc LRA create live ranges : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 137 kB ( 0%) ggc LRA hard reg assignment : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc reload : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc thread pro- & epilogue : 0.10 ( 0%) usr 0.02 ( 0%) sys 0.03 ( 0%) wall 2859 kB ( 0%) ggc shorten branches : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc reg stack : 0.00 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc final : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.16 ( 0%) wall 3775 kB ( 0%) ggc symout : 0.06 ( 0%) usr 0.04 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc initialize rtl : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 12 kB ( 0%) ggc rest of compilation : 0.15 ( 0%) usr 0.05 ( 0%) sys 0.18 ( 0%) wall 4640 kB ( 0%) ggc TOTAL :1747.32 51.80 1799.28 133809627 kB ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-18 16:06 ` Jason Mancini @ 2019-04-18 19:07 ` Segher Boessenkool 2019-04-18 19:38 ` Jeff Law 0 siblings, 1 reply; 24+ messages in thread From: Segher Boessenkool @ 2019-04-18 19:07 UTC (permalink / raw) To: Jason Mancini; +Cc: Jonathan Wakely, gcc-help On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote: > The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow. > Bisected, and found it's due to revision 249333 on gcc-7-branch. > Here's the commit log and -ftime-report output. Where do we > go from here? Thanks! -JasonM Please open a bug report? https://gcc.gnu.org/bugzilla Thanks, Segher ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-18 19:07 ` Segher Boessenkool @ 2019-04-18 19:38 ` Jeff Law 2019-04-19 2:03 ` Jason Mancini ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Jeff Law @ 2019-04-18 19:38 UTC (permalink / raw) To: Segher Boessenkool, Jason Mancini; +Cc: Jonathan Wakely, gcc-help On 4/18/19 1:07 PM, Segher Boessenkool wrote: > On Thu, Apr 18, 2019 at 04:06:17PM +0000, Jason Mancini wrote: >> The root cause is between 7.1 and 7.2! 7.1 is fast, 7.2 is slow. >> Bisected, and found it's due to revision 249333 on gcc-7-branch. >> Here's the commit log and -ftime-report output. Where do we >> go from here? Thanks! -JasonM > > Please open a bug report? https://gcc.gnu.org/bugzilla WIth a testcase! jeff ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-18 19:38 ` Jeff Law @ 2019-04-19 2:03 ` Jason Mancini 2019-04-22 22:01 ` Jason Mancini 2019-04-23 0:18 ` Jason Mancini 2 siblings, 0 replies; 24+ messages in thread From: Jason Mancini @ 2019-04-19 2:03 UTC (permalink / raw) To: Jeff Law, Segher Boessenkool; +Cc: Jonathan Wakely, gcc-help Working on getting a testcase prepared to file a bug report. The trimmed down case shows a 6x degradation instead of 45x at -O0, but hopefully that's enough to pinpoint the reason. (15 vs 85 seconds.) Seems like some O(n^2) behavior, 15s becomes 90s, but 40s becomes 30m. I used "gcc -E" to generate the output. Is that what we're looking for here? I'm not familiar with *.ii files (or is that the typical extension used for preprocessor output?) Thanks! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-18 19:38 ` Jeff Law 2019-04-19 2:03 ` Jason Mancini @ 2019-04-22 22:01 ` Jason Mancini 2019-04-22 22:17 ` Jason Mancini 2019-04-23 0:18 ` Jason Mancini 2 siblings, 1 reply; 24+ messages in thread From: Jason Mancini @ 2019-04-22 22:01 UTC (permalink / raw) To: gcc-help On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at: if (undeduced_auto_decl (fn)) mark_used (fn, complain); // <= this guy from gcc-7-branch r249333 else /* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598. */ TREE_USED (fn) = 1; I'm still working on a code sample. The code sample has to be large to tickle the issue so far. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-22 22:01 ` Jason Mancini @ 2019-04-22 22:17 ` Jason Mancini 0 siblings, 0 replies; 24+ messages in thread From: Jason Mancini @ 2019-04-22 22:17 UTC (permalink / raw) To: gcc-help > On gcc trunk, the performance culprit is at gcc/cp/call.c function build_over_call at: > > if (undeduced_auto_decl (fn)) > mark_used (fn, complain); // <= this guy from gcc-7-branch r249333 > else > /* Otherwise set TREE_USED for the benefit of -Wunused-function. See PR80598. */ > TREE_USED (fn) = 1; mark_used is only called 1260 times, but inflates run time from ~13 to ~81 seconds for one sample. The calls to mark_used aren't expensive, so they must be triggering a down-stream effect. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-18 19:38 ` Jeff Law 2019-04-19 2:03 ` Jason Mancini 2019-04-22 22:01 ` Jason Mancini @ 2019-04-23 0:18 ` Jason Mancini 2019-04-23 12:58 ` Jonathan Wakely 2019-04-29 20:33 ` Jason Mancini 2 siblings, 2 replies; 24+ messages in thread From: Jason Mancini @ 2019-04-23 0:18 UTC (permalink / raw) To: gcc-help We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions. Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-23 0:18 ` Jason Mancini @ 2019-04-23 12:58 ` Jonathan Wakely 2019-04-29 20:33 ` Jason Mancini 1 sibling, 0 replies; 24+ messages in thread From: Jonathan Wakely @ 2019-04-23 12:58 UTC (permalink / raw) To: Jason Mancini; +Cc: gcc-help On Tue, 23 Apr 2019 at 01:18, Jason Mancini <jayrusman@hotmail.com> wrote: > > We've determined that the gcc perf drop is due to use of decltype(auto) as the return type for template functions. Replacing with a known type or func(...) -> decltype(...) trailing type syntax seems to avoid the performance issue. Is the code you're compiling the same in all cases, meaning decltype(auto) is faster with GCC 6 than later releases? Or are you only using decltype(auto) with the later releases? It's possible that GCC 7 and later fixes some bugs in the handling of decltype(auto) which makes it slower than GCC 6. It's unsurprising that decltype(auto) requires the compiler to do more work, but ideally that work wouldn't make compilation exponentially slower, even with less buggy behaviour than in older releases. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: gcc9 snapshot 20190414 is 30x slower than gcc 6.3 2019-04-23 0:18 ` Jason Mancini 2019-04-23 12:58 ` Jonathan Wakely @ 2019-04-29 20:33 ` Jason Mancini 2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall 1 sibling, 1 reply; 24+ messages in thread From: Jason Mancini @ 2019-04-29 20:33 UTC (permalink / raw) To: gcc-help Jason Mancini said: > We've determined that the gcc perf drop is due to use of decltype(auto) as the > return type for template functions. Replacing with a known type or func(...) -> decltype(...) > trailing type syntax seems to avoid the performance issue. I misspoke here. Turns out that the above replacement made everything equally slow. So the performance bug was lurking in there, and gcc-7-branch r249333 exposed it more. Yeah yeah, we need to get an offending code blob cleaned up, approved, and bug filed. I've been using gcc9 snapshots with part of r249333 reverted in the mean time to make forward progress vetting gcc9 on our code base (no other problems so far!) Jason ^ permalink raw reply [flat|nested] 24+ messages in thread
* C11, <stdatomic.h> and atomic pointers 2019-04-29 20:33 ` Jason Mancini @ 2019-05-01 13:31 ` Chris Hall 2019-05-01 14:15 ` Martin Sebor 2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall 0 siblings, 2 replies; 24+ messages in thread From: Chris Hall @ 2019-05-01 13:31 UTC (permalink / raw) To: gcc-help I find that: int * _Atomic foo ; int bar[12] ; foo = bar ; foo += 4 ; // foo -> &bar[4] -- of course foo = bar ; atomic_fetch_add(&foo, 4) ; // foo -> &bar[1] -- ???? which, I confess, I did not quite expect. (Happy I looked, though !) So, I looked at the Standard: 7.17.7.5 The atomic_fetch and modify generic functions 1 The following operations perform arithmetic and bitwise computations. All of these operations are applicable to an object of any atomic integer type. None of these operations is applicable to atomic_bool. Of course, "integer type" excludes pointers, so I guess what it does with pointers is undefined. Should gcc be throwing a friendly warning here ? But the Standard goes on to say: 3 ... For signed integer types ... there are no undefined results. ... ... For address types, the result may be an undefined address, but the operations otherwise have no undefined behavior. I don't know why it feels the need to mention "address types", given that they are not valid arguments ? [I'm assuming that by "address types" it actually means "pointer types". I can find no other mention of "address type".] The "Synopsis" says: 2 #include <stdatomic.h> <C> atomic_fetch_<key>(volatile <A> *object, <M> operand); <C> atomic_fetch_<key>_explicit(volatile <A> *object, <M> operand, memory_order order); and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction", as follows: 5 In the following synopses: - An <A> refers to one of the atomic types. - A <C> refers to its corresponding non-atomic type. - An <M> refers to the type of the other argument for arithmetic operations. For atomic integer types, <M> is <C>. For atomic pointer types, <M> is ptrdiff_t. As it happens, <M> only used in the Synopsis for atomic_fetch_<key>... which is not defined for pointer types ? I realise this is not really the place for discussion of the Standard, but I assume that what gcc does is based on some interpretation of it. Is there a good place to look for that interpretation ? Chris -------------------------- FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require the various <A> arguments to be atomic types... they are perfectly happy with ordinary types. That doesn't seem right to me. FWIW (2): the Standard (later in 7.17.7.5) says: 5 NOTE The operation of the atomic_fetch and modify generic functions are nearly equivalent to the operation of the corresponding op= compound assignment operators. The only differences are that the compound assignment operators are not guaranteed to operate atomically, ... Except that "6.5.16.2 Compound assignment" says: 3 A compound assignment of the form E1 op= E2 ... ... If E1 has an atomic type, compound assignment is a read-modify-write operation with memory_order_seq_cst memory order semantics. which looks like a flat contradiction to me. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: C11, <stdatomic.h> and atomic pointers 2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall @ 2019-05-01 14:15 ` Martin Sebor 2019-05-02 7:54 ` Jonathan Wakely 2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall 1 sibling, 1 reply; 24+ messages in thread From: Martin Sebor @ 2019-05-01 14:15 UTC (permalink / raw) To: Chris Hall, gcc-help On 05/01/2019 07:31 AM, Chris Hall wrote: > > I find that: > >  int * _Atomic foo ; >  int bar[12] ; > >  foo = bar ; >  foo += 4 ;           // foo -> &bar[4] -- of course > >  foo = bar ; >  atomic_fetch_add(&foo, 4) ;    // foo -> &bar[1] -- ???? > I think this is a bug 64843: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64843 > which, I confess, I did not quite expect. (Happy I looked, though !) > > So, I looked at the Standard: > >  7.17.7.5 The atomic_fetch and modify generic functions > >  1 The following operations perform arithmetic and bitwise >    computations. All of these operations are applicable to an >    object of any atomic integer type. None of these >    operations is applicable to atomic_bool. > > Of course, "integer type" excludes pointers, so I guess what it does > with pointers is undefined. > > Should gcc be throwing a friendly warning here ? > > But the Standard goes on to say: > >  3 ... For signed integer types ... there are no undefined >    results. ... >    ... For address types, the result may be an undefined >    address, but the operations otherwise have no undefined >    behavior. > > I don't know why it feels the need to mention "address types", given > that they are not valid arguments ? [I'm assuming that by "address > types" it actually means "pointer types". I can find no other mention > of "address type".] Yes, that's a problem in the standard text that should be fixed. > The "Synopsis" says: > >  2 #include <stdatomic.h> >    <C> atomic_fetch_<key>(volatile <A> *object, <M> operand); >    <C> atomic_fetch_<key>_explicit(volatile <A> *object, >                              <M> operand, memory_order order); > > and the meaning of <A>, <C> and <M> is given in "7.17.1 Introduction", > as follows: > >  5 In the following synopses: > >     - An <A> refers to one of the atomic types. >     - A <C> refers to its corresponding non-atomic type. >     - An <M> refers to the type of the other argument for >       arithmetic operations. For atomic integer types, <M> >       is <C>. For atomic pointer types, <M> is ptrdiff_t. > > As it happens, <M> only used in the Synopsis for atomic_fetch_<key>... > which is not defined for pointer types ? > > I realise this is not really the place for discussion of the Standard, > but I assume that what gcc does is based on some interpretation of it. > Is there a good place to look for that interpretation ? There are C defect reports that GCC also considers. Some may already be incorporated, others are not. C11 defect reports are tracked here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm If the above isn't being tracked there or in the list below we might want write up a new issue for it and submit it to WG14 to get it fixed in C2X. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2316.htm Martin > > Chris > > -------------------------- > > FWIW (1): the functions in gcc/glibc's <stdatomic.h> do *not* require > the various <A> arguments to be atomic types... they are perfectly happy > with ordinary types. That doesn't seem right to me. > > FWIW (2): the Standard (later in 7.17.7.5) says: > >  5 NOTE The operation of the atomic_fetch and modify generic >    functions are nearly equivalent to the operation of the >    corresponding op= compound assignment operators. The only >    differences are that the compound assignment operators are >    not guaranteed to operate atomically, ... > > Except that "6.5.16.2 Compound assignment" says: > >  3 A compound assignment of the form E1 op= E2 ... >    ... If E1 has an atomic type, compound assignment is a >    read-modify-write operation with memory_order_seq_cst >    memory order semantics. > > which looks like a flat contradiction to me. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: C11, <stdatomic.h> and atomic pointers 2019-05-01 14:15 ` Martin Sebor @ 2019-05-02 7:54 ` Jonathan Wakely 0 siblings, 0 replies; 24+ messages in thread From: Jonathan Wakely @ 2019-05-02 7:54 UTC (permalink / raw) To: Martin Sebor; +Cc: Chris Hall, gcc-help The equivalent wording in the C++ standard was modified by http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0558r1.pdf to remove the nonsense words "address types", and to clarify that using the fetch_xxx functions for arithmetic on pointers is valid, except for pointers to void and function pointers. ^ permalink raw reply [flat|nested] 24+ messages in thread
* __STDC_NO_THREADS__ and late model gcc/glibc 2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall 2019-05-01 14:15 ` Martin Sebor @ 2019-06-20 11:28 ` Chris Hall 2019-06-20 12:17 ` Jonathan Wakely 2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall 1 sibling, 2 replies; 24+ messages in thread From: Chris Hall @ 2019-06-20 11:28 UTC (permalink / raw) To: gcc-help I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1', denying support for <threads.h>. Nevertheless, it does support _Thread_local. As of glibc 2.28, <threads.h> appears in the library. I guess that means some version of gcc will no longer set __STDC_NO_THREADS__ ? On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all versions up to and including the "trunk". But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1, and no longer set by gcc v7.2.0 (which I just built on my machine). I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc 2.28. I'm guessing that something, somewhere is taking that into account. I have looked everywhere I can think of to find where __STDC_NO_THREADS__ is configured... but to no avail :-( Does anyone know what I should expect, or where I should look to find out, please ? Thanks, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc 2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall @ 2019-06-20 12:17 ` Jonathan Wakely 2019-06-21 9:39 ` Chris Hall 2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall 1 sibling, 1 reply; 24+ messages in thread From: Jonathan Wakely @ 2019-06-20 12:17 UTC (permalink / raw) To: Chris Hall; +Cc: gcc-help On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote: > > > I find that gcc 4.9.0 onwards defines __STDC_NO_THREADS__ to be '1', > denying support for <threads.h>. Nevertheless, it does support > _Thread_local. > > As of glibc 2.28, <threads.h> appears in the library. I guess that > means some version of gcc will no longer set __STDC_NO_THREADS__ ? > > On gcc.godbolt.org, I find that __STDC_NO_THREADS__ is defined for all > versions up to and including the "trunk". > > But on my machine, __STDC_NO_THREADS__ is no longer set by gcc v9.1.1, > and no longer set by gcc v7.2.0 (which I just built on my machine). > > I note that gcc.godbolt.org has glibc 2.27, while my machine has glibc > 2.28. I'm guessing that something, somewhere is taking that into account. > > I have looked everywhere I can think of to find where > __STDC_NO_THREADS__ is configured... but to no avail :-( > > Does anyone know what I should expect, or where I should look to find > out, please ? Glibc provides it in the /usr/include/stdc-predef.h file which is implicitly pre-included by GCC. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc 2019-06-20 12:17 ` Jonathan Wakely @ 2019-06-21 9:39 ` Chris Hall 2019-06-21 9:51 ` Jonathan Wakely 0 siblings, 1 reply; 24+ messages in thread From: Chris Hall @ 2019-06-21 9:39 UTC (permalink / raw) To: Jonathan Wakely; +Cc: gcc-help On 20/06/2019 13:17, Jonathan Wakely wrote: > On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote: ... >> I have looked everywhere I can think of to find where >> __STDC_NO_THREADS__ is configured... but to no avail :-( >> >> Does anyone know what I should expect, or where I should look to find >> out, please ? > Glibc provides it in the /usr/include/stdc-predef.h file which is > implicitly pre-included by GCC. Ah ha ! Thank you. I guess that accounts for the "predefined" _STDC_PREDEF_H. AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is built, depending on the target. Deep Magic of the First Magnitude. Thanks, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: __STDC_NO_THREADS__ and late model gcc/glibc 2019-06-21 9:39 ` Chris Hall @ 2019-06-21 9:51 ` Jonathan Wakely 0 siblings, 0 replies; 24+ messages in thread From: Jonathan Wakely @ 2019-06-21 9:51 UTC (permalink / raw) To: Chris Hall; +Cc: gcc-help On Fri, 21 Jun 2019 at 10:39, Chris Hall wrote: > > On 20/06/2019 13:17, Jonathan Wakely wrote: > > On Thu, 20 Jun 2019 at 12:28, Chris Hall wrote: > ... > >> I have looked everywhere I can think of to find where > >> __STDC_NO_THREADS__ is configured... but to no avail :-( > >> > >> Does anyone know what I should expect, or where I should look to find > >> out, please ? > > > Glibc provides it in the /usr/include/stdc-predef.h file which is > > implicitly pre-included by GCC. > > Ah ha ! Thank you. > > I guess that accounts for the "predefined" _STDC_PREDEF_H. > > AFAICS, the decision to "preinclude" <stdc-predef.h> is made when gcc is > built, depending on the target. Right, see gcc/config/glibc-c.c > Deep Magic of the First Magnitude. :-) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Function returning struct on x86_64 (at least) 2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall 2019-06-20 12:17 ` Jonathan Wakely @ 2020-01-06 15:09 ` Chris Hall 2020-01-06 15:19 ` Marc Glisse 1 sibling, 1 reply; 24+ messages in thread From: Chris Hall @ 2020-01-06 15:09 UTC (permalink / raw) To: gcc-help I hoped to do something "clever" with a function of the form: typedef struct { char s[64] ; } qerr_str_t ; extern qerr_str_t qerrst0(int err) { qerr_str_t st ; snprintf(st.s, sizeof(st.s), "errno=%d", err) ; return st ; } but was disappointed to find that this compiles (gcc 8.3 and others, -O2) to this: .LC0: .string "errno=%d" qerrst0: pushq %rbx movl %esi, %ecx movq %rdi, %rbx movl $.LC0, %edx movl $64, %esi xorl %eax, %eax subq $64, %rsp movq %rsp, %rdi call snprintf movdqa (%rsp), %xmm0 movq %rbx, %rax movdqa 16(%rsp), %xmm1 movdqa 32(%rsp), %xmm2 movdqa 48(%rsp), %xmm3 movups %xmm0, (%rbx) movups %xmm1, 16(%rbx) movups %xmm2, 32(%rbx) movups %xmm3, 48(%rbx) addq $64, %rsp popq %rbx ret On reflection, the compiler is playing safe and not writing to whatever the "hidden" pointer %rdi is pointing at, until the implicit assignment. So I have no right to be disappointed. The object of the exercise is to create temporary strings for use like this: int main(int argc, char* argv[]) { printf("%s: %s\n", argv[0], qerrst0(argc).s) ; } where the "hidden" pointer passed to qerrst0() does not, in fact, point to anything accessible. Sadly, even when qerrst0() is inlined, I find: .LC0: .string "errno=%d" .LC1: .string "%s: %s\n" main: pushq %rbx movl %edi, %ecx movq %rsi, %rbx movl $.LC0, %edx movl $64, %esi xorl %eax, %eax addq $-128, %rsp leaq 64(%rsp), %rdi call snprintf movdqa 64(%rsp), %xmm0 movq (%rbx), %rsi xorl %eax, %eax movdqa 80(%rsp), %xmm1 movdqa 96(%rsp), %xmm2 movq %rsp, %rdx movl $.LC1, %edi movdqa 112(%rsp), %xmm3 movaps %xmm0, (%rsp) movaps %xmm1, 16(%rsp) movaps %xmm2, 32(%rsp) movaps %xmm3, 48(%rsp) call printf subq $-128, %rsp xorl %eax, %eax popq %rbx ret where there is still an (unnecessary) assignment going on ! I tried something simpler: extern qerr_str_t qerrst1(int err) { qerr_str_t st ; st.s[0] = err ; return st ; } which compiles to: qerrst1: movq %rdi, %rax movb %sil, (%rdi) ret ...so a trivial case optimises as one might hope. As does: extern qerr_str_t qerrst2(int err) { qerr_str_t st ; char* q = st.s ; q[0] = err ; q[63] = err ; return st ; } qerrst2: movq %rdi, %rax movb %sil, (%rdi) movb %sil, 63(%rdi) ret The following are also optimised: extern qerr_str_t qerrst3a(int err) { qerr_str_t st = { "" } ; return st ; } extern qerr_str_t qerrst3b(int err) { qerr_str_t st ; char* q = st.s ; memset(q, 0, sizeof(st.s)) ; return st ; } to the same code: qerrst3a/b: pxor %xmm0, %xmm0 movq %rdi, %rax movups %xmm0, (%rdi) movups %xmm0, 16(%rdi) movups %xmm0, 32(%rdi) movups %xmm0, 48(%rdi) ret However, ever so slightly more complicated: extern qerr_str_t qerrst4(int err) { qerr_str_t st ; for (int i = 0 ; i < (err & 63) ; ++i) st.s[i] = err - i ; return st ; } qerrst4: movl %esi, %edx movq %rdi, %rax andl $63, %edx je .L12 subl $1, %edx leaq -71(%rsp,%rdx), %r8 leaq -72(%rsp), %rdx addl %edx, %esi .L11: movl %esi, %ecx subl %edx, %ecx addq $1, %rdx movb %cl, -1(%rdx) cmpq %r8, %rdx jne .L11 .L12: movdqa -72(%rsp), %xmm0 movdqa -56(%rsp), %xmm1 movdqa -40(%rsp), %xmm2 movdqa -24(%rsp), %xmm3 movups %xmm0, (%rax) movups %xmm1, 16(%rax) movups %xmm2, 32(%rax) movups %xmm3, 48(%rax) ret Which is a puzzle :-( Interestingly, I also found (after a little effort): extern qerr_str_t qerrst5(int err, char* fred) { qerr_str_t st ; st.s[ 0] = err ; st.s[ 2] = fred[ 8] ; st.s[ 4] = fred[ 6] ; st.s[ 6] = fred[ 4] ; st.s[ 8] = fred[ 2] ; st.s[10] = fred[ 0] ; return st ; } qerrst5: movq %rdi, %rax movzbl 8(%rdx), %r9d movzbl 6(%rdx), %r8d movzbl 4(%rdx), %edi movzbl 2(%rdx), %ecx movb %sil, (%rax) -- BUG iff %rax == movzbl (%rdx), %edx -- %rdx ! movb %r9b, 2(%rax) movb %r8b, 4(%rax) movb %dil, 6(%rax) movb %cl, 8(%rax) movb %dl, 10(%rax) ret which is very nearly correct... except as noted, if *fred points at the final destination !! For this to do what I had hoped (and I imagine is the majority case), what is needed is a way to mark the declaration of 'qerr_str_t st' in the function as a "clone" of the final destination 'qerr_str_t' in the caller -- so that the compiler could Just Do It. I looked for an __attribute__(()) for this... but could not find one. Is there any way in which I can persuade the compiler that a function returning a struct does not need to worry about preserving the value of the final destination (ie the struct at %rdi) ? Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Function returning struct on x86_64 (at least) 2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall @ 2020-01-06 15:19 ` Marc Glisse 2020-01-07 16:20 ` Chris Hall 0 siblings, 1 reply; 24+ messages in thread From: Marc Glisse @ 2020-01-06 15:19 UTC (permalink / raw) To: Chris Hall; +Cc: gcc-help On Mon, 6 Jan 2020, Chris Hall wrote: [description of NRVO] > Is there any way in which I can persuade the compiler that a function > returning a struct does not need to worry about preserving the value of the > final destination (ie the struct at %rdi) ? Compile the file as C++ instead of C. Not that it would be forbidden in C, but the optimization happens to be in the C++ front-end. There is also an optimization pass called nrv, but it does trigger that often. -- Marc Glisse ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Function returning struct on x86_64 (at least) 2020-01-06 15:19 ` Marc Glisse @ 2020-01-07 16:20 ` Chris Hall 0 siblings, 0 replies; 24+ messages in thread From: Chris Hall @ 2020-01-07 16:20 UTC (permalink / raw) To: gcc-help, Marc Glisse On 06/01/2020 15:19, Marc Glisse wrote: > On Mon, 6 Jan 2020, Chris Hall wrote: > > [description of NRVO] > >> Is there any way in which I can persuade the compiler that a function >> returning a struct does not need to worry about preserving the value >> of the final destination (ie the struct at %rdi) ? > Compile the file as C++ instead of C. Not that it would be forbidden in > C, but the optimization happens to be in the C++ front-end. There is > also an optimization pass called nrv, but it does trigger that often. The idea of trying to write as-standard-as-possible C11, and then compiling it as C++ makes me queasy :-( As far as I can see, tree-nrv is enabled. Some of the functions do write directly to %rdi, so I guess that's being done by tree-nrv. But, as you say, this optimization does seem not to be applied often. I was puzzled by why the optimization would not be forbidden in C... ...clearly, the problem is that 'z = f(...) ;' is defined such that the result must be as if f() creates a temporary struct t, which is copied to z when f() completes. If f() writes directly to z it must not then read from z -- which looks hard to guarantee, not least because only the *caller* knows what z is !! Now, as I noted before, I found (after a little effort): typedef struct { char s[64] ; } qerrst_t ; extern qerrst_t qerrst5(int err, char* foo) { qerr_str_t st ; st.s[ 0] = err ; st.s[ 2] = foo[ 8] ; st.s[ 4] = foo[ 6] ; st.s[ 6] = foo[ 4] ; st.s[ 8] = foo[ 2] ; st.s[10] = foo[ 0] ; return st ; } qerrst5: movq %rdi, %rax movzbl 8(%rdx), %r9d movzbl 6(%rdx), %r8d movzbl 4(%rdx), %edi movzbl 2(%rdx), %ecx movb %sil, (%rax) -- BUG iff %rax == movzbl (%rdx), %edx -- %rdx ! movb %r9b, 2(%rax) movb %r8b, 4(%rax) movb %dil, 6(%rax) movb %cl, 8(%rax) movb %dl, 10(%rax) ret which does *not* do a copy in the function, and which is very nearly correct... except as noted, if *foo points at the final destination !! HOWEVER... when I tried this function the bug did NOT appear. It turns out that the *caller* passes a pointer to a hidden struct and the *caller* copies that to the final destination !!! The test I ran is: typedef struct { char s[64] ; } qerrst_t ; extern qerrst_t qerrst5(int err, char* foo) ; static void __attribute__((noinline)) show(const char* name, char* s) { printf(name) ; for (int i = 0 ; i <= 10 ; ++i) printf(" %3d", (unsigned char)s[i]) ; printf("\n") ; } int main(Unused int argc, Unused char* argv[]) { int err = argc ; qerrst_t x, y ; for (int i = 0 ; i < (int)sizeof(x.s) ; ++i) y.s[i] = x.s[i] = (char)(100 + i) ; y = qerrst5(err, x.s) ; show("y", y.s) ; x = qerrst5(err, x.s) ; show("x", x.s) ; return 0 ; } and in a separate compilation unit (for completeness): extern qerrst_t qerrst5(int err, char* foo) { qerrst_t st ; st.s[ 0] = (char)err ; st.s[ 1] = -1 ; st.s[ 2] = foo[ 8] ; st.s[ 3] = -3 ; st.s[ 4] = foo[ 6] ; st.s[ 5] = -5 ; st.s[ 6] = foo[ 4] ; st.s[ 7] = -7 ; st.s[ 8] = foo[ 2] ; st.s[ 9] = -8 ; st.s[10] = foo[ 0] ; return st ; } which compiles to: Dump of assembler code for function qerrst5: 0x47fdd0 <+0>: mov %rdi,%rax 0x47fdd3 <+3>: movzbl 0x8(%rdx),%r8d # read foo[8] 0x47fdd8 <+8>: movzbl 0x6(%rdx),%edi 0x47fddc <+12>: mov %esi,%r9d 0x47fddf <+15>: movzbl 0x2(%rdx),%ecx 0x47fde3 <+19>: movzbl 0x4(%rdx),%esi 0x47fde7 <+23>: mov %r9b,(%rax) # write st.s[0] 0x47fdea <+26>: movb $0xff,0x1(%rax) 0x47fdee <+30>: movb $0xfd,0x3(%rax) 0x47fdf2 <+34>: movb $0xfb,0x5(%rax) 0x47fdf6 <+38>: movb $0xf9,0x7(%rax) 0x47fdfa <+42>: movb $0xf8,0x9(%rax) 0x47fdfe <+46>: mov %r8b,0x2(%rax) 0x47fe02 <+50>: mov %dil,0x4(%rax) 0x47fe06 <+54>: mov %sil,0x6(%rax) 0x47fe0a <+58>: mov %cl,0x8(%rax) 0x47fe0d <+61>: movzbl (%rdx),%edx # read foo[0] # -- BUG if foo == st.s 0x47fe10 <+64>: mov %dl,0xa(%rax) # write st.s[10] 0x47fe13 <+67>: retq And the result was: y 1 255 108 253 106 251 104 249 102 248 100 x 1 255 108 253 106 251 104 249 102 248 100 SURPRISE ! expected to see: x 1 255 108 253 106 251 104 249 102 248 1 <<< BUG Looking at main() we see: Dump of assembler code for function main: 0x4012a0 <+0>: push %rbp 0x4012a1 <+1>: mov %edi,%esi 0x4012a3 <+3>: mov %rsp,%rbp 0x4012a6 <+6>: push %r12 0x4012a8 <+8>: mov %edi,%r12d 0x4012ab <+11>: and $0xffffffffffffffe0,%rsp 0x4012af <+15>: sub $0xc0,%rsp 0x4012b6 <+22>: vmovaps 0x94642(%rip),%xmm0 # 0x495900 0x4012be <+30>: lea 0x40(%rsp),%rdx # ->x 0x4012c3 <+35>: mov %rsp,%rdi # ->t 0x4012c6 <+38>: vmovaps %xmm0,0x40(%rsp) # x0 0x4012cc <+44>: vmovaps %xmm0,0x80(%rsp) # y0 0x4012d5 <+53>: vmovaps 0x94633(%rip),%xmm0 # 0x495910 0x4012dd <+61>: vmovaps %xmm0,0x50(%rsp) # x1 0x4012e3 <+67>: vmovaps %xmm0,0x90(%rsp) # y1 0x4012ec <+76>: vmovaps 0x9462c(%rip),%xmm0 # 0x495920 0x4012f4 <+84>: vmovaps %xmm0,0x60(%rsp) # x2 0x4012fa <+90>: vmovaps %xmm0,0xa0(%rsp) # y2 0x401303 <+99>: vmovaps 0x94625(%rip),%xmm0 # 0x495930 0x40130b <+107>: vmovaps %xmm0,0x70(%rsp) # x3 0x401311 <+113>: vmovaps %xmm0,0xb0(%rsp) # y3 0x40131a <+122>: callq 0x47fdd0 <qerrst5> 0x40131f <+127>: vmovups (%rsp),%xmm1 # t0 0x401324 <+132>: lea 0x80(%rsp),%rsi # ->y 0x40132c <+140>: mov $0x494b3e,%edi 0x401331 <+145>: vmovups 0x10(%rsp),%xmm2 # t1 0x401337 <+151>: vmovups 0x20(%rsp),%xmm3 # t2 0x40133d <+157>: vmovups 0x30(%rsp),%xmm4 # t3 0x401343 <+163>: vmovaps %xmm1,0x80(%rsp) # y0 0x40134c <+172>: vmovaps %xmm2,0x90(%rsp) # y1 0x401355 <+181>: vmovaps %xmm3,0xa0(%rsp) # t2 0x40135e <+190>: vmovaps %xmm4,0xb0(%rsp) # y3 0x401367 <+199>: callq 0x47fb60 <show> 0x40136c <+204>: lea 0x40(%rsp),%rdx # ->x 0x401371 <+209>: mov %r12d,%esi # err 0x401374 <+212>: mov %rsp,%rdi # ->t 0x401377 <+215>: callq 0x47fdd0 <qerrst5> 0x40137c <+220>: vmovups (%rsp),%xmm5 # t0 0x401381 <+225>: lea 0x40(%rsp),%rsi # ->x 0x401386 <+230>: mov $0x4937c4,%edi 0x40138b <+235>: vmovups 0x10(%rsp),%xmm6 # t1 0x401391 <+241>: vmovups 0x20(%rsp),%xmm7 # t2 0x401397 <+247>: vmovups 0x30(%rsp),%xmm1 # t3 0x40139d <+253>: vmovaps %xmm5,0x40(%rsp) # x0 0x4013a3 <+259>: vmovaps %xmm6,0x50(%rsp) # x1 0x4013a9 <+265>: vmovaps %xmm7,0x60(%rsp) # x2 0x4013af <+271>: vmovaps %xmm1,0x70(%rsp) # x3 0x4013b5 <+277>: callq 0x47fb60 <show> 0x4013ba <+282>: xor %eax,%eax 0x4013bc <+284>: mov -0x8(%rbp),%r12 0x4013c0 <+288>: leaveq 0x4013c1 <+289>: retq The caller is passing a pointer to a hidden 't' and then *itself* copying the result to the destination of the assignment !! It looks like the caller is taking care of the problem, so a function returning a struct does not need to... surely ? So I also tried: typedef struct { char s[64] ; } qerrst_t ; extern qerrst_t qerrst0(int err) ; int main(Unused int argc, Unused char* argv[]) { int err = argc ; qerrst_t z ; printf("qerrst0()='%s'\n", qerrst0(err).s) ; z = qerrst0(err) ; printf("qerrst0()='%s'\n", z.s) ; return 0 ; } and in a separate compilation unit (for completeness): extern qerrst_t qerrst0(int err) { qerrst_t st ; snprintf(st.s, sizeof(st.s), "errno=%d", err) ; return st ; } which compiles to: Dump of assembler code for function qerrst0: 0x47fc00 <+0>: push %r12 0x47fc02 <+2>: mov %esi,%ecx 0x47fc04 <+4>: mov %rdi,%r12 0x47fc07 <+7>: mov $0x495910,%edx 0x47fc0c <+12>: sub $0x40,%rsp 0x47fc10 <+16>: mov $0x40,%esi 0x47fc15 <+21>: xor %eax,%eax 0x47fc17 <+23>: mov %rsp,%rdi 0x47fc1a <+26>: callq 0x4010b0 <snprintf@plt> 0x47fc1f <+31>: vmovaps (%rsp),%xmm0 0x47fc24 <+36>: mov %r12,%rax 0x47fc27 <+39>: vmovaps 0x10(%rsp),%xmm1 0x47fc2d <+45>: vmovaps 0x20(%rsp),%xmm2 0x47fc33 <+51>: vmovaps 0x30(%rsp),%xmm3 0x47fc39 <+57>: vmovups %xmm0,(%r12) 0x47fc3f <+63>: vmovups %xmm1,0x10(%r12) 0x47fc46 <+70>: vmovups %xmm2,0x20(%r12) 0x47fc4d <+77>: vmovups %xmm3,0x30(%r12) 0x47fc54 <+84>: add $0x40,%rsp 0x47fc58 <+88>: pop %r12 0x47fc5a <+90>: retq which, as before, creates a temporary, local struct which is copied to the return struct pointed to by %rdi. And now we see: Dump of assembler code for function main: 0x401280 <+0>: push %rbp 0x401281 <+1>: mov %edi,%esi 0x401283 <+3>: mov %rsp,%rbp 0x401286 <+6>: push %r12 0x401288 <+8>: mov %edi,%r12d 0x40128b <+11>: and $0xffffffffffffffe0,%rsp 0x40128f <+15>: sub $0xc0,%rsp 0x401296 <+22>: lea 0x80(%rsp),%rdi # ->t 0x40129e <+30>: callq 0x47fc00 <qerrst0> # qerrst0(err).s 0x4012a3 <+35>: lea 0x80(%rsp),%rsi 0x4012ab <+43>: mov $0x495882,%edi 0x4012b0 <+48>: xor %eax,%eax 0x4012b2 <+50>: callq 0x4010a0 <printf@plt> # printf(..., t) 0x4012b7 <+55>: mov %r12d,%esi 0x4012ba <+58>: mov %rsp,%rdi # ->t 0x4012bd <+61>: callq 0x47fc00 <qerrst0> # z = qerrst0(err) ; 0x4012c2 <+66>: vmovups (%rsp),%xmm0 # t0 0x4012c7 <+71>: lea 0x40(%rsp),%rsi # ->z 0x4012cc <+76>: mov $0x495882,%edi 0x4012d1 <+81>: vmovups 0x10(%rsp),%xmm1 # t1 0x4012d7 <+87>: vmovups 0x20(%rsp),%xmm2 # t2 0x4012dd <+93>: xor %eax,%eax 0x4012df <+95>: vmovups 0x30(%rsp),%xmm3 # t3 0x4012e5 <+101>: vmovaps %xmm0,0x40(%rsp) # z0 ) 0x4012eb <+107>: vmovaps %xmm1,0x50(%rsp) # z1 ) copied from t 0x4012f1 <+113>: vmovaps %xmm2,0x60(%rsp) # z2 ) 0x4012f7 <+119>: vmovaps %xmm3,0x70(%rsp) # z3 ) 0x4012fd <+125>: callq 0x4010a0 <printf@plt> # printf(..., z.s) 0x401302 <+130>: xor %eax,%eax 0x401304 <+132>: mov -0x8(%rbp),%r12 0x401308 <+136>: leaveq 0x401309 <+137>: retq So for: printf("qerrst0()='%s'\n", qerrst0(err).s) ; there is one (spurious) copy in qerrst0(). And for: z = qerrst0(err) ; printf("qerrst0()='%s'\n", z.s) ; there is one (spurious) copy in qerrst0() AND a *second* copy in main(). Is it just me, or is this broken ? So, I looked at the AMD64 ABI (Draft 0.99.7 â November 17, 2014 â 15:08), Section 3.2.3 Parameter Passing, p22: Returning of Values: .... 2. If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi as if it were the first argument to the function. In effect, this address becomes a âhiddenâ first argument. This storage must not overlap any data visible to the callee through other names than this argument. So... the ABI appears to say that the callee does *not* need to do any copying *ever*. This pushes the problem back to the caller. If the caller can be sure that the final destination is not visible to the callee, it too can avoid copying. So... why is the qerrst0() function doing a copy ? Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2020-01-07 16:20 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-17 0:28 gcc9 snapshot 20190414 is 30x slower than gcc 6.3 Jason Mancini 2019-04-17 2:10 ` Jason Mancini 2019-04-17 2:20 ` Xi Ruoyao 2019-04-17 8:38 ` Jonathan Wakely 2019-04-17 9:07 ` Segher Boessenkool 2019-04-18 16:06 ` Jason Mancini 2019-04-18 19:07 ` Segher Boessenkool 2019-04-18 19:38 ` Jeff Law 2019-04-19 2:03 ` Jason Mancini 2019-04-22 22:01 ` Jason Mancini 2019-04-22 22:17 ` Jason Mancini 2019-04-23 0:18 ` Jason Mancini 2019-04-23 12:58 ` Jonathan Wakely 2019-04-29 20:33 ` Jason Mancini 2019-05-01 13:31 ` C11, <stdatomic.h> and atomic pointers Chris Hall 2019-05-01 14:15 ` Martin Sebor 2019-05-02 7:54 ` Jonathan Wakely 2019-06-20 11:28 ` __STDC_NO_THREADS__ and late model gcc/glibc Chris Hall 2019-06-20 12:17 ` Jonathan Wakely 2019-06-21 9:39 ` Chris Hall 2019-06-21 9:51 ` Jonathan Wakely 2020-01-06 15:09 ` Function returning struct on x86_64 (at least) Chris Hall 2020-01-06 15:19 ` Marc Glisse 2020-01-07 16:20 ` Chris Hall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).