public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above
@ 2020-05-21 22:08 freddie at witherden dot org
  2020-05-22  9:14 ` [Bug c++/95264] " rguenth at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: freddie at witherden dot org @ 2020-05-21 22:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

            Bug ID: 95264
           Summary: Infinite Loop When Compiling Templated C++ code at -O1
                    and above
           Product: gcc
           Version: 10.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: freddie at witherden dot org
  Target Milestone: ---

Created attachment 48578
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48578&action=edit
Preprocessed source.

When attempting to compile the header-only C++ application Polyquad
(https://github.com/PyFR/Polyquad) with any recent version of GCC at any
optimization level, the compiler gets stuck.  (And either dies with c++: fatal
error: Killed signal terminated program cc1plus or an out of memory error
depending on the platform.)

This is believed to be an interaction between the Boost bfloat type (an
arbitrary precision numerical type) and the Eigen library (a heavily templated
matrix library).

By comparison, Clang is able to compile the application in a few minutes at any
optimization level with memory never peaking above 3-4 GiB.  GCC with -O3 -g
will happily malloc in excess of 30 GiB before dying (although this can be
curtailed somewhat by -fno-var-tracking-assignments).

The compiler command (from the uncompressed pre-processed source) is:

/usr/libexec/gcc/x86_64-pc-linux-gnu/10.1.0/cc1plus -fpreprocessed main.ii
-march=skylake -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16
-msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4
-mno-xop -mbmi -msgx -mbmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mavx -mavx2
-msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed
-mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1-mclflushopt -mxsavec -mxsaves
-mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi
-mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku
-mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes
-mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b -mno-waitpkg
-mno-cldemote -mno-ptwrite -mno-avx512bf16 -mno-enqcmd -mno-avx512vp2intersect
--param l1-cache-size=32 --param l1-cache-line-size=64 --param
l2-cache-size=6144-mtune=skylake -quiet -dumpbase main.cpp -auxbase-strip
CMakeFiles/polyquad.dir/src/main.cpp.o -O3 -Wno-deprecated -std=c++17 -version
-fno-var-tracking-assignments-o main.s

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
@ 2020-05-22  9:14 ` rguenth at gcc dot gnu.org
  2020-05-22  9:35 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-22  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  We do have (a) huuuge function here, containing 539237 basic blocks
after early inlining which is

void polyquad::BaseDomain<Derived, T, Ndim, Norbits>::expand(const VectorXT&,
polyquad::BaseDomain<Derived, T, Ndim, Norbits>::MatrixPtsT&) const [with
Derived =
polyquad::TetDomain<boost::multiprecision::number<boost::multiprecision::backends::cpp_bin_float<100>
> >; T =
boost::multiprecision::number<boost::multiprecision::backends::cpp_bin_float<100>
>; int Ndim = 3; int Norbits = 5]

obviously every IL walk will be bad here.  Didn't yet find the actual wall it
runs into, still runs...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
  2020-05-22  9:14 ` [Bug c++/95264] " rguenth at gcc dot gnu.org
@ 2020-05-22  9:35 ` rguenth at gcc dot gnu.org
  2020-05-22  9:44 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-22  9:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
We're then inlining some more costing another ~5GB ontop of the early
optimization memory use of ~5GB (might be other IPA transforms than inlining
as well).  The big function is meanwhile 2 million basic blocks...
update-SSA and friends are no fun here (the function with 2 million BBs is
eval_orthob).

Ah, you use [[gnu::flatten]] on that - so isn't it just what you asked for?

I wonder if Clang implements that at all.

Note the issue with -fvar-tracking* and -g and large functions is known...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
  2020-05-22  9:14 ` [Bug c++/95264] " rguenth at gcc dot gnu.org
  2020-05-22  9:35 ` rguenth at gcc dot gnu.org
@ 2020-05-22  9:44 ` rguenth at gcc dot gnu.org
  2020-05-22  9:55 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-22  9:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2020-05-22
     Ever confirmed|0                           |1

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to compile
and about 3GB of memory, -O2 needs around 2 minutes (same memory), -O3
is the same as -O2.

Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
or --param large-function-growth more moderately in case you can measure an
effect on runtime.

Note multiple [[gnu::flatten]] can really exponentially grow program size
since it is not appearant which functions might be used from another
translation unit until you can use -fwhole-program (single CU program)
or -flto (but there [[gnu::flatten]] is applied to early to avoid such
growth - sth we might want to fix).  Placing things not used from outside
in anonymous namespaces might help.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (2 preceding siblings ...)
  2020-05-22  9:44 ` rguenth at gcc dot gnu.org
@ 2020-05-22  9:55 ` rguenth at gcc dot gnu.org
  2020-05-22 10:19 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-22  9:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
clang documentation mentions they support [[gnu::flatten]], whether
implementations match here is of course another question.

I guess for a convoluted cgraph our flatten implementation leaves sth to be
desired - if there's two calls to the same function we inline it fully
twice and have to reap benefits of inlining all calls (recursively) in them
twice rather than producing an optimized body for the flatten inlining
first.  One could envision some early cloning for the purpose of flattening,
pushing down the flattening attribute to the clones that end up being
inlined multiple times.  Not sure how easy that would be - Honza?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (3 preceding siblings ...)
  2020-05-22  9:55 ` rguenth at gcc dot gnu.org
@ 2020-05-22 10:19 ` rguenth at gcc dot gnu.org
  2020-05-22 11:13 ` freddie at witherden dot org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-22 10:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So confirmed we eventually blow up at -O1:

++: fatal error: Killed signal terminated program cc1plus                       
compilation terminated.
Command exited with non-zero status 1                                           
3015.48user 45.01system 1:08:57elapsed 73%CPU (0avgtext+0avgdata
30682104maxresident)k                                                           
1549456inputs+47040outputs (2343major+9807077minor)pagefaults 0swaps

didn't manage to catch where in the process of compilation that was though,
during PTA it hovered at ~12GB.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (4 preceding siblings ...)
  2020-05-22 10:19 ` rguenth at gcc dot gnu.org
@ 2020-05-22 11:13 ` freddie at witherden dot org
  2020-05-22 11:34 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: freddie at witherden dot org @ 2020-05-22 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #6 from Freddie Witherden <freddie at witherden dot org> ---
(In reply to Richard Biener from comment #3)
> So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to
> compile and about 3GB of memory, -O2 needs around 2 minutes (same memory),
> -O3
> is the same as -O2.
> 
> Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
> or --param large-function-growth more moderately in case you can measure an
> effect on runtime.
> 
> Note multiple [[gnu::flatten]] can really exponentially grow program size
> since it is not appearant which functions might be used from another
> translation unit until you can use -fwhole-program (single CU program)
> or -flto (but there [[gnu::flatten]] is applied to early to avoid such
> growth - sth we might want to fix).  Placing things not used from outside
> in anonymous namespaces might help.

The [[gnu::flatten]] was added to get GCC's performance in the case of T =
double on a par with Clang's.  (We don't care about performance with T = bfloat
as it is just used as a final polishing pass.)  I can understand why GCC does
not want to inline it in the case of T = bfloat which is a complex type, but
for T = double the function is basically just a sequence of mov's to populate
an array.

As the function is of the form

for (int i = 0; i < N; i++) // N = template arg
  for (int j = 0; j < p[N]; j++) // runtime trip count
      foo(i, ...); // static polymorphism

with foo being a large switch-case on its first argument the expectation was
for the compiler to inline foo, unroll the outer loop, and then prune the dead
cases such that we have something similar to

for (int j = 0; j < p[0]; j++)
    foo(0, ...); // inline i = 0 case
for (int j = 0; j < p[1]; j++)
    foo(1, ...); // inline i = 1 case
// ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (5 preceding siblings ...)
  2020-05-22 11:13 ` freddie at witherden dot org
@ 2020-05-22 11:34 ` rguenther at suse dot de
  2020-05-22 11:49 ` freddie at witherden dot org
  2021-10-02  6:00 ` [Bug ipa/95264] " pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: rguenther at suse dot de @ 2020-05-22 11:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 22 May 2020, freddie at witherden dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264
> 
> --- Comment #6 from Freddie Witherden <freddie at witherden dot org> ---
> (In reply to Richard Biener from comment #3)
> > So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to
> > compile and about 3GB of memory, -O2 needs around 2 minutes (same memory),
> > -O3
> > is the same as -O2.
> > 
> > Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
> > or --param large-function-growth more moderately in case you can measure an
> > effect on runtime.
> > 
> > Note multiple [[gnu::flatten]] can really exponentially grow program size
> > since it is not appearant which functions might be used from another
> > translation unit until you can use -fwhole-program (single CU program)
> > or -flto (but there [[gnu::flatten]] is applied to early to avoid such
> > growth - sth we might want to fix).  Placing things not used from outside
> > in anonymous namespaces might help.
> 
> The [[gnu::flatten]] was added to get GCC's performance in the case of T =
> double on a par with Clang's.  (We don't care about performance with T = bfloat
> as it is just used as a final polishing pass.)  I can understand why GCC does
> not want to inline it in the case of T = bfloat which is a complex type, but
> for T = double the function is basically just a sequence of mov's to populate
> an array.
> 
> As the function is of the form
> 
> for (int i = 0; i < N; i++) // N = template arg
>   for (int j = 0; j < p[N]; j++) // runtime trip count
>       foo(i, ...); // static polymorphism
> 
> with foo being a large switch-case on its first argument the expectation was
> for the compiler to inline foo, unroll the outer loop, and then prune the dead
> cases such that we have something similar to
> 
> for (int j = 0; j < p[0]; j++)
>     foo(0, ...); // inline i = 0 case
> for (int j = 0; j < p[1]; j++)
>     foo(1, ...); // inline i = 1 case
> // ...

Ah, interesting.  This kind of static polymorphism should be handled
by IPA-CP already but it's of course possible we're confused about
a detail in this very testcase.  Honza?

Instead of [[gnu::flatten]] you could use the 
__attribute__((always_inline)) attribute on the foo function definition
if you didn't simplify the outline above too much to make that
infeasible.  IIRC we do not have sth like

  [[gnu::inline]] foo(i, ...);

to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...);
both which seem useful.  Not sure if the C++ syntax would support
such placement of an attribute of course.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c++/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (6 preceding siblings ...)
  2020-05-22 11:34 ` rguenther at suse dot de
@ 2020-05-22 11:49 ` freddie at witherden dot org
  2021-10-02  6:00 ` [Bug ipa/95264] " pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: freddie at witherden dot org @ 2020-05-22 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #8 from Freddie Witherden <freddie at witherden dot org> ---
(In reply to rguenther@suse.de from comment #7)
> 
> Instead of [[gnu::flatten]] you could use the 
> __attribute__((always_inline)) attribute on the foo function definition
> if you didn't simplify the outline above too much to make that
> infeasible.  IIRC we do not have sth like
> 
>   [[gnu::inline]] foo(i, ...);
> 
> to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...);
> both which seem useful.  Not sure if the C++ syntax would support
> such placement of an attribute of course.

So this is exactly what we had in the pre-flatten version of the code:

https://github.com/PyFR/Polyquad/commit/f24366c059d2d693222985cdd9333238bd909ad3

The issue was while GCC would inline the annotated functions it would go no
further.  As such, if I recall correctly, all of the constructor calls to the
relatively simple Eigen vector types were no longer inlined.  Thus a line of
code which should translate into a few register-to-memory mov instructions
results in a  a constructor call, an assignment call, and some cleanup.  Since
I could not add the force inline attribute to the library types I went in
search of an alternative.

For the T = bfloat eval_orthob instance is the "if
(std::is_fundamental<T>::value)" considered before the body is inlined?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug ipa/95264] Infinite Loop When Compiling Templated C++ code at -O1 and above
  2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
                   ` (7 preceding siblings ...)
  2020-05-22 11:49 ` freddie at witherden dot org
@ 2021-10-02  6:00 ` pinskia at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-10-02  6:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |compile-time-hog,
                   |                            |memory-hog
                 CC|                            |marxin at gcc dot gnu.org
          Component|c++                         |ipa
             Status|WAITING                     |NEW

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-02  6:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-21 22:08 [Bug c++/95264] New: Infinite Loop When Compiling Templated C++ code at -O1 and above freddie at witherden dot org
2020-05-22  9:14 ` [Bug c++/95264] " rguenth at gcc dot gnu.org
2020-05-22  9:35 ` rguenth at gcc dot gnu.org
2020-05-22  9:44 ` rguenth at gcc dot gnu.org
2020-05-22  9:55 ` rguenth at gcc dot gnu.org
2020-05-22 10:19 ` rguenth at gcc dot gnu.org
2020-05-22 11:13 ` freddie at witherden dot org
2020-05-22 11:34 ` rguenther at suse dot de
2020-05-22 11:49 ` freddie at witherden dot org
2021-10-02  6:00 ` [Bug ipa/95264] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).