[Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows
@ 2021-04-04  6:54 izbyshev at ispras dot ru
  2021-04-04  7:13 ` [Bug other/99903] " pinskia at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: izbyshev at ispras dot ru @ 2021-04-04  6:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

            Bug ID: 99903
           Summary: 32-bit x86 frontends randomly crash while reporting
                    timing on Windows
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: izbyshev at ispras dot ru
  Target Milestone: ---
              Host: i686-w64-mingw32
            Target: i686-w64-mingw32
             Build: i686-w64-mingw32

Created attachment 50503
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50503&action=edit
A reproducer

Running cc1 from i686-w64-mingw32-gcc[1] with timing report enabled (e.g.
without `-quiet` or with `-ftime-report`) may randomly crash on Windows.

A reproducer script is attached (tested on two Windows 10 1909 x64 machines).
The time to reproduce varied from seconds to half an hour. Running multiple cc1
to increase load appears to help, but is not required; just running a single
cc1 in a loop is enough too. The crash manifests as silent exit with code
3221225477 (0xc0000005; EXCEPTION_ACCESS_VIOLATION). Debugging showed the
following.

The eventual crash is actually a nested one. It happens at [2] while reporting
an ICE because `context->printer` is NULL (which, by itself, looks like a
separate bug).

The initial crash is triggered by `gcc_unreachable()` in
`timer::validate_phases()`[3].

#0  0x0125be91 in pp_format(pretty_printer*, text_info*) ()
#1  0x0125146b in diagnostic_report_diagnostic(diagnostic_context*,
diagnostic_info*) ()
#2  0x01251bd7 in diagnostic_impl(rich_location*, diagnostic_metadata const*,
int, char const*, char**, diagnostic_t)
    ()
#3  0x012521b8 in internal_error(char const*, ...) ()
#4  0x01528a96 in fancy_abort(char const*, int, char const*) ()
#5  0x014f9e9c in timer::validate_phases(_iobuf*) const [clone .cold] ()
#6  0x009400d3 in timer::print(_iobuf*) ()
#7  0x00942422 in toplev::~toplev() ()
#8  0x01526003 in main ()

And `gcc_unreachable()` is reached because `phase_user > total->user *
tolerance` is true. GCC doesn't print anything when run from `repro.py` because
of output buffering, but I extracted one message from memory:

Timing error: total of phase timers exceeds total time.
user    1.734723475976807094e-18 > 8.673617379884035472e-19

I couldn't reproduce this with 64-bit x86 GCC.

[1]
https://github.com/msys2/MINGW-packages/tree/5d82e17a56a3216a54dbb5924b92c96284ae5c97/mingw-w64-gcc
[2]
https://gcc.gnu.org/git?p=gcc.git;a=blob;f=gcc/diagnostic.c;h=ed52bc03d17afa2d7ac3419beb73e8f59b6f392b;hb=releases/gcc-10.2.0#l1077
[3]
https://gcc.gnu.org/git?p=gcc.git;a=blob;f=gcc/timevar.c;h=a3a882d32040db6503260ab51ee4bf338c4aa2b5;hb=releases/gcc-10.2.0#l628

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug other/99903] 32-bit x86 frontends randomly crash while reporting timing on Windows
  2021-04-04  6:54 [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows izbyshev at ispras dot ru
@ 2021-04-04  7:13 ` pinskia at gcc dot gnu.org
  2021-04-04  7:40 ` izbyshev at ispras dot ru
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-04  7:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Build|i686-w64-mingw32            |
             Status|UNCONFIRMED                 |WAITING
             Target|i686-w64-mingw32            |
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-04

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I don't see anything wrong with the code.
clock () is being used here.  Maybe there is a bug in the implementation of
clock where it is not always incrementing.  Since it is random, maybe the
problem is switching between the cores.

Is there a way to bind GCC to a specific core and test again?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug other/99903] 32-bit x86 frontends randomly crash while reporting timing on Windows
  2021-04-04  6:54 [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows izbyshev at ispras dot ru
  2021-04-04  7:13 ` [Bug other/99903] " pinskia at gcc dot gnu.org
@ 2021-04-04  7:40 ` izbyshev at ispras dot ru
  2021-04-06 15:48 ` izbyshev at ispras dot ru
  2021-05-04 15:14 ` amonakov at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: izbyshev at ispras dot ru @ 2021-04-04  7:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

--- Comment #2 from Alexey Izbyshev <izbyshev at ispras dot ru> ---
> Is there a way to bind GCC to a specific core and test again?

Yes, `repro.py` can be run via `start /affinity MASK`. I've started two
experiments, with one- and two-processor masks. They haven't crashed after
several minutes, but I'll wait longer and then report the result.

Anyway, it seems that crashing in `gcc_unreachable()` due to NULL
`context->printer` is still kind-of-a-bug, since it makes it difficult to
understand what went wrong in the first place.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug other/99903] 32-bit x86 frontends randomly crash while reporting timing on Windows
  2021-04-04  6:54 [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows izbyshev at ispras dot ru
  2021-04-04  7:13 ` [Bug other/99903] " pinskia at gcc dot gnu.org
  2021-04-04  7:40 ` izbyshev at ispras dot ru
@ 2021-04-06 15:48 ` izbyshev at ispras dot ru
  2021-05-04 15:14 ` amonakov at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: izbyshev at ispras dot ru @ 2021-04-06 15:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

--- Comment #3 from Alexey Izbyshev <izbyshev at ispras dot ru> ---
Crashes eventually occurred with both one- and two-processor affinity masks, so
pinning GCC to a single core doesn't help. But I've tracked the reason down.

When `get_time()` from `gcc/timevar.c` gets inlined into its callers (which
happens with -O2), it "returns" the result on a x87 FPU register. Then
`timevar_accumulate()` computes the difference between this 80-bit number and a
64-bit double stored in the timer structure. So when `clock()` returns 15 at
both start and end measurements, this code basically subtracts 15 * (1.0 /
1000) stored with 64-bit precision from itself computed with 80-bit precision,
and the difference is 8.673617379884035472e-19. When `clock()` returns 15 for
all measurements during a single cc1 run, the total time and each phase time
are equal to this same constant, and the sum of phase times is twice the total
time:

Timing error: total of phase timers exceeds total time.
user    1.734723475976807094e-18 > 8.673617379884035472e-19

Maybe GCC should round such ridiculously small intervals to zero?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug other/99903] 32-bit x86 frontends randomly crash while reporting timing on Windows
  2021-04-04  6:54 [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows izbyshev at ispras dot ru
                   ` (2 preceding siblings ...)
  2021-04-06 15:48 ` izbyshev at ispras dot ru
@ 2021-05-04 15:14 ` amonakov at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-05-04 15:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|1                           |0
             Status|WAITING                     |UNCONFIRMED
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
32-bit Linux should also be affected (perhaps with less probability if clock()
is more precise). It is surprising we track time in a 'double', a 64-bit
integer storing nanoseconds would be more appropriate.

Removing WAITING, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-04 15:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-04  6:54 [Bug other/99903] New: 32-bit x86 frontends randomly crash while reporting timing on Windows izbyshev at ispras dot ru
2021-04-04  7:13 ` [Bug other/99903] " pinskia at gcc dot gnu.org
2021-04-04  7:40 ` izbyshev at ispras dot ru
2021-04-06 15:48 ` izbyshev at ispras dot ru
2021-05-04 15:14 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).