public inbox for cygwin-patches@cygwin.com
 help / color / mirror / Atom feed
* Fix nanosleep returning negative rem
@ 2021-07-20 15:16 David Allsopp
  2021-07-21  8:39 ` Corinna Vinschen
  0 siblings, 1 reply; 8+ messages in thread
From: David Allsopp @ 2021-07-20 15:16 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

I've pushed a repro case for this to
https://github.com/dra27/cygwin-nanosleep-bug.git

Originally noticed as the main CI system for OCaml has been failing
sporadically for the signal.ml test mentioned in that repo. This morning I
tried hammering that test on my dev machine and discovered that it fails
very frequently. No idea if that's drivers, Windows 10 updates, number of
cores or what, but it was definitely happening, and easily.

Drilling further, it appears that NtQueryTimer is able to return a negative
value in the TimeRemaining field even when SignalState is false. The values
I've seen have always been < 15ms - i.e. less than the timer resolution, so
I wonder if there is a point at which the timer has elapsed but has not been
signalled, but WaitForMultipleObjects returns because of the EINTR signal.
Mildly surprising that it seems to be so reproducible.

Anyway, a patch is attached which simply guards a negative return value. The
test on tbi.SignalState is in theory unnecessary.

All best,


David

[-- Attachment #2: 0001-Ensure-nanosleep-2-never-returns-negative-rem.patch --]
[-- Type: application/octet-stream, Size: 1911 bytes --]

From 24f5adc3ac8246d93582ac6e4b2779b369f8b3f1 Mon Sep 17 00:00:00 2001
From: David Allsopp <david.allsopp@metastack.com>
Date: Tue, 20 Jul 2021 16:07:00 +0100
Subject: [PATCH] Ensure nanosleep(2) never returns negative rem

It appears to be the case that NtQueryTimer can return a negative time
remaining for an unsignalled timer. The value appears to be less than
the timer resolution.

Signed-off-by: David Allsopp <david.allsopp@metastack.com>
---
 winsup/cygwin/cygwait.cc    | 6 ++++--
 winsup/cygwin/release/3.2.1 | 4 ++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/winsup/cygwin/cygwait.cc b/winsup/cygwin/cygwait.cc
index 1d6c7c9cc..dbbe1db6e 100644
--- a/winsup/cygwin/cygwait.cc
+++ b/winsup/cygwin/cygwait.cc
@@ -104,8 +104,10 @@ cygwait (HANDLE object, PLARGE_INTEGER timeout, unsigned mask)
 		    sizeof tbi, NULL);
       /* if timer expired, TimeRemaining is negative and represents the
 	  system uptime when signalled */
-      if (timeout->QuadPart < 0LL)
-	timeout->QuadPart = tbi.SignalState ? 0LL : tbi.TimeRemaining.QuadPart;
+      if (timeout->QuadPart < 0LL) {
+	timeout->QuadPart = tbi.SignalState || tbi.TimeRemaining.QuadPart < 0LL
+                            ? 0LL : tbi.TimeRemaining.QuadPart;
+      }
       NtCancelTimer (_my_tls.locals.cw_timer, NULL);
     }
 
diff --git a/winsup/cygwin/release/3.2.1 b/winsup/cygwin/release/3.2.1
index 4f4db622a..2a339718c 100644
--- a/winsup/cygwin/release/3.2.1
+++ b/winsup/cygwin/release/3.2.1
@@ -44,3 +44,7 @@ Bug Fixes
   AF_UNSPEC.  As specified by POSIX and Linux, this is allowed on
   datagram sockets, and its effect is to reset the socket's peer
   address.
+
+- Fix nanosleep(2) returning negative rem. NtQueryTimer appears to be able to
+  return a negative remaining time (less than the timer resolution) for
+  unsignalled timers.
-- 
2.29.2.windows.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix nanosleep returning negative rem
  2021-07-20 15:16 Fix nanosleep returning negative rem David Allsopp
@ 2021-07-21  8:39 ` Corinna Vinschen
  2021-07-21  9:07   ` David Allsopp
  0 siblings, 1 reply; 8+ messages in thread
From: Corinna Vinschen @ 2021-07-21  8:39 UTC (permalink / raw)
  To: cygwin-patches

Hi David,

On Jul 20 16:16, David Allsopp wrote:
> I've pushed a repro case for this to
> https://github.com/dra27/cygwin-nanosleep-bug.git
> 
> Originally noticed as the main CI system for OCaml has been failing
> sporadically for the signal.ml test mentioned in that repo. This morning I
> tried hammering that test on my dev machine and discovered that it fails
> very frequently. No idea if that's drivers, Windows 10 updates, number of
> cores or what, but it was definitely happening, and easily.
> 
> Drilling further, it appears that NtQueryTimer is able to return a negative
> value in the TimeRemaining field even when SignalState is false. The values
> I've seen have always been < 15ms - i.e. less than the timer resolution, so
> I wonder if there is a point at which the timer has elapsed but has not been
> signalled, but WaitForMultipleObjects returns because of the EINTR signal.
> Mildly surprising that it seems to be so reproducible.
> 
> Anyway, a patch is attached which simply guards a negative return value. The
> test on tbi.SignalState is in theory unnecessary.

Thanks for the patch, I think your patch is fine.  However, I'd like
to dig a bit into this to see what exactly happens.  Do you have a
very simple testcase in plain C, by any chance?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Fix nanosleep returning negative rem
  2021-07-21  8:39 ` Corinna Vinschen
@ 2021-07-21  9:07   ` David Allsopp
  2021-07-21  9:30     ` Corinna Vinschen
  0 siblings, 1 reply; 8+ messages in thread
From: David Allsopp @ 2021-07-21  9:07 UTC (permalink / raw)
  To: cygwin-patches

> On Jul 20 16:16, David Allsopp wrote:
> > I've pushed a repro case for this to
> > https://github.com/dra27/cygwin-nanosleep-bug.git
> >
> > Originally noticed as the main CI system for OCaml has been failing
> > sporadically for the signal.ml test mentioned in that repo. This
> > morning I tried hammering that test on my dev machine and discovered
> > that it fails very frequently. No idea if that's drivers, Windows 10
> > updates, number of cores or what, but it was definitely happening, and
> > easily.
> >
> > Drilling further, it appears that NtQueryTimer is able to return a
> > negative value in the TimeRemaining field even when SignalState is
> > false. The values I've seen have always been < 15ms - i.e. less than
> > the timer resolution, so I wonder if there is a point at which the
> > timer has elapsed but has not been signalled, but WaitForMultipleObjects
> returns because of the EINTR signal.
> > Mildly surprising that it seems to be so reproducible.
> >
> > Anyway, a patch is attached which simply guards a negative return
> > value. The test on tbi.SignalState is in theory unnecessary.
> 
> Thanks for the patch, I think your patch is fine.  However, I'd like to
> dig a bit into this to see what exactly happens.  Do you have a very
> simple testcase in plain C, by any chance?

https://github.com/dra27/cygwin-nanosleep-bug/blob/main/signal.c was as simple as I'd gone at this stage (eliminating OCaml from the equation!). It might be possible to get it to happen without all the pthreads stuff: having confirmed it definitely wasn't OCaml and been able to put the appropriate system_printf's into cygwait to see that NtQueryTimer really was returning this small negative value, I stopped simplifying.

Does that repro case trigger on your system too?

Best,


D

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix nanosleep returning negative rem
  2021-07-21  9:07   ` David Allsopp
@ 2021-07-21  9:30     ` Corinna Vinschen
  2021-07-21  9:33       ` Corinna Vinschen
  0 siblings, 1 reply; 8+ messages in thread
From: Corinna Vinschen @ 2021-07-21  9:30 UTC (permalink / raw)
  To: cygwin-patches

On Jul 21 09:07, David Allsopp wrote:
> > On Jul 20 16:16, David Allsopp wrote:
> > > I've pushed a repro case for this to
> > > https://github.com/dra27/cygwin-nanosleep-bug.git
> > >
> > > Originally noticed as the main CI system for OCaml has been failing
> > > sporadically for the signal.ml test mentioned in that repo. This
> > > morning I tried hammering that test on my dev machine and discovered
> > > that it fails very frequently. No idea if that's drivers, Windows 10
> > > updates, number of cores or what, but it was definitely happening, and
> > > easily.
> > >
> > > Drilling further, it appears that NtQueryTimer is able to return a
> > > negative value in the TimeRemaining field even when SignalState is
> > > false. The values I've seen have always been < 15ms - i.e. less than
> > > the timer resolution, so I wonder if there is a point at which the
> > > timer has elapsed but has not been signalled, but WaitForMultipleObjects
> > returns because of the EINTR signal.
> > > Mildly surprising that it seems to be so reproducible.
> > >
> > > Anyway, a patch is attached which simply guards a negative return
> > > value. The test on tbi.SignalState is in theory unnecessary.
> > 
> > Thanks for the patch, I think your patch is fine.  However, I'd like to
> > dig a bit into this to see what exactly happens.  Do you have a very
> > simple testcase in plain C, by any chance?
> 
> https://github.com/dra27/cygwin-nanosleep-bug/blob/main/signal.c was
> as simple as I'd gone at this stage (eliminating OCaml from the
> equation!). It might be possible to get it to happen without all the
> pthreads stuff: having confirmed it definitely wasn't OCaml and been
> able to put the appropriate system_printf's into cygwait to see that
> NtQueryTimer really was returning this small negative value, I stopped
> simplifying.
> 
> Does that repro case trigger on your system too?

I'm not sure.  Would the output " - nanosleep failed: ..." indicate the
bug has been triggered?  If so, no, I can't reproduce this on my system.

I wrote a quick STC using the NT API calls and I can't reproduce the
problem with this code either.  The output is either

  SignalState: 1 TimeRemaining: -5354077459183

or

  SignalState: 0 TimeRemaining: 653

I never get a small negative value in the latter case.  Can you
reproduce your problem with this testcase or tweak it to reproduce it?

Either way, your patch as safe guard should be ok.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix nanosleep returning negative rem
  2021-07-21  9:30     ` Corinna Vinschen
@ 2021-07-21  9:33       ` Corinna Vinschen
  2021-07-21 16:02         ` David Allsopp
  0 siblings, 1 reply; 8+ messages in thread
From: Corinna Vinschen @ 2021-07-21  9:33 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

On Jul 21 11:30, Corinna Vinschen wrote:
> I wrote a quick STC using the NT API calls and I can't reproduce the
> problem with this code either.  The output is either
> 
>   SignalState: 1 TimeRemaining: -5354077459183
> 
> or
> 
>   SignalState: 0 TimeRemaining: 653
> 
> I never get a small negative value in the latter case.  Can you
> reproduce your problem with this testcase or tweak it to reproduce it?

Now I actually attached the code :}


Corinna

[-- Attachment #2: timer.c --]
[-- Type: text/plain, Size: 1580 bytes --]

#include <stdio.h>
#include <w32api/windows.h>
#include <w32api/winternl.h>
#include <w32api/ntdef.h>

typedef enum _TIMER_INFORMATION_CLASS {
  TimerBasicInformation = 0
} TIMER_INFORMATION_CLASS, *PTIMER_INFORMATION_CLASS;

typedef struct _TIMER_BASIC_INFORMATION {
  LARGE_INTEGER TimeRemaining;
  BOOLEAN SignalState;
} TIMER_BASIC_INFORMATION, *PTIMER_BASIC_INFORMATION;

NTSTATUS NTAPI NtCreateTimer(PHANDLE, ACCESS_MASK, POBJECT_ATTRIBUTES,
			     TIMER_TYPE);

NTSTATUS NTAPI NtCancelTimer(HANDLE, PBOOLEAN);

NTSTATUS NTAPI NtSetTimer(HANDLE, PLARGE_INTEGER, PVOID, PVOID,
			  BOOLEAN, LONG, PBOOLEAN);

NTSTATUS NTAPI NtQueryTimer (HANDLE, TIMER_INFORMATION_CLASS, PVOID,
			     ULONG, PULONG);

int
main ()
{
  HANDLE event;
  HANDLE timer;
  NTSTATUS status;
  LARGE_INTEGER timeout;
  TIMER_BASIC_INFORMATION tbi;

  event = CreateEvent (NULL, TRUE, FALSE, NULL);
  status = NtCreateTimer (&timer, TIMER_ALL_ACCESS, NULL, NotificationTimer);
  if (!NT_SUCCESS (status))
    {
      fprintf (stderr, "NtCreateTimer: 0x%08lx\n", status);
      return 1;
    }
  timeout.QuadPart = -10000000LL;	/* 1 sec */
  status = NtSetTimer (timer, &timeout, NULL, NULL, FALSE, 0, NULL);
  if (!NT_SUCCESS (status))
    {
      fprintf (stderr, "NtSetTimer: 0x%08lx\n", status);
      return 1;
    }
  WaitForSingleObject (event, 985L);
  NtQueryTimer (timer, TimerBasicInformation, &tbi, sizeof tbi, NULL);
  printf ("SignalState: %d TimeRemaining: %lld\n", tbi.SignalState,
						   tbi.TimeRemaining.QuadPart);
  NtCancelTimer (timer, NULL);
  NtClose (timer);
  return 0;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Fix nanosleep returning negative rem
  2021-07-21  9:33       ` Corinna Vinschen
@ 2021-07-21 16:02         ` David Allsopp
  2021-07-21 17:57           ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2021-07-22  8:01           ` Corinna Vinschen
  0 siblings, 2 replies; 8+ messages in thread
From: David Allsopp @ 2021-07-21 16:02 UTC (permalink / raw)
  To: cygwin-patches

Corinna Vinschen wrote:
> Sent: 21 July 2021 10:33
> To: cygwin-patches@cygwin.com
> Subject: Re: Fix nanosleep returning negative rem
> 
> On Jul 21 11:30, Corinna Vinschen wrote:
> > I wrote a quick STC using the NT API calls and I can't reproduce the
> > problem with this code either.  The output is either
> >
> >   SignalState: 1 TimeRemaining: -5354077459183
> >
> > or
> >
> >   SignalState: 0 TimeRemaining: 653
> >
> > I never get a small negative value in the latter case.  Can you
> > reproduce your problem with this testcase or tweak it to reproduce it?
> 
> Now I actually attached the code :}

:) Yes, I can reproduce - I didn't even need a loop! Third time:

  dra@Thor /cygdrive/c/Scratch/nanosleep
  $ ./timer
  SignalState: 0 TimeRemaining: -1151

That said, I can get it easily get this on my desktop (AMD Ryzen Threadripper 3990X) but not at all on my laptop (Intel Core i7-8650U). On the laptop, ignoring the couple of signalled cases, 747 runs of timer.c give values between 131597-149947 with a very narrow SD (~4000) whereas on the AMD chip, 738 runs gives a range of -2722 to 149896 with a relatively wider SD of ~23000.

The CI system where this was first seen is an virtualised Intel system so it doesn't appear to be as simple as CPU manufacturer or even core count. That said, I'm not at all familiar with the details of how this works, but I expect the timer for these things is part of the chipset, not the CPU?!

Best,


D

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Fix nanosleep returning negative rem
  2021-07-21 16:02         ` David Allsopp
@ 2021-07-21 17:57           ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
  2021-07-22  8:01           ` Corinna Vinschen
  1 sibling, 0 replies; 8+ messages in thread
From: Lavrentiev, Anton (NIH/NLM/NCBI) [C] @ 2021-07-21 17:57 UTC (permalink / raw)
  To: David Allsopp, cygwin-patches

> I can get it easily get this on my desktop (AMD Ryzen Threadripper 3990X) but not at all on my laptop (Intel Core i7-8650U)

Not sure if that's really related but:

Have you checked what is your default Windows timer resolution?  Some applications change it from the default
100HZ to 1000HZ (by e.g. using timeBeginPeriod(1) -- MS Edge does this, and Chrome, I think, but not Firefox -- yet
the change is actually _global_ as documented).

The following code shows the timer resolution:

#include <windows.h>
#include <timeapi.h>
#include <stdio.h>
#include <synchapi.h>

int main()
{
    unsigned int prev = timeGetTime(), next;
    Sleep(1);
    next = timeGetTime();
    printf("Sleep(1) = %u\n", next - prev);
    return 0;
}

Output:

Sleep(1) = 10
means the default 100HZ

Sleep(1) = 1
means the increased resolution of 1ms

Anton Lavrentiev
Contractor NIH/NLM/NCBI

P.S. timeBeginPeriod():

This function affects a global Windows setting. Windows uses the lowest value (that is, highest
resolution) requested by any process. Setting a higher resolution can improve the accuracy of
time-out intervals in wait functions. However, it can also reduce overall system performance,
because the thread scheduler switches tasks more often. High resolutions can also prevent the
CPU power management system from entering power-saving modes.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix nanosleep returning negative rem
  2021-07-21 16:02         ` David Allsopp
  2021-07-21 17:57           ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
@ 2021-07-22  8:01           ` Corinna Vinschen
  1 sibling, 0 replies; 8+ messages in thread
From: Corinna Vinschen @ 2021-07-22  8:01 UTC (permalink / raw)
  To: cygwin-patches

On Jul 21 16:02, David Allsopp wrote:
> Corinna Vinschen wrote:
> > Sent: 21 July 2021 10:33
> > To: cygwin-patches@cygwin.com
> > Subject: Re: Fix nanosleep returning negative rem
> > 
> > On Jul 21 11:30, Corinna Vinschen wrote:
> > > I wrote a quick STC using the NT API calls and I can't reproduce the
> > > problem with this code either.  The output is either
> > >
> > >   SignalState: 1 TimeRemaining: -5354077459183
> > >
> > > or
> > >
> > >   SignalState: 0 TimeRemaining: 653
> > >
> > > I never get a small negative value in the latter case.  Can you
> > > reproduce your problem with this testcase or tweak it to reproduce it?
> > 
> > Now I actually attached the code :}
> 
> :) Yes, I can reproduce - I didn't even need a loop! Third time:
> 
>   dra@Thor /cygdrive/c/Scratch/nanosleep
>   $ ./timer
>   SignalState: 0 TimeRemaining: -1151
> 
> That said, I can get it easily get this on my desktop (AMD Ryzen
> Threadripper 3990X) but not at all on my laptop (Intel Core i7-8650U).
> On the laptop, ignoring the couple of signalled cases, 747 runs of
> timer.c give values between 131597-149947 with a very narrow SD
> (~4000) whereas on the AMD chip, 738 runs gives a range of -2722 to
> 149896 with a relatively wider SD of ~23000.
> 
> The CI system where this was first seen is an virtualised Intel system
> so it doesn't appear to be as simple as CPU manufacturer or even core
> count.

Weird.  I pushed your patch.

> That said, I'm not at all familiar with the details of how this
> works, but I expect the timer for these things is part of the chipset,
> not the CPU?!

As timers go, the timer behind the NtCreateTimer scenes is a simple
interrupt based timer with a default resolution of 15 ms.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-22  8:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 15:16 Fix nanosleep returning negative rem David Allsopp
2021-07-21  8:39 ` Corinna Vinschen
2021-07-21  9:07   ` David Allsopp
2021-07-21  9:30     ` Corinna Vinschen
2021-07-21  9:33       ` Corinna Vinschen
2021-07-21 16:02         ` David Allsopp
2021-07-21 17:57           ` Lavrentiev, Anton (NIH/NLM/NCBI) [C]
2021-07-22  8:01           ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).