public inbox for cygwin-developers@cygwin.com
 help / color / mirror / Atom feed
* Re: load average calculation failing
       [not found] <Pine.BSF.4.63.2205051618470.42373@m0.truegem.net>
@ 2022-05-08  7:01 ` Mark Geisert
       [not found]   ` <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk>
  2022-05-09 11:29   ` load average calculation failing Jon Turney
  0 siblings, 2 replies; 20+ messages in thread
From: Mark Geisert @ 2022-05-08  7:01 UTC (permalink / raw)
  To: Cygwin-developers

Mark Geisert wrote (on the main Cygwin mailing list):
> I've recently noticed that the 'xload' I routinely run shows zero load even with 
> compute-bound processes running.  This is on both Cygwin pre-3.4.0 as well as 
> 3.3.4.  A test program, shown below, indicates that getloadavg() is returning with 
> 0 status, i.e. not an error but no elems
> of the passed-in array updated.
> 
> Stepping with gdb through the test program seems weird within the 
> loadavginfo::load_init method.  Single-stepping at line loadavg.cc:68 goes to 
> strace.h:52 and then to _sigbe ?!
> 
> I had recently updated both Cygwin and Windows 10 to latest at the same time so I 
> cannot say when the failure started.  Last day or two at most.
> 
> ..mark
> 
> -------------------
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> 
> int
> main (int argc, char **argv)
> {
>      double loadavg[3];
> 
>      int res = getloadavg (loadavg, 3);
>      if (res == -1)
>          return 0xFF;
>      if (res > 0)
>          for (int i = 0; i < res; i++)
>              printf ("%f.2 ", loadavg[i]);
> 
>      return res;
> }

I've debugged a bit further..  Within Cygwin's loadavg.cc:load_init(), the 
PdhOpenQueryW() call returns successfully.  The subsequent PdhAddEnglishCounterW() 
call is unsuccessful.  It returns status 0x800007D0 == PDH_CSTATUS_NO_MACHINE. 
The code (at line 68 mentioned above) calls debug_printf() to conditionally 
display the error, which is what leads to the strace.h and _sigbe; that's fine.

The weird PDH_CSTATUS_NO_MACHINE is the problem.  I'll try running the example 
from an elevated shell.  Or rebooting the machine.  After that it's consulting 
some oracle TBD. :-(

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
       [not found]   ` <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk>
@ 2022-05-09  8:45     ` Corinna Vinschen
  2022-05-09  8:53       ` Corinna Vinschen
  2022-05-10  8:34       ` Mark Geisert
  0 siblings, 2 replies; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-09  8:45 UTC (permalink / raw)
  To: Mark Geisert, cygwin-developers

[redirect back to cygwin-developers]

On May  8 11:27, Jon Turney wrote:
> On 08/05/2022 08:01, Mark Geisert wrote:
> > Mark Geisert wrote (on the main Cygwin mailing list):
> > > I've recently noticed that the 'xload' I routinely run shows zero
> > > load even with compute-bound processes running.  This is on both
> > > Cygwin pre-3.4.0 as well as 3.3.4.  A test program, shown below,
> > > indicates that getloadavg() is returning with 0 status, i.e. not an
> > > error but no elems
> > > of the passed-in array updated.
> > > 
> > > Stepping with gdb through the test program seems weird within the
> > > loadavginfo::load_init method.  Single-stepping at line
> > > loadavg.cc:68 goes to strace.h:52 and then to _sigbe ?!
> > > 
> > > I had recently updated both Cygwin and Windows 10 to latest at the
> > > same time so I cannot say when the failure started.  Last day or two
> > > at most.
> > > 
> [...]
> > 
> > I've debugged a bit further..  Within Cygwin's loadavg.cc:load_init(),
> > the PdhOpenQueryW() call returns successfully.  The subsequent
> > PdhAddEnglishCounterW() call is unsuccessful.  It returns status

This is a bit weird.  I tried to debug this for a while on Friday on
W11 and on W11 I can reproduce *a* problem, too, just not the same you
report here.

On W11 I see load_init() working fine, the calls to
PdhAddEnglishCounterW succeed.  But then the call to
PdhGetFormattedCounterValue in get_load() fails with error
PDH_INVALID_DATA.  The CStatus member of fmtvalue1 is set to
PDH_CSTATUS_NO_INSTANCE.

If I tweak get_load to call PdhCollectQueryData again after a fail,
the second call succeeds.  The only problem with this is, the returned
data doesn't make a lot of sense. It only starts to make sense if I
add a Sleep(1000) before the second PdhCollectQueryData call, which is
rather disappointing.

Jon, would it, perhaps, make sense to call PdhCollectQueryData in
load_init(), without actually checking the return value?  The idea is,
to make sure to have a base for the next call to PdhCollectQueryData
from inside load_init.

But even then, the first values returned by getloadavg might not make
much sense, so I guess this is just clutching for straws...

> > 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned

This is a weird error.

  "The path did not contain a computer name and the function was unable
   to retrieve the local computer name."

Yeah, sure.

Mark, did you try to add the computer name to the path by calling
GetComputerName() in load_init?

I tried the patch I pasted to the end of this mail, but it did not help
the first PdhGetFormattedCounterValue call in get_load to return
success.

> > above) calls debug_printf() to conditionally display the error, which is
> > what leads to the strace.h and _sigbe; that's fine.
> > 
> > The weird PDH_CSTATUS_NO_MACHINE is the problem.  I'll try running the
> > example from an elevated shell.  Or rebooting the machine.  After that
> > it's consulting some oracle TBD. :-(
> > 
> 
> Thanks for looking into this.
> You can find the user space version of this code I initially wrote at
> https://github.com/jon-turney/windows-loadavg, which might save you some
> time.
> 
> I can't reproduce this on W10 21H1, so I think this must be due to some
> change in later Windows...

I can reproduce this on W10 21H1, too, and the problem is the one
I outlined above, with load_init working fine and just the
PdhGetFormattedCounterValue failing in get_load.


Corinna



diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
index 127591a2e1f5..a014c2eb758c 100644
--- a/winsup/cygwin/loadavg.cc
+++ b/winsup/cygwin/loadavg.cc
@@ -40,6 +40,7 @@
 #include <time.h>
 #include <sys/strace.h>
 #include <pdh.h>
 
 static PDH_HQUERY query;
 static PDH_HCOUNTER counter1;
@@ -55,6 +55,17 @@ static bool load_init (void)
     tried = true;
 
     PDH_STATUS status;
+    DWORD size = MAX_PATH;
+    WCHAR machine_name[MAX_PATH];
+    WCHAR counter_name[MAX_PATH + 64];
+    PWCHAR counter_p = counter_name;
+
+    if (GetComputerNameW (machine_name, &size))
+      {
+	*counter_p++ = L'\\';
+	*counter_p++ = L'\\';
+	counter_p = wcpcpy (counter_p, machine_name);
+      }
 
     status = PdhOpenQueryW (NULL, 0, &query);
     if (status != STATUS_SUCCESS)
@@ -62,18 +73,17 @@ static bool load_init (void)
 	debug_printf ("PdhOpenQueryW, status %y", status);
 	return false;
       }
-    status = PdhAddEnglishCounterW (query,
-				    L"\\Processor(_Total)\\% Processor Time",
-				    0, &counter1);
+
+    wcpcpy (counter_p, L"\\Processor(_Total)\\% Processor Time");
+    status = PdhAddEnglishCounterW (query, counter_name, 0, &counter1);
     if (status != STATUS_SUCCESS)
       {
 	debug_printf ("PdhAddEnglishCounterW(time), status %y", status);
 	return false;
       }
-    status = PdhAddEnglishCounterW (query,
-				    L"\\System\\Processor Queue Length",
-				    0, &counter2);
 
+    wcpcpy (counter_p, L"\\System\\Processor Queue Length");
+    status = PdhAddEnglishCounterW (query, counter_name, 0, &counter2);
     if (status != STATUS_SUCCESS)
       {
 	debug_printf ("PdhAddEnglishCounterW(queue length), status %y", status);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
  2022-05-09  8:45     ` Corinna Vinschen
@ 2022-05-09  8:53       ` Corinna Vinschen
  2022-05-10  8:34       ` Mark Geisert
  1 sibling, 0 replies; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-09  8:53 UTC (permalink / raw)
  To: Mark Geisert, cygwin-developers

On May  9 10:45, Corinna Vinschen wrote:
> Mark, did you try to add the computer name to the path by calling
> GetComputerName() in load_init?
> 
> I tried the patch I pasted to the end of this mail, but it did not help
> the first PdhGetFormattedCounterValue call in get_load to return
> success.
> 
> > > above) calls debug_printf() to conditionally display the error, which is
> > > what leads to the strace.h and _sigbe; that's fine.
> > > 
> > > The weird PDH_CSTATUS_NO_MACHINE is the problem.  I'll try running the
> > > example from an elevated shell.  Or rebooting the machine.  After that
> > > it's consulting some oracle TBD. :-(
> > > 
> > 
> > Thanks for looking into this.
> > You can find the user space version of this code I initially wrote at
> > https://github.com/jon-turney/windows-loadavg, which might save you some
> > time.
> > 
> > I can't reproduce this on W10 21H1, so I think this must be due to some
> > change in later Windows...
> 
> I can reproduce this on W10 21H1, too, and the problem is the one
> I outlined above, with load_init working fine and just the
> PdhGetFormattedCounterValue failing in get_load.

Btw, the other patch, which makes loadavg work for me, is the
below one.  It just doesn't really make me happy.


diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
index 127591a2e1f5..e08c13c86de6 100644
--- a/winsup/cygwin/loadavg.cc
+++ b/winsup/cygwin/loadavg.cc
@@ -40,6 +40,7 @@
 #include <time.h>
 #include <sys/strace.h>
 #include <pdh.h>
+#include <pdhmsg.h>
 
 static PDH_HQUERY query;
 static PDH_HCOUNTER counter1;
@@ -95,18 +96,33 @@ static bool load_init (void)
 /* estimate the current load */
 static bool get_load (double *load)
 {
+  PDH_STATUS ret;
+  PDH_FMT_COUNTERVALUE fmtvalue1;
+  bool tried_again = false;
+
   *load = 0.0;
 
-  PDH_STATUS ret = PdhCollectQueryData (query);
+try_again:
+
+  ret = PdhCollectQueryData (query);
   if (ret != ERROR_SUCCESS)
     return false;
 
   /* Estimate the number of running processes as (NumberOfProcessors) * (%
      Processor Time) */
-  PDH_FMT_COUNTERVALUE fmtvalue1;
   ret = PdhGetFormattedCounterValue (counter1, PDH_FMT_DOUBLE, NULL, &fmtvalue1);
   if (ret != ERROR_SUCCESS)
-    return false;
+    {
+      if (ret == (PDH_STATUS) PDH_INVALID_DATA
+	  && fmtvalue1.CStatus == PDH_CSTATUS_INVALID_DATA
+	  && !tried_again)
+	{
+	  tried_again = true;
+	  Sleep (1000L);
+	  goto try_again;
+	}
+      return false;
+    }
 
   double running = fmtvalue1.doubleValue * wincap.cpu_count () / 100;
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
  2022-05-08  7:01 ` load average calculation failing Mark Geisert
       [not found]   ` <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk>
@ 2022-05-09 11:29   ` Jon Turney
  2022-05-10  8:21     ` Mark Geisert
  1 sibling, 1 reply; 20+ messages in thread
From: Jon Turney @ 2022-05-09 11:29 UTC (permalink / raw)
  To: Mark Geisert, cygwin-developers

On 08/05/2022 08:01, Mark Geisert wrote:
> Mark Geisert wrote (on the main Cygwin mailing list):
>> I've recently noticed that the 'xload' I routinely run shows zero load 
>> even with compute-bound processes running.  This is on both Cygwin 
>> pre-3.4.0 as well as 3.3.4.  A test program, shown below, indicates 
>> that getloadavg() is returning with 0 status, i.e. not an error but no 
>> elems
>> of the passed-in array updated.
>>

One other thing I forgot to mention: xload doesn't call getloadavg(), it 
directly uses the PDH interface, see:

https://gitlab.freedesktop.org/xorg/app/xload/-/blob/master/get_load.c#L71

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
  2022-05-09 11:29   ` load average calculation failing Jon Turney
@ 2022-05-10  8:21     ` Mark Geisert
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Geisert @ 2022-05-10  8:21 UTC (permalink / raw)
  To: cygwin-developers

Jon Turney wrote:
> On 08/05/2022 08:01, Mark Geisert wrote:
>> Mark Geisert wrote (on the main Cygwin mailing list):
>>> I've recently noticed that the 'xload' I routinely run shows zero load even 
>>> with compute-bound processes running.  This is on both Cygwin pre-3.4.0 as well 
>>> as 3.3.4.  A test program, shown below, indicates that getloadavg() is 
>>> returning with 0 status, i.e. not an error but no elems
>>> of the passed-in array updated.
>>>
> 
> One other thing I forgot to mention: xload doesn't call getloadavg(), it directly 
> uses the PDH interface, see:
> 
> https://gitlab.freedesktop.org/xorg/app/xload/-/blob/master/get_load.c#L71

Ah, I did notice xload on the list of pdh.dll users at some point, thanks.
And thanks explicitly for the pointer to your github-hosted wmi-loadavg.cc.
That has indeed been very helpful in poking around this issue.
Cheers,

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
  2022-05-09  8:45     ` Corinna Vinschen
  2022-05-09  8:53       ` Corinna Vinschen
@ 2022-05-10  8:34       ` Mark Geisert
  2022-05-10 13:37         ` Jon Turney
  1 sibling, 1 reply; 20+ messages in thread
From: Mark Geisert @ 2022-05-10  8:34 UTC (permalink / raw)
  To: cygwin-developers

Corinna Vinschen wrote:
> [redirect back to cygwin-developers]
> 
> On May  8 11:27, Jon Turney wrote:
>> On 08/05/2022 08:01, Mark Geisert wrote:
>>> Mark Geisert wrote (on the main Cygwin mailing list):
>>>> I've recently noticed that the 'xload' I routinely run shows zero
>>>> load even with compute-bound processes running.  This is on both
>>>> Cygwin pre-3.4.0 as well as 3.3.4.  A test program, shown below,
>>>> indicates that getloadavg() is returning with 0 status, i.e. not an
>>>> error but no elems
>>>> of the passed-in array updated.
>>>>
>>>> Stepping with gdb through the test program seems weird within the
>>>> loadavginfo::load_init method.  Single-stepping at line
>>>> loadavg.cc:68 goes to strace.h:52 and then to _sigbe ?!
>>>>
>>>> I had recently updated both Cygwin and Windows 10 to latest at the
>>>> same time so I cannot say when the failure started.  Last day or two
>>>> at most.
>>>>
>> [...]
>>>
>>> I've debugged a bit further..  Within Cygwin's loadavg.cc:load_init(),
>>> the PdhOpenQueryW() call returns successfully.  The subsequent
>>> PdhAddEnglishCounterW() call is unsuccessful.  It returns status
> 
> This is a bit weird.  I tried to debug this for a while on Friday on
> W11 and on W11 I can reproduce *a* problem, too, just not the same you
> report here.
> 
> On W11 I see load_init() working fine, the calls to
> PdhAddEnglishCounterW succeed.  But then the call to
> PdhGetFormattedCounterValue in get_load() fails with error
> PDH_INVALID_DATA.  The CStatus member of fmtvalue1 is set to
> PDH_CSTATUS_NO_INSTANCE.
> 
> If I tweak get_load to call PdhCollectQueryData again after a fail,
> the second call succeeds.  The only problem with this is, the returned
> data doesn't make a lot of sense. It only starts to make sense if I
> add a Sleep(1000) before the second PdhCollectQueryData call, which is
> rather disappointing.
> 
> Jon, would it, perhaps, make sense to call PdhCollectQueryData in
> load_init(), without actually checking the return value?  The idea is,
> to make sure to have a base for the next call to PdhCollectQueryData
> from inside load_init.
> 
> But even then, the first values returned by getloadavg might not make
> much sense, so I guess this is just clutching for straws...
> 
>>> 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
> 
> This is a weird error.
> 
>    "The path did not contain a computer name and the function was unable
>     to retrieve the local computer name."
> 
> Yeah, sure.
> 
> Mark, did you try to add the computer name to the path by calling
> GetComputerName() in load_init?

I tried more ham-handedly by prepending L"\\\\hostname" or L"\\\\.".  No change.
I'm running W10 21H2 on my home machines.  One with the issue is up-to-date with 
Windows patches.  Another that still shows reasonable load averages may not have 
the very latest patches; I need to verify that.

Some web page I found while searching for PDH stuff claimed that the performance 
counters are maintained by a Windows Service, which only gets started when some 
process attaches to pdh.dll.  I have to find that page again and see if it talks 
about which Windows versions that applies to.  That might possibly explain why one 
can't get reasonable counter numbers immediately after PdhOpenQuery.  But then, my 
running xload, which does load pdh.dll, should be seeing good counters.

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing
  2022-05-10  8:34       ` Mark Geisert
@ 2022-05-10 13:37         ` Jon Turney
  2022-05-11 23:40           ` load average calculation failing -- fixed by Windows update Mark Geisert
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Turney @ 2022-05-10 13:37 UTC (permalink / raw)
  To: cygwin-developers

On 10/05/2022 09:34, Mark Geisert wrote:
> Corinna Vinschen wrote:
>> [redirect back to cygwin-developers]

>>
>>>> 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
>>
>> This is a weird error.
>>
>>    "The path did not contain a computer name and the function was unable
>>     to retrieve the local computer name."
>>
>> Yeah, sure.
>>
>> Mark, did you try to add the computer name to the path by calling
>> GetComputerName() in load_init?

As we've seen before, this error also seems to be also used for "not 
authorized" problems.

https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-10 13:37         ` Jon Turney
@ 2022-05-11 23:40           ` Mark Geisert
  2022-05-12  8:17             ` Corinna Vinschen
  2022-05-12  9:48             ` Corinna Vinschen
  0 siblings, 2 replies; 20+ messages in thread
From: Mark Geisert @ 2022-05-11 23:40 UTC (permalink / raw)
  To: cygwin-developers

Jon Turney wrote:
> On 10/05/2022 09:34, Mark Geisert wrote:
>> Corinna Vinschen wrote:
>>> [redirect back to cygwin-developers]
> 
>>>
>>>>> 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
>>>
>>> This is a weird error.
>>>
>>>    "The path did not contain a computer name and the function was unable
>>>     to retrieve the local computer name."
>>>
>>> Yeah, sure.
>>>
>>> Mark, did you try to add the computer name to the path by calling
>>> GetComputerName() in load_init?
> 
> As we've seen before, this error also seems to be also used for "not authorized" 
> problems.
> 
> https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b 

Sheesh.  This all seems entirely too complicated.

But thankfully, after installing latest Windows patches (from yesterday's MS Patch 
Tuesday) I find myself on W10 21H2 Build 19044.1706.  Xload, uptime, and Jon's 
initial PoC code now show good load averages.  I had previously been on Build 
19044.1645.

The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no 
errors on subsequent counter reads.  This sounds like it now matches what Corinna 
reported for W11.  I wonder if she's running build 1706 already.

It seems to me MS broke PDH or its interfacing for one build, 1645, and fixed it 
for the next, 1706.  That's all I can surmise from the data we have.

I think my work and/or damage here (on this topic) is done.
Cheers,

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-11 23:40           ` load average calculation failing -- fixed by Windows update Mark Geisert
@ 2022-05-12  8:17             ` Corinna Vinschen
  2022-05-12  8:24               ` Mark Geisert
  2022-05-12  9:48             ` Corinna Vinschen
  1 sibling, 1 reply; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-12  8:17 UTC (permalink / raw)
  To: cygwin-developers

On May 11 16:40, Mark Geisert wrote:
> Jon Turney wrote:
> > On 10/05/2022 09:34, Mark Geisert wrote:
> > > Corinna Vinschen wrote:
> > > > [redirect back to cygwin-developers]
> > 
> > > > 
> > > > > > 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
> > > > 
> > > > This is a weird error.
> > > > 
> > > >    "The path did not contain a computer name and the function was unable
> > > >     to retrieve the local computer name."
> > > > 
> > > > Yeah, sure.
> > > > 
> > > > Mark, did you try to add the computer name to the path by calling
> > > > GetComputerName() in load_init?
> > 
> > As we've seen before, this error also seems to be also used for "not
> > authorized" problems.
> > 
> > https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b
> 
> Sheesh.  This all seems entirely too complicated.
> 
> But thankfully, after installing latest Windows patches (from yesterday's MS
> Patch Tuesday) I find myself on W10 21H2 Build 19044.1706.  Xload, uptime,
> and Jon's initial PoC code now show good load averages.  I had previously
> been on Build 19044.1645.
> 
> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
> errors on subsequent counter reads.  This sounds like it now matches what
> Corinna reported for W11.  I wonder if she's running build 1706 already.
> 
> It seems to me MS broke PDH or its interfacing for one build, 1645, and
> fixed it for the next, 1706.  That's all I can surmise from the data we
> have.
> 
> I think my work and/or damage here (on this topic) is done.

You're the luck one then. It still doesn't work for me on W10 21H1
and W11.


Corinna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-12  8:17             ` Corinna Vinschen
@ 2022-05-12  8:24               ` Mark Geisert
  2022-05-12  8:43                 ` Corinna Vinschen
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Geisert @ 2022-05-12  8:24 UTC (permalink / raw)
  To: cygwin-developers

Corinna Vinschen wrote:
> On May 11 16:40, Mark Geisert wrote:
>> Jon Turney wrote:
>>> On 10/05/2022 09:34, Mark Geisert wrote:
>>>> Corinna Vinschen wrote:
>>>>> [redirect back to cygwin-developers]
>>>
>>>>>
>>>>>>> 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
>>>>>
>>>>> This is a weird error.
>>>>>
>>>>>     "The path did not contain a computer name and the function was unable
>>>>>      to retrieve the local computer name."
>>>>>
>>>>> Yeah, sure.
>>>>>
>>>>> Mark, did you try to add the computer name to the path by calling
>>>>> GetComputerName() in load_init?
>>>
>>> As we've seen before, this error also seems to be also used for "not
>>> authorized" problems.
>>>
>>> https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b
>>
>> Sheesh.  This all seems entirely too complicated.
>>
>> But thankfully, after installing latest Windows patches (from yesterday's MS
>> Patch Tuesday) I find myself on W10 21H2 Build 19044.1706.  Xload, uptime,
>> and Jon's initial PoC code now show good load averages.  I had previously
>> been on Build 19044.1645.
>>
>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>> errors on subsequent counter reads.  This sounds like it now matches what
>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>
>> It seems to me MS broke PDH or its interfacing for one build, 1645, and
>> fixed it for the next, 1706.  That's all I can surmise from the data we
>> have.
>>
>> I think my work and/or damage here (on this topic) is done.
> 
> You're the luck one then. It still doesn't work for me on W10 21H1
> and W11.

What Windows build(s) are you running?  The 'ver' command inside a Command Prompt 
window is the easiest way to determine.

Also, both my machines were running build 1645 but it was a Windows Home machine 
that had the issue; the Windows Pro machine was fine.  So many variables....

..mark


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-12  8:24               ` Mark Geisert
@ 2022-05-12  8:43                 ` Corinna Vinschen
  0 siblings, 0 replies; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-12  8:43 UTC (permalink / raw)
  To: cygwin-developers

On May 12 01:24, Mark Geisert wrote:
> Corinna Vinschen wrote:
> > On May 11 16:40, Mark Geisert wrote:
> > > Jon Turney wrote:
> > > > On 10/05/2022 09:34, Mark Geisert wrote:
> > > > > Corinna Vinschen wrote:
> > > > > > [redirect back to cygwin-developers]
> > > > 
> > > > > > 
> > > > > > > > 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
> > > > > > 
> > > > > > This is a weird error.
> > > > > > 
> > > > > >     "The path did not contain a computer name and the function was unable
> > > > > >      to retrieve the local computer name."
> > > > > > 
> > > > > > Yeah, sure.
> > > > > > 
> > > > > > Mark, did you try to add the computer name to the path by calling
> > > > > > GetComputerName() in load_init?
> > > > 
> > > > As we've seen before, this error also seems to be also used for "not
> > > > authorized" problems.
> > > > 
> > > > https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b
> > > 
> > > Sheesh.  This all seems entirely too complicated.
> > > 
> > > But thankfully, after installing latest Windows patches (from yesterday's MS
> > > Patch Tuesday) I find myself on W10 21H2 Build 19044.1706.  Xload, uptime,
> > > and Jon's initial PoC code now show good load averages.  I had previously
> > > been on Build 19044.1645.
> > > 
> > > The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
> > > errors on subsequent counter reads.  This sounds like it now matches what
> > > Corinna reported for W11.  I wonder if she's running build 1706 already.
> > > 
> > > It seems to me MS broke PDH or its interfacing for one build, 1645, and
> > > fixed it for the next, 1706.  That's all I can surmise from the data we
> > > have.
> > > 
> > > I think my work and/or damage here (on this topic) is done.
> > 
> > You're the luck one then. It still doesn't work for me on W10 21H1
> > and W11.
> 
> What Windows build(s) are you running?  The 'ver' command inside a Command
> Prompt window is the easiest way to determine.
> 
> Also, both my machines were running build 1645 but it was a Windows Home
> machine that had the issue; the Windows Pro machine was fine.  So many
> variables....

Both domain member machine, Windows 10 version 10.0.19043.1706 and
Windows 11 version 10.0.22000.675. Both Enterprise editions.


Corinna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-11 23:40           ` load average calculation failing -- fixed by Windows update Mark Geisert
  2022-05-12  8:17             ` Corinna Vinschen
@ 2022-05-12  9:48             ` Corinna Vinschen
  2022-05-13 10:34               ` Jon Turney
  1 sibling, 1 reply; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-12  9:48 UTC (permalink / raw)
  To: cygwin-developers

On May 11 16:40, Mark Geisert wrote:
> Jon Turney wrote:
> > On 10/05/2022 09:34, Mark Geisert wrote:
> > > Corinna Vinschen wrote:
> > > > [redirect back to cygwin-developers]
> > 
> > > > 
> > > > > > 0x800007D0 == PDH_CSTATUS_NO_MACHINE. The code (at line 68 mentioned
> > > > 
> > > > This is a weird error.
> > > > 
> > > >    "The path did not contain a computer name and the function was unable
> > > >     to retrieve the local computer name."
> > > > 
> > > > Yeah, sure.
> > > > 
> > > > Mark, did you try to add the computer name to the path by calling
> > > > GetComputerName() in load_init?
> > 
> > As we've seen before, this error also seems to be also used for "not
> > authorized" problems.
> > 
> > https://sourceware.org/git/?p=newlib-cygwin.git;a=commitdiff;h=de7f13aa9acec022ad1e4b3f929d4dc982ddf60b
> 
> Sheesh.  This all seems entirely too complicated.
> 
> But thankfully, after installing latest Windows patches (from yesterday's MS
> Patch Tuesday) I find myself on W10 21H2 Build 19044.1706.  Xload, uptime,
> and Jon's initial PoC code now show good load averages.  I had previously
> been on Build 19044.1645.
> 
> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
> errors on subsequent counter reads.  This sounds like it now matches what
> Corinna reported for W11.  I wonder if she's running build 1706 already.

Erm... looks like I didn't read your mail throughly enough.

This behaviour, the first call returning with PDH_INVALID_DATA and only
subsequent calls returning valid(?) values, is what breaks the
getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
now works, but Cygwin is still broken.


Corinna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-12  9:48             ` Corinna Vinschen
@ 2022-05-13 10:34               ` Jon Turney
  2022-05-13 11:04                 ` Corinna Vinschen
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Turney @ 2022-05-13 10:34 UTC (permalink / raw)
  To: cygwin-developers

On 12/05/2022 10:48, Corinna Vinschen wrote:
> On May 11 16:40, Mark Geisert wrote:
>>
>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>> errors on subsequent counter reads.  This sounds like it now matches what
>> Corinna reported for W11.  I wonder if she's running build 1706 already.
> 
> Erm... looks like I didn't read your mail throughly enough.
> 
> This behaviour, the first call returning with PDH_INVALID_DATA and only
> subsequent calls returning valid(?) values, is what breaks the
> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
> now works, but Cygwin is still broken.

The first attempt to read '% Processor Time' is expected to fail with 
PDH_INVALID_DATA, since it doesn't have a value at a particular instant, 
but one averaged over a period of time.

This is what the following comment is meant to record:

"Note that PDH will only return data for '% Processor Time' after the 
second call to PdhCollectQueryData(), as it's computed over an interval, 
so the first attempt to estimate load will fail and 0.0 will be returned."


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-13 10:34               ` Jon Turney
@ 2022-05-13 11:04                 ` Corinna Vinschen
  2022-05-13 11:05                   ` Corinna Vinschen
  0 siblings, 1 reply; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-13 11:04 UTC (permalink / raw)
  To: cygwin-developers

On May 13 11:34, Jon Turney wrote:
> On 12/05/2022 10:48, Corinna Vinschen wrote:
> > On May 11 16:40, Mark Geisert wrote:
> > > 
> > > The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
> > > errors on subsequent counter reads.  This sounds like it now matches what
> > > Corinna reported for W11.  I wonder if she's running build 1706 already.
> > 
> > Erm... looks like I didn't read your mail throughly enough.
> > 
> > This behaviour, the first call returning with PDH_INVALID_DATA and only
> > subsequent calls returning valid(?) values, is what breaks the
> > getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
> > now works, but Cygwin is still broken.
> 
> The first attempt to read '% Processor Time' is expected to fail with
> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
> one averaged over a period of time.
> 
> This is what the following comment is meant to record:
> 
> "Note that PDH will only return data for '% Processor Time' after the second
> call to PdhCollectQueryData(), as it's computed over an interval, so the
> first attempt to estimate load will fail and 0.0 will be returned."

But.

Every invocation of getloadavg() returns 0.  Even under load.  Calling
`cat /proc/loadavg' is an excercise in futility.

The only way to make getloadavg() work is to call it in a loop from the
same process with a 1 sec pause between invocations.  In that case, even
a parallel `cat /proc/loadavg' shows the same load values.

However, as soon as I stop the looping process, the /proc/loadavg values
are frozen in the last state they had when stopping that process.

Any suggestions how to fix this?


Corinna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-13 11:04                 ` Corinna Vinschen
@ 2022-05-13 11:05                   ` Corinna Vinschen
  2022-05-16  5:25                     ` load average calculation imperfections Mark Geisert
  2022-05-17 14:48                     ` load average calculation failing -- fixed by Windows update Jon Turney
  0 siblings, 2 replies; 20+ messages in thread
From: Corinna Vinschen @ 2022-05-13 11:05 UTC (permalink / raw)
  To: cygwin-developers

On May 13 13:04, Corinna Vinschen wrote:
> On May 13 11:34, Jon Turney wrote:
> > On 12/05/2022 10:48, Corinna Vinschen wrote:
> > > On May 11 16:40, Mark Geisert wrote:
> > > > 
> > > > The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
> > > > errors on subsequent counter reads.  This sounds like it now matches what
> > > > Corinna reported for W11.  I wonder if she's running build 1706 already.
> > > 
> > > Erm... looks like I didn't read your mail throughly enough.
> > > 
> > > This behaviour, the first call returning with PDH_INVALID_DATA and only
> > > subsequent calls returning valid(?) values, is what breaks the
> > > getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
> > > now works, but Cygwin is still broken.
> > 
> > The first attempt to read '% Processor Time' is expected to fail with
> > PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
> > one averaged over a period of time.
> > 
> > This is what the following comment is meant to record:
> > 
> > "Note that PDH will only return data for '% Processor Time' after the second
> > call to PdhCollectQueryData(), as it's computed over an interval, so the
> > first attempt to estimate load will fail and 0.0 will be returned."
> 
> But.
> 
> Every invocation of getloadavg() returns 0.  Even under load.  Calling
> `cat /proc/loadavg' is an excercise in futility.
> 
> The only way to make getloadavg() work is to call it in a loop from the
> same process with a 1 sec pause between invocations.  In that case, even
> a parallel `cat /proc/loadavg' shows the same load values.
> 
> However, as soon as I stop the looping process, the /proc/loadavg values
> are frozen in the last state they had when stopping that process.

Oh, and, stopping and restarting all Cygwin processes in the session will
reset the loadavg to 0.

> Any suggestions how to fix this?
> 
> 
> Corinna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation imperfections
  2022-05-13 11:05                   ` Corinna Vinschen
@ 2022-05-16  5:25                     ` Mark Geisert
  2022-05-16 16:49                       ` Jon Turney
  2022-05-17 14:48                     ` load average calculation failing -- fixed by Windows update Jon Turney
  1 sibling, 1 reply; 20+ messages in thread
From: Mark Geisert @ 2022-05-16  5:25 UTC (permalink / raw)
  To: cygwin-developers

Corinna Vinschen wrote:
> On May 13 13:04, Corinna Vinschen wrote:
>> On May 13 11:34, Jon Turney wrote:
>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>
>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>
>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>
>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>> subsequent calls returning valid(?) values, is what breaks the
>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>> now works, but Cygwin is still broken.
>>>
>>> The first attempt to read '% Processor Time' is expected to fail with
>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>> one averaged over a period of time.
>>>
>>> This is what the following comment is meant to record:
>>>
>>> "Note that PDH will only return data for '% Processor Time' after the second
>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>> first attempt to estimate load will fail and 0.0 will be returned."
>>
>> But.
>>
>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>> `cat /proc/loadavg' is an excercise in futility.
>>
>> The only way to make getloadavg() work is to call it in a loop from the
>> same process with a 1 sec pause between invocations.  In that case, even
>> a parallel `cat /proc/loadavg' shows the same load values.
>>
>> However, as soon as I stop the looping process, the /proc/loadavg values
>> are frozen in the last state they had when stopping that process.
> 
> Oh, and, stopping and restarting all Cygwin processes in the session will
> reset the loadavg to 0.
> 
>> Any suggestions how to fix this?

I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the 
following update to Cygwin's loadavg.cc:

diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
index 127591a2e..cceb3e9fe 100644
--- a/winsup/cygwin/loadavg.cc
+++ b/winsup/cygwin/loadavg.cc
@@ -87,6 +87,9 @@ static bool load_init (void)
      }

      initialized = true;
+
+    /* prime the data pump, hopefully */
+    (void) PdhCollectQueryData (query);
    }

    return initialized;

It's only somewhat better because it seems like multiple updaters of the load 
average act sort of independently.  It's hard to characterize what I'm seeing but 
let me try.

First let me shove xload aside by saying it shows instantaneous load and is thus a 
different animal.  It only cares about total %processor time, so its load average 
value never goes higher than ncpus, nor does it have any decay behavior built-in.

Any other Cygwin app I know of is using getloadavg() under the hood.  When it 
calculates a new set of 1,5,15 minute load averages, it uses total %processor time 
and total processor queue length.  It has a decay behavior that I think has been 
around since early Unix.  What I haven't noticed before is an "inverse" decay 
behavior that seems wrong to me, but maybe Linux has this.  That is, if you have 
just one compute-bound process the load average won't reach 1.0 until that process 
has been running for a full minute.  You don't see instantaneous load.

I guess that's all reasonable so far.  But I think the wrinkle Cygwin is adding, 
allowing the load average to be calculated by multiple updaters, makes it seem 
like updaters are not keeping in sync with each other despite the loadavginfo 
shared data.  I can't quite wrap my head around the current implementation to 
prove or disprove its correctness.

Ideally, the shared data should have the most recently calculated 1,5,15 minute 
load averages and a timestamp of when they were calculated.  And then any process 
that calls getloadavg() should independently decide whether it's time to calculate 
an updated set of values for machine-wide use.  But can the decay calculations get 
messed up due to multiple updaters?  I want to say no, but I can't quite convince 
myself.  Each updater has its own idea of the 1,5,15 timespans, doesn't it, 
because updates can occur at random, rather than at a set period like a kernel 
would do?

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation imperfections
  2022-05-16  5:25                     ` load average calculation imperfections Mark Geisert
@ 2022-05-16 16:49                       ` Jon Turney
  2022-05-17  5:39                         ` Mark Geisert
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Turney @ 2022-05-16 16:49 UTC (permalink / raw)
  To: cygwin-developers, Mark Geisert

On 16/05/2022 06:25, Mark Geisert wrote:
> Corinna Vinschen wrote:
>> On May 13 13:04, Corinna Vinschen wrote:
>>> On May 13 11:34, Jon Turney wrote:
>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>
>>>>>> The first counter read now gets error 0xC0000BC6 == 
>>>>>> PDH_INVALID_DATA, but no
>>>>>> errors on subsequent counter reads.  This sounds like it now 
>>>>>> matches what
>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 
>>>>>> already.
>>>>>
>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>
>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and 
>>>>> only
>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe 
>>>>> xload
>>>>> now works, but Cygwin is still broken.
>>>>
>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular 
>>>> instant, but
>>>> one averaged over a period of time.
>>>>
>>>> This is what the following comment is meant to record:
>>>>
>>>> "Note that PDH will only return data for '% Processor Time' after 
>>>> the second
>>>> call to PdhCollectQueryData(), as it's computed over an interval, so 
>>>> the
>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>
>>> But.
>>>
>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>> `cat /proc/loadavg' is an excercise in futility.
>>>
>>> The only way to make getloadavg() work is to call it in a loop from the
>>> same process with a 1 sec pause between invocations.  In that case, even
>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>
>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>> are frozen in the last state they had when stopping that process.
>>
>> Oh, and, stopping and restarting all Cygwin processes in the session will
>> reset the loadavg to 0.
>>
>>> Any suggestions how to fix this?
> 
> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' 
> with the following update to Cygwin's loadavg.cc:
> 
> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
> index 127591a2e..cceb3e9fe 100644
> --- a/winsup/cygwin/loadavg.cc
> +++ b/winsup/cygwin/loadavg.cc
> @@ -87,6 +87,9 @@ static bool load_init (void)
>       }
> 
>       initialized = true;
> +
> +    /* prime the data pump, hopefully */
> +    (void) PdhCollectQueryData (query);
>     }

Yeah, something like this might be a good idea, as at the moment we 
report load averages of 0 for the 5 seconds after the first time someone 
asks for it.

It's not ideal, because with this change, we go on to call 
PdhCollectQueryData() again very shortly afterwards, so the first value 
for '% Processor Time' is measured over a very short interval, and so 
may be very inaccurate.

>     return initialized;
> 
> It's only somewhat better because it seems like multiple updaters of the 
> load average act sort of independently.  It's hard to characterize what 
> I'm seeing but let me try.
> 
> First let me shove xload aside by saying it shows instantaneous load and 
> is thus a different animal.  It only cares about total %processor time, 
> so its load average value never goes higher than ncpus, nor does it have 
> any decay behavior built-in.
> 
> Any other Cygwin app I know of is using getloadavg() under the hood.  
> When it calculates a new set of 1,5,15 minute load averages, it uses 
> total %processor time and total processor queue length.  It has a decay 
> behavior that I think has been around since early Unix.  What I haven't 
> noticed before is an "inverse" decay behavior that seems wrong to me, 
> but maybe Linux has this.  That is, if you have just one compute-bound 
> process the load average won't reach 1.0 until that process has been 
> running for a full minute.  You don't see instantaneous load.

In fact it asymptotically approaches 1, so it wouldn't each it until 
you've had a load of 1 for a long time compared to the time you are 
averaging over.

Starting from idle, a unit load after 1 minute would result in an 
1-minute load average of (1 - (1/e)) = ~0.62.   See 
https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html 
for some discussion of that.

That's just how it works, as a measure of demand, not load.

> I guess that's all reasonable so far.  But I think the wrinkle Cygwin is 
> adding, allowing the load average to be calculated by multiple updaters, 
> makes it seem like updaters are not keeping in sync with each other 
> despite the loadavginfo shared data.  I can't quite wrap my head around 
> the current implementation to prove or disprove its correctness.
> 
> Ideally, the shared data should have the most recently calculated 1,5,15 
> minute load averages and a timestamp of when they were calculated.  And 
> then any process that calls getloadavg() should independently decide 
> whether it's time to calculate an updated set of values for machine-wide 
> use.  But can the decay calculations get messed up due to multiple 
> updaters?  I want to say no, but I can't quite convince myself.  Each 
> updater has its own idea of the 1,5,15 timespans, doesn't it, because 
> updates can occur at random, rather than at a set period like a kernel 
> would do?

I think not, because last_time is part of the shared loadavginfo state, 
which is the unix epoch time that the last update was computed, and 
updating that is guarded by a mutex.

That's not to say that this code might not be wrong in some other way :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation imperfections
  2022-05-16 16:49                       ` Jon Turney
@ 2022-05-17  5:39                         ` Mark Geisert
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Geisert @ 2022-05-17  5:39 UTC (permalink / raw)
  To: cygwin-developers

Jon Turney wrote:
> On 16/05/2022 06:25, Mark Geisert wrote:
>> Corinna Vinschen wrote:
>>> On May 13 13:04, Corinna Vinschen wrote:
>>>> On May 13 11:34, Jon Turney wrote:
>>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>>
>>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>>>
>>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>>
>>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>>>> now works, but Cygwin is still broken.
>>>>>
>>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>>>> one averaged over a period of time.
>>>>>
>>>>> This is what the following comment is meant to record:
>>>>>
>>>>> "Note that PDH will only return data for '% Processor Time' after the second
>>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>>
>>>> But.
>>>>
>>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>>> `cat /proc/loadavg' is an excercise in futility.
>>>>
>>>> The only way to make getloadavg() work is to call it in a loop from the
>>>> same process with a 1 sec pause between invocations.  In that case, even
>>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>>
>>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>>> are frozen in the last state they had when stopping that process.
>>>
>>> Oh, and, stopping and restarting all Cygwin processes in the session will
>>> reset the loadavg to 0.
>>>
>>>> Any suggestions how to fix this?
>>
>> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the 
>> following update to Cygwin's loadavg.cc:
>>
>> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
>> index 127591a2e..cceb3e9fe 100644
>> --- a/winsup/cygwin/loadavg.cc
>> +++ b/winsup/cygwin/loadavg.cc
>> @@ -87,6 +87,9 @@ static bool load_init (void)
>>       }
>>
>>       initialized = true;
>> +
>> +    /* prime the data pump, hopefully */
>> +    (void) PdhCollectQueryData (query);
>>     }
> 
> Yeah, something like this might be a good idea, as at the moment we report load 
> averages of 0 for the 5 seconds after the first time someone asks for it.
> 
> It's not ideal, because with this change, we go on to call PdhCollectQueryData() 
> again very shortly afterwards, so the first value for '% Processor Time' is 
> measured over a very short interval, and so may be very inaccurate.

Perhaps add a short delay, say 100ms, after that first PdhCollectQueryData()? 
Enough for anything compute-bound to be measurable but not enough to be 
human-noticeable?  Something even shorter?

[...]
>> Any other Cygwin app I know of is using getloadavg() under the hood. When it 
>> calculates a new set of 1,5,15 minute load averages, it uses total %processor 
>> time and total processor queue length.  It has a decay behavior that I think has 
>> been around since early Unix.  What I haven't noticed before is an "inverse" 
>> decay behavior that seems wrong to me, but maybe Linux has this.  That is, if 
>> you have just one compute-bound process the load average won't reach 1.0 until 
>> that process has been running for a full minute.  You don't see instantaneous load.
> 
> In fact it asymptotically approaches 1, so it wouldn't each it until you've had a 
> load of 1 for a long time compared to the time you are averaging over.
> 
> Starting from idle, a unit load after 1 minute would result in an 1-minute load 
> average of (1 - (1/e)) = ~0.62.   See 
> https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html for some 
> discussion of that.
> 
> That's just how it works, as a measure of demand, not load.

Thanks for that link; that was interesting to read.  OK on that's how it is, the 
ramp even more drawn out over time than I was thinking.

[...]
>> Ideally, the shared data should have the most recently calculated 1,5,15 minute 
>> load averages and a timestamp of when they were calculated.  And then any 
>> process that calls getloadavg() should independently decide whether it's time to 
>> calculate an updated set of values for machine-wide use.  But can the decay 
>> calculations get messed up due to multiple updaters?  I want to say no, but I 
>> can't quite convince myself.  Each updater has its own idea of the 1,5,15 
>> timespans, doesn't it, because updates can occur at random, rather than at a set 
>> period like a kernel would do?
> 
> I think not, because last_time is part of the shared loadavginfo state, which is 
> the unix epoch time that the last update was computed, and updating that is 
> guarded by a mutex.
> 
> That's not to say that this code might not be wrong in some other way :)

Alright, I see the problem with how I was visualizing multiple updaters.  I was 
thinking of the "real" load average over time as a superposition (sum, I guess) of 
the decaying exponential curves of all the updaters' calculations.  But no, each 
updater replaces the current curve with a new one based on its own new data.  What 
I was envisioning would be much more complex and require more state memory.  Oof.

I can submit a patch for the added PdhCollectQueryData() plus short Sleep() if it 
would make sense to try it for awhile on Cygwin head.  Other suggestions welcome.

..mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-13 11:05                   ` Corinna Vinschen
  2022-05-16  5:25                     ` load average calculation imperfections Mark Geisert
@ 2022-05-17 14:48                     ` Jon Turney
  2022-05-17 19:48                       ` Mark Geisert
  1 sibling, 1 reply; 20+ messages in thread
From: Jon Turney @ 2022-05-17 14:48 UTC (permalink / raw)
  To: cygwin-developers

On 13/05/2022 12:05, Corinna Vinschen wrote:
> On May 13 13:04, Corinna Vinschen wrote:
>> On May 13 11:34, Jon Turney wrote:
>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>
>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>
>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>
>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>> subsequent calls returning valid(?) values, is what breaks the
>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>> now works, but Cygwin is still broken.
>>>
>>> The first attempt to read '% Processor Time' is expected to fail with
>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>> one averaged over a period of time.
>>>
>>> This is what the following comment is meant to record:
>>>
>>> "Note that PDH will only return data for '% Processor Time' after the second
>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>> first attempt to estimate load will fail and 0.0 will be returned."
>>
>> But.
>>
>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>> `cat /proc/loadavg' is an excercise in futility.
>>
>> The only way to make getloadavg() work is to call it in a loop from the
>> same process with a 1 sec pause between invocations.  In that case, even
>> a parallel `cat /proc/loadavg' shows the same load values.
>>
>> However, as soon as I stop the looping process, the /proc/loadavg values
>> are frozen in the last state they had when stopping that process.
> 
> Oh, and, stopping and restarting all Cygwin processes in the session will
> reset the loadavg to 0.
> 
>> Any suggestions how to fix this?

Ah, right.  'while true ; do cat /proc/loadavg ; done', just shows a 
stream of zeroes, because each process only calls getloadavg() once, 
which doesn't update the loadavg, because the first call to fetch it 
fails PDH_INVALID_DATA.

This isn't really simply fixable because PDH "handles" aren't shareable 
between processes.

I don't think this is new since it is mentioned in [1].

Non-solution: use top instead :)

[1] https://cygwin.com/pipermail/cygwin-patches/2017q1/008699.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: load average calculation failing -- fixed by Windows update
  2022-05-17 14:48                     ` load average calculation failing -- fixed by Windows update Jon Turney
@ 2022-05-17 19:48                       ` Mark Geisert
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Geisert @ 2022-05-17 19:48 UTC (permalink / raw)
  To: cygwin-developers

Jon Turney wrote:
> On 13/05/2022 12:05, Corinna Vinschen wrote:
>> On May 13 13:04, Corinna Vinschen wrote:
>>> On May 13 11:34, Jon Turney wrote:
>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>
>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>>
>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>
>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>>> now works, but Cygwin is still broken.
>>>>
>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>>> one averaged over a period of time.
>>>>
>>>> This is what the following comment is meant to record:
>>>>
>>>> "Note that PDH will only return data for '% Processor Time' after the second
>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>
>>> But.
>>>
>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>> `cat /proc/loadavg' is an excercise in futility.
>>>
>>> The only way to make getloadavg() work is to call it in a loop from the
>>> same process with a 1 sec pause between invocations.  In that case, even
>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>
>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>> are frozen in the last state they had when stopping that process.
>>
>> Oh, and, stopping and restarting all Cygwin processes in the session will
>> reset the loadavg to 0.
>>
>>> Any suggestions how to fix this?
> 
> Ah, right.  'while true ; do cat /proc/loadavg ; done', just shows a stream of 
> zeroes, because each process only calls getloadavg() once, which doesn't update 
> the loadavg, because the first call to fetch it fails PDH_INVALID_DATA.
> 
> This isn't really simply fixable because PDH "handles" aren't shareable between 
> processes.
> 
> I don't think this is new since it is mentioned in [1].
> 
> Non-solution: use top instead :)
> 
> [1] https://cygwin.com/pipermail/cygwin-patches/2017q1/008699.html

I think my starting the "imperfections" thread only fractured the discussion :-(. 
The patch I mention in the other thread improves this repeated cat /proc/loadavg 
display, if nothing else.  It throws away that first sample that errors.

~ while true; do cat /proc/loadavg; done
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.00 0.00 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.08 0.01 0.00 1/10
0.07 0.01 0.00 1/10
0.07 0.01 0.00 1/10
0.07 0.01 0.00 1/10
...

..mark


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-05-17 19:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.BSF.4.63.2205051618470.42373@m0.truegem.net>
2022-05-08  7:01 ` load average calculation failing Mark Geisert
     [not found]   ` <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk>
2022-05-09  8:45     ` Corinna Vinschen
2022-05-09  8:53       ` Corinna Vinschen
2022-05-10  8:34       ` Mark Geisert
2022-05-10 13:37         ` Jon Turney
2022-05-11 23:40           ` load average calculation failing -- fixed by Windows update Mark Geisert
2022-05-12  8:17             ` Corinna Vinschen
2022-05-12  8:24               ` Mark Geisert
2022-05-12  8:43                 ` Corinna Vinschen
2022-05-12  9:48             ` Corinna Vinschen
2022-05-13 10:34               ` Jon Turney
2022-05-13 11:04                 ` Corinna Vinschen
2022-05-13 11:05                   ` Corinna Vinschen
2022-05-16  5:25                     ` load average calculation imperfections Mark Geisert
2022-05-16 16:49                       ` Jon Turney
2022-05-17  5:39                         ` Mark Geisert
2022-05-17 14:48                     ` load average calculation failing -- fixed by Windows update Jon Turney
2022-05-17 19:48                       ` Mark Geisert
2022-05-09 11:29   ` load average calculation failing Jon Turney
2022-05-10  8:21     ` Mark Geisert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).