From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark@maxrnd.com>
Received: from m0.truegem.net (m0.truegem.net [69.55.228.47])
 by sourceware.org (Postfix) with ESMTPS id 8F7E3385742A
 for <cygwin-developers@cygwin.com>; Tue, 17 May 2022 05:39:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F7E3385742A
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=maxrnd.com
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=maxrnd.com
Received: (from daemon@localhost)
 by m0.truegem.net (8.12.11/8.12.11) id 24H5dsgU064211
 for <cygwin-developers@cygwin.com>; Mon, 16 May 2022 22:39:54 -0700 (PDT)
 (envelope-from mark@maxrnd.com)
Received: from 162-235-43-67.lightspeed.irvnca.sbcglobal.net(162.235.43.67),
 claiming to be "[192.168.1.100]"
 via SMTP by m0.truegem.net, id smtpdmRgs61; Mon May 16 22:39:45 2022
Subject: Re: load average calculation imperfections
To: cygwin-developers@cygwin.com
References: <Pine.BSF.4.63.2205051618470.42373@m0.truegem.net>
 <3a3edd10-2617-0919-4eb0-7ca965b48963@maxrnd.com>
 <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk>
 <YnjUxi/IsgkTInKL@calimero.vinschen.de>
 <53664601-5858-ffd5-f854-a5c10fc25613@maxrnd.com>
 <670cea06-e202-3c90-e567-b78d737f5156@dronecode.org.uk>
 <2c7d326d-3de0-9787-897f-54c62bf3bbcc@maxrnd.com>
 <YnzX0Ma0qWJP+NZV@calimero.vinschen.de>
 <ffea3eda-8c4a-a087-680b-87c9f039e5b0@dronecode.org.uk>
 <Yn47RN+teJ0MypdG@calimero.vinschen.de>
 <Yn47gb2o07WjnDlk@calimero.vinschen.de>
 <5dbeb18a-92ef-4b6a-64eb-8fe1f60887fc@maxrnd.com>
 <ceee3f15-52ea-d679-67db-d1573eec5616@dronecode.org.uk>
From: Mark Geisert <mark@maxrnd.com>
Message-ID: <e94c85cd-bc94-7e48-eb14-93ce22344f90@maxrnd.com>
Date: Mon, 16 May 2022 22:39:45 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Firefox/52.0 SeaMonkey/2.49.4
MIME-Version: 1.0
In-Reply-To: <ceee3f15-52ea-d679-67db-d1573eec5616@dronecode.org.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00, BODY_8BITS,
 GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: cygwin-developers@cygwin.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Cygwin core component developers mailing list
 <cygwin-developers.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin-developers>,
 <mailto:cygwin-developers-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin-developers/>
List-Post: <mailto:cygwin-developers@cygwin.com>
List-Help: <mailto:cygwin-developers-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin-developers>,
 <mailto:cygwin-developers-request@cygwin.com?subject=subscribe>
X-List-Received-Date: Tue, 17 May 2022 05:40:04 -0000

Jon Turney wrote:
> On 16/05/2022 06:25, Mark Geisert wrote:
>> Corinna Vinschen wrote:
>>> On May 13 13:04, Corinna Vinschen wrote:
>>>> On May 13 11:34, Jon Turney wrote:
>>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>>
>>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>>>
>>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>>
>>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>>>> now works, but Cygwin is still broken.
>>>>>
>>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>>>> one averaged over a period of time.
>>>>>
>>>>> This is what the following comment is meant to record:
>>>>>
>>>>> "Note that PDH will only return data for '% Processor Time' after the second
>>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>>
>>>> But.
>>>>
>>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>>> `cat /proc/loadavg' is an excercise in futility.
>>>>
>>>> The only way to make getloadavg() work is to call it in a loop from the
>>>> same process with a 1 sec pause between invocations.  In that case, even
>>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>>
>>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>>> are frozen in the last state they had when stopping that process.
>>>
>>> Oh, and, stopping and restarting all Cygwin processes in the session will
>>> reset the loadavg to 0.
>>>
>>>> Any suggestions how to fix this?
>>
>> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the 
>> following update to Cygwin's loadavg.cc:
>>
>> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
>> index 127591a2e..cceb3e9fe 100644
>> --- a/winsup/cygwin/loadavg.cc
>> +++ b/winsup/cygwin/loadavg.cc
>> @@ -87,6 +87,9 @@ static bool load_init (void)
>>       }
>>
>>       initialized = true;
>> +
>> +    /* prime the data pump, hopefully */
>> +    (void) PdhCollectQueryData (query);
>>     }
> 
> Yeah, something like this might be a good idea, as at the moment we report load 
> averages of 0 for the 5 seconds after the first time someone asks for it.
> 
> It's not ideal, because with this change, we go on to call PdhCollectQueryData() 
> again very shortly afterwards, so the first value for '% Processor Time' is 
> measured over a very short interval, and so may be very inaccurate.

Perhaps add a short delay, say 100ms, after that first PdhCollectQueryData()? 
Enough for anything compute-bound to be measurable but not enough to be 
human-noticeable?  Something even shorter?

[...]
>> Any other Cygwin app I know of is using getloadavg() under the hood. When it 
>> calculates a new set of 1,5,15 minute load averages, it uses total %processor 
>> time and total processor queue length.  It has a decay behavior that I think has 
>> been around since early Unix.  What I haven't noticed before is an "inverse" 
>> decay behavior that seems wrong to me, but maybe Linux has this.  That is, if 
>> you have just one compute-bound process the load average won't reach 1.0 until 
>> that process has been running for a full minute.  You don't see instantaneous load.
> 
> In fact it asymptotically approaches 1, so it wouldn't each it until you've had a 
> load of 1 for a long time compared to the time you are averaging over.
> 
> Starting from idle, a unit load after 1 minute would result in an 1-minute load 
> average of (1 - (1/e)) = ~0.62.   See 
> https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html for some 
> discussion of that.
> 
> That's just how it works, as a measure of demand, not load.

Thanks for that link; that was interesting to read.  OK on that's how it is, the 
ramp even more drawn out over time than I was thinking.

[...]
>> Ideally, the shared data should have the most recently calculated 1,5,15 minute 
>> load averages and a timestamp of when they were calculated.  And then any 
>> process that calls getloadavg() should independently decide whether it's time to 
>> calculate an updated set of values for machine-wide use.  But can the decay 
>> calculations get messed up due to multiple updaters?  I want to say no, but I 
>> can't quite convince myself.  Each updater has its own idea of the 1,5,15 
>> timespans, doesn't it, because updates can occur at random, rather than at a set 
>> period like a kernel would do?
> 
> I think not, because last_time is part of the shared loadavginfo state, which is 
> the unix epoch time that the last update was computed, and updating that is 
> guarded by a mutex.
> 
> That's not to say that this code might not be wrong in some other way :)

Alright, I see the problem with how I was visualizing multiple updaters.  I was 
thinking of the "real" load average over time as a superposition (sum, I guess) of 
the decaying exponential curves of all the updaters' calculations.  But no, each 
updater replaces the current curve with a new one based on its own new data.  What 
I was envisioning would be much more complex and require more state memory.  Oof.

I can submit a patch for the added PdhCollectQueryData() plus short Sleep() if it 
would make sense to try it for awhile on Cygwin head.  Other suggestions welcome.

..mark