From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from m0.truegem.net (m0.truegem.net [69.55.228.47]) by sourceware.org (Postfix) with ESMTPS id 8F7E3385742A for ; Tue, 17 May 2022 05:39:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F7E3385742A Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=maxrnd.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=maxrnd.com Received: (from daemon@localhost) by m0.truegem.net (8.12.11/8.12.11) id 24H5dsgU064211 for ; Mon, 16 May 2022 22:39:54 -0700 (PDT) (envelope-from mark@maxrnd.com) Received: from 162-235-43-67.lightspeed.irvnca.sbcglobal.net(162.235.43.67), claiming to be "[192.168.1.100]" via SMTP by m0.truegem.net, id smtpdmRgs61; Mon May 16 22:39:45 2022 Subject: Re: load average calculation imperfections To: cygwin-developers@cygwin.com References: <3a3edd10-2617-0919-4eb0-7ca965b48963@maxrnd.com> <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk> <53664601-5858-ffd5-f854-a5c10fc25613@maxrnd.com> <670cea06-e202-3c90-e567-b78d737f5156@dronecode.org.uk> <2c7d326d-3de0-9787-897f-54c62bf3bbcc@maxrnd.com> <5dbeb18a-92ef-4b6a-64eb-8fe1f60887fc@maxrnd.com> From: Mark Geisert Message-ID: Date: Mon, 16 May 2022 22:39:45 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00, BODY_8BITS, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin-developers@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Cygwin core component developers mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2022 05:40:04 -0000 Jon Turney wrote: > On 16/05/2022 06:25, Mark Geisert wrote: >> Corinna Vinschen wrote: >>> On May 13 13:04, Corinna Vinschen wrote: >>>> On May 13 11:34, Jon Turney wrote: >>>>> On 12/05/2022 10:48, Corinna Vinschen wrote: >>>>>> On May 11 16:40, Mark Geisert wrote: >>>>>>> >>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no >>>>>>> errors on subsequent counter reads.  This sounds like it now matches what >>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already. >>>>>> >>>>>> Erm... looks like I didn't read your mail throughly enough. >>>>>> >>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only >>>>>> subsequent calls returning valid(?) values, is what breaks the >>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload >>>>>> now works, but Cygwin is still broken. >>>>> >>>>> The first attempt to read '% Processor Time' is expected to fail with >>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but >>>>> one averaged over a period of time. >>>>> >>>>> This is what the following comment is meant to record: >>>>> >>>>> "Note that PDH will only return data for '% Processor Time' after the second >>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the >>>>> first attempt to estimate load will fail and 0.0 will be returned." >>>> >>>> But. >>>> >>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling >>>> `cat /proc/loadavg' is an excercise in futility. >>>> >>>> The only way to make getloadavg() work is to call it in a loop from the >>>> same process with a 1 sec pause between invocations.  In that case, even >>>> a parallel `cat /proc/loadavg' shows the same load values. >>>> >>>> However, as soon as I stop the looping process, the /proc/loadavg values >>>> are frozen in the last state they had when stopping that process. >>> >>> Oh, and, stopping and restarting all Cygwin processes in the session will >>> reset the loadavg to 0. >>> >>>> Any suggestions how to fix this? >> >> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the >> following update to Cygwin's loadavg.cc: >> >> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc >> index 127591a2e..cceb3e9fe 100644 >> --- a/winsup/cygwin/loadavg.cc >> +++ b/winsup/cygwin/loadavg.cc >> @@ -87,6 +87,9 @@ static bool load_init (void) >>       } >> >>       initialized = true; >> + >> +    /* prime the data pump, hopefully */ >> +    (void) PdhCollectQueryData (query); >>     } > > Yeah, something like this might be a good idea, as at the moment we report load > averages of 0 for the 5 seconds after the first time someone asks for it. > > It's not ideal, because with this change, we go on to call PdhCollectQueryData() > again very shortly afterwards, so the first value for '% Processor Time' is > measured over a very short interval, and so may be very inaccurate. Perhaps add a short delay, say 100ms, after that first PdhCollectQueryData()? Enough for anything compute-bound to be measurable but not enough to be human-noticeable? Something even shorter? [...] >> Any other Cygwin app I know of is using getloadavg() under the hood. When it >> calculates a new set of 1,5,15 minute load averages, it uses total %processor >> time and total processor queue length.  It has a decay behavior that I think has >> been around since early Unix.  What I haven't noticed before is an "inverse" >> decay behavior that seems wrong to me, but maybe Linux has this.  That is, if >> you have just one compute-bound process the load average won't reach 1.0 until >> that process has been running for a full minute.  You don't see instantaneous load. > > In fact it asymptotically approaches 1, so it wouldn't each it until you've had a > load of 1 for a long time compared to the time you are averaging over. > > Starting from idle, a unit load after 1 minute would result in an 1-minute load > average of (1 - (1/e)) = ~0.62.   See > https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html for some > discussion of that. > > That's just how it works, as a measure of demand, not load. Thanks for that link; that was interesting to read. OK on that's how it is, the ramp even more drawn out over time than I was thinking. [...] >> Ideally, the shared data should have the most recently calculated 1,5,15 minute >> load averages and a timestamp of when they were calculated.  And then any >> process that calls getloadavg() should independently decide whether it's time to >> calculate an updated set of values for machine-wide use.  But can the decay >> calculations get messed up due to multiple updaters?  I want to say no, but I >> can't quite convince myself.  Each updater has its own idea of the 1,5,15 >> timespans, doesn't it, because updates can occur at random, rather than at a set >> period like a kernel would do? > > I think not, because last_time is part of the shared loadavginfo state, which is > the unix epoch time that the last update was computed, and updating that is > guarded by a mutex. > > That's not to say that this code might not be wrong in some other way :) Alright, I see the problem with how I was visualizing multiple updaters. I was thinking of the "real" load average over time as a superposition (sum, I guess) of the decaying exponential curves of all the updaters' calculations. But no, each updater replaces the current curve with a new one based on its own new data. What I was envisioning would be much more complex and require more state memory. Oof. I can submit a patch for the added PdhCollectQueryData() plus short Sleep() if it would make sense to try it for awhile on Cygwin head. Other suggestions welcome. ..mark