From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from m0.truegem.net (m0.truegem.net [69.55.228.47]) by sourceware.org (Postfix) with ESMTPS id 6C2063858D33 for ; Mon, 16 May 2022 05:25:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 6C2063858D33 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=maxrnd.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=maxrnd.com Received: (from daemon@localhost) by m0.truegem.net (8.12.11/8.12.11) id 24G5PshD082266 for ; Sun, 15 May 2022 22:25:54 -0700 (PDT) (envelope-from mark@maxrnd.com) Received: from 162-235-43-67.lightspeed.irvnca.sbcglobal.net(162.235.43.67), claiming to be "[192.168.1.100]" via SMTP by m0.truegem.net, id smtpdIItLUi; Sun May 15 22:25:47 2022 Subject: Re: load average calculation imperfections To: cygwin-developers@cygwin.com References: <3a3edd10-2617-0919-4eb0-7ca965b48963@maxrnd.com> <223aa826-7bf9-281a-aed8-e16349de5b96@dronecode.org.uk> <53664601-5858-ffd5-f854-a5c10fc25613@maxrnd.com> <670cea06-e202-3c90-e567-b78d737f5156@dronecode.org.uk> <2c7d326d-3de0-9787-897f-54c62bf3bbcc@maxrnd.com> From: Mark Geisert Message-ID: <5dbeb18a-92ef-4b6a-64eb-8fe1f60887fc@maxrnd.com> Date: Sun, 15 May 2022 22:25:47 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-8.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin-developers@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Cygwin core component developers mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2022 05:26:00 -0000 Corinna Vinschen wrote: > On May 13 13:04, Corinna Vinschen wrote: >> On May 13 11:34, Jon Turney wrote: >>> On 12/05/2022 10:48, Corinna Vinschen wrote: >>>> On May 11 16:40, Mark Geisert wrote: >>>>> >>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no >>>>> errors on subsequent counter reads. This sounds like it now matches what >>>>> Corinna reported for W11. I wonder if she's running build 1706 already. >>>> >>>> Erm... looks like I didn't read your mail throughly enough. >>>> >>>> This behaviour, the first call returning with PDH_INVALID_DATA and only >>>> subsequent calls returning valid(?) values, is what breaks the >>>> getloadavg function and, consequentially, /proc/loadavg. So maybe xload >>>> now works, but Cygwin is still broken. >>> >>> The first attempt to read '% Processor Time' is expected to fail with >>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but >>> one averaged over a period of time. >>> >>> This is what the following comment is meant to record: >>> >>> "Note that PDH will only return data for '% Processor Time' after the second >>> call to PdhCollectQueryData(), as it's computed over an interval, so the >>> first attempt to estimate load will fail and 0.0 will be returned." >> >> But. >> >> Every invocation of getloadavg() returns 0. Even under load. Calling >> `cat /proc/loadavg' is an excercise in futility. >> >> The only way to make getloadavg() work is to call it in a loop from the >> same process with a 1 sec pause between invocations. In that case, even >> a parallel `cat /proc/loadavg' shows the same load values. >> >> However, as soon as I stop the looping process, the /proc/loadavg values >> are frozen in the last state they had when stopping that process. > > Oh, and, stopping and restarting all Cygwin processes in the session will > reset the loadavg to 0. > >> Any suggestions how to fix this? I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the following update to Cygwin's loadavg.cc: diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc index 127591a2e..cceb3e9fe 100644 --- a/winsup/cygwin/loadavg.cc +++ b/winsup/cygwin/loadavg.cc @@ -87,6 +87,9 @@ static bool load_init (void) } initialized = true; + + /* prime the data pump, hopefully */ + (void) PdhCollectQueryData (query); } return initialized; It's only somewhat better because it seems like multiple updaters of the load average act sort of independently. It's hard to characterize what I'm seeing but let me try. First let me shove xload aside by saying it shows instantaneous load and is thus a different animal. It only cares about total %processor time, so its load average value never goes higher than ncpus, nor does it have any decay behavior built-in. Any other Cygwin app I know of is using getloadavg() under the hood. When it calculates a new set of 1,5,15 minute load averages, it uses total %processor time and total processor queue length. It has a decay behavior that I think has been around since early Unix. What I haven't noticed before is an "inverse" decay behavior that seems wrong to me, but maybe Linux has this. That is, if you have just one compute-bound process the load average won't reach 1.0 until that process has been running for a full minute. You don't see instantaneous load. I guess that's all reasonable so far. But I think the wrinkle Cygwin is adding, allowing the load average to be calculated by multiple updaters, makes it seem like updaters are not keeping in sync with each other despite the loadavginfo shared data. I can't quite wrap my head around the current implementation to prove or disprove its correctness. Ideally, the shared data should have the most recently calculated 1,5,15 minute load averages and a timestamp of when they were calculated. And then any process that calls getloadavg() should independently decide whether it's time to calculate an updated set of values for machine-wide use. But can the decay calculations get messed up due to multiple updaters? I want to say no, but I can't quite convince myself. Each updater has its own idea of the 1,5,15 timespans, doesn't it, because updates can occur at random, rather than at a set period like a kernel would do? ..mark