From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from albireo.enyo.de (albireo.enyo.de [37.24.231.21]) by sourceware.org (Postfix) with ESMTPS id E84DF3858028 for ; Mon, 12 Apr 2021 18:53:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E84DF3858028 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=deneb.enyo.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=fw@deneb.enyo.de Received: from [172.17.203.2] (port=55679 helo=deneb.enyo.de) by albireo.enyo.de ([172.17.140.2]) with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) id 1lW1h4-0004Ad-Pz; Mon, 12 Apr 2021 18:53:54 +0000 Received: from fw by deneb.enyo.de with local (Exim 4.92) (envelope-from ) id 1lW1h4-0000eg-J4; Mon, 12 Apr 2021 20:53:54 +0200 From: Florian Weimer To: Wilco Dijkstra via Libc-alpha Cc: "naohirot\@fujitsu.com" , Wilco Dijkstra , Szabolcs Nagy Subject: Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX References: Date: Mon, 12 Apr 2021 20:53:54 +0200 In-Reply-To: (Wilco Dijkstra via Libc-alpha's message of "Mon, 12 Apr 2021 12:52:05 +0000") Message-ID: <87tuobibb1.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2021 18:53:59 -0000 * Wilco Dijkstra via Libc-alpha: > 5. Odd prefetches > > I have a hard time believing first prefetching the data to be > written, then clearing it using DC ZVA (???), then prefetching the > same data a 2nd time, before finally write the loaded data is > helping performance... Generally hardware prefetchers are able to > do exactly the right thing since memcpy is trivial to prefetch. So > what is the performance gain of each prefetch/clear step? What is > the difference between memcpy and memmove performance (given memmove > doesn't do any of this)? Another downside is exposure of latent concurrency bugs: G1: Phantom zeros in cardtable I guess the CPU's heritage is shining through here. 8-)