From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 29853 invoked by alias); 25 May 2017 19:26:51 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 29828 invoked by uid 89); 25 May 2017 19:26:49 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=Micro, thursday, rare X-HELO: homiemail-a92.g.dreamhost.com Subject: Re: Ping: [Patch] aarch64: Thunderx specific memcpy and memmove To: Wilco Dijkstra , Andrew Pinski Cc: Szabolcs Nagy , "Ellcey, Steve" , libc-alpha , nd References: <1493663254.29498.11.camel@cavium.com> <5909E2C5.7090603@arm.com> <1494366305.9224.26.camel@cavium.com> <74006e0a-fb4a-dc36-bc29-77303cef3cfb@gotplt.org> <5925BD04.7000902@arm.com> <0950612b-cff4-2256-6f81-3bacf30ce7e9@gotplt.org> From: Siddhesh Poyarekar Message-ID: <135198a3-ad77-5117-9c13-b4456268e74a@gotplt.org> Date: Thu, 25 May 2017 19:26:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SW-Source: 2017-05/txt/msg00771.txt.bz2 On Thursday 25 May 2017 11:19 PM, Wilco Dijkstra wrote: > Given the number of micro architectures already existing, it would be a really > bad situation to end up with one memcpy per micro architecture... It's not just per micro-architecture... > Micro architectures will tend to converge rather than diverge as performance > level increases. So I believe it's generally best to use the same instructions for > memcpy as for compiled code as that is what CPUs will actually encounter > and optimize for. For the rare, very large copies we could do something different > if it helps (eg. prefetch, non-temporals, SIMD registers etc). ... because as you say, micro-architectures may well converge over time to some extent, but you will still end up having multiple memcpy implementation taking advantage of different features in aarch64 architecture over time. For example, SVE routines vs non-SVE routines. You'll need both and looking at how x86 has evolved, there will be much more to come. > An ifunc has a measurable overhead unfortunately, and that would no longer > be trivially avoidable via static linking. Most calls to memcpy tend to be very > small copies. Maybe we should investigate statically linking the small copy part > of memcpy with say -O3? Sure, that might be something to look at as a data point, but again getting rid of multiarch is not the option for desktop/server implementations, especially if micro-architecture specific routines give measurable gains over generic implementations in the general case, i.e. dynamically linked programs that need to run out of the box and optimally on multiple types of hardware. Static binaries unfortunately become the edge case here. Siddhesh