From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com [IPv6:2607:f8b0:4864:20::22b]) by sourceware.org (Postfix) with ESMTPS id 84C943858C60 for ; Thu, 9 Feb 2023 12:25:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 84C943858C60 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-oi1-x22b.google.com with SMTP id 20so1475910oix.5 for ; Thu, 09 Feb 2023 04:25:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=KnTdVfxniW7qTfIVXAvLYmCoTLZbZubDpWl4KwTCdMY=; b=Ugi4h64Rf0cb+kjRVMH6cFWHaVdsE6i5I587nvJ6p20c6a0Z4qc+l9DqZAGaXNqMHy 2UZaxykmpUO0mVw2Dtbj2DEaGxCasr3/AHE9p3iZSI4zMZdOKCr5HKOmdKTrpqWos9+/ 1INtGcTbX7qcTnrniLTQN5ePAMmzYJEBpmCq4RWb70cAO2BiFdbBnT7jvVizhB+OXB3I usI0jgAS03a7F384GlLkmhwu7htVLolMsYxA+tUw+QCY3KRTAstYIDnpf0vUpkW+/c6f StGd5SeCq0l8fGgIcYB54am8KjKE8UTNHcFO+XouIxEvOwM3NjWar++hO+6gBs6n3vDn iGbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KnTdVfxniW7qTfIVXAvLYmCoTLZbZubDpWl4KwTCdMY=; b=a187ytETsMpnVzKvMzJvww+LnP+53SJnKPzYBam7+rFPHX+fPXW9+Lekn+PIfE8rFb B7zzLu1pAOo4qvWX5e7PIns72j4WOGD0q1j82e6cwTp/I/XEJO8H+kOdgbv29Ttjj84F Dqjvhwyv3hIgQEXX92LRnbT9TT5Q8m0o5PeXU2G2Ty8nhstn/AapY50WvsuXEXYT+yO0 Ak4VNzpjnFzCwNJmUPQE//RxbnU6QXOqEuYRGUW7ue3yTBZ/Zki/LXlEsE8NMA/jBBN6 +5LtNJCKKvDsXTgnaBdmNCeUDyv3C6cpKx9Js22GuHs0MKKA4cVPCK5gNpjysjSNQVfU QuKg== X-Gm-Message-State: AO0yUKU6Of1etjtCuQuLfHTQ+2pf254CgFI6VMQQMtJDQAF5Nk5Z+zvc pR3MiH5eaATM6P4e840FMhMJIq24n8oJngxrI0g= X-Google-Smtp-Source: AK7set8qJJQ/di4gnD1u0Fm/IMIp63kjZIxp8Oqq6JPfTqcqxFokBjH2p5VTVnYD5dCA9KXESPNUsQ== X-Received: by 2002:a05:6808:82:b0:364:c3de:2d01 with SMTP id s2-20020a056808008200b00364c3de2d01mr4908821oic.25.1675945546446; Thu, 09 Feb 2023 04:25:46 -0800 (PST) Received: from ?IPV6:2804:1b3:a7c2:8ced:4490:9a44:1abc:1757? ([2804:1b3:a7c2:8ced:4490:9a44:1abc:1757]) by smtp.gmail.com with ESMTPSA id g8-20020aca3908000000b00377fae9d36csm706261oia.52.2023.02.09.04.25.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 09 Feb 2023 04:25:45 -0800 (PST) Message-ID: <9614674f-0024-830f-c3f0-4e31e5f92ff2@linaro.org> Date: Thu, 9 Feb 2023 09:25:42 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Content-Language: en-US To: Wilco Dijkstra Cc: 'GNU C Library' References: From: Adhemerval Zanella Netto Organization: Linaro In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_NUMSUBJECT,KAM_STORAGE_GOOGLE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 09/02/23 08:43, Wilco Dijkstra wrote: > Hi Adhemerval, > >> The generic routines still assumes that hardware can't or is prohibitive >> expensive to issue unaligned memory access. However, I think we move toward >> this direction to start adding unaligned variants when it makes sense. > > There is a _STRING_ARCH_unaligned define that can be set per target. It needs > cleaning up since it's used mostly for premature micro-optimizations (eg. getenv.c) > where using a fixed size memcpy would be best (it also appears to have big-endian > bugs). > I will add on my backlog to clean this up. And it is not ideal, at least for RISCV plans, to have a global flag that sets fast unaligned; maybe we should move a per-file flag so it can recompiled to provide ifunc variants. >> Another usual tuning is loop unrolling, which depends on underlying hardware. >> Unfortunately we need to explicit force gcc to unroll some loop construction >> (for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might >> be another approach you might use to tune RISCV routines. > > Compiler unrolling is unlikely to give improved results, especially on GCC where > the default unroll factor is still 16 times which will just bloat the code... > So all reasonable unrolling is best done by hand (and doesn't need to be target > specific). The Makefile snippet I posted uses max-variable-expansions-in-unroller and max-unroll-times to limit the number of unroll. This will most likely need to be done per architecture and even per cpu (for ifunc variants). But manual unrolling could be an option as well. > >> The memcpy, memmove, memset, memcmp are a slight different subject. Although >> current generic mem routines does use some explicit unrolling, it also does >> not take in consideration unaligned access, vector instructions, or special >> instruction (such as cache clear one). And these usually make a lot of >> difference. > > Indeed. However it is also quite difficult to make use of all these without a lot of > target specific code and inline assembler. And at that point you might as well use > assembler... > >> What I would expect it maybe we can use a similar strategy Google is doing >> with llvm libc, which based its work on the automemcpy paper [1]. It means >> that for unaligned, each architecture will reimplement the memory routine >> block. Although the project focus on static compiling, I think using hooks >> over assembly routines might be a better approach (you might reuse code >> blocks or try different strategies more easily). >> >> [1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf > > I'm still not convinced about this strategy - it's hard to beat assembler using > generic code. The way it works in LLVM is that you implement a new set of > builtins that inline an optimal memcpy for a fixed size. But you don't know the > alignment, so this only works on targets that support fast unaligned access. > And with different compiler versions/options you get major performance > variations due to code reordering, register allocation differences or failure > to emit load/store pairs... > > I believe it is reasonable to ensure the generic string functions are efficient > to avoid having to write assembler for every string function. However it > becomes crazy when you set the goal to be as close as possible to the best > assembler version in all cases. Most targets will add assembly versions for > key functions like memcpy, strlen etc. > The LLVM libc does use a lot of arch-specific code and the resulting implementation is not really generic; but at least it showed that it possible to provide competitive mem routines without the need to code it in assembly. But afaiu, their goals are different indeed, since they do focus on static linking and LTO, where a mem i implementation using C and compiler builtins provide more optimization opportunities. But I would like for generic glibc mem routines is be at least good enough where arch maintainer could tune small parts without the need to extra boilerplate. It most likely won't beet hand tuned implementations, specially if it uses builtins and instructions only available on newer compilers; but I think we can still improve our internal framework to avoid relying in assembly implementation too much. We can start by providing an unaligned variant of memcpy, memmove, memcmp, and memset using default word accesses and move to use the google paper strategy to decompose it in blocks tied with compiler builtins. It would allow the architecture to build using -mcpu or any flags to emit vector instructions (by limiting the block size along with the builtin, similar to what we did on strcspn.c to avoid the memset call).