From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xAD9=6F=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com [IPv6:2607:f8b0:4864:20::22b])
	by sourceware.org (Postfix) with ESMTPS id 84C943858C60
	for <libc-alpha@sourceware.org>; Thu,  9 Feb 2023 12:25:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 84C943858C60
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-oi1-x22b.google.com with SMTP id 20so1475910oix.5
        for <libc-alpha@sourceware.org>; Thu, 09 Feb 2023 04:25:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=KnTdVfxniW7qTfIVXAvLYmCoTLZbZubDpWl4KwTCdMY=;
        b=Ugi4h64Rf0cb+kjRVMH6cFWHaVdsE6i5I587nvJ6p20c6a0Z4qc+l9DqZAGaXNqMHy
         2UZaxykmpUO0mVw2Dtbj2DEaGxCasr3/AHE9p3iZSI4zMZdOKCr5HKOmdKTrpqWos9+/
         1INtGcTbX7qcTnrniLTQN5ePAMmzYJEBpmCq4RWb70cAO2BiFdbBnT7jvVizhB+OXB3I
         usI0jgAS03a7F384GlLkmhwu7htVLolMsYxA+tUw+QCY3KRTAstYIDnpf0vUpkW+/c6f
         StGd5SeCq0l8fGgIcYB54am8KjKE8UTNHcFO+XouIxEvOwM3NjWar++hO+6gBs6n3vDn
         iGbA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=KnTdVfxniW7qTfIVXAvLYmCoTLZbZubDpWl4KwTCdMY=;
        b=a187ytETsMpnVzKvMzJvww+LnP+53SJnKPzYBam7+rFPHX+fPXW9+Lekn+PIfE8rFb
         B7zzLu1pAOo4qvWX5e7PIns72j4WOGD0q1j82e6cwTp/I/XEJO8H+kOdgbv29Ttjj84F
         Dqjvhwyv3hIgQEXX92LRnbT9TT5Q8m0o5PeXU2G2Ty8nhstn/AapY50WvsuXEXYT+yO0
         Ak4VNzpjnFzCwNJmUPQE//RxbnU6QXOqEuYRGUW7ue3yTBZ/Zki/LXlEsE8NMA/jBBN6
         +5LtNJCKKvDsXTgnaBdmNCeUDyv3C6cpKx9Js22GuHs0MKKA4cVPCK5gNpjysjSNQVfU
         QuKg==
X-Gm-Message-State: AO0yUKU6Of1etjtCuQuLfHTQ+2pf254CgFI6VMQQMtJDQAF5Nk5Z+zvc
	pR3MiH5eaATM6P4e840FMhMJIq24n8oJngxrI0g=
X-Google-Smtp-Source: AK7set8qJJQ/di4gnD1u0Fm/IMIp63kjZIxp8Oqq6JPfTqcqxFokBjH2p5VTVnYD5dCA9KXESPNUsQ==
X-Received: by 2002:a05:6808:82:b0:364:c3de:2d01 with SMTP id s2-20020a056808008200b00364c3de2d01mr4908821oic.25.1675945546446;
        Thu, 09 Feb 2023 04:25:46 -0800 (PST)
Received: from ?IPV6:2804:1b3:a7c2:8ced:4490:9a44:1abc:1757? ([2804:1b3:a7c2:8ced:4490:9a44:1abc:1757])
        by smtp.gmail.com with ESMTPSA id g8-20020aca3908000000b00377fae9d36csm706261oia.52.2023.02.09.04.25.44
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 09 Feb 2023 04:25:45 -0800 (PST)
Message-ID: <9614674f-0024-830f-c3f0-4e31e5f92ff2@linaro.org>
Date: Thu, 9 Feb 2023 09:25:42 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.7.1
Subject: Re: [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines
 for RV64
Content-Language: en-US
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>
References: <PAWPR08MB898220123E75682E73D8E91C83D99@PAWPR08MB8982.eurprd08.prod.outlook.com>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <PAWPR08MB898220123E75682E73D8E91C83D99@PAWPR08MB8982.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_NUMSUBJECT,KAM_STORAGE_GOOGLE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 09/02/23 08:43, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
>> The generic routines still assumes that hardware can't or is prohibitive 
>> expensive to issue unaligned memory access.  However, I think we move toward 
>> this direction to start adding unaligned variants when it makes sense.
> 
> There is a _STRING_ARCH_unaligned define that can be set per target. It needs
> cleaning up since it's used mostly for premature micro-optimizations (eg. getenv.c)
> where using a fixed size memcpy would be best (it also appears to have big-endian
> bugs).
> 

I will add on my backlog to clean this up.  And it is not ideal, at least for
RISCV plans, to have a global flag that sets fast unaligned; maybe we should
move a per-file flag so it can recompiled to provide ifunc variants.

>> Another usual tuning is loop unrolling, which depends on underlying hardware.
>> Unfortunately we need to explicit force gcc to unroll some loop construction
>> (for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might
>> be another approach you might use to tune RISCV routines.
> 
> Compiler unrolling is unlikely to give improved results, especially on GCC where
> the default unroll factor is still 16 times which will just bloat the code...
> So all reasonable unrolling is best done by hand (and doesn't need to be target
> specific).

The Makefile snippet I posted uses max-variable-expansions-in-unroller and 
max-unroll-times to limit the number of unroll.  This will most likely need to
be done per architecture and even per cpu (for ifunc variants).

But manual unrolling could be an option as well.

> 
>> The memcpy, memmove, memset, memcmp are a slight different subject.  Although
>> current generic mem routines does use some explicit unrolling, it also does
>> not take in consideration unaligned access, vector instructions, or special 
>> instruction (such as cache clear one).  And these usually make a lot of
>> difference.
> 
> Indeed. However it is also quite difficult to make use of all these without a lot of
> target specific code and inline assembler. And at that point you might as well use
> assembler...
> 
>> What I would expect it maybe we can use a similar strategy Google is doing
>> with llvm libc, which based its work on the automemcpy paper [1]. It means
>> that for unaligned, each architecture will reimplement the memory routine
>> block.  Although the project focus on static compiling, I think using hooks
>> over assembly routines might be a better approach (you might reuse code
>> blocks or try different strategies more easily).
>>
>> [1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf
> 
> I'm still not convinced about this strategy - it's hard to beat assembler using
> generic code. The way it works in LLVM is that you implement a new set of
> builtins that inline an optimal memcpy for a fixed size. But you don't know the
> alignment, so this only works on targets that support fast unaligned access.
> And with different compiler versions/options you get major performance
> variations due to code reordering, register allocation differences or failure
> to emit load/store pairs...
> 
> I believe it is reasonable to ensure the generic string functions are efficient
> to avoid having to write assembler for every string function. However it
> becomes crazy when you set the goal to be as close as possible to the best
> assembler version in all cases. Most targets will add assembly versions for
> key functions like memcpy, strlen etc.
>
The LLVM libc does use a lot of arch-specific code and the resulting implementation
is not really generic; but at least it showed that it possible to provide competitive
mem routines without the need to code it in assembly. But afaiu, their goals are 
different indeed, since they do focus on static linking and LTO, where a mem i
implementation using C and  compiler builtins provide more optimization opportunities.

But I would like for generic glibc mem routines is be at least good enough where 
arch maintainer could tune small parts without the need to extra boilerplate.
It most likely won't beet hand tuned implementations, specially if it uses builtins
and instructions only available on newer compilers; but I think we can still improve 
our internal framework to avoid relying in assembly implementation too much.

We can start by providing an unaligned variant of memcpy, memmove, memcmp, and memset
using default word accesses and move to use the google paper strategy to decompose it
in blocks tied with compiler builtins.  It would allow the architecture to build
using -mcpu or any flags to emit vector instructions (by limiting the block size along
with the builtin, similar to what we did on strcspn.c to avoid the memset call).