From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=tKoc=6E=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-ot1-x334.google.com (mail-ot1-x334.google.com [IPv6:2607:f8b0:4864:20::334])
	by sourceware.org (Postfix) with ESMTPS id 2D8A53858C50
	for <libc-alpha@sourceware.org>; Wed,  8 Feb 2023 19:49:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2D8A53858C50
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-ot1-x334.google.com with SMTP id g21-20020a9d6495000000b0068bb336141dso5535166otl.11
        for <libc-alpha@sourceware.org>; Wed, 08 Feb 2023 11:49:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=wxVvffyms6a5Gb/Dq6nZ5n1TQIa1N09kASFP4ghNCoo=;
        b=WAYGLZXgkQlZ9f107BzAPaMzcWJUh6E/U59J4WJcjKm9RqJscqAtyIlfpsl9cgyAhc
         hPEHKGBAsnlvVkNCExHbgd8x5lMAVyB8tRGUpt1gQYtBZ5ybeBcrlR7CxN6Oyq4cNDJq
         PqtvG9lstQ+rWHTcXtth0Fb4lqF974aYSYlP8A351PCemV7oa8Ac2dxQeTT6cu6WSyC0
         apFqJjO+MAypWmdGhj/WMRdA7swzWFt4H6RXgrzS8TTmybJ6/JOLMy+yQtcsiHV3dja9
         lb9zSq3lwilVyQPMJYDAr8Pm9G6PYivk4DAu0/8GAHDLgbrvDlRyk7KxkHry6rlEUbB3
         Lm/w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:organization:from:references
         :cc:to:content-language:subject:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=wxVvffyms6a5Gb/Dq6nZ5n1TQIa1N09kASFP4ghNCoo=;
        b=17GMz3Efh+Gyd6Tt6kntzIDXziY/1gs0IWltktjmCkz0/xejwM3TarIcjjAsSCqpal
         EEjKh79wSy0RcxvJQ64WgZ1rCJeYszkxMCC1VnFIxrKC1JPphBLmvHFajjj7PKkCNUMI
         8go1BQC7rfRBXF5zbw77XmpLUcUkig+boqpY6jg+kjicJ63H/ZgNeMiELPWV9wskifY+
         oKj+I6AWxBfqn9OKm7zaO06cAcFQrVsKfcPbDa6J/54ygnTu7vXOsDkkNSoXRaEur6PN
         TgEKsUA/A2VGSpzDmOtr5hhMcIFR/817KDHBaZfJSn5DBGbxP56TPPeWp2xzMwHGcwu+
         8YOw==
X-Gm-Message-State: AO0yUKXPT1TPUs6FqIEGQ36XdKRb8/TPSfzR2reQfBnHeul3ePgRT3Pe
	fT2IXOTrEJA6MJag6Xx+NVLe4w==
X-Google-Smtp-Source: AK7set9xSvBYT8Did+qw0vvY45ojeqLOOHK4I67zmmyBoyngTBjOMSFeyDx5P0e1YeRFQgVi69dW9A==
X-Received: by 2002:a05:6830:690c:b0:68b:c8e9:5b06 with SMTP id cx12-20020a056830690c00b0068bc8e95b06mr3787070otb.28.1675885741341;
        Wed, 08 Feb 2023 11:49:01 -0800 (PST)
Received: from ?IPV6:2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19? ([2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19])
        by smtp.gmail.com with ESMTPSA id h2-20020a9d5542000000b0068d752f1870sm6760713oti.5.2023.02.08.11.48.57
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 08 Feb 2023 11:49:00 -0800 (PST)
Message-ID: <ae18aa51-f38d-952f-8576-a31614069adb@linaro.org>
Date: Wed, 8 Feb 2023 16:48:56 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.7.1
Subject: Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
Content-Language: en-US
To: Palmer Dabbelt <palmer@dabbelt.com>, philipp.tomsich@vrull.eu
Cc: goldstein.w.n@gmail.com, christoph.muellner@vrull.eu,
 libc-alpha@sourceware.org, Darius Rad <darius@bluespec.com>,
 Andrew Waterman <andrew@sifive.com>, DJ Delorie <dj@redhat.com>,
 Vineet Gupta <vineetg@rivosinc.com>, kito.cheng@sifive.com,
 jeffreyalaw@gmail.com, heiko.stuebner@vrull.eu
References: <mhng-e7534968-425c-4164-a403-1d02725c661c@palmer-ri-x1c9>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Organization: Linaro
In-Reply-To: <mhng-e7534968-425c-4164-a403-1d02725c661c@palmer-ri-x1c9>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_STORAGE_GOOGLE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


On 08/02/23 14:55, Palmer Dabbelt wrote:
> On Wed, 08 Feb 2023 07:13:44 PST (-0800), philipp.tomsich@vrull.eu wrote:
>> On Tue, 7 Feb 2023 at 02:20, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
>>> <christoph.muellner@vrull.eu> wrote:
>>> >
>>> > From: Christoph Müllner <christoph.muellner@vrull.eu>
>>> >
>>> > The implementation of strncmp() can be accelerated using Zbb's orc.b
>>> > instruction. Let's add an optimized implementation that makes use
>>> > of this instruction.
>>> >
>>> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
>>>
>>> Not necessary, but imo performance patches should have at least some reference
>>> to the expected speedup versus the existing alternatives.
>>
>> Given that this is effectively a SWAR-like optimization (orc.b allows
>> us to test 8 bytes in parallel for a NUL byte), we should be able to
>> show the benefit through a reduction in dynamic instructions.  Would
>> this be considered reasonable reference data?
> 
> Generally for performance improvements the only metrics that count come from real hardware.  Processor implementation is complex and it's not generally true that reducing dynamic instructions results in better performance (particularly when more complex flavors instructions replace simpler ones).
> 

I agree with Noah here that we need to have some baseline performance number, 
even tough we are comparing naive implementations (what glibc used to have for
implementations).

> We've not been so good about this on the RISC-V side of things, though.  I think that's largely because we didn't have all that much complexity around this, but there's a ton of stuff showing up right now.  The general theory has been that Zbb instructions will execute faster than their corresponding I sequences, but nobody has proved that.  I believe the new JH7110 has Zba and Zbb, so maybe the right answer there is to just benchmark things before merging them?  That way we can get back to doing things sanely before we go too far down the premature optimization rabbit hole.
> 
> FWIW: we had a pretty similar discussion in Linux land around these and nobody could get the JH7110 to boot, but given that we have ~6 months until glibc releases again hopefully that will be sorted out.  There's a bunch of ongoing work looking at the more core issues like probing, so maybe it's best to focus on getting that all sorted out first?  It's kind of awkward to have a bunch of routines posted in a whole new framework that's not sorting out all the probing dependencies.

Just a heads up that with latest generic string routines optimization, all
str* routines should now use new zbb extensions (if compiler is instructed
to do so). I think you might squeeze some cycles with hand crafted assembly
routine, but I would rather focus on trying to optimize code generation
instead.

The generic routines still assumes that hardware can't or is prohibitive 
expensive to issue unaligned memory access.  However, I think we move toward 
this direction to start adding unaligned variants when it makes sense.

Another usual tuning is loop unrolling, which depends on underlying hardware.
Unfortunately we need to explicit force gcc to unroll some loop construction
(for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might
be another approach you might use to tune RISCV routines.

The memcpy, memmove, memset, memcmp are a slight different subject.  Although
current generic mem routines does use some explicit unrolling, it also does
not take in consideration unaligned access, vector instructions, or special 
instruction (such as cache clear one).  And these usually make a lot of
difference.

What I would expect it maybe we can use a similar strategy Google is doing
with llvm libc, which based its work on the automemcpy paper [1]. It means
that for unaligned, each architecture will reimplement the memory routine
block.  Although the project focus on static compiling, I think using hooks
over assembly routines might be a better approach (you might reuse code
blocks or try different strategies more easily).

[1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf