From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=5WhA=56=rivosinc.com=slewis@sourceware.org>
Received: from mail-lj1-x231.google.com (mail-lj1-x231.google.com [IPv6:2a00:1450:4864:20::231])
	by sourceware.org (Postfix) with ESMTPS id 6B8973858C52
	for <libc-alpha@sourceware.org>; Thu,  2 Feb 2023 15:35:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6B8973858C52
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com
Received: by mail-lj1-x231.google.com with SMTP id o12so2242499ljp.11
        for <libc-alpha@sourceware.org>; Thu, 02 Feb 2023 07:35:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=rivosinc-com.20210112.gappssmtp.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=BFuhr3soefRPillDwK9TZAnnQyK5iT3tNf3dE1SZWwU=;
        b=yOjvIlJDBqSnccBVLmjiWGmamU6u70vpeNG8u+OtZlnonpocfpuwylPtNcAerF+heS
         hdn7U7Vwoe8cLl5OJfVQ9UotpFZ9j1KfSXyS+rLBqeje7xEoVZmsGvC/L1tDVOIj5cu4
         tBURGLM3rk+gFYYg1kuR1irPNUduQkTlE74b8IqzSAlk897xWUhrwaVg90ahmmc19NBq
         suL8mkrDLyzMyCDP75rZzLu5OInu0W5ytpYLKLIMwxCltnZwqv9hraJoJ1+N2TAlQjE0
         QnZa9hvhbN6KvkyXbm0wHx7cEMZKwkelZtSA4esFgUgQauQcFagZ2J+eiL4SLUzPfn+i
         5lBQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=BFuhr3soefRPillDwK9TZAnnQyK5iT3tNf3dE1SZWwU=;
        b=7Gt3QdBEJ+T71s4GEv2+OZioBLafBcBULxxDVSas0MXr04ydrwR/rBSrXaaGu4c5qX
         JQKz/mzqw3Pu5W1VQTqj8SGCzjW/wXGvw/Kxkba12nxn7VxYbTbz8H5JFa4koAuc1Jfh
         N92OeRAUf14N4aXNzwKsgvQOR9+mP/JRE4QQ0rObjfGuN0Lb9dDctUCBmpf72DequctA
         z2bRd7iEt48XENoiqr/FH0/5rRv1nd8Me840fPk6qvfsvV8p4YqaWl30oU/XMEMQXsms
         dDs8YzlIdZbZKs9XLE7u3VYUj9GXVaUT1TWWpnVa2oSW6WfGqehbtR1uR3HhHXv9/q77
         25pA==
X-Gm-Message-State: AO0yUKUQdQ+nwfXXVJ79uTlAyqyc2tomQcjo3wU8PNHM/ysfKjBuxrwZ
	19NPBCLqZA5LqBHZkVRRVqpDX8iYSlRuKHFe/ZXlPg==
X-Google-Smtp-Source: AK7set9k5v8tBKKA9271/rpvrkTK9KdDo1WwD1Lu6/Rny42lPnVvMFTWMpzN0Kmy+9T1lsOmVLBTxhF89HG4j/+p94o=
X-Received: by 2002:a05:651c:38d:b0:290:517d:a477 with SMTP id
 e13-20020a05651c038d00b00290517da477mr998104ljp.82.1675352126978; Thu, 02 Feb
 2023 07:35:26 -0800 (PST)
MIME-Version: 1.0
References: <20230201095232.15942-1-slewis@rivosinc.com> <20230201095232.15942-2-slewis@rivosinc.com>
 <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org> <CAE2KcLqB56DQFpAeif0bbLNBUVRQRqLQwJByh44qvC94YWbiow@mail.gmail.com>
 <10c3e62f-e5a3-8c3f-7a5d-509b696aa12c@linaro.org> <CAE2KcLp=LTsSxpv1w=+LLQC+-VvJ+f5AYTxHBP+3mhN7e48GQw@mail.gmail.com>
In-Reply-To: <CAE2KcLp=LTsSxpv1w=+LLQC+-VvJ+f5AYTxHBP+3mhN7e48GQw@mail.gmail.com>
From: Sergei Lewis <slewis@rivosinc.com>
Date: Thu, 2 Feb 2023 15:35:15 +0000
Message-ID: <CAE2KcLoSJavj0QOPQT6y4MKDa39MqS4zazTFqzwQ6wsmbjuTgg@mail.gmail.com>
Subject: Re: [PATCH 2/2] riscv: vectorised mem* and str* functions
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org
Content-Type: multipart/alternative; boundary="00000000000067112705f3b950b1"
X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

--00000000000067112705f3b950b1
Content-Type: text/plain; charset="UTF-8"

In general, I suggest caution with tradeoffs between function calls and
code reuse: on modern superscalar architectures, the cost of a mispredicted
branch can be huge in terms of the number of operations that could
otherwise get retired, and although the function invocation and return
themselves are completely predictable, each of these functions contains a
loop with an end condition that is data driven, so essentially random and
all but guaranteeing a mispredict per call in practice; folding
functionality into a single loop (for tiny pieces of code like these, for
small inputs) is a noticeable win over two calls each with its own loop.

On Thu, Feb 2, 2023 at 3:20 PM Sergei Lewis <slewis@rivosinc.com> wrote:

>  I think it would be better to provide vectorized mem* and
>
> str* that work indendently of the compiler option used.
>>
>
> A generic vectorized mem*/str* with no scalar fallback has no issues with
> alignment and is actually smaller and simpler code as well. The awkwardness
> here is performance of very small operations, which are a significant
> portion of the calls to these functions in practice in the wild: for
> operations much smaller than the vector length, a scalar implementation is
> faster - but this is only true if it either makes no unaligned accesses, or
> unaligned accesses are permitted and reasonably performant on the target,
> which (as others have mentioned here) may not be the case on RISCV; and
> there is a limit to how much we can check each invocation without paying
> more for the checks than we save. RISCV vector length, though, may be quite
> large, so basing the tradeoff on that with fallback to a scalar loop that
> just processes a byte per iteration may also be prohibitive.
>
> Using ifuncs would, of course, provide a way to address this once the
> required support / plumbing is in place. I'll look at shelving
> the microoptimisations until then and sticking to more generic code here.
> Using the newly visible OP_T_THRES from your patchset may be the way
> forward in the interim.
>
>
>> Another option, which I will add on my default string refactor, it to use
>> strlen plus memrchr:
>>
>>   char *strrchr (const char *s, int c)
>>   {
>>     return __memrchr (s, c, strlen(s) + 1);
>>   }
>>
>> It would only 2 function calls and if the architecture provides optimized
>> strlen and memrchr, the performance overhead should be only the additional
>> functions call (which the advantage of less icache pressure).
>>
>
> This approach still means you're reading the entire s buffer twice in the
> worst case: once to find the end, then again to scan it for c. It's not the
> end of the world - often, s will be small and c present, and s will be in
> cache for the second read, so the tradeoff is arguably less clear than
> what's in the generic code now. I'll see if I can gather some more info on
> usage in the wild.
>
>
>> I recall that I tested using a 256-bit bitfield instead of 256-byte
>> table, but
>> it incured in some overhead on most architecture (I might check again).
>>
> One option might to parametrize both the table generation and the table
>> search,
>>
>
> I'm using performance measurements here, of course; this sort of tradeoff
> might be another one for ifuncs or even selected at runtime. What I suggest
> might be slightly unfortunate here, though, is the creation of an internal
> glibc api that forces one particular design choice on all platforms in all
> situations.
>
> Note that the cost of the table generation step is almost as important as
> the cost of the scan - e.g. strtok() is implemented using these and
> generally called in a tight loop by code in the wild; use of that directly
> or similar patterns during parsing/tokenization means these functions are
> typically invoked many times, frequently, and the standard library provides
> no way for the table to be reused the table between the calls even though
> the accept/reject chars remain the same. So it is likely that people
> optimising glibc for different platforms will still want to provide
> optimized paths for the table generation even if it is factored out into a
> separate module.
>

--00000000000067112705f3b950b1--