public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
To: Sergei Lewis <slewis@rivosinc.com>, Andrew Waterman <andrew@sifive.com>
Cc: libc-alpha@sourceware.org
Subject: Re: [PATCH 2/2] riscv: vectorised mem* and str* functions
Date: Fri, 3 Feb 2023 08:35:38 -0300	[thread overview]
Message-ID: <fb264b84-4900-78cb-864b-5a96157ca97f@linaro.org> (raw)
In-Reply-To: <CAE2KcLp=LTsSxpv1w=+LLQC+-VvJ+f5AYTxHBP+3mhN7e48GQw@mail.gmail.com>



On 02/02/23 12:20, Sergei Lewis wrote:
>      I think it would be better to provide vectorized mem* and 
> 
>     str* that work indendently of the compiler option used.
> 
> 
> A generic vectorized mem*/str* with no scalar fallback has no issues with alignment and is actually smaller and simpler code as well. The awkwardness here is performance of very small operations, which are a significant portion of the calls to these functions in practice in the wild: for operations much smaller than the vector length, a scalar implementation is faster - but this is only true if it either makes no unaligned accesses, or unaligned accesses are permitted and reasonably performant on the target, which (as others have mentioned here) may not be the case on RISCV; and there is a limit to how much we can check each invocation without paying more for the checks than we save. RISCV vector length, though, may be quite large, so basing the tradeoff on that with fallback to a scalar loop that just processes a byte per iteration may also be prohibitive.
> 
> Using ifuncs would, of course, provide a way to address this once the required support / plumbing is in place. I'll look at shelving the microoptimisations until then and sticking to more generic code here. Using the newly visible OP_T_THRES from your patchset may be the way forward in the interim.

Yes, this is similar to all other architecture that provides vector/simd 
instructions (aarch64 SVE for instance uses a similar strategy).  That's
not the issue in fact, the issues is having an extra compiler define
flag that affects binary distribution and incurs in extra maintenance
that is not tied to ABIs definition.  It means that we will need to also
build/check for RVV with __riscv_strict_align.

Is there a meaningful performance difference on always using the 
__riscv_strict_align code path to have a implementation that works whatever
the chip has fast unaligned memory access support?

The VLEN arbitrary upper bound and page size limit is also worrisome, as
Andrew has pointed out.  I would prefer to either have a generic implementation
that works without such limits (or assumes something more sane, like aarch64
with unaligned access that expects the minimum support 4k page size), or have
a way to fallback to default implementation if the criteria is not met.


>  
> 
>     Another option, which I will add on my default string refactor, it to use
>     strlen plus memrchr:
> 
>       char *strrchr (const char *s, int c)
>       {
>         return __memrchr (s, c, strlen(s) + 1);
>       }
> 
>     It would only 2 function calls and if the architecture provides optimized
>     strlen and memrchr, the performance overhead should be only the additional
>     functions call (which the advantage of less icache pressure).
> 
> 
> This approach still means you're reading the entire s buffer twice in the worst case: once to find the end, then again to scan it for c. It's not the end of the world - often, s will be small and c present, and s will be in cache for the second read, so the tradeoff is arguably less clear than what's in the generic code now. I'll see if I can gather some more info on usage in the wild.

Yes, ideally we will do the strrchr on only one pass over the string.  But the 
idea here is really try to use composability, specially when the symbol usage
has a very low probability to be a hotspot.

>  
> 
>     I recall that I tested using a 256-bit bitfield instead of 256-byte table, but
>     it incured in some overhead on most architecture (I might check again).
> 
>     One option might to parametrize both the table generation and the table search,
> 
> 
> I'm using performance measurements here, of course; this sort of tradeoff might be another one for ifuncs or even selected at runtime. What I suggest might be slightly unfortunate here, though, is the creation of an internal glibc api that forces one particular design choice on all platforms in all situations. 

I see the other way around, where it makes maintenance and remove complexity. To
give you an example, many memchr implementations uses a strategy to first calculate
the final address using the input address and size.  And some failed to take in
consideration that it might overflow [1].  We had to fixed on multiple 
implementations, where if we have a parametrized implementation we could just fix
on the default one.

Of course where the architecture optimization does not fit on the generic
framework, it is up to arch-maintainers to use a specialized one.  But the generic
implementation I proposing was used over the year by multiple architectures
which generates multiple arch-specific implementations.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=21182 

>  
> Note that the cost of the table generation step is almost as important as the cost of the scan - e.g. strtok() is implemented using these and generally called in a tight loop by code in the wild; use of that directly or similar patterns during parsing/tokenization means these functions are typically invoked many times, frequently, and the standard library provides no way for the table to be reused the table between the calls even though the accept/reject chars remain the same. So it is likely that people optimising glibc for different platforms will still want to provide optimized paths for the table generation even if it is factored out into a separate module.

I think using a bitfield for the table might be profitable for tight
called loops as you suggest, I will evaluate the usage on generic code.  
But again, my idea is try to parametrize only what the
architecture need to optimize, and even if compiler improves (through
autovectorization or other optimization pass) to avoid the need to such
arch-specific implementations.

  parent reply	other threads:[~2023-02-03 11:35 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01  9:52 [PATCH 1/2] riscv: sysdeps support for vectorised functions Sergei Lewis
2023-02-01  9:52 ` [PATCH 2/2] riscv: vectorised mem* and str* functions Sergei Lewis
2023-02-01 15:33   ` Jeff Law
2023-02-01 16:42     ` Florian Weimer
2023-02-01 17:07       ` Jeff Law
2023-02-02  9:34         ` Sergei Lewis
2023-02-06 12:49         ` Sergei Lewis
2023-02-01 17:17     ` Adhemerval Zanella Netto
2023-02-01 17:38   ` Adhemerval Zanella Netto
2023-02-01 18:13     ` Noah Goldstein
2023-02-02 10:02     ` Sergei Lewis
2023-02-02 14:26       ` Adhemerval Zanella Netto
2023-02-02 15:20         ` Sergei Lewis
2023-02-02 15:35           ` Sergei Lewis
2023-02-03 11:35           ` Adhemerval Zanella Netto [this message]
2023-02-03 14:04             ` Sergei Lewis
2023-02-01 18:11   ` Noah Goldstein
2023-02-01 18:13   ` Andrew Waterman
2023-02-01 19:03   ` Andrew Waterman
2023-02-03  0:13     ` Vineet Gupta
2023-02-03  0:51       ` Andrew Waterman
2023-05-03  2:11   ` Yun Hsiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb264b84-4900-78cb-864b-5a96157ca97f@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=andrew@sifive.com \
    --cc=libc-alpha@sourceware.org \
    --cc=slewis@rivosinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).