>
>  I think it would be better to provide vectorized mem* and

str* that work indendently of the compiler option used.
>

A generic vectorized mem*/str* with no scalar fallback has no issues with
alignment and is actually smaller and simpler code as well. The awkwardness
here is performance of very small operations, which are a significant
portion of the calls to these functions in practice in the wild: for
operations much smaller than the vector length, a scalar implementation is
faster - but this is only true if it either makes no unaligned accesses, or
unaligned accesses are permitted and reasonably performant on the target,
which (as others have mentioned here) may not be the case on RISCV; and
there is a limit to how much we can check each invocation without paying
more for the checks than we save. RISCV vector length, though, may be quite
large, so basing the tradeoff on that with fallback to a scalar loop that
just processes a byte per iteration may also be prohibitive.

Using ifuncs would, of course, provide a way to address this once the
required support / plumbing is in place. I'll look at shelving
the microoptimisations until then and sticking to more generic code here.
Using the newly visible OP_T_THRES from your patchset may be the way
forward in the interim.


> Another option, which I will add on my default string refactor, it to use
> strlen plus memrchr:
>
>   char *strrchr (const char *s, int c)
>   {
>     return __memrchr (s, c, strlen(s) + 1);
>   }
>
> It would only 2 function calls and if the architecture provides optimized
> strlen and memrchr, the performance overhead should be only the additional
> functions call (which the advantage of less icache pressure).
>

This approach still means you're reading the entire s buffer twice in the
worst case: once to find the end, then again to scan it for c. It's not the
end of the world - often, s will be small and c present, and s will be in
cache for the second read, so the tradeoff is arguably less clear than
what's in the generic code now. I'll see if I can gather some more info on
usage in the wild.


> I recall that I tested using a 256-bit bitfield instead of 256-byte table,
> but
> it incured in some overhead on most architecture (I might check again).
>
One option might to parametrize both the table generation and the table
> search,
>

I'm using performance measurements here, of course; this sort of tradeoff
might be another one for ifuncs or even selected at runtime. What I suggest
might be slightly unfortunate here, though, is the creation of an internal
glibc api that forces one particular design choice on all platforms in all
situations.

Note that the cost of the table generation step is almost as important as
the cost of the scan - e.g. strtok() is implemented using these and
generally called in a tight loop by code in the wild; use of that directly
or similar patterns during parsing/tokenization means these functions are
typically invoked many times, frequently, and the standard library provides
no way for the table to be reused the table between the calls even though
the accept/reject chars remain the same. So it is likely that people
optimising glibc for different platforms will still want to provide
optimized paths for the table generation even if it is factored out into a
separate module.