On 12/24/22 00:24, Wilco Dijkstra wrote:
> Hi Alex,
> 
>> For that, we'd first need to discuss what is a typical scenario.
> 
> Like copying/concatenating strings that fit within the buffer. >
>> And also, it depends a lot on what the compiler can optimize.  If I call
>> strlcat(3) in a loop, I know that stpecpy(3) is going to be orders of magnitude
>> faster.
> 
> If you're trying to say that the 'strcat' variant is bad then yes absolutely -

I must admit that it has good things, if you have a compiler that does the magic 
for you.  GCC optimizes strcat(3) into stpcpy(3), so if you know that it will be 
optimized, it's not so bad, and the source code is cleaner to the eye.

> it's better to inline in the compiler or avoid the 'strcat' versions altogether
> (that's also why I would strongly suggest never to add more 'cat' variants).

inlining in the compiler is a good solution.  And yes, I agree on not adding cat 
variants, but at the same time, str functions are problematic in that they can't 
be chained, as opposed to stp functions.  That problem shows itself in the 
snprintf(3) bugs I mentioned.  Users need a way to catenate easily, even if not 
with a 'cat' function.

> But that doesn't say anything about whether stpecpy is better than strlcpy.
> 
>> If I call strlcpy(3) in a loop, doing what an ideal compiler might do, that
>> might be something to benchmark, but we'd also need to discuss what is a good
>> input for the benchmark.
> 
> The typical case would be copying or concatenating smallish strings to a buffer.

Okay, I'll try to prepare a benchmark.

> 
>> In the OpenBSD definition of strlcpy(), I count 4 branches, and one of them is
>> inside a while loop.  So, I'd find it very surprising if strlcpy(3) outperformed
>> stpecpy(3).
> 
> If that really is the OpenBSD implementation then this proves my point that
> non-standard string functions are often totally unoptimized.

And not only that, but I find your version much more readable.  I don't 
understand how the OpenBSD version was written that way, and hasn't been fixed 
so far.

> 
> A basic implementation of strlcpy would use strlen and memcpy so it is fast
> on every system without requiring any optimization:
> 
> size_t
> strlcpy (char *dst, const char *src, size_t size)
> {
>    size_t len = strlen (src);
> 
>    if (size == 0)
>      return len;
>    size = len >= size ? size - 1 : len;

I'd use a separate variable dlen, to differentiate it from size.  Otherwise, it 
looks like an off-by-one bug just below, since writing at a [size] usually means 
writing past the array.

>    dst[size] = 0;
>    memcpy (dst, src, size);
>    return len;
> }

Then, to compare oranges to oranges, I'll provide the equivalently optimized 
stpecpy(3):

char *
stpecpy (char *dst, char *end, const char *restrict src)
{
   size_t dsize;
   size_t dlen;
   size_t slen;

   slen = strlen(src);

   if (dst == end)
     return NULL;
   if (unlikely(dst == NULL))
     return NULL;
   if (dst > end)
     unreachable();
   dsize = end - dst;
   dlen = slen >= dsize ? dsize - 1 : slen;
   dst[dlen] = 0;
   return mempcpy(dst, src, dlen);
}

Now we can really compare them.  (unlikely() is the obvious wrapper over 
__builtin_expect(), and unreachable() is C23's equivalent of 
__builtin_unreachable();  they're just extra optimizations, but can be ignored.)

There are various decissions to take here:

-  We could leave NULL as UB, but I want to handle it for being able to combine 
with stpeprintf().  Although, we could implement stpeprintf() so that it never 
fails (we would need to implement it without the INT_MAX limitation of snprintf(3)).

-  We could call strnlen(3) instead, but strlen(3) is probably faster in the 
average use case, and has the benefit of crashing on invalid input.

The differences with your strlcpy(3) implementation are:

-  NULL check.
-  dsize = end - dst; calculation

Considering that strlcpy(3) chained calls need extra boilerplate at call site 
(remember):

               n = strlcpy(buf, "Hello ", sizeof(buf));
               if (n >= sizeof(buf))
                   goto toolong;
               n += strlcpy(buf + n, "world", sizeof(buf) - n);
               if (n >= sizeof(buf))
                   goto toolong;
               n += strlcpy(buf + n, "!", sizeof(buf) - n);
               if (n >= sizeof(buf))
                   goto toolong;
               puts(buf);

we see that there are in reality more calculations in the case of strlcpy(3); I 
see 2 '+'s and 1 '-' at strlcpy(3) call site, while we only had an extra '-' in 
the stpecpy(3) internals.  The number of conditionals seems to be the same after 
all, except for one single conditional after all the chained stpecpy(3) calls.

So, for equally optimized code, stpecpy(3) seems to win.  It's not hard to 
believe, since they perform the same operation, with the difference that 
strlcpy(3) forces the user to recalculate the buffer size (or really, the 
compiler on behalf of the user, since the intention is that users don't write 
this code, and instead write calls to strlcat(3)).  While it seems very obvious, 
since the source code is so similar that we can see the differences, I'll still 
try to write a benchmark.

And remember that performance is only second to usability, which is the main 
selling point of stpe*() functions.

[...]

>>> In contrast we can be pretty sure that the standard strlen, memcpy etc are both
>>> correct and efficient on all targets/libc's.
>>
>> Sure, but memcpy(3) is not usable in code that needs to truncate.  We need to
>> compare against stpncpy(3) (ughhh) and strlcpy(3).
> 
> The idea is that if we add new string functions, their implementation should use
> other string functions that are known to be well optimized for most targets.

Fair.  Above is my optimized version of stpecpy(3).

> 
> Cheers,
> Wilco

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>