Hi Wilco,

On 12/23/22 15:59, Wilco Dijkstra wrote:
> Hi Alex,
> 
>>> Given strlcpy and strlcat are in POSIX next and therefore bar
>>> some extraordinary event will be in glibc, I think we should
>>> probably wait until those two land, then see if there's still
>>> an appetite for stpecpy in glibc.
>>
>> I disagree for the following reasons.
>>
>> strlcpy(3)/strlcat(3) are designed to be _slow_, in exchange for added
>> simplicity and safety.  That's what Theo told me about them.  They didn't care
>> about performance.  The two performance issues are:
> 
> We'd need actual benchmarks to confirm there is a detectable performance
> difference in typical scenarios.

For that, we'd first need to discuss what is a typical scenario.

And also, it depends a lot on what the compiler can optimize.  If I call 
strlcat(3) in a loop, I know that stpecpy(3) is going to be orders of magnitude 
faster.

If I call strlcpy(3) in a loop, doing what an ideal compiler might do, that 
might be something to benchmark, but we'd also need to discuss what is a good 
input for the benchmark.

> So calling strlcpy slow is premature. Looking
> at the proposed stpecpy, it seems it has a lot more branches and special cases
> compared to a typical strlcpy,

In my definition of stpecpy(), I see 3 branches (the other one is just an 
unreachable path to let the compiler optimize better).  And in the NULL check, 
we could use __builtin_expect to let it assume it's unlikely.

And there's one more branch inside memccpy(3), so that's 3 branches plus one 
more branch that can be inside __builtin_expect().  After those branches, it's 
all memcpy(3).

In the OpenBSD definition of strlcpy(), I count 4 branches, and one of them is 
inside a while loop.  So, I'd find it very surprising if strlcpy(3) outperformed 
stpecpy(3).

For completeness, I compared these definitions:

alx@asus5775:~/src/bsd/openbsd/src$ grepc -tfd strlcpy lib/
lib/libc/string/strlcpy.c:27:
size_t
strlcpy(char *dst, const char *src, size_t dsize)
{
	const char *osrc = src;
	size_t nleft = dsize;

	/* Copy as many bytes as will fit. */
	if (nleft != 0) {
		while (--nleft != 0) {
			if ((*dst++ = *src++) == '\0')
				break;
		}
	}

	/* Not enough room in dst, add NUL and traverse rest of src. */
	if (nleft == 0) {
		if (dsize != 0)
			*dst = '\0';		/* NUL-terminate dst */
		while (*src++)
			;
	}

	return(src - osrc - 1);	/* count does not include NUL */
}

alx@asus5775:~/src/alx/libstp/src$ grepc -tfd stpecpy
./stp/stpe/stpecpy.c:15:
char *stp_nullable
stpecpy(char *stp_nullable dst, char end[0], const char *restrict src)
{
	char  *stp_nullable p;

	if (dst == end)
		return end;
	if (stp_unlikely(dst == NULL))  // Allow chaining with stpeprintf().
		return NULL;
	if (dst > end)
		stp_unreachable();

	p = memccpy(dst, src, '\0', end - dst);
	if (p != NULL)
		return p - 1;

	/* Truncation detected. */
	end[-1] = '\0';
	return end;
}

alx@asus5775:~/src/gnu/glibc$ grepc -tfd __memccpy
./string/memccpy.c:30:
void *
__memccpy (void *dest, const void *src, int c, size_t n)
{
   void *p = memchr (src, c, n);

   if (p != NULL)
     return __mempcpy (dest, src, p - src + 1);

   memcpy (dest, src, n);
   return NULL;
}



> and that adds extra overhead. Using memccpy
> is risky too since that is often not optimized.

Well, with the current memccpy(3) I already suspect it's going to be faster than 
strlcpy(3).  If you optimize it, it would increase the chances that it's faster :)

> 
>> -  Traverse the entire input string, to make sure it's a string.  stpecpy(3)
>> instead only reads what's necessary for the copy; it stops reading after truncation.
> 
> Almost all strings are short and few cases need truncation, so I don't see the issue.
> People concerned about performance wouldn't use the standard string functions
> anyway.

That's true for copying strings without truncation; in such cases, mempcpy(3) is 
probably the best thing you can use.  That's documented in string_copying(7).

However, mempcpy(3) doesn't help when you need to truncate.  And there are cases 
where you need to truncate, especially when you need formatting (snprintf(3)).

> 
>> -  strlcat(3) finds the terminating null byte; that's something you already know
>> where it is, with functions that return a useful pointer (mempcpy(3), stpcpy(3),
>> and stpecpy(3)).
> 
> If you know the end of the destination string then don't use concatenation. Easy!

I'll quote strlcpy(3) man page:

        Since it is known how many  characters  were  copied  the  first  time,
        things can be sped up a bit by using a copy instead of an append:

              [... code example ...]

        However,  one  may question the validity of such optimizations, as they
        defeat the whole purpose of strlcpy() and strlcat().  As  a  matter  of
        fact, the first version of this manual page got it wrong.

So, not easy.

> 
> In fact compilers could inline the 'dest += strlen (dest)' part of strcat and call
> strcpy instead. This allows optimization of the strlen in case you know the size
> of the destination string. This is true for strlcpy too, a compiler could just inline
> it and optimize the strlen (src).

Fair.  Compilers already optimize strcat(3) into stpcpy(3), so I expect they 
could optimize strlcat(3) in the same way.  So, let's assume that a compiler can 
optimize strlcat(3) calls into the correct strlcpy(3) calls:

               if (strlcpy(buf, "Hello ", sizeof(buf)) >= sizeof(buf))
                   goto toolong;
               if (strlcat(buf, "world", sizeof(buf)) >= sizeof(buf))
                   goto toolong;
               len = strlcat(buf, "!", sizeof(buf));
               if (len >= sizeof(buf))
                   goto toolong;
               puts(buf);

the code above could be optimized into (assuming I didn't write any bugs, which 
is likely, due to complexity):

               n = strlcpy(buf, "Hello ", sizeof(buf));
               if (n >= sizeof(buf))
                   goto toolong;
               n += strlcpy(buf + n, "world", sizeof(buf) - n);
               if (n >= sizeof(buf))
                   goto toolong;
               n += strlcpy(buf + n, "!", sizeof(buf) - n);
               if (n >= sizeof(buf))
                   goto toolong;
               puts(buf);

This means that to the branches in the implementation of strlcpy(3), we need to 
add one more.  strlcpy(3) forces the user to check for truncation after every 
call, while stpecpy(3) allows checking truncation only once at the end of the 
chained calls.  So, here goes another performance issue of these calls.

And, while strlcpy(3)/cat(3) are simpler _when truncating is fine_, truncating 
is almost never fine.  Which means you almost always need to check for 
truncation.  Which means boilerplate code, which in turn makes these calls more 
complex to use than simply stpecpy(3).  Here goes the equivalent code to the above:

               end = buf + sizeof(buf);
               p = buf;
               p = stpecpy(p, end, "Hello ");
               p = stpecpy(p, end, "world");
               p = stpecpy(p, end, "!");
               if (p == end)
                   goto toolong;
               len = p - buf;
               puts(buf);

I find it _way_ more readable than the strlcpy(3)/cat(3) code.  Oh, and did I 
say it has less branches? :)

> 
>> The reason that triggered me wanting to add this function is seeing strncpy(3)
>> used for a patch to some glibc internals themselves.  Using strlcpy(3)/cat(3) in
>> glibc internals would be bad for performance; I would hope that glibc uses the
>> optimal internals, even if it also provides slow functions for users.
> 
> Most internal uses are unlikely to be performance critical, and correctness is kind
> of important for libraries.

Correctness comes from readability.  The less code you write, the more correct 
the code will be.  Just see the above for proof that this function actually 
promotes correct code, which in fact was the main reason for this function. 
Performance is just a nice side effect.

> 
> IMHO inventing many slightly different non-standard string functions is what
> causes performance and correctness issues. People disagree about the semantics
> (strlcpy has been argued over for a decade or so). Even if a library supports them,
> you never know which implementations are actually well optimized (obviously
> this varies between ISA and different libc's).

The good thing about stpecpy(3) is that it's difficult to make it slow.  The 
naive implementation is already quite simple and fast.  Not saying it's 
comparable to memcpy(3), but it's likely to be faster than competition, which is 
strlcpy(3).

> So which non-standard string function
> is safe to use across all targets and libraries?

I covered that in string_copying(7).  I guess in a few weeks you'll be able to 
read it in your favourite unsable distro :)

> 
> In contrast we can be pretty sure that the standard strlen, memcpy etc are both
> correct and efficient on all targets/libc's.

Sure, but memcpy(3) is not usable in code that needs to truncate.  We need to 
compare against stpncpy(3) (ughhh) and strlcpy(3).

> 
> Cheers,
> Wilco

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>